TED爬虫

这是沙特2016夏天斋月时候写的第一个爬虫TED_spider.py。写文章复习一下。

用到的库

抓取目标网址：https://www.ted.com/talks
sqlite3 数据库
BeautifulSoup 解析页面
urllib.request 发起请求

得到网页信息

urlopen得到网页源码：

def make_soup(url):
  html=urlopen(url).read()
  return BeautifulSoup(html,"lxml")#html to lxml

image.png

如上图用浏览器观察找到信息所在位置。

关键是用beautiful soup 精确选中你要提取的信息，需要对HTML和CSS的熟悉，对BS4的熟悉：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html：

def get_talks(url):
  talks=make_soup(url).find("div","row row-sm-4up row-lg-6up row-skinny")# tag and class
  talk_links= [BASE_URL+h4.a["href"] for h4 in talks.findAll("h4","h9 m5")]#List Comprehensions
  #there is "posted rated" info on the index page
  return talk_links

数据库初始化

用数据库对得到的信息进行存储，这里用的sqlite，需要对sql和数据库的了解：

if os.path.exists("data/TED.db"):
    conn=sqlite3.connect("data/TED.db")
    cur=conn.cursor()
else:
    #建立数据库
    conn=sqlite3.connect("data/TED.db")
    #建立cursor
    cur=conn.cursor()
    cur.execute('''CREATE TABLE TED
    (
        ID INTEGER PRIMARY KEY AUTOINCREMENT,
        speaker CHAR,
        talk_name CHAR,
        talk_link TEXT,
        watch_times INT,
        place CHAR,
        length CHAR,
        month CHAR,
        brief_description TEXT,
        transcript TEXT,
        similar_topics TEXT
    );''')
    conn.commit()

然后就是流程代码，对所有演讲网页进行遍历，抓取信息，存入数据库。
接下来可以对数据进行一系列分析。

todo

提升速度，多进程。
了解scrapy
登录和反反爬虫

【爬虫1】TED演讲