关键词:
爬虫 urllib3 BeautifulSoup4
思路:
之前用过python写爬虫,用的urllib,看了看现在还有urllib3,API更简单,性能可能更好,然后分析网页还是之前用过的BeautifuleSoup4
过程:
1.先试试urllib3 获得斗鱼分类内容
pip install urllib3
import urllib3
http = urllib3.PoolManager()
r = http.request('GET', "https://www.douyu.com/directory")
plain_text = r.data.decode("utf-8")
file = open("content.html", "w", encoding='utf-8')
file.write(plain_text);
content.html生成了,用chrome打开发现也是有内容的,说明这个是没有问题的,不过以后真的开始爬虫运行起来,可能会被斗鱼封ip,这个以后发生了再看怎么解决。
- 获得分类信息
Chrome打开 https://www.douyu.com/directory,按F12出来源码。寻找分类的源码块
用beautifulSoup拿到这部分内容
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
def getClassify():
r = http.request('GET', "https://www.douyu.com/directory")
plain_text = r.data.decode("utf-8")
file = open("content.html", "w", encoding='utf-8')
file.write(plain_text);
soup = BeautifulSoup(plain_text, "html5lib")
classify_list = soup.findAll(attrs={'class':'layout-Classify-item'})
for classify in classify_list:
link_info = classify.find('a')
link = link_info.get('href')
name_info = classify.find('strong')
classify_name = name_info.text
print(classify_name + ":" + link)
getClassify()
可以看到打印出了分类的信息,前几个空值,看了一下,推荐分类也是相同的layout-Classify-item,在没有登录情况下推荐都是空的,问题不大,后续再处理。
3.获得英雄联盟分类下的主播
Chrome打开https://www.douyu.com/g_LOL,查看主播信息
还是使用bs4来拿到相关信息
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
def getLOL():
r = http.request('GET', "https://www.douyu.com/g_LOL")
plain_text = r.data.decode("utf-8")
file = open("content_lol.html", "w", encoding='utf-8')
file.write(plain_text);
soup = BeautifulSoup(plain_text, "html5lib")
classify_list = soup.findAll('li', {'class':'layout-Cover-item'})
for classify in classify_list:
link_info = classify.find('a')
link = link_info.get('href')
name_info = classify.find(attrs={'class':'DyListCover-user'})
user_name = name_info.text
print(user_name + ":" + link)
getLOL()
可以看到主播信息,但是只有第一页的。
下次再做跳转。