目标:爬取百度百科的人物关系,并用三元组储存
- 导入所需的包 这里我们使用requests获得网页源代码 并用xpath解析
import requests
from lxml import etree
import pandas as pd
import time
import random
- 用代理ip伪装自己,原理就是爬一页随机换个ip, 百度是有反爬措施的,我从快代理找了5个ip因为总共也就1000页要爬,这就够了
proxy_list = [
{"http" : "124.88.67.81:80"},
{"http" : "124.88.67.81:80"},
{"http" : "124.88.67.81:80"},
{"http" : "124.88.67.81:80"},
{"http" : "124.88.67.81:80"}
]
快代理:https://www.kuaidaili.com/free/ 用免费的就好了 各位自行找5个ip把"http" :后面的内容替换掉就好了
- 设置请求头:(经常换请求头也是反爬的一个措施,但我这次懒了,就用这一个了)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
- 获取所有待爬取艺术家列表
我使用的是雅昌上的数据
https://amma.artron.net/artronindex_artist.php
url = 'https://amma.artron.net/artronindex_artist.php'
page_text = requests.get(url=url,headers=headers,proxies=random.choice(proxy_list)).content.decode('utf-8')
#把网页便变成xpath结构
tree = etree.HTML(page_text)
artlist = tree.xpath('//div[@class="sub-Aritst-Area"]/dl//li/a/text()')
artlist_nn=[]
我这里首先通过request获得源网页之后使用的Xpath解析, XPath的具体使用做了个小总结
- 获取所有节点
- 获取所有li标签
- //* //li
- 获取子节点
- 我们通过/或//即可查找元素的子节点和子孙节点
- li节点的所有直接a子节点
- //li/a
- 获取ul的所有子孙a节点
- //ul//a
- 获取父节点属性
- 知道子节点查询父节点
- //div[@class="filter-wrap"]/../@class'
- //div[@class="filter-wrap"]/parent::*/@class'
- 属性定位
- 找到当前源码中所有class属性为song的div标签
- //div[@class="song"]
- 层级&索引定位
- 找到class属性值为tang的div的直系子标签ul下的第二个子标签li下的直系子标签a
- //div[@class="tang"]/ul/li[2]/a
- 多属性匹配
- 从当前源码中找到href属性为空且class属性为song的所有a标签
- //a[@href="" and @class="song"]
- 模糊匹配
- 找到所有class属性值中包含nb的所有div标签
- //div[contains(@class,"so")]
- 找到所有class属性值以zb开头的所有div标签
- //div[starts-with(@class,"ta")]
- 获取文本
- / 表示获取某个标签下的文本内容
- // 表示获取某个标签下的文本内容和所有子标签下的文本内容
- //div[@class="song"]/p[1]/text()
- //div[@class="tang"]//text()
- 获取属性
- //div[@class="tang"]//li[2]/a/@href
给大家安利一个神器谷歌浏览器的插件 Xpath Helper 鼠标悬停在想获取的内容上按shift可以自动获取xpath表达式,也可以在里面自行编写xpath表达式之后检验正确与否
5.现在所有待爬取的艺术家列表都在artlist_nn里面了,接下来去爬百度百科
for i in artlist :
try:
#time.sleep(1)
url = 'https://baike.baidu.com/item/'+i
page_text = requests.get(url=url,headers=headers,proxies=random.choice(proxy_list)).content.decode('utf-8')
#把网页便变成xpath结构
tree = etree.HTML(page_text)
re =tree.xpath('//div[@class="lemma-relation-module viewport"]/ul/li/a/div/span[@class="name"]/text()')
na =tree.xpath('//div[@class="lemma-relation-module viewport"]/ul/li/a/div/span[@class="title"]/text()')
if len(re) != 0:
artlist_nn.append(i)
df = pd.DataFrame()
a=a+1
print(a)
df['n']=0
df['r']=re
df['n']=i
df['m']=na
dfz=pd.concat([dfz,df],axis=0,ignore_index=True)
#df.to_csv('result/'+i+'.csv',encoding='utf-8')
dfz.to_csv('result.csv',encoding='utf-8')
except:
print('爬取失败')
pass
continue
这里值得注意的是有ip地址池中有可能会有ip被封,这时程序会中断,所以我用了try 和 except 使得程序能够在ip被封爬取失败之后继续运行
下面是结果表
附上完整代码
import requests
from lxml import etree
import pandas as pd
import time
import random
proxy_list = [
{ "http": "http://113.195.23.2:9999" },
{ "http": "http://39.84.114.140:9999" },
{ "http": "http://110.243.7.29:9999" },
{ "http": "http://27.188.65.244:8060" }]
#//div[@class="sub-Aritst-Area"]/dl//li/a/text()
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
url = 'https://amma.artron.net/artronindex_artist.php'
page_text = requests.get(url=url,headers=headers,proxies=random.choice(proxy_list)).content.decode('utf-8')
#把网页便变成xpath结构
tree = etree.HTML(page_text)
artlist = tree.xpath('//div[@class="sub-Aritst-Area"]/dl//li/a/text()')
#https://baike.baidu.com/item/%E6%9D%8E%E5%8F%AF%E6%9F%93/331468?fr=aladdin
a=0
artlist_nn=[]
dfz=pd.DataFrame(columns=['n','r','m'])
for i in artlist :
try:
#time.sleep(1)
url = 'https://baike.baidu.com/item/'+i
page_text = requests.get(url=url,headers=headers,proxies=random.choice(proxy_list)).content.decode('utf-8')
#把网页便变成xpath结构
tree = etree.HTML(page_text)
re =tree.xpath('//div[@class="lemma-relation-module viewport"]/ul/li/a/div/span[@class="name"]/text()')
na =tree.xpath('//div[@class="lemma-relation-module viewport"]/ul/li/a/div/span[@class="title"]/text()')
if len(re) != 0:
artlist_nn.append(i)
df = pd.DataFrame()
a=a+1
print(a)
df['n']=0
df['r']=re
df['n']=i
df['m']=na
dfz=pd.concat([dfz,df],axis=0,ignore_index=True)
#df.to_csv('result/'+i+'.csv',encoding='utf-8')
dfz.to_csv('result.csv',encoding='utf-8')
except:
print('爬取失败')
pass
continue