这里我简单的爬取了煎蛋网的段子,煎蛋网有些段子会被屏蔽的现象产生,所以要对这块东西进行处理。
下面就是按常规去处理,附上具体代码
import requests
froml xml import etree
url='http://jandan.net/duan'
headers={
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate',
'Accept-Language':'zh-CN,zh;q=0.8',
'Cache-Control':'no-cache',
'Connection':'keep-alive',
'Host':'jandan.net',
'Pragma':'no-cache',
'Referer':'http://jandan.net/qa',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3100.0 Safari/537.36',
}
html=requests.get(url,headers=headers);
html.encoding="utf-8"
root=etree.HTML(html.text)
result=root.xpath("//div[@class='row']")
for i in range(len(result)):
author=result[i].xpath(".//div[@class='author']/strong/text()")
text=re sult[i].xpath(".//div[@class='text']")[0]
if(text.xpath("./p[@class='bad_content']")):
text=result[i].xpath(".//div[@class='text']/p[2]/text()")
else:
text=result[i].xpath(".//div[@class='text']/p/text()")
print '作者',author[0],'内容',text[0]
上面的xpath上的.//div[@class='author']/strong/text()解释,就是在class为row的div下找到class为author的div,再在strong标签下,得到标签中的字。