- 学习xpath,使用lxml+xpath提取内容。
- 使用xpath提取丁香园论坛的回复内容。
- 丁香园直通点:http://www.dxy.cn/bbs/thread/626626#626626 。
- 参考资料:https://blog.csdn.net/naonao77/article/details/88129994
1.学习xpath
XPath 是一门在 XML 文档中查找信息的语言。XPath 可用来在 XML 文档中对元素和属性进行遍历。 XPath 是 W3C XSLT 标准的主要元素,并且 XQuery 和 XPointer 都构建于 XPath 表达之上。(官方教程:http://www.w3school.com.cn/xpath/index.asp)
参考链接:用lxml解析HTML
2.使用xpath提取丁香园论坛的回复内容
import requests
from lxml import etree
def getItem():
headers = {
"Connection": "keep-alive",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0",
"Accept": "text/html,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip,deflate,sdch",
"Accept-Language": "zh-CN,zh;q=0.8"
}
url = 'http://www.dxy.cn/bbs/thread/626626#626626'
request = requests.get(url, headers=headers)
# response = urllib.request.urlopen(request).read().decode("utf-8")
html = request.text
tree = etree.HTML(html)
user = tree.xpath('//div[@class="auth"]/a/text()')
content = tree.xpath('//td[@class="postbody"]')
# print(user)
# print(content)
# datas = []
for i in range(0,len(user)):
print(user[i].strip()+":"+content[i].xpath('string(.)').strip())
if __name__ == '__main__':
getItem()
输出: