解析一个本地网页,获取标题,图片地址,价格,评分量和评分星级。
网页如下
代码
from bs4 import BeautifulSoup
with open('D:\宣宣\homework/index.html','r') as wb_data:
soup = BeautifulSoup(wb_data,'lxml') #解析网页内容
images = soup.select('body > div > div > div.col-md-9 > div > div > div > img')
tittles = soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4 > a')
prices = soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4.pull-right')
reviews = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p.pull-right')
stars = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)')
# print(images,tittles,price,reviews,stars,sep= '\n--------------\n')
for tittle,image,price,review,star in zip(tittles,images,prices,reviews,stars):
data = {
'tittle':tittle.get_text(), #提取文本信息
'image':image.get('src'), #提取图片地址src是地址参数
'price':price.get_text(),
'review':review.get_text(),
'star':len(star.find_all("span",class_='glyphicon glyphicon-star'))
}
print(data)
'''
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > img
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.caption > h4:nth-child(2) > a
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.ratings > p:nth-child(2) > span:nth-child(3)
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.ratings > p.pull-right
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.caption > h4.pull-right
运行结果
总结
1.用Python爬取网页信息,首先得对网页有基本的了解。知道如何在浏览器查询对应图片、文字的HTML代码。再通过copy CSS selector进行有用信息的提取
2.在星级提取中,stars = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)'),copy CSS selector是body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.ratings > p:nth-child(2) > span:nth-child(3),开始没把最后的span:nth-child(3)这一串去掉,结果star=0.后来才明白要提取总共多少个星星,应该写到父级标签 p:nth-child(2) ,才会统计所有。nth-child是会出错的。应改为nth-of-type(2),意为选择器匹配属于父元素的特定类型的第 2个子元素的每个元素。
3.通过不停的出错,对照答案,查文档,对代码的理解加深的。最后运行代码成功,又是一件喜悦的事情,学习动力持续不断。