Python爬虫实战第一天
任务
爬取图示网站的商品信息,包括:商品名称,价格,评论,评论数以及图片链接
成果
{'name': 'EarPod', 'price': '$24.99', 'stars': 5, 'reviews': '65', 'imageurl': 'img/pic_0000_073a9256d9624c92a05dc680fc28865f.jpg'}
{'name': 'New Pocket', 'price': '$64.99', 'stars': 4, 'reviews': '12', 'imageurl': 'img/pic_0005_828148335519990171_c234285520ff.jpg'}
{'name': 'New sunglasses', 'price': '$74.99', 'stars': 4, 'reviews': '31', 'imageurl': 'img/pic_0006_949802399717918904_339a16e02268.jpg'}
{'name': 'Art Cup', 'price': '$84.99', 'stars': 3, 'reviews': '6', 'imageurl': 'img/pic_0008_975641865984412951_ade7a767cfc8.jpg'}
{'name': 'iphone gamepad', 'price': '$94.99', 'stars': 4, 'reviews': '18', 'imageurl': 'img/pic_0001_160243060888837960_1c3bcd26f5fe.jpg'}
{'name': 'Best Bed', 'price': '$214.5', 'stars': 4, 'reviews': '18', 'imageurl': 'img/pic_0002_556261037783915561_bf22b24b9e4e.jpg'}
{'name': 'iWatch', 'price': '$500', 'stars': 4, 'reviews': '35', 'imageurl': 'img/pic_0011_1032030741401174813_4e43d182fce7.jpg'}
{'name': 'Park tickets', 'price': '$15.5', 'stars': 4, 'reviews': '8', 'imageurl': 'img/pic_0010_1027323963916688311_09cc2d7648d9.jpg'}
源代码
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
info = []
# 打开本地网页,使用BS解析
with open('/Documents/Code/Plan-for-combating-master/week1/1_2/1_2answer_of_homework/index.html', 'r') as wb_data:
soup = BeautifulSoup(wb_data, 'lxml')
# chrome出来的路径(body > div:nth-of-type(2) > div > div.col-md-9 > div:nth-of-type(2) > div > div > div.caption > h4:nth-of-type(2) > a)有问题,为什么??
images = soup.select('body > div:nth-of-type(1) > div > div.col-md-9 > div:nth-of-type(2) > div > div > img')
names = soup.select(
'body > div:nth-of-type(1) > div > div.col-md-9 > div:nth-of-type(2) > div > div > div.caption > h4:nth-of-type(2) > a')
prices = soup.select(
'body > div:nth-of-type(1) > div > div.col-md-9 > div:nth-of-type(2) > div > div > div.caption > h4.pull-right')
reviews = soup.select(
'body > div:nth-of-type(1) > div > div.col-md-9 > div:nth-of-type(2) > div > div > div.ratings > p.pull-right')
stars = soup.select(
'body > div:nth-of-type(1) > div > div.col-md-9 > div:nth-of-type(2) > div > div > div.ratings > p:nth-of-type(2)')
print(reviews[0].get_text())
for image, name, price, review, star in zip(images, names, prices, reviews, stars):
'''
class在python中属于保留字,直接使用keyword参数(class='glyphicon glyphicon-star'会导致语法错误,有两种解决方法:
一是通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag:attrs={'class': 'glyphicon glyphicon-star'
二是通过如下形式class_='glyphicon glyphicon-star'
'''
data = {
'name': name.get_text(),
'price': price.get_text(),
'stars': len(star.find_all('span', class_='glyphicon glyphicon-star')),
'reviews': review.get_text().rstrip(' reviews'),
'imageurl': image.get('src')
}
print(data)
info.append(data)
小结
- BS不懂多看官方文档: https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#unicode-dammit
- chrome开发工具里面copy的CSS selector不一定正确,如果没有返回对象,可逐级调试,看到底是哪一级的标签出错。