1,基础知识
使用BeautifulSoup解析网页:
步骤:
- Step1:解析网页
BeautifulSoup(html, 'lxml')
- Step2:描述要爬取得东西在哪
Soup.select( )
- Step3:从标签中获取需要的信息
Soup.select(???)
2,自己动手写程序
-The Result:
-The Code:
from bs4 import BeautifulSoup
path = '/Users/huoqi/Documents/pythonlearn/combating/week1/1_2/homework1_2/1_2_homework_required/index.html'
with open(path, 'r') as wb_data:
#print(wb_data)
Soup = BeautifulSoup(wb_data, 'lxml')
#print(Soup)
images = Soup.select('body > div > div > div.col-md-9 > div > div > div > img')
titles = Soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4 > a')
prices = Soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4.pull-right')
views = Soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p.pull-right')
stars = Soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)')
#print(images, titles, prices, views, stars)
for image, title, price, view, star in zip(images, titles, prices, views, stars):
data = {
'image' : image.get('src'),
'title' : title.get_text(),
'price' : price.get_text(),
'view' : view.get_text(),
'star' : len(star.find_all('span', class_= "glyphicon glyphicon-star"))
}
print(data)
3,反思与总结
- len()函数可以返回列表元素的个数。
- 使用copy selector选出来的路径要多比较。
- 路径的修改问题尚未明白,现在仍在思考。
KEEP FIGHTING!