学习Python爬虫的第一个小demo,给出一些笔记,以便日后复习。
在使用Python做爬虫的时候,可以分为两大块:1.将目标网页内容请求下来;2.对请求下来的内容做整理
这里也是先给出每一步的笔记,然后给出最终的源代码。
一、导入相关库
import requests
from lxml import etree
二、将目标网页内容请求下来
1.设置请求头
- 原因是一些网站可能会有反爬虫机制,设置请求头,可以绕过一些网站的反爬虫机制,成功获取数据。
- 设置请求头的时候,一般情况下要设置
User-Agent
和Referer
,如果只设置这两项不足以绕过网站的反爬虫机制的话,就使用Chrome的开发者工具,设置更多的请求头。
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
"Referer": "https://www.douban.com/"
}
2.请求网页内容
douban_url = "https://movie.douban.com/cinema/nowplaying/shanghai/"
response = requests.get(douban_url, headers=headers)
douban_text = response.text
三、对请求下来的内容做整理
- 这里主要是使用lxml配合xpath语法进行整理,将每一部电影的信息整理到字典中,最终将所有的电影存放在列表中
html_element = etree.HTML(douban_text)
ul = html_element.xpath('//ul[@class="lists"]')[0]
lis = ul.xpath('./li')
movies = []
for li in lis:
title = li.xpath('./@data-title')[0]
score = li.xpath('./@data-score')[0]
star = li.xpath('./@data-star')[0]
duration = li.xpath('./@data-duration')[0]
region = li.xpath('./@data-region')[0]
director = li.xpath('./@data-director')[0]
actors = li.xpath('./@data-actors')[0]
post = li.xpath('.//img/@src')[0]
movie = {
"title": title,
"score": score,
"star": star,
"duration": duration,
"redion": region,
"director": director,
"actors": actors,
"post": post
}
movies.append(movie)
for movie in movies:
print(movie)
四、完整代码
# 导入相关库
import requests
from lxml import etree
# 1.将目标网页的内容请求下来
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
"Referer": "https://www.douban.com/"
}
douban_url = "https://movie.douban.com/cinema/nowplaying/shanghai/"
response = requests.get(douban_url, headers=headers)
douban_text = response.text
# 2.将抓取的数据进行处理
html_element = etree.HTML(douban_text)
ul = html_element.xpath('//ul[@class="lists"]')[0]
lis = ul.xpath('./li')
movies = []
for li in lis:
title = li.xpath('./@data-title')[0]
score = li.xpath('./@data-score')[0]
star = li.xpath('./@data-star')[0]
duration = li.xpath('./@data-duration')[0]
region = li.xpath('./@data-region')[0]
director = li.xpath('./@data-director')[0]
actors = li.xpath('./@data-actors')[0]
post = li.xpath('.//img/@src')[0]
movie = {
"title": title,
"score": score,
"star": star,
"duration": duration,
"redion": region,
"director": director,
"actors": actors,
"post": post
}
movies.append(movie)
for movie in movies:
print(movie)