最近都没怎么写爬虫,主要是不知道如何能够把爬到的数据利用起来,今天就贴一个简单的爬虫。
import requests
import pymongo
import time
from urllib.parse import *
client = pymongo.MongoClient('localhost', 27017)
douban = client['douban']
movie = douban['movie']
tag_list = ['热门', '最新', '经典', '可播放', '豆瓣高分', '冷门佳片', '华语', '欧美', '韩国', '日本', '动作',
'喜剧', '爱情', '科幻', '悬疑', '恐怖', '成长']
url_list = ['https://movie.douban.com/j/search_subjects?type=movie&tag={}&'
'sort=recommend&page_limit=20&page_start={}'.format(quote(tag), page) for tag in tag_list for page in range(0, 500, 20)]
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) \
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'}
def get_item(url):
r = requests.get(url, headers=headers)
wb_data = r.json()
if wb_data['subjects']:
for value in wb_data['subjects']:
data = {
'title': value['title'],
'id': value['id'],
'url': value['url'],
'images': value['cover'],
'rate': value['rate']
}
movie.insert_one(data)
else:
pass
for url in url_list:
get_item(url)
time.sleep(1)
print(movie.find().count())
爬取的数据不多只有几千条,而且有重复的部分,缺点多多 ,继续学习。