步骤一 创建爬取项目:
1. 进入你的桌面文件夹
cd desktop
2. 创建爬虫项目
scrapy startproject imove
3.创建爬虫机器人,名字就叫movie
cd imove
scrapy genspider movie
4.调整settings.py
变更user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36'
拒绝机器人协议
ROBOTSTXT_OBEY = False
管道通
ITEM_PIPELINES = {
'imovie.pipelines.ImoviePipeline': 300,
}
步骤二 初始化
- 在填入需要爬入的网站:http://****.com
填入最开始的那个网页 : http://****.com.index.html
allowed_domains = ['www.dytt8.net']
start_urls = ['https://www.dytt8.net/html/gndy/dyzz/index.html']
2. 爬取内容的结构化数据类型 items.py
import scrapy
class ImovieItem(scrapy.Item):
title = scrapy.Field()
date = scrapy.Field()
url = scrapy.Field()
步骤三 填写爬虫规则
观察网站,需要爬取电影名字、时间、详情url地址,方便继续深入爬取
网页的规则为:
(Xpath 语言)
//table
名字 = .//a/text()
日期 = .//td[@style='padding-left:3px']/font/text()
URL= 域名+ ".//a/@href"
步骤四 实现自动翻页
1. 判断是否存在下一页
if response.xpath("//a[text()='下一页']"):
2. 找出下一页网址
(Xpath 语言)
//a[text()='下一页']/@href
3. 点击它!
yield self.make_requests_from_url(next_page)
步骤五 入库
1. import sqlite3
2. sql 建表:
create table if not exists movies (title text ,date text , url text);
3. sql 查表
insert into movies (title,date,url) values (?,?,?);",(item["title"],item["date"],item["url"])
4. 数据库验证(可以不做)parse_sqlite.py
import sqlite3
import pandas as pd
conn = sqlite3.connect("data.sqlite")
df = pd.read_sql_query("select * from movies limit 5;", conn)
print(df)
步骤六 运行
scrapy crawl movie
运行结果:
全都保存在数据库中,方便下步操作
全部代码:
# movie.py
# -*- coding: utf-8 -*-
import scrapy
from imovie.items import ImovieItem
class MovieSpider(scrapy.Spider):
name = 'movie'
allowed_domains = ['www.dytt8.net']
start_urls = ['https://www.dytt8.net/html/gndy/dyzz/index.html']
def parse(self, response):
tables = response.xpath("//table")
imoveitem = ImovieItem()
for table in tables:
try:
imoveitem["title"] = table.xpath(".//a/text()").extract_first()
imoveitem["date"] = table.xpath(".//td[@style='padding-left:3px']/font/text()").extract_first().split()[0]
imoveitem["url"] = "https://www.dytt8.net"+table.xpath(".//a/@href").extract_first()
except:pass
print(imoveitem)
yield imoveitem
if response.xpath("//a[text()='下一页']"):
next_page = "https://www.dytt8.net/html/gndy/dyzz/"+response.xpath("//a[text()='下一页']/@href").extract_first()
yield self.make_requests_from_url(next_page)
# items.py
import scrapy
class ImovieItem(scrapy.Item):
title = scrapy.Field()
date = scrapy.Field()
url = scrapy.Field()
# pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import sqlite3
class ImoviePipeline(object):
def __init__(self):
self.conn = sqlite3.connect("data.sqlite")
cur = self.conn.cursor()
# with self.conn.cursor() as cur:
# cur.execute("create table movies (titie text ,date text , url text);")
cur.execute("create table if not exists movies (title text ,date text , url text);")
cur.close()
def process_item(self, item, spider):
cur = self.conn.cursor()
# with self.conn.cursor() as cur:
# cur.execute("insert into movies (title,date,url) values (?,?,?);",(item["title"],item["date"],item["url"]))
cur.execute("insert into movies (title,date,url) values (?,?,?);",(item["title"],item["date"],item["url"]))
self.conn.commit()
print("插入成功!")
cur.close()
return item
# parse_sqlite.py
import sqlite3
import pandas as pd
conn = sqlite3.connect("data.sqlite")
df = pd.read_sql_query("select * from movies;", conn)
print(df)