女朋友这几天晚上总是在看电影,╮(╯▽╰)╭,让哥哥一个人自己玩。哼,不就是电影嘛?我给你一个库!
说干就干,前几天在程老哥的指导下,终于理解了多层网页爬取的时候,数据是怎么传递的。今天选的阳光电影网也是这种的结构:http://www.ygdy8.com/ 选择最喜欢的欧美类
起始网页之这样:http://www.ygdy8.com/html/gndy/oumei/list_7_1.html
不废话,先上代码:
# -*- coding: utf-8 -*-
import scrapy
from yangguang.items import YangguangItem
from scrapy.spiders import CrawlSpider
class Ygdy8ComSpider(CrawlSpider):
name = "ygdy8.com"
allowed_domains = ["ygdy8.com"]
start_urls = ['http://www.ygdy8.com/html/gndy/oumei/list_7_1.html']
def parse(self, response):
items=[]
print(response.url)
infos= response.xpath('//table[@border="0"]/tr[2]/td[2]/b/a[2]')
for info in infos:
item = YangguangItem()
next_page_link = info.xpath('@href')[0].extract()
next_page_name = info.xpath('text()')[0].extract()
full_page_link= 'http://www.ygdy8.com'+next_page_link#这里一定要加http:// 不然会报错
item['next_page_name'] = next_page_name
item['full_page_link'] = full_page_link
items.append(item)
for item in items:
yield scrapy.Request(url=item['full_page_link'],meta={'item_1':item},callback=self.parse_page) #老规矩,这里把下一页的网址传递给下一页的解析函数
for i in range(2,164): #构造循环函数
url= 'http://www.ygdy8.com/html/gndy/oumei/list_7_%s.html'%i
yield scrapy.Request(url,callback=self.parse)
def parse_page(self,response): #解析传递下来的网址的函数
items= response.meta['item_1']
item = YangguangItem()
dy_link= response.xpath('//table[@border="0"]/tbody/tr/td/a/@href').extract()
item['dy_link']=dy_link
item['next_page_name']=items['next_page_name']
item['full_page_link']= items['full_page_link']
print(item)
yield item
items
import scrapy
class YangguangItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
full_page_link= scrapy.Field()
dy_link = scrapy.Field()
next_page_name= scrapy.Field()
pipline
import pymysql
def dbHandle():
conn = pymysql.connect(
host="localhost",
user="root",
passwd="密码",
charset="utf8",
use_unicode=False
)
return conn
class YangguangPipeline(object):
def process_item(self, item, spider):
dbObject = dbHandle()
cursor = dbObject.cursor()
sql = "insert into ygdy.dy(dy_link,next_page_name,full_page_link) value (%s,%s,%s)"
try:
cursor.execute(sql, (item['dy_link'], item['next_page_name'], item['full_page_link']))
cursor.connection.commit()
except BaseException as e:
print("错误在这里>>>>", e, "<<<<<<错误在这里")
dbObject.rollback()
return item
setting
之前一直存不到数据库,后来问了程老哥,
ITEM_PIPELINES = {
'yangguang.pipelines.YangguangPipeline': 300,
}```
这句话默认是不打开的,所以要在setting里把他打开。
最后看一下存下来的东西
![Paste_Image.png](http://upload-images.jianshu.io/upload_images/4324326-f4be3cfed4d4663d.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
老婆,来找我要电影吧。。
![Paste_Image.png](http://upload-images.jianshu.io/upload_images/4324326-da0013c37346bd4d.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)