安装scrapy
Anaconda3-5.0.1-Windows-x86_64(http://www.scrapyd.cn/download/125.html)
...
创建项目
scrapy startproject mingyan
创建爬虫
scrapy genspider dytt8 http://www.dytt8.net/html/gndy/jddy/20160320/50523.html
编写项目
-- mingyan.py
import scrapy
class mingyan(scrapy.Spider): #需要继承scrapy.Spider类
name = "mingyan"
def start_requests(self):
urls = [
'http://lab.scrapyd.cn/page/1/',
'http://lab.scrapyd.cn/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
### tag
tag = getattr(self, 'tag', None) # 获取tag值,也就是爬取时传过来的参数
if tag is not None: # 判断是否存在tag,若存在,重新构造url
url = url + 'tag/' + tag # 构造url若tag=爱情,url= "http://lab.scrapyd.cn/tag/爱情"
### 下一页
#if next_page is not None:
#next_page = response.urljoin(next_page)
#yield scrapy.Request(next_page, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'mingyan-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('保存文件: %s' % filename)
-- xx.item
import scrapy
class MyscrapyItem(scrapy.Item):
student_id = scrapy.Field()
student_name = scrapy.Field()
运行项目
scrapy crawl mingyan
scrapy runspider scrapy_cn.py (单文件运行,不作为项目)
调试
scrapy shell http://lab.scrapyd.cn
< response.css('title')
> 调试结果
下载网页
scrapy fetch http://www.scrapyd.cn >3.html
python io
filename = 'mingyan-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)