昨晚晚上一不小心学习了崔庆才,崔大神的博客,试着尝试一下爬取一个网站的全部内容,福利吧网站现在已经找不到了,然后一不小心逛到了汽车之家 (http://www.autohome.com.cn/beijing/)
很喜欢这个网站,女人都喜欢车,更何况男人呢。(捂脸)
说一下思路:
1 . 使用CrawlSpider 这个spider,
2. 使用Rule
上面这两个配合使用可以起到爬取全站的作用
3. 使用LinkExtractor 配合Rule可以进行url规则的匹配
4. FormRequest 这是scrapy 登陆使用的一个包
注意:这里进行全站的爬取只是单纯的把以 .html 的url进行打印,保存到json文件,
这里我们还可以继续往下深入的,进行url下的内容提取。
说一下提取的思路:这里我们可以随便找一个url下的内容,然后找到想要提取到的内容,进行xpath提取,
xpath 的一般提取规则:选中想要提取内容的那一行,然后右键copy --> copy xpath 就可以啦,这里老司机说是最好用chrom浏览器的xpath,火狐可能有时候提取不到想要的元素,
xpath提取的简单并且常用的规则:
//*[@id=”post_content”]/p[1]
意思是:在根节点下面的有一个id为post_content的标签里面的第一个p标签(p[1])
如果你需要提取的是这个标签的文本你需要在后面加点东西变成下面这样:
//*[@id=”post_content”]/p[1]/text()
后面加上text()标签就是提取文本
如果要提取标签里面的属性就把text()换成@属性比如:
//*[@id=”post_content”]/p[1]/@src
So Easy!XPath提取完毕!来看看怎么用的!那就更简单了!!!!
response.xpath(‘你Copy的XPath’).extract()[‘要取第几个值’]
注意XPath提取出来的默认是List。
上面就是简单的提取规则,是不是很容易懂,我觉着也是,比之前学的容易懂多了,可能我现在还是个小白吧。哈哈哈。
附录一下:
关于imgurl那个XPath:
你先随便找一找图片的地址Copy XPath类似得到这样的:
//*[@id=”post_content”]/p[2]/img
你瞅瞅网页会发现每一个有几张图片 每张地址都在一个p标签下的img标签的src属性中
把这个2去掉变成:
//*[@id=”post_content”]/p/img
就变成了所有p标签下的img标签了!加上 /@src 后所有图片就获取到啦!(不加[0]是因为我们要所有的地址、加了 就只能获取一个了!)
关于XPath更多的用法与功能详解,建议大家去看看w3cschool
看来我确实没有怎么看w3c啊。还是抓个时间去看一下比较好, 毕竟是基础嘛。
大概:废话就这么多,我真是个话痨,感觉。
贴上代码片吧,里面的内容注释都很详细。
步骤1:
spider里面的文件
# -*- coding: utf-8 -*-
# @Time : 2017/8/27 0:43
# @Author : 蛇崽
# @Email : 17193337679@163.com (主要进行全站爬取的练习)
# @File : LongXunDaoHangSpider.py
# crawlspider,rule配合使用可以起到遍历全站的作用,request为请求的接口
from scrapy.spider import CrawlSpider,Rule,Request
# 配合使用Rule进行url规则匹配
from scrapy.linkextractors import LinkExtractor
# scrapy 中用作登陆使用的一个包
from scrapy import FormRequest
from allNet.items import LongXunDaoHang
class longxunDaoHang(CrawlSpider):
name = 'longxun'
allowed_domains = ['autohome.com.cn']
start_urls = ['http://www.autohome.com.cn/shanghai/']
rules = (
Rule(LinkExtractor(allow=('\.html',)),callback='parse_item',follow=True),
)
def parse_item(self,response):
print(response.url)
daohang = LongXunDaoHang()
daohang['categoryLink'] = response.url
yield daohang
步骤2:
settings.py的内容:
# -*- coding: utf-8 -*-
# Scrapy settings for allNet project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'allNet'
SPIDER_MODULES = ['allNet.spiders']
NEWSPIDER_MODULE = 'allNet.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'allNet (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = True
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
# 'allNet.middlewares.AllnetSpiderMiddleware': 543,
# }
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
# 'allNet.middlewares.MyCustomDownloaderMiddleware': 543,
# 'allNet.middleware.JsonWritePipline':300,
# }
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'allNet.pipelines.AllnetPipeline': 300,
'allNet.pipelines.JsonWritePipline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
步骤3:
piplines.py的内容
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
class AllnetPipeline(object):
def process_item(self, item, spider):
return item
# 写入json文件
class JsonWritePipline(object):
def __init__(self):
self.file = open('汽车之家全站url.json','w',encoding='utf-8')
def process_item(self,item,spider):
line = json.dumps(dict(item),ensure_ascii=False)+"\n"
self.file.write(line)
return item
def spider_closed(self,spider):
self.file.close()
很奇怪的是,汽车之家这里用的cookie什么的都没有进行设置,但是爬取全站这玩意,它就一直没有报错,昨天晚上十二点左右写的代码,想着用scrapy应该不一会就爬取完了吧,但是现在早上还一直在爬,我也是醉了,晚上好几次电脑进行休眠了,然后我又把他重新弄亮了,现在有点奇葩的是,现在spider还在运行着,但是json文件写不进去了,蛮怪怪的。最后上张爬取成果图吧:
这里留给自己一个作业:在爬取的url中进行数据的提取,存储,简单点:就是url下面内容的进行保存。(捂脸.jpg)
源码会上传到github上的。
https://github.com/643435675/PyStudy