17 Scrapy内置爬虫CrawlSpider和Spider的差异、使用正则分析链接

切换目录到项目工程文件夹:命令行中输入

Scrapy gensipder -l

返回结果:

$ scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

同样使用命令创建赶集网的另外一个爬虫文件

创建一个crawlspider
cd到项目工程的目录以后,输入以下命令:

scrapy genspider -t crawl 新的爬虫名称 新的网站域名

例如

scrapy genspider -t crawl ganji2 ganji.com

运行结果

>>>Created spider 'ganji2' using template 'crawl' in module:
  secondary_zufang.spiders.ganji2

工程目录下出现了一个ganji.2.py

文件内部是这样的


在这里面的start_url应该自行操作改成自己想要爬取的网页。

比较crawlspider和basic的区别

创建basic类型的spider

scrapy genspider -t basic tem example.com

创建完成以后,项目spider目录下多了一个tmp.py的文件。
同样是使用命令来新建爬虫,但是里面和上述的crawlspider相比还是少了几样东西。


在crawlspider中,是parse_item方法,而且在函数中是不允许你重写parse函数的,否则可能会出现异常。
在官方文档中,crawlspider是爬取有规律的网站内容。

使用shell命令调试

cd到项目文件夹下,输入scrapy shell 网址例如

 scrapy shell http://bj.ganji.com/wblist/haidian/zufang/

返回结果:

2018-12-21 12:15:06 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: secondary_zufang)
2018-12-21 12:15:06 [scrapy.utils.log] INFO: Versions: lxml 4.2.4.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.3 (v3.6.3:2c5fed86e0, Oct  3 2017, 00:32:08) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Darwin-17.7.0-x86_64-i386-64bit
2018-12-21 12:15:06 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'secondary_zufang', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'secondary_zufang.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['secondary_zufang.spiders']}
2018-12-21 12:15:06 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2018-12-21 12:15:06 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-12-21 12:15:06 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-12-21 12:15:06 [scrapy.middleware] INFO: Enabled item pipelines:
['secondary_zufang.pipelines.SecondaryZufangPipeline']
2018-12-21 12:15:06 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-12-21 12:15:06 [scrapy.core.engine] INFO: Spider opened
2018-12-21 12:15:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://bj.ganji.com/robots.txt> (referer: None)
2018-12-21 12:15:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://bj.ganji.com/wblist/haidian/zufang/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x10d639860>
[s]   item       {}
[s]   request    <GET http://bj.ganji.com/wblist/haidian/zufang/>
[s]   response   <200 http://bj.ganji.com/wblist/haidian/zufang/>
[s]   settings   <scrapy.settings.Settings object at 0x10e74d710>
[s]   spider     <Ganji2Spider 'ganji2' at 0x10edaad30>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

再把分析链接的模块导入进来:

from scrapy.linkextractors import LinkExtractor

输入以下命令,他会把页面里所有的链接都有提取出来:

tmp = LinkExtractor(r'')  #这个是空的正则,可以匹配任何链接
tmp.extract_links(response)
[Link(url='http://bj.ganji.com/fang/', text='', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/chuzu/', text='租房', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/ershoufang/', text='二手房', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/shangpucs/', text='商铺', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/zhaozu/', text='写字楼', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/changfang/', text='厂房', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/cangkucf/', text='仓库', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/tudi/', text='土地', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/cheku/', text='车位', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/wblist/haidian/zufang/', text='出租房', fragment='', nofollow=True),
 Link(url='http://post.58.com/fang/1/8/s5', text='免费发布信息', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/', text='北京赶集', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/zufang/', text='北京租房', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/zufang/?pagetype=area', text='区域\n                                    \n                                ', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/zufang/sub/?pagetype=ditie', text='地铁\n                                    \n                                ', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/chaoyang/zufang/', text='朝阳', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/haidian/zufang/', text='海淀', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/dongcheng/zufang/', text='东城', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/xicheng/zufang/', text='西城', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/chongwen/zufang/', text='崇文', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/xuanwu/zufang/', text='宣武', fragment='', nofollow=False),
。。。。。。。。这里省略部分类似内容。。。。。。。。
 Link(url='http://bj.ganji.com/xiaoqu/huayuandonglu16hao/chuzuxq/', text='\n                                                花园东路16号院...\n                                            ', ft='', nofollow=False),
 Link(url='https://jxjump.58.com/service?target=FCADV8oV3os7xtAhI2suhvPnTEJt7VvwSrGZ89jJDSaNiZGPZpk1zEffDjpdRkNz3Q5xoKYl4Bi0ja0STWzx3zCRPYhCR2qu8NqUfyBP1ZKMCGY1mbpJGLQe4MLWCBtO3CV1GeEvZYetOdm79IubjBATd84ukfCbRGVaWhwAwIAsnVFVGVkJ-frjEcIsiu1SCX0XjWK1hBvMP5Ogvh79ZdJK950gnTnuV4ut2oMzJts5psgWNQ37EDbog7g&pubid=53973391&apptype=10&psid=152852492202554173951819533&entinfo=36506725264296_0&cookie=|||&fzbref=1&key=&params=rank0830gspriceB2550^desc&gjcity=bj', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/fuxinglu11haoyuan/chuzuxq/', text='\n                                                复兴路11号院...\n                                            ', frag'', nofollow=False),
 Link(url='https://jxjump.58.com/service?target=FCADV8oV3os7xtAhI2suhvPnTEJt7VvwSrGZ89jJDSaNiZGPZpk1zEffDjpdRkNz3Q5xoKYl4Bi0ja0Rjk7py_gFkXIyKx8C3feATyBP1ZKMCGY0_lHXr41EYvY6_kCEzXoV_eEvZYetOdm7tUgy8gGYBrIukfCbRGVaWhwAwIAsnVFVGVkJ-frjEcIsiu1SCX0XjJlshikCCdcyrXame0WrKfkgnTnuV4ut2oMzJts5psgWsF5zmZtTCDw&pubid=53952061&apptype=10&psid=152852492202554173951819533&entinfo=36506153352577_0&cookie=|||&fzbref=1&key=&params=rank0830gspriceB2550^desc&gjcity=bj', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='https://jxjump.58.com/service?target=FCADV8oV3os7xtAhI2suhvPnTEJt7VvwSrGZ89jJDSaNiZGPZpk1zEffDjpdRkNz3Q5xoKYl4Bi0ja0RK-qBSOCO4X9Aqxb55Bt8ryBP1ZKMCGY2UE2j0rkCcdgL7Z5Dw3ipDeEvZYetOdm63BuhRypVvZ4ukfCbRGVaWhwAwIAsnVFVGVkJ-frjEcIsiu1SCX0XjNQr1dZSEQZQ-CW_Mhoakx0gnTnuV4ut2oMzJts5psgWZGAfLB2zanA&pubid=53897059&apptype=10&psid=152852492202554173951819533&entinfo=36504407043592_0&cookie=|||&fzbref=1&key=&params=rank0830gspriceB2550^desc&gjcity=bj', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/mudanyuandongli/chuzuxq/', text='\n                                                牡丹园东里...\n                                            ', fragmentnofollow=False),
 Link(url='https://jxjump.58.com/service?target=FCADV8oV3os7xtAhI2suhvPnTEJt7VvwSrGZ89jJDSaNiZGPZpk1zEffDjpdRkNz3Q5xoKYl4Bi0ja0TMq88evYBaBSquYHHcYTCiyBP1ZKMCGY1mbpJGLQe4MAc5-aJwKmkPeEvZYetOdm7Aap9GwaXd64ukfCbRGVaWhwAwIAsnVFVGVkJ-frjEcIsiu1SCX0XjENEfUaGPtcBsLmmwuR9Ix0gnTnuV4ut2oMzJts5psgWDhtNAesvD4A&pubid=53903799&apptype=10&psid=152852492202554173951819533&entinfo=36504585975840_0&cookie=|||&fzbref=1&key=&params=rank0830gspriceB2550^desc&gjcity=bj', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/longxianglu8hao/chuzuxq/', text='\n                                                龙翔路8号院...\n                                            ', fragmen nofollow=False),
 Link(url='https://jxjump.58.com/service?target=FCADV8oV3os7xtAhI2suhvPnTEJt7VvwSrGZ89jJDSaNiZGPZpk1zEffDjpdRkNz3Q5xoKYl4Bi0ja0RClDRDo67SzOLvSyQZZ8HYyBP1ZKMCGY315cEFUaeoIMHPhud0MxuWeEvZYetOdm772LkDJkdp34ukfCbRGVaWhwAwIAsnVFVGVkJ-frjEcIsiu1SCX0XjmuTv-Y3xRZwiPZB47nGe9UgnTnuV4ut2oMzJts5psgUmMYZH-DENLw&pubid=53895758&apptype=10&psid=152852492202554173951819533&entinfo=36504374330393_0&cookie=|||&fzbref=1&key=&params=rank0830gspriceB2550^desc&gjcity=bj', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/309yiyuanjiashulou/chuzuxq/', text='\n                                                黑山扈路甲17号院...\n                                            ',nt='', nofollow=False),
 Link(url='http://bj.ganji.com/zufang/36525781752728x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/cuiweilu21haoyuan/chuzuxq/', text='\n                                                翠微路21号院...\n                                            ', frag'', nofollow=False),
 Link(url='http://bj.ganji.com/zufang/36514856072971x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/landehuating/chuzuxq/', text='\n                                                兰德华庭...\n                                            ', fragment='', llow=False),
 Link(url='http://bj.ganji.com/zufang/36364461048994x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/xiaoyingdonglu7haoyuan/chuzuxq/', text='\n                                                小营东路7号院...\n                                            'ment='', nofollow=False),
 Link(url='http://bj.ganji.com/zufang/36283837290114x.shtml?ding=https://short.58.com/zd_p/4d183517-ac8c-4370-a0c1-282763d4a987/?target=dc-16-xgk_hvimob_89368680324775q-feykn&end=end', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/taipinglu34haoyuan/chuzuxq/', text='\n                                                太平路34号院...\n                                            ', fra='', nofollow=False),
 Link(url='http://bj.ganji.com/zufang/36448703569416x.shtml?ding=https://short.58.com/zd_p/57e5e09c-29bc-4db5-b392-46106dfa6069/?target=dc-16-xgk_hvimob_89556048192579q-feykn&end=end', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/zijinzhangan/chuzuxq/', text='\n                                                紫金长安...\n                                            ', fragment='', llow=False),
 Link(url='http://bj.ganji.com/zufang/36452227627012x.shtml?ding=https://short.58.com/zd_p/59f9ea44-183d-40e8-8bcb-e3c7f13874c5/?target=dc-16-xgk_hvimob_89513330930473q-feykn&end=end', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/yongwangjiayuansiqu/chuzuxq/', text='\n                                                永旺家园四区...\n                                            ', fr='', nofollow=False),
 Link(url='http://bj.ganji.com/zufang/36414250778516x.shtml?ding=https://short.58.com/zd_p/a859ca84-d600-4400-9a95-38f3d762e828/?target=dc-16-xgk_hvimob_89575314006179q-feykn&end=end', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/huarunxiangshuwanyiqi/chuzuxq/', text='\n                                                橡树湾一期...\n                                            ', frt='', nofollow=False),
 Link(url='http://bj.ganji.com/zufang/36227844908548x.shtml?ding=https://short.58.com/zd_p/9bfed205-196b-4420-9087-e2e1a3269ddd/?target=dc-16-xgk_hvimob_89330655246156q-feykn&end=end', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/yiheshanzhuangbj/chuzuxq/', text='\n                                                颐和山庄...\n                                            ', fragment=nofollow=False),
 Link(url='http://bj.ganji.com/zufang/36455527886345x.shtml?ding=https://short.58.com/zd_p/bce20dd9-651c-428f-a9b5-8c291fc7b376/?target=dc-16-xgk_hvimob_89511130669851q-feykn&end=end', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/beiwujiayuanxili/chuzuxq/', text='\n                                                北坞嘉园西里...\n                                            ', fragm, nofollow=False),
 Link(url='http://bj.ganji.com/zufang/36455272312598x.shtml?ding=https://short.58.com/zd_p/2df7e63a-e90f-4534-9785-a951019f26df/?target=dc-16-xgk_hvimob_89511303873126q-feykn&end=end', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/tujingjiayuan/chuzuxq/', text='\n                                                图景嘉园...\n                                            ', fragment='',ollow=False),
 Link(url='http://bj.ganji.com/zufang/36455485142684x.shtml?ding=https://short.58.com/zd_p/32d7b98e-3e39-40e2-ae4b-7d7c2c76b5b4/?target=dc-16-xgk_hvimob_89511561753965q-feykn&end=end', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/yichengxishanhuafuxiyuan/chuzuxq/', text='\n                                                亿城西山华府禧园...\n                                        ragment='', nofollow=False),
 Link(url='http://bj.ganji.com/zufang/35843472234064x.shtml?ding=https://short.58.com/zd_p/2fb60c53-a4f9-4b09-a4e7-e9674705065a/?target=dc-16-xgk_hvimob_81658503385495q-feykn&end=end', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/wanshoulu18haoyuan/chuzuxq/', text='\n                                                万寿路18号院...\n                                            ', fra='', nofollow=False),
 Link(url='http://bj.ganji.com/zufang/36005967713804x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/mashenmiaoxiaoqu/chuzuxq/', text='\n                                                马神庙小区...\n                                            ', fragmen nofollow=False),
 Link(url='http://bj.ganji.com/zufang/36514389880345x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/zhangzhilu5hao/chuzuxq/', text='\n                                                财智会馆...\n                                            ', fragment=''follow=False),
 Link(url='http://bj.ganji.com/zufang/36210224693127x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/zizaixiangshan/chuzuxq/', text='\n                                                永泰自在香山...\n                                            ', fragmennofollow=False),
 Link(url='http://bj.ganji.com/zufang/36515288107668x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/wanshuyuan/chuzuxq/', text='\n                                                万树园...\n                                            ', fragment='', nofow=False),
 Link(url='http://bj.ganji.com/zufang/36219286813961x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/fenghuangxiaoqu1/chuzuxq/', text='\n                                                凤凰小区...\n                                            ', fragment=nofollow=False),
 Link(url='http://bj.ganji.com/zufang/36226224648989x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/wufulinglongju/chuzuxq/', text='\n                                                五福玲珑居...\n                                            ', fragment=ofollow=False),
 Link(url='http://bj.ganji.com/zufang/36523006142221x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/lingnanlu30haoyuan/chuzuxq/', text='\n                                                科委宿舍...\n                                            ', fragmen, nofollow=False),
 Link(url='http://bj.ganji.com/zufang/36518070380572x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/longxiangzhongqu/chuzuxq/', text='\n                                                龙乡小区(中区)...\n                                            ', fra'', nofollow=False),
 Link(url='http://bj.ganji.com/zufang/36260836748417x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/huangzhuangxiaoqubj/chuzuxq/', text='\n                                                中国科学院黄庄小区...\n                                           ent='', nofollow=False),
 Link(url='http://bj.ganji.com/zufang/36496483775505x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/qiangyouqinghexincheng/chuzuxq/', text='\n                                                强佑清河新城...\n                                            ',ent='', nofollow=False),
 Link(url='http://bj.ganji.com/zufang/36515848489356x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/yiyuanjuyiqi/chuzuxq/', text='\n                                                颐源居...\n                                            ', fragment='', nolow=False),
 Link(url='http://bj.ganji.com/zufang/36453349478792x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/jinyumeiheyuan/chuzuxq/', text='\n                                                美和园西区...\n                                            ', fragment=ofollow=False),
 Link(url='http://bj.ganji.com/zufang/36448512145167x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/baoshengli/chuzuxq/', text='\n                                                宝盛里...\n                                            ', fragment='', nofow=False),
 Link(url='http://bj.ganji.com/zufang/36440408016649x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/ruiheyuan/chuzuxq/', text='\n                                                金隅瑞和园...\n                                            ', fragment='', now=False),
 Link(url='http://bj.ganji.com/zufang/36521935379723x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/jingouhelu12haoyuan/chuzuxq/', text='\n                                                金沟河路12号院...\n                                            ', nt='', nofollow=False),
 Link(url='http://bj.ganji.com/zufang/36364398864531x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/xuezhiyuan/chuzuxq/', text='\n                                                学知园...\n                                            ', fragment='', nofow=False),
 Link(url='http://bj.ganji.com/zufang/36459015677700x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/mingguangbeili/chuzuxq/', text='\n                                                明光北里...\n                                            ', fragment=''follow=False),
 Link(url='http://bj.ganji.com/zufang/36522777615364x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/zizhuyuanjia3hao/chuzuxq/', text='\n                                                紫竹院路甲3号院...\n                                            ', fr'', nofollow=False),
 Link(url='http://bj.ganji.com/zufang/36360244840342x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/shanglinxi/chuzuxq/', text='\n                                                上林溪...\n                                            ', fragment='', nofow=False),
 Link(url='http://bj.ganji.com/zufang/36430964540828x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/yongtaidongli/chuzuxq/', text='\n                                                永泰东里...\n                                            ', fragment='',ollow=False),
 Link(url='http://bj.ganji.com/zufang/36485009202206x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36470563064452x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/beiyisanyuan/chuzuxq/', text='\n                                                北医三院家属区小区...\n                                            ', fra nofollow=False),
 Link(url='http://bj.ganji.com/zufang/36517073349672x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/taiyueyuan/chuzuxq/', text='\n                                                太月园(南区)...\n                                            ', fragment=''ollow=False),
 Link(url='http://bj.ganji.com/zufang/36505985388960x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/yifengzhuangyuan/chuzuxq/', text='\n                                                颐丰庄园(西区)...\n                                            ', fra'', nofollow=False),
 Link(url='http://bj.ganji.com/zufang/34794182085293x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/xiaoqu/shangdijiayuanbj/chuzuxq/', text='\n                                                上地佳园...\n                                            ', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/haidian/zufang/pn2/', text='2', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/haidian/zufang/pn3/', text='3', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/haidian/zufang/pn70/', text='70', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/hezu/', text='北京合租房', fragment='', nofollow=False),
 Link(url='http://sh.ganji.com/zufang', text='上海租房网', fragment='', nofollow=False),
 Link(url='http://zz.ganji.com/zufang', text='郑州租房网', fragment='', nofollow=False),
 Link(url='http://sy.ganji.com/zufang', text='沈阳租房网', fragment='', nofollow=False),
 Link(url='http://sz.ganji.com/zufang', text='深圳租房网', fragment='', nofollow=False),
 Link(url='http://cd.ganji.com/zufang', text='成都租房网', fragment='', nofollow=False),
 Link(url='http://cq.ganji.com/zufang', text='重庆租房网', fragment='', nofollow=False),
 Link(url='http://qd.ganji.com/zufang', text='青岛租房网', fragment='', nofollow=False),
 Link(url='http://wh.ganji.com/zufang', text='武汉租房网', fragment='', nofollow=False),
 Link(url='http://tj.ganji.com/zufang', text='天津租房网', fragment='', nofollow=False),
 Link(url='http://jn.ganji.com/zufang', text='济南租房网', fragment='', nofollow=False),
 Link(url='http://nj.ganji.com/zufang', text='南京租房网', fragment='', nofollow=False),
 Link(url='http://gz.ganji.com/zufang', text='广州租房网', fragment='', nofollow=False),
 Link(url='http://xa.ganji.com/zufang', text='西安租房网', fragment='', nofollow=False),
 Link(url='http://hf.ganji.com/zufang', text='合肥租房网', fragment='', nofollow=False),
 Link(url='http://sjz.ganji.com/zufang', text='石家庄租房网', fragment='', nofollow=False),
 Link(url='http://dl.ganji.com/zufang', text='大连租房网', fragment='', nofollow=False),
 Link(url='http://hz.ganji.com/zufang', text='杭州租房网', fragment='', nofollow=False),
 Link(url='http://kezhan.58.com/bj/qingnianlvshe/', text='北京青年旅社', fragment='', nofollow=False),
 Link(url='http://bj.58.com/xiaoqu/shenggunanli/', text='胜古南里', fragment='', nofollow=False),
 Link(url='http://bj.ganji.com/wblist/haidian/zufang/m.anjuke.com/bj/loupan/haidian/', text='海淀楼盘', fragment='', nofollow=False),
 Link(url='http://m.anjuke.com/bj/loupan/249388/', text='京汉铂寓', fragment='', nofollow=False),
 Link(url='http://bj.zu.anjuke.com/fangyuan/haidian/', text='海淀租房', fragment='', nofollow=False),
 Link(url='http://bj.58.com/pinpaigongyu/646228473643278336/', text='家乐美地', fragment='', nofollow=False),
 Link(url='http://www.ganji.com/misc/abouts/index.php?act=about', text='关于Ganji', fragment='', nofollow=True),
 Link(url='http://www.ganji.com/tuiguang/index/', text='赶集推广', fragment='', nofollow=True),
 Link(url='http://tuiguang.ganji.com/zhaoshang/agent.htm', text=' 渠道合作 ', fragment='', nofollow=True),
 Link(url='http://help.ganji.com/', text='帮助中心', fragment='', nofollow=True),
 Link(url='http://help.ganji.com/html/sjbmy/', text='手机号被冒用', fragment='', nofollow=True),
 Link(url='http://www.ganji.com/misc/abouts/link.php?act=link', text='友情链接', fragment='', nofollow=True),
 Link(url='http://www.ganji.com/misc/abouts/index.php?act=job', text='招贤纳士', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/quxiandaohang/', text='区县导航', fragment='', nofollow=False),
 Link(url='http://mobile.ganji.com/', text='手机赶集', fragment='', nofollow=True),
 Link(url='http://3g.ganji.com/bj_fang1/', text='租房触屏版', fragment='', nofollow=False)]

都取出来了,很恐怖。
使用正则过滤下就能拿到需要的链接了。在明确链接样式的情况下才能进行正则表达式的设计。


问号后面有键值,先不考虑问号后面的部分了,配前面的吧。页面网址应该是这样的形式r'http://bj.ganji.com/zufang/\d+x.shtml'

输入:

tmp = LinkExtractor(r'http://bj.ganji.com/zufang/\d+x.shtml')
tmp.extract_links(response) #这是个列表

输出的链接明显少了

[Link(url='http://bj.ganji.com/zufang/36525781752728x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36514856072971x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36364461048994x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36283837290114x.shtml?ding=https://short.58.com/zd_p/4d183517-ac8c-4370-a0c1-282763d4a987/?target=dc-16-xgk_hvimob_89368680324775q-feykn&end=end', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36448703569416x.shtml?ding=https://short.58.com/zd_p/57e5e09c-29bc-4db5-b392-46106dfa6069/?target=dc-16-xgk_hvimob_89556048192579q-feykn&end=end', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36452227627012x.shtml?ding=https://short.58.com/zd_p/59f9ea44-183d-40e8-8bcb-e3c7f13874c5/?target=dc-16-xgk_hvimob_89513330930473q-feykn&end=end', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36414250778516x.shtml?ding=https://short.58.com/zd_p/a859ca84-d600-4400-9a95-38f3d762e828/?target=dc-16-xgk_hvimob_89575314006179q-feykn&end=end', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36227844908548x.shtml?ding=https://short.58.com/zd_p/9bfed205-196b-4420-9087-e2e1a3269ddd/?target=dc-16-xgk_hvimob_89330655246156q-feykn&end=end', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36455527886345x.shtml?ding=https://short.58.com/zd_p/bce20dd9-651c-428f-a9b5-8c291fc7b376/?target=dc-16-xgk_hvimob_89511130669851q-feykn&end=end', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36455272312598x.shtml?ding=https://short.58.com/zd_p/2df7e63a-e90f-4534-9785-a951019f26df/?target=dc-16-xgk_hvimob_89511303873126q-feykn&end=end', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36455485142684x.shtml?ding=https://short.58.com/zd_p/32d7b98e-3e39-40e2-ae4b-7d7c2c76b5b4/?target=dc-16-xgk_hvimob_89511561753965q-feykn&end=end', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/35843472234064x.shtml?ding=https://short.58.com/zd_p/2fb60c53-a4f9-4b09-a4e7-e9674705065a/?target=dc-16-xgk_hvimob_81658503385495q-feykn&end=end', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36005967713804x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36514389880345x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36210224693127x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36515288107668x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36219286813961x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36226224648989x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36523006142221x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36518070380572x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36260836748417x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36496483775505x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36515848489356x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36453349478792x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36448512145167x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36440408016649x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36521935379723x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36364398864531x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36459015677700x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36522777615364x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36360244840342x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36430964540828x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36485009202206x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36470563064452x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36517073349672x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/36505985388960x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True),
 Link(url='http://bj.ganji.com/zufang/34794182085293x.shtml', text='\n                                            \n                                        ', fragment='', nofollow=True)]

这出来的一个一个链接是LinkExtractor提取出来的,但是如果写进RULES里面去,就会直接把链接爬上来,
打开Pycharm工程,把Rule里面的正则条件替换成

http://bj.ganji.com/zufang/\d+x.shtml

但是这一步要规定回调函数,这里面设置成的是parse_item,parse_item会自动获取这个链接的response。那如果想查询刚才的url呢,要使用response.url才能得到链接。

Crawlspider知识点整理

CrawlSpider继承最基础的Spider,所以Spider有的方法和属性,CrawlSpider全部具备。
CrawlSpider别于Spider的特性是多了一个rules参数,其作用是定义提取动作,可以快速的检索符合正则的路由,并非常方便的回调到函数中。

几点说明

1、 follow是一个布尔(boolean)值,指定了根据该规则从response提取的链 接是否需要跟进。如果 callbackNonefollow 默认设置为True ,否则默认为 Falsefollow 默认设置为True时候,会一直跟进爬取此链接打开的页面的response的符合规则的链接。
注意:如果不写callback也不写follow的话,表示follow默认跟进,至于要将拿到的链接重新打开,根据规则再提取里面的链接,如果里面的链接触发了某个支持callback的规则,那么再传到callback对应的函数里进行提取。

这两条规则一个是详情页的链接,一个是下一页的链接,所以下一页的链接默认跟进

2、rules :一个包含一个(或多个) Rule 对象的集合(list)。 每个 Rule对爬取网站的动作定义了特定表现。 Rule对象在下边会介绍。 如果多个rule 匹配了相同的链接,则根据他们在本属性中被定义的顺序,第一个会被使用。
3、URL链接提取的类LinkExtractor,主要参数为:
allow:满足括号中“正则表达式”的值会被提取,如果为空,则全部 匹配。 deny:与这个正则表达式(或正则表达式列表)不匹配的URL一定不 提取。
allow_domains:会被提取的链接的domains
deny_domains:一定不会被提取链接的domains
restrict_xpaths:使用xpath表达式,和allow共同作用过滤链接。还有一个类似的restrict_css

警告

当编写CrawlSpider爬虫规则时,请避免使用 parse 作为回调函数。由于 CrawlSpider 使用parse方法来实现其逻辑,如果您覆盖了 parse
法,Crawlspider 将会运行失败。涉及的示例:

$ scrapy shell http://bj.ganji.com/fang1/
# ......
# 略过 Scrapy Log
>>> from scrapy.linkextractors import LinkExtractor >>> tmp = LinkExtractor(r'')
>>> len(tmp.extract_links(response))
Out: 875
>>> get_links =
LinkExtractor(r'http://bj.ganji.com/fang1/\d+x.htm')
>>> len(get_links.extract_links(response))
Out: 89

实际操作的时候并不简单

另外如果需要转码到json,可以使用如下语句


最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 215,294评论 6 497
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,780评论 3 391
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 161,001评论 0 351
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,593评论 1 289
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,687评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,679评论 1 294
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,667评论 3 415
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,426评论 0 270
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,872评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,180评论 2 331
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,346评论 1 345
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,019评论 5 340
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,658评论 3 323
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,268评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,495评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,275评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,207评论 2 352

推荐阅读更多精彩内容