切换目录到项目工程文件夹:命令行中输入
Scrapy gensipder -l
返回结果:
$ scrapy genspider -l
Available templates:
basic
crawl
csvfeed
xmlfeed
同样使用命令创建赶集网的另外一个爬虫文件
创建一个crawlspider
cd到项目工程的目录以后,输入以下命令:
scrapy genspider -t crawl 新的爬虫名称 新的网站域名
例如
scrapy genspider -t crawl ganji2 ganji.com
运行结果
>>>Created spider 'ganji2' using template 'crawl' in module:
secondary_zufang.spiders.ganji2
工程目录下出现了一个ganji.2.py
文件内部是这样的
在这里面的start_url应该自行操作改成自己想要爬取的网页。
比较crawlspider和basic的区别
创建basic类型的spider
scrapy genspider -t basic tem example.com
创建完成以后,项目spider目录下多了一个tmp.py的文件。
同样是使用命令来新建爬虫,但是里面和上述的crawlspider相比还是少了几样东西。
在crawlspider中,是parse_item方法,而且在函数中是不允许你重写parse函数的,否则可能会出现异常。
在官方文档中,crawlspider是爬取有规律的网站内容。
使用shell命令调试
cd到项目文件夹下,输入scrapy shell 网址
例如
scrapy shell http://bj.ganji.com/wblist/haidian/zufang/
返回结果:
2018-12-21 12:15:06 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: secondary_zufang)
2018-12-21 12:15:06 [scrapy.utils.log] INFO: Versions: lxml 4.2.4.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.3 (v3.6.3:2c5fed86e0, Oct 3 2017, 00:32:08) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Darwin-17.7.0-x86_64-i386-64bit
2018-12-21 12:15:06 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'secondary_zufang', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'secondary_zufang.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['secondary_zufang.spiders']}
2018-12-21 12:15:06 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage']
2018-12-21 12:15:06 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-12-21 12:15:06 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-12-21 12:15:06 [scrapy.middleware] INFO: Enabled item pipelines:
['secondary_zufang.pipelines.SecondaryZufangPipeline']
2018-12-21 12:15:06 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-12-21 12:15:06 [scrapy.core.engine] INFO: Spider opened
2018-12-21 12:15:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://bj.ganji.com/robots.txt> (referer: None)
2018-12-21 12:15:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://bj.ganji.com/wblist/haidian/zufang/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x10d639860>
[s] item {}
[s] request <GET http://bj.ganji.com/wblist/haidian/zufang/>
[s] response <200 http://bj.ganji.com/wblist/haidian/zufang/>
[s] settings <scrapy.settings.Settings object at 0x10e74d710>
[s] spider <Ganji2Spider 'ganji2' at 0x10edaad30>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
再把分析链接的模块导入进来:
from scrapy.linkextractors import LinkExtractor
输入以下命令,他会把页面里所有的链接都有提取出来:
tmp = LinkExtractor(r'') #这个是空的正则,可以匹配任何链接
tmp.extract_links(response)
[Link(url='http://bj.ganji.com/fang/', text='', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/chuzu/', text='租房', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/ershoufang/', text='二手房', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/shangpucs/', text='商铺', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/zhaozu/', text='写字楼', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/changfang/', text='厂房', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/cangkucf/', text='仓库', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/tudi/', text='土地', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/cheku/', text='车位', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/wblist/haidian/zufang/', text='出租房', fragment='', nofollow=True),
Link(url='http://post.58.com/fang/1/8/s5', text='免费发布信息', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/', text='北京赶集', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/', text='北京租房', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/?pagetype=area', text='区域\n \n ', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/sub/?pagetype=ditie', text='地铁\n \n ', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/chaoyang/zufang/', text='朝阳', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/haidian/zufang/', text='海淀', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/dongcheng/zufang/', text='东城', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/xicheng/zufang/', text='西城', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/chongwen/zufang/', text='崇文', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/xuanwu/zufang/', text='宣武', fragment='', nofollow=False),
。。。。。。。。这里省略部分类似内容。。。。。。。。
Link(url='http://bj.ganji.com/xiaoqu/huayuandonglu16hao/chuzuxq/', text='\n 花园东路16号院...\n ', ft='', nofollow=False),
Link(url='https://jxjump.58.com/service?target=FCADV8oV3os7xtAhI2suhvPnTEJt7VvwSrGZ89jJDSaNiZGPZpk1zEffDjpdRkNz3Q5xoKYl4Bi0ja0STWzx3zCRPYhCR2qu8NqUfyBP1ZKMCGY1mbpJGLQe4MLWCBtO3CV1GeEvZYetOdm79IubjBATd84ukfCbRGVaWhwAwIAsnVFVGVkJ-frjEcIsiu1SCX0XjWK1hBvMP5Ogvh79ZdJK950gnTnuV4ut2oMzJts5psgWNQ37EDbog7g&pubid=53973391&apptype=10&psid=152852492202554173951819533&entinfo=36506725264296_0&cookie=|||&fzbref=1&key=¶ms=rank0830gspriceB2550^desc&gjcity=bj', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/fuxinglu11haoyuan/chuzuxq/', text='\n 复兴路11号院...\n ', frag'', nofollow=False),
Link(url='https://jxjump.58.com/service?target=FCADV8oV3os7xtAhI2suhvPnTEJt7VvwSrGZ89jJDSaNiZGPZpk1zEffDjpdRkNz3Q5xoKYl4Bi0ja0Rjk7py_gFkXIyKx8C3feATyBP1ZKMCGY0_lHXr41EYvY6_kCEzXoV_eEvZYetOdm7tUgy8gGYBrIukfCbRGVaWhwAwIAsnVFVGVkJ-frjEcIsiu1SCX0XjJlshikCCdcyrXame0WrKfkgnTnuV4ut2oMzJts5psgWsF5zmZtTCDw&pubid=53952061&apptype=10&psid=152852492202554173951819533&entinfo=36506153352577_0&cookie=|||&fzbref=1&key=¶ms=rank0830gspriceB2550^desc&gjcity=bj', text='\n \n ', fragment='', nofollow=True),
Link(url='https://jxjump.58.com/service?target=FCADV8oV3os7xtAhI2suhvPnTEJt7VvwSrGZ89jJDSaNiZGPZpk1zEffDjpdRkNz3Q5xoKYl4Bi0ja0RK-qBSOCO4X9Aqxb55Bt8ryBP1ZKMCGY2UE2j0rkCcdgL7Z5Dw3ipDeEvZYetOdm63BuhRypVvZ4ukfCbRGVaWhwAwIAsnVFVGVkJ-frjEcIsiu1SCX0XjNQr1dZSEQZQ-CW_Mhoakx0gnTnuV4ut2oMzJts5psgWZGAfLB2zanA&pubid=53897059&apptype=10&psid=152852492202554173951819533&entinfo=36504407043592_0&cookie=|||&fzbref=1&key=¶ms=rank0830gspriceB2550^desc&gjcity=bj', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/mudanyuandongli/chuzuxq/', text='\n 牡丹园东里...\n ', fragmentnofollow=False),
Link(url='https://jxjump.58.com/service?target=FCADV8oV3os7xtAhI2suhvPnTEJt7VvwSrGZ89jJDSaNiZGPZpk1zEffDjpdRkNz3Q5xoKYl4Bi0ja0TMq88evYBaBSquYHHcYTCiyBP1ZKMCGY1mbpJGLQe4MAc5-aJwKmkPeEvZYetOdm7Aap9GwaXd64ukfCbRGVaWhwAwIAsnVFVGVkJ-frjEcIsiu1SCX0XjENEfUaGPtcBsLmmwuR9Ix0gnTnuV4ut2oMzJts5psgWDhtNAesvD4A&pubid=53903799&apptype=10&psid=152852492202554173951819533&entinfo=36504585975840_0&cookie=|||&fzbref=1&key=¶ms=rank0830gspriceB2550^desc&gjcity=bj', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/longxianglu8hao/chuzuxq/', text='\n 龙翔路8号院...\n ', fragmen nofollow=False),
Link(url='https://jxjump.58.com/service?target=FCADV8oV3os7xtAhI2suhvPnTEJt7VvwSrGZ89jJDSaNiZGPZpk1zEffDjpdRkNz3Q5xoKYl4Bi0ja0RClDRDo67SzOLvSyQZZ8HYyBP1ZKMCGY315cEFUaeoIMHPhud0MxuWeEvZYetOdm772LkDJkdp34ukfCbRGVaWhwAwIAsnVFVGVkJ-frjEcIsiu1SCX0XjmuTv-Y3xRZwiPZB47nGe9UgnTnuV4ut2oMzJts5psgUmMYZH-DENLw&pubid=53895758&apptype=10&psid=152852492202554173951819533&entinfo=36504374330393_0&cookie=|||&fzbref=1&key=¶ms=rank0830gspriceB2550^desc&gjcity=bj', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/309yiyuanjiashulou/chuzuxq/', text='\n 黑山扈路甲17号院...\n ',nt='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36525781752728x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/cuiweilu21haoyuan/chuzuxq/', text='\n 翠微路21号院...\n ', frag'', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36514856072971x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/landehuating/chuzuxq/', text='\n 兰德华庭...\n ', fragment='', llow=False),
Link(url='http://bj.ganji.com/zufang/36364461048994x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/xiaoyingdonglu7haoyuan/chuzuxq/', text='\n 小营东路7号院...\n 'ment='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36283837290114x.shtml?ding=https://short.58.com/zd_p/4d183517-ac8c-4370-a0c1-282763d4a987/?target=dc-16-xgk_hvimob_89368680324775q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/taipinglu34haoyuan/chuzuxq/', text='\n 太平路34号院...\n ', fra='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36448703569416x.shtml?ding=https://short.58.com/zd_p/57e5e09c-29bc-4db5-b392-46106dfa6069/?target=dc-16-xgk_hvimob_89556048192579q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/zijinzhangan/chuzuxq/', text='\n 紫金长安...\n ', fragment='', llow=False),
Link(url='http://bj.ganji.com/zufang/36452227627012x.shtml?ding=https://short.58.com/zd_p/59f9ea44-183d-40e8-8bcb-e3c7f13874c5/?target=dc-16-xgk_hvimob_89513330930473q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/yongwangjiayuansiqu/chuzuxq/', text='\n 永旺家园四区...\n ', fr='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36414250778516x.shtml?ding=https://short.58.com/zd_p/a859ca84-d600-4400-9a95-38f3d762e828/?target=dc-16-xgk_hvimob_89575314006179q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/huarunxiangshuwanyiqi/chuzuxq/', text='\n 橡树湾一期...\n ', frt='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36227844908548x.shtml?ding=https://short.58.com/zd_p/9bfed205-196b-4420-9087-e2e1a3269ddd/?target=dc-16-xgk_hvimob_89330655246156q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/yiheshanzhuangbj/chuzuxq/', text='\n 颐和山庄...\n ', fragment=nofollow=False),
Link(url='http://bj.ganji.com/zufang/36455527886345x.shtml?ding=https://short.58.com/zd_p/bce20dd9-651c-428f-a9b5-8c291fc7b376/?target=dc-16-xgk_hvimob_89511130669851q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/beiwujiayuanxili/chuzuxq/', text='\n 北坞嘉园西里...\n ', fragm, nofollow=False),
Link(url='http://bj.ganji.com/zufang/36455272312598x.shtml?ding=https://short.58.com/zd_p/2df7e63a-e90f-4534-9785-a951019f26df/?target=dc-16-xgk_hvimob_89511303873126q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/tujingjiayuan/chuzuxq/', text='\n 图景嘉园...\n ', fragment='',ollow=False),
Link(url='http://bj.ganji.com/zufang/36455485142684x.shtml?ding=https://short.58.com/zd_p/32d7b98e-3e39-40e2-ae4b-7d7c2c76b5b4/?target=dc-16-xgk_hvimob_89511561753965q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/yichengxishanhuafuxiyuan/chuzuxq/', text='\n 亿城西山华府禧园...\n ragment='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/35843472234064x.shtml?ding=https://short.58.com/zd_p/2fb60c53-a4f9-4b09-a4e7-e9674705065a/?target=dc-16-xgk_hvimob_81658503385495q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/wanshoulu18haoyuan/chuzuxq/', text='\n 万寿路18号院...\n ', fra='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36005967713804x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/mashenmiaoxiaoqu/chuzuxq/', text='\n 马神庙小区...\n ', fragmen nofollow=False),
Link(url='http://bj.ganji.com/zufang/36514389880345x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/zhangzhilu5hao/chuzuxq/', text='\n 财智会馆...\n ', fragment=''follow=False),
Link(url='http://bj.ganji.com/zufang/36210224693127x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/zizaixiangshan/chuzuxq/', text='\n 永泰自在香山...\n ', fragmennofollow=False),
Link(url='http://bj.ganji.com/zufang/36515288107668x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/wanshuyuan/chuzuxq/', text='\n 万树园...\n ', fragment='', nofow=False),
Link(url='http://bj.ganji.com/zufang/36219286813961x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/fenghuangxiaoqu1/chuzuxq/', text='\n 凤凰小区...\n ', fragment=nofollow=False),
Link(url='http://bj.ganji.com/zufang/36226224648989x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/wufulinglongju/chuzuxq/', text='\n 五福玲珑居...\n ', fragment=ofollow=False),
Link(url='http://bj.ganji.com/zufang/36523006142221x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/lingnanlu30haoyuan/chuzuxq/', text='\n 科委宿舍...\n ', fragmen, nofollow=False),
Link(url='http://bj.ganji.com/zufang/36518070380572x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/longxiangzhongqu/chuzuxq/', text='\n 龙乡小区(中区)...\n ', fra'', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36260836748417x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/huangzhuangxiaoqubj/chuzuxq/', text='\n 中国科学院黄庄小区...\n ent='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36496483775505x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/qiangyouqinghexincheng/chuzuxq/', text='\n 强佑清河新城...\n ',ent='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36515848489356x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/yiyuanjuyiqi/chuzuxq/', text='\n 颐源居...\n ', fragment='', nolow=False),
Link(url='http://bj.ganji.com/zufang/36453349478792x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/jinyumeiheyuan/chuzuxq/', text='\n 美和园西区...\n ', fragment=ofollow=False),
Link(url='http://bj.ganji.com/zufang/36448512145167x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/baoshengli/chuzuxq/', text='\n 宝盛里...\n ', fragment='', nofow=False),
Link(url='http://bj.ganji.com/zufang/36440408016649x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/ruiheyuan/chuzuxq/', text='\n 金隅瑞和园...\n ', fragment='', now=False),
Link(url='http://bj.ganji.com/zufang/36521935379723x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/jingouhelu12haoyuan/chuzuxq/', text='\n 金沟河路12号院...\n ', nt='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36364398864531x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/xuezhiyuan/chuzuxq/', text='\n 学知园...\n ', fragment='', nofow=False),
Link(url='http://bj.ganji.com/zufang/36459015677700x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/mingguangbeili/chuzuxq/', text='\n 明光北里...\n ', fragment=''follow=False),
Link(url='http://bj.ganji.com/zufang/36522777615364x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/zizhuyuanjia3hao/chuzuxq/', text='\n 紫竹院路甲3号院...\n ', fr'', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36360244840342x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/shanglinxi/chuzuxq/', text='\n 上林溪...\n ', fragment='', nofow=False),
Link(url='http://bj.ganji.com/zufang/36430964540828x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/yongtaidongli/chuzuxq/', text='\n 永泰东里...\n ', fragment='',ollow=False),
Link(url='http://bj.ganji.com/zufang/36485009202206x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36470563064452x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/beiyisanyuan/chuzuxq/', text='\n 北医三院家属区小区...\n ', fra nofollow=False),
Link(url='http://bj.ganji.com/zufang/36517073349672x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/taiyueyuan/chuzuxq/', text='\n 太月园(南区)...\n ', fragment=''ollow=False),
Link(url='http://bj.ganji.com/zufang/36505985388960x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/yifengzhuangyuan/chuzuxq/', text='\n 颐丰庄园(西区)...\n ', fra'', nofollow=False),
Link(url='http://bj.ganji.com/zufang/34794182085293x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/shangdijiayuanbj/chuzuxq/', text='\n 上地佳园...\n ', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/haidian/zufang/pn2/', text='2', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/haidian/zufang/pn3/', text='3', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/haidian/zufang/pn70/', text='70', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/hezu/', text='北京合租房', fragment='', nofollow=False),
Link(url='http://sh.ganji.com/zufang', text='上海租房网', fragment='', nofollow=False),
Link(url='http://zz.ganji.com/zufang', text='郑州租房网', fragment='', nofollow=False),
Link(url='http://sy.ganji.com/zufang', text='沈阳租房网', fragment='', nofollow=False),
Link(url='http://sz.ganji.com/zufang', text='深圳租房网', fragment='', nofollow=False),
Link(url='http://cd.ganji.com/zufang', text='成都租房网', fragment='', nofollow=False),
Link(url='http://cq.ganji.com/zufang', text='重庆租房网', fragment='', nofollow=False),
Link(url='http://qd.ganji.com/zufang', text='青岛租房网', fragment='', nofollow=False),
Link(url='http://wh.ganji.com/zufang', text='武汉租房网', fragment='', nofollow=False),
Link(url='http://tj.ganji.com/zufang', text='天津租房网', fragment='', nofollow=False),
Link(url='http://jn.ganji.com/zufang', text='济南租房网', fragment='', nofollow=False),
Link(url='http://nj.ganji.com/zufang', text='南京租房网', fragment='', nofollow=False),
Link(url='http://gz.ganji.com/zufang', text='广州租房网', fragment='', nofollow=False),
Link(url='http://xa.ganji.com/zufang', text='西安租房网', fragment='', nofollow=False),
Link(url='http://hf.ganji.com/zufang', text='合肥租房网', fragment='', nofollow=False),
Link(url='http://sjz.ganji.com/zufang', text='石家庄租房网', fragment='', nofollow=False),
Link(url='http://dl.ganji.com/zufang', text='大连租房网', fragment='', nofollow=False),
Link(url='http://hz.ganji.com/zufang', text='杭州租房网', fragment='', nofollow=False),
Link(url='http://kezhan.58.com/bj/qingnianlvshe/', text='北京青年旅社', fragment='', nofollow=False),
Link(url='http://bj.58.com/xiaoqu/shenggunanli/', text='胜古南里', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/wblist/haidian/zufang/m.anjuke.com/bj/loupan/haidian/', text='海淀楼盘', fragment='', nofollow=False),
Link(url='http://m.anjuke.com/bj/loupan/249388/', text='京汉铂寓', fragment='', nofollow=False),
Link(url='http://bj.zu.anjuke.com/fangyuan/haidian/', text='海淀租房', fragment='', nofollow=False),
Link(url='http://bj.58.com/pinpaigongyu/646228473643278336/', text='家乐美地', fragment='', nofollow=False),
Link(url='http://www.ganji.com/misc/abouts/index.php?act=about', text='关于Ganji', fragment='', nofollow=True),
Link(url='http://www.ganji.com/tuiguang/index/', text='赶集推广', fragment='', nofollow=True),
Link(url='http://tuiguang.ganji.com/zhaoshang/agent.htm', text=' 渠道合作 ', fragment='', nofollow=True),
Link(url='http://help.ganji.com/', text='帮助中心', fragment='', nofollow=True),
Link(url='http://help.ganji.com/html/sjbmy/', text='手机号被冒用', fragment='', nofollow=True),
Link(url='http://www.ganji.com/misc/abouts/link.php?act=link', text='友情链接', fragment='', nofollow=True),
Link(url='http://www.ganji.com/misc/abouts/index.php?act=job', text='招贤纳士', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/quxiandaohang/', text='区县导航', fragment='', nofollow=False),
Link(url='http://mobile.ganji.com/', text='手机赶集', fragment='', nofollow=True),
Link(url='http://3g.ganji.com/bj_fang1/', text='租房触屏版', fragment='', nofollow=False)]
都取出来了,很恐怖。
使用正则过滤下就能拿到需要的链接了。在明确链接样式的情况下才能进行正则表达式的设计。
问号后面有键值,先不考虑问号后面的部分了,配前面的吧。页面网址应该是这样的形式
r'http://bj.ganji.com/zufang/\d+x.shtml'
输入:
tmp = LinkExtractor(r'http://bj.ganji.com/zufang/\d+x.shtml')
tmp.extract_links(response) #这是个列表
输出的链接明显少了
[Link(url='http://bj.ganji.com/zufang/36525781752728x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36514856072971x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36364461048994x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36283837290114x.shtml?ding=https://short.58.com/zd_p/4d183517-ac8c-4370-a0c1-282763d4a987/?target=dc-16-xgk_hvimob_89368680324775q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36448703569416x.shtml?ding=https://short.58.com/zd_p/57e5e09c-29bc-4db5-b392-46106dfa6069/?target=dc-16-xgk_hvimob_89556048192579q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36452227627012x.shtml?ding=https://short.58.com/zd_p/59f9ea44-183d-40e8-8bcb-e3c7f13874c5/?target=dc-16-xgk_hvimob_89513330930473q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36414250778516x.shtml?ding=https://short.58.com/zd_p/a859ca84-d600-4400-9a95-38f3d762e828/?target=dc-16-xgk_hvimob_89575314006179q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36227844908548x.shtml?ding=https://short.58.com/zd_p/9bfed205-196b-4420-9087-e2e1a3269ddd/?target=dc-16-xgk_hvimob_89330655246156q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36455527886345x.shtml?ding=https://short.58.com/zd_p/bce20dd9-651c-428f-a9b5-8c291fc7b376/?target=dc-16-xgk_hvimob_89511130669851q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36455272312598x.shtml?ding=https://short.58.com/zd_p/2df7e63a-e90f-4534-9785-a951019f26df/?target=dc-16-xgk_hvimob_89511303873126q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36455485142684x.shtml?ding=https://short.58.com/zd_p/32d7b98e-3e39-40e2-ae4b-7d7c2c76b5b4/?target=dc-16-xgk_hvimob_89511561753965q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/35843472234064x.shtml?ding=https://short.58.com/zd_p/2fb60c53-a4f9-4b09-a4e7-e9674705065a/?target=dc-16-xgk_hvimob_81658503385495q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36005967713804x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36514389880345x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36210224693127x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36515288107668x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36219286813961x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36226224648989x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36523006142221x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36518070380572x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36260836748417x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36496483775505x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36515848489356x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36453349478792x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36448512145167x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36440408016649x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36521935379723x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36364398864531x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36459015677700x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36522777615364x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36360244840342x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36430964540828x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36485009202206x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36470563064452x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36517073349672x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36505985388960x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/34794182085293x.shtml', text='\n \n ', fragment='', nofollow=True)]
这出来的一个一个链接是LinkExtractor提取出来的,但是如果写进RULES里面去,就会直接把链接爬上来,
打开Pycharm工程,把Rule里面的正则条件替换成
http://bj.ganji.com/zufang/\d+x.shtml
但是这一步要规定回调函数,这里面设置成的是parse_item,parse_item会自动获取这个链接的response。那如果想查询刚才的url呢,要使用response.url
才能得到链接。
Crawlspider知识点整理
CrawlSpider继承最基础的Spider,所以Spider有的方法和属性,CrawlSpider全部具备。
CrawlSpider别于Spider的特性是多了一个rules参数,其作用是定义提取动作,可以快速的检索符合正则的路由,并非常方便的回调到函数中。
几点说明
1、 follow
是一个布尔(boolean)值,指定了根据该规则从response提取的链 接是否需要跟进。如果 callback
为None
, follow
默认设置为True
,否则默认为 False
。 follow
默认设置为True
时候,会一直跟进爬取此链接打开的页面的response的符合规则的链接。
注意:如果不写callback
也不写follow
的话,表示follow
默认跟进,至于要将拿到的链接重新打开,根据规则再提取里面的链接,如果里面的链接触发了某个支持callback
的规则,那么再传到callback
对应的函数里进行提取。
2、rules
:一个包含一个(或多个) Rule
对象的集合(list)。 每个 Rule
对爬取网站的动作定义了特定表现。 Rule
对象在下边会介绍。 如果多个rule
匹配了相同的链接,则根据他们在本属性中被定义的顺序,第一个会被使用。
3、URL
链接提取的类LinkExtractor
,主要参数为:
allow
:满足括号中“正则表达式”的值会被提取,如果为空,则全部 匹配。 deny
:与这个正则表达式(或正则表达式列表)不匹配的URL一定不 提取。
allow_domains
:会被提取的链接的domains
。
deny_domains
:一定不会被提取链接的domains
。
restrict_xpaths
:使用xpath
表达式,和allow
共同作用过滤链接。还有一个类似的restrict_css
警告
当编写CrawlSpider爬虫规则时,请避免使用 parse
作为回调函数。由于 CrawlSpider 使用parse
方法来实现其逻辑,如果您覆盖了 parse
方
法,Crawlspider 将会运行失败。涉及的示例:
$ scrapy shell http://bj.ganji.com/fang1/
# ......
# 略过 Scrapy Log
>>> from scrapy.linkextractors import LinkExtractor >>> tmp = LinkExtractor(r'')
>>> len(tmp.extract_links(response))
Out: 875
>>> get_links =
LinkExtractor(r'http://bj.ganji.com/fang1/\d+x.htm')
>>> len(get_links.extract_links(response))
Out: 89
实际操作的时候并不简单
另外如果需要转码到json,可以使用如下语句