python3的爬虫笔记14——Scrapy命令

命令格式:scrapy <command> [options] [args]

commands 作用 命令作用域
crawl 使用一个spider开始爬取任务 项目内
check 代码语法检查 项目内
list 列出当前项目中所有可用的spiders ,每一行显示一个spider 项目内
edit 在命令窗口下编辑一个爬虫 项目内
parse 用指定spider方法来访问URL 项目内
bench 测试当前爬行速度 全局
fetch 使用Scrapy downloader获取URL 全局
genspider 使用预定义模板生成一个新的spider 全局
runspider Run a self-contained spider (without creating a project) 全局
settings 获取Scrapy配置信息 全局
shell 命令行交互窗口下访问URL 全局
startproject 创建一个新项目 全局
version 打印Scrapy版本 全局
view 通过浏览器打开URL,显示内容为Scrapy实际所见 全局

1、创建项目 startproject

scrapy startproject myproject [project_dir]
project_dir路径下创建一个名为myproject的新的爬虫项目,若没有指名project_dir,则project_dir名字将和myproject一样。

C:\Users\m1812>scrapy startproject mytestproject
New Scrapy project 'mytestproject', using template directory 'C:\\Users\\m1812\\Anaconda3\\lib\\site-packages\\scrapy\\templates\\project', created in:
    C:\Users\m1812\mytestproject

You can start your first spider with:
    cd mytestproject
    scrapy genspider example example.com
C:\Users\m1812>cd mytestproject

C:\Users\m1812\mytestproject>tree
文件夹 PATH 列表
卷序列号为 5680-D4D0
C:.
└─mytestproject
    ├─spiders
    │  └─__pycache__
    └─__pycache__

2、生成爬虫 genspider

在上面的目录下:
scrapy genspider mydomain mydomain.com

C:\Users\m1812\mytestproject>scrapy genspider baidu www.baidu.com
Created spider 'baidu' using template 'basic' in module:
  mytestproject.spiders.baidu



看下genspider的详细用法:

C:\Users\m1812\mytestproject>scrapy genspider -h
Usage
=====
  scrapy genspider [options] <name> <domain>

Generate new spider using pre-defined templates

Options
=======
--help, -h              show this help message and exit
--list, -l              List available templates
--edit, -e              Edit spider after creating it
--dump=TEMPLATE, -d TEMPLATE
                        Dump template to standard output
--template=TEMPLATE, -t TEMPLATE
                        Uses a custom template.
--force                 If the spider already exists, overwrite it with the
                        template

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure

模板的使用:-t TEMPLATE
模板类型:

C:\Users\m1812\mytestproject>scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

测试下模板的使用:

C:\Users\m1812\mytestproject>scrapy genspider -t crawl zhihu www.zhihu.com
Created spider 'zhihu' using template 'crawl' in module:
  mytestproject.spiders.zhihu
对比下和刚刚baidu的区别

3、运行爬虫 crawl

scrapy crawl <spider>

C:\Users\m1812\mytestproject>scrapy crawl zhihu
2019-04-06 15:14:18 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: mytestproject)
2019-04-06 15:14:18 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mytestproject.spiders', 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'mytestproject', 'SPIDER_MODULES': ['mytestproject.spiders']}
2019-04-06 15:14:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2019-04-06 15:14:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 15:14:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 15:14:18 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-06 15:14:18 [scrapy.core.engine] INFO: Spider opened
2019-04-06 15:14:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 15:14:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-06 15:14:23 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://www.zhihu.com/robots.txt> (referer: None)
2019-04-06 15:14:28 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://www.zhihu.com/> (referer: None)
2019-04-06 15:14:28 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://www.zhihu.com/>: HTTP status code is not handled or not allowed
2019-04-06 15:14:28 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-06 15:14:28 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 527,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 813,
 'downloader/response_count': 2,
 'downloader/response_status_count/400': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 6, 7, 14, 28, 947408),
 'log_count/DEBUG': 3,
 'log_count/INFO': 8,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 4, 6, 7, 14, 18, 593508)}
2019-04-06 15:14:28 [scrapy.core.engine] INFO: Spider closed (finished)

知乎需要一些请求头才能成功访问,所以这里状态码显示不成功。

4、检查代码 check

scrapy check [-l] <spider>

C:\Users\m1812\mytestproject>scrapy check

----------------------------------------------------------------------
Ran 0 contracts in 0.000s

OK

如果随便把代码改错,这里删了网址的一个引号。再运行下,就能检查到错误。

C:\Users\m1812\mytestproject>scrapy check
Traceback (most recent call last):
  File "C:\Users\m1812\Anaconda3\Scripts\scrapy-script.py", line 5, in <module>
    sys.exit(scrapy.cmdline.execute())
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\cmdline.py", line 141, in execute
    cmd.crawler_process = CrawlerProcess(settings)
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\crawler.py", line 238, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\crawler.py", line 129, in __init__
    self.spider_loader = _get_spider_loader(settings)
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\crawler.py", line 325, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\spiderloader.py", line 45, in from_settings
    return cls(settings)
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\spiderloader.py", line 23, in __init__
    self._load_all_spiders()
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\spiderloader.py", line 32, in _load_all_spiders
    for module in walk_modules(name):
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\utils\misc.py", line 71, in walk_modules
    submod = import_module(fullpath)
  File "C:\Users\m1812\Anaconda3\lib\importlib\__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 986, in _gcd_import
  File "<frozen importlib._bootstrap>", line 969, in _find_and_load
  File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 661, in exec_module
  File "<frozen importlib._bootstrap_external>", line 767, in get_code
  File "<frozen importlib._bootstrap_external>", line 727, in source_to_code
  File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
  File "C:\Users\m1812\mytestproject\mytestproject\spiders\zhihu.py", line 10
    start_urls = [http://www.zhihu.com/']

这个指令实用性不高。

5、显示项目内可用爬虫 list

scrapy list

C:\Users\m1812\mytestproject>scrapy list
baidu
zhihu

6、编辑爬虫 edit

scrapy edit <spider>
windows下好像用不了,一般也用不到,在ide中如pycharm中运行即可。

7、获取URL fetch

这是个全局命令:scrapy fetch [options] <url>
详细用法:

C:\Users\m1812\mytestproject>scrapy fetch -h
Usage
=====
  scrapy fetch [options] <url>

Fetch a URL using the Scrapy downloader and print its content to stdout. You
may want to use --nolog to disable logging

Options
=======
--help, -h              show this help message and exit
--spider=SPIDER         use this spider
--headers               print response HTTP headers instead of body
--no-redirect           do not handle HTTP 3xx status codes and print response
                        as-is

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure

测试下获取百度的信息,注意这里一定要加上http://

C:\Users\m1812>scrapy fetch http://www.baidu.com
2019-04-06 15:44:51 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2019-04-06 15:44:51 [scrapy.utils.log] INFO: Overridden settings: {}
2019-04-06 15:44:51 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole']
2019-04-06 15:44:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 15:44:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 15:44:51 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-06 15:44:51 [scrapy.core.engine] INFO: Spider opened
2019-04-06 15:44:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 15:44:51 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-06 15:44:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.baidu.com> (referer: None)
2019-04-06 15:44:51 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-06 15:44:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 211,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1476,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 6, 7, 44, 51, 989960),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 4, 6, 7, 44, 51, 759268)}
2019-04-06 15:44:51 [scrapy.core.engine] INFO: Spider closed (finished)
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>鐧惧害涓€涓嬶紝浣犲氨鐭ラ亾</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=鐧惧害涓€涓?class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>鏂伴椈</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>鍦板浘</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>瑙嗛</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>璐村惂</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>鐧诲綍</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">鐧诲綍</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">鏇村浜у搧</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>鍏充簬鐧惧害</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>浣跨敤鐧惧害鍓嶅繀璇?/a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>鎰忚鍙嶉</a>&nbsp;浜琁CP璇?30173鍙?nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

>

试下不输出日志

C:\Users\m1812>scrapy fetch --nolog http://www.baidu.com
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>鐧惧害涓€涓嬶紝浣犲氨鐭ラ亾</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=鐧惧害涓€涓?class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>鏂伴椈</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>鍦板浘</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>瑙嗛</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>璐村惂</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>鐧诲綍</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">鐧诲綍</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">鏇村浜у搧</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>鍏充簬鐧惧害</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>浣跨敤鐧惧害鍓嶅繀璇?/a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>鎰忚鍙嶉</a>&nbsp;浜琁CP璇?30173鍙?nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

>

获取headers

C:\Users\m1812>scrapy fetch --nolog --headers http://www.baidu.com
> User-Agent: Scrapy/1.3.3 (+http://scrapy.org)
> Accept-Language: en
> Accept-Encoding: gzip,deflate
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>
< Date: Sat, 06 Apr 2019 07:48:42 GMT
< Server: bfe/1.0.8.18
< Content-Type: text/html
< Last-Modified: Mon, 23 Jan 2017 13:28:12 GMT
< Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform
< Pragma: no-cache
< Set-Cookie: BDORZ=27315; max-age=86400; domain=.baidu.com; path=/

此外还有很多其他功能,如--no-redirect:禁止重定向

8、以Scrapy所见在浏览器中打开URL view

这是个全局命令:scrapy view [options] <url>
通过浏览器打开URL,显示内容为Scrapy实际所见。有时候spider看到的页面和常规方式不同,这个方法能检查spider看到的信息是否和你期待的一致。

C:\Users\m1812>scrapy view http://www.baidu.com
2019-04-06 16:01:45 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2019-04-06 16:01:45 [scrapy.utils.log] INFO: Overridden settings: {}
2019-04-06 16:01:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2019-04-06 16:01:46 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 16:01:46 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 16:01:46 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-06 16:01:46 [scrapy.core.engine] INFO: Spider opened
2019-04-06 16:01:46 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:01:46 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-06 16:01:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.baidu.com> (referer: None)
2019-04-06 16:01:46 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-06 16:01:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 211,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1476,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 6, 8, 1, 46, 435330),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 4, 6, 8, 1, 46, 78537)}
2019-04-06 16:01:46 [scrapy.core.engine] INFO: Spider closed (finished)

测试下淘宝,很多加载不出来,说明淘宝用的是ajax异步加载,常规的request方法不能获得信息。


9、命令行交互窗口下访问URL shell

这是个全局命令:scrapy shell [options] <url>

C:\Users\m1812>scrapy shell http://www.baidu.com
2019-04-06 16:11:41 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2019-04-06 16:11:41 [scrapy.utils.log] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2019-04-06 16:11:41 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2019-04-06 16:11:42 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 16:11:42 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 16:11:42 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-06 16:11:42 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-06 16:11:42 [scrapy.core.engine] INFO: Spider opened
2019-04-06 16:11:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.baidu.com> (referer: None)
2019-04-06 16:11:42 [traitlets] DEBUG: Using default logger
2019-04-06 16:11:42 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x0000025F58BBA320>
[s]   item       {}
[s]   request    <GET http://www.baidu.com>
[s]   response   <200 http://www.baidu.com>
[s]   settings   <scrapy.settings.Settings object at 0x0000025F593EE6D8>
[s]   spider     <DefaultSpider 'default' at 0x25f5afd1470>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: scrapy
Out[1]: <module 'scrapy' from 'C:\\Users\\m1812\\Anaconda3\\lib\\site-packages\\scrapy\\__init__.py'>

In [2]: request
Out[2]: <GET http://www.baidu.com>

In [3]: response
Out[3]: <200 http://www.baidu.com>

In [4]: view(response)
Out[4]: True

In [5]: response.text
Out[5]: '<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">登录</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'

In [6]: response.headers
Out[6]:
{b'Cache-Control': b'private, no-cache, no-store, proxy-revalidate, no-transform',
 b'Content-Type': b'text/html',
 b'Date': b'Sat, 06 Apr 2019 08:11:42 GMT',
 b'Last-Modified': b'Mon, 23 Jan 2017 13:28:12 GMT',
 b'Pragma': b'no-cache',
 b'Server': b'bfe/1.0.8.18',
 b'Set-Cookie': b'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/'}

In [7]: response.css('title::text').extract_first()
Out[7]: '百度一下,你就知道'

In [8]: exit()
In [4]的显示结果

10、用指定spider方法来访问URL parse

scrapy parse <url> [options]
这里用前一讲访问quotes.toscrape.com的spider测试。

C:\Users\m1812>cd quotetutorial

C:\Users\m1812\quotetutorial>scrapy parse http://quotes.toscrape.com -c parse
2019-04-06 16:24:23 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: quotetutorial)
2019-04-06 16:24:23 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['quotetutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'quotetutorial.spiders', 'BOT_NAME': 'quotetutorial'}
2019-04-06 16:24:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats']
2019-04-06 16:24:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 16:24:24 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 16:24:24 [scrapy.middleware] INFO: Enabled item pipelines:
['quotetutorial.pipelines.QuotetutorialPipeline',
 'quotetutorial.pipelines.MongoPipeline']
2019-04-06 16:24:24 [scrapy.core.engine] INFO: Spider opened
2019-04-06 16:24:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:24:24 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-06 16:24:24 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-04-06 16:24:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com> (referer: None)
2019-04-06 16:24:25 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-06 16:24:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 444,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2701,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 6, 8, 24, 25, 485334),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 4, 6, 8, 24, 24, 258282)}
2019-04-06 16:24:25 [scrapy.core.engine] INFO: Spider closed (finished)

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[{�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mAlbert Einstein�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mchange�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mdeep-thoughts�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mthinking�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mworld�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“The world as we have created it is a process of our thinking. It �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mcannot be changed without changing our thinking.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mJ.K. Rowling�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mabilities�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mchoices�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“It is our choices, Harry, that show what we truly are, far more �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mthan our abilities.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mAlbert Einstein�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33minspirational�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mlife�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mlive�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mmiracle�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mmiracles�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“There are only two ways to live your life. One is as though �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mnothing is a miracle. The other is as though everything is a �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mmiracle.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mJane Austen�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33maliteracy�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mbooks�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mclassic�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mhumor�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“The person, be it gentleman or lady, who has not pleasure in a �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mgood novel, must be intolerably stupid.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mMarilyn Monroe�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mbe-yourself�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33minspirational�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m"�[39;49;00m�[33m“Imperfection is beauty, madness is genius and it�[39;49;00m�[33m'�[39;49;00m�[33ms better to be �[39;49;00m�[33m"�[39;49;00m
          �[33m'�[39;49;00m�[33mabsolutely ridiculous than absolutely boring.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mAlbert Einstein�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33madulthood�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33msuccess�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mvalue�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“Try not to become a man of success. Rather become a man of �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mvalue.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mAndré Gide�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mlife�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mlove�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“It is better to be hated for what you are than to be loved for �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mwhat you are not.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mThomas A. Edison�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33medison�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mfailure�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33minspirational�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mparaphrased�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m"�[39;49;00m�[33m“I have not failed. I�[39;49;00m�[33m'�[39;49;00m�[33mve just found 10,000 ways that won�[39;49;00m�[33m'�[39;49;00m�[33mt work.”�[39;49;00m�[33m"�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mEleanor Roosevelt�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mmisattributed-eleanor-roosevelt�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“A woman is like a tea bag; you never know how strong it is until �[39;49;00m�[33m'�[39;49;00m
          �[33m"�[39;49;00m�[33mit�[39;49;00m�[33m'�[39;49;00m�[33ms in hot water.”�[39;49;00m�[33m"�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mSteve Martin�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mhumor�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mobvious�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33msimile�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“A day without sunshine is like, you know, night.”�[39;49;00m�[33m'�[39;49;00m}]

# Requests  -----------------------------------------------------------------
[<GET http://quotes.toscrape.com/page/�[34m2�[39;49;00m/>]

输出了Scraped Itemsrequests

11、获取Scrapy配置信息 settings

scrapy settings [options]

C:\Users\m1812\quotetutorial>scrapy settings -h
Usage
=====
  scrapy settings [options]

Get settings values

Options
=======
--help, -h              show this help message and exit
--get=SETTING           print raw setting value
--getbool=SETTING       print setting value, interpreted as a boolean
--getint=SETTING        print setting value, interpreted as an integer
--getfloat=SETTING      print setting value, interpreted as a float
--getlist=SETTING       print setting value, interpreted as a list

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure

测试:

C:\Users\m1812\quotetutorial>scrapy settings --get MONGO_URI
localhost

12、运行爬虫 runspider

crawl不同的是,runspider直接运行的是文件名称(xxx.py),并且要移动到相应目录下。
scrapy runspider <spider_file.py>

C:\Users\m1812\quotetutorial>cd quotetutorial

C:\Users\m1812\quotetutorial\quotetutorial>dir
 驱动器 C 中的卷没有标签。
 卷的序列号是 5680-D4D0

 C:\Users\m1812\quotetutorial\quotetutorial 的目录

2019/04/05  22:44    <DIR>          .
2019/04/05  22:44    <DIR>          ..
2019/04/05  20:04               364 items.py
2019/04/05  19:16             1,887 middlewares.py
2019/04/05  22:35             1,431 pipelines.py
2019/04/05  22:44             3,292 settings.py
2019/04/05  22:02    <DIR>          spiders
2017/03/10  23:31                 0 __init__.py
2019/04/06  14:33    <DIR>          __pycache__
               5 个文件          6,974 字节
               4 个目录 28,533,673,984 可用字节

C:\Users\m1812\quotetutorial\quotetutorial>cd spiders

C:\Users\m1812\quotetutorial\quotetutorial\spiders>dir
 驱动器 C 中的卷没有标签。
 卷的序列号是 5680-D4D0

 C:\Users\m1812\quotetutorial\quotetutorial\spiders 的目录

2019/04/05  22:02    <DIR>          .
2019/04/05  22:02    <DIR>          ..
2019/04/05  22:02               914 quotes.py
2017/03/10  23:31               161 __init__.py
2019/04/05  22:02    <DIR>          __pycache__
               2 个文件          1,075 字节
               3 个目录 28,533,673,984 可用字节
C:\Users\m1812\quotetutorial\quotetutorial\spiders>scrapy runspider quotes.py

运行结果和crawl是一样的。

13、显示版本 version

显示scrapy的版本信息,相关依赖库信息。

C:\Users\m1812\quotetutorial>scrapy version -v
Scrapy    : 1.3.3
lxml      : 3.6.4.0
libxml2   : 2.9.4
cssselect : 1.0.1
parsel    : 1.2.0
w3lib     : 1.17.0
Twisted   : 17.5.0
Python    : 3.5.2 |Anaconda 4.2.0 (64-bit)| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
pyOpenSSL : 16.2.0 (OpenSSL 1.0.2j  26 Sep 2016)
Platform  : Windows-10-10.0.17134-SP0

14、测试爬行速度 bench

C:\Users\m1812>scrapy bench
2019-04-06 16:43:34 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2019-04-06 16:43:34 [scrapy.utils.log] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'LOGSTATS_INTERVAL': 1, 'LOG_LEVEL': 'INFO'}
2019-04-06 16:43:37 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.closespider.CloseSpider',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2019-04-06 16:43:37 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 16:43:37 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 16:43:37 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-06 16:43:37 [scrapy.core.engine] INFO: Spider opened
2019-04-06 16:43:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:38 [scrapy.extensions.logstats] INFO: Crawled 61 pages (at 3660 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:39 [scrapy.extensions.logstats] INFO: Crawled 109 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:40 [scrapy.extensions.logstats] INFO: Crawled 157 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:41 [scrapy.extensions.logstats] INFO: Crawled 205 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:42 [scrapy.extensions.logstats] INFO: Crawled 245 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:43 [scrapy.extensions.logstats] INFO: Crawled 285 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:44 [scrapy.extensions.logstats] INFO: Crawled 317 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:45 [scrapy.extensions.logstats] INFO: Crawled 357 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:46 [scrapy.extensions.logstats] INFO: Crawled 389 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:47 [scrapy.core.engine] INFO: Closing spider (closespider_timeout)
2019-04-06 16:43:47 [scrapy.extensions.logstats] INFO: Crawled 429 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:48 [scrapy.extensions.logstats] INFO: Crawled 445 pages (at 960 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 182101,
 'downloader/request_count': 445,
 'downloader/request_method_count/GET': 445,
 'downloader/response_bytes': 1209563,
 'downloader/response_count': 445,
 'downloader/response_status_count/200': 445,
 'finish_reason': 'closespider_timeout',
 'finish_time': datetime.datetime(2019, 4, 6, 8, 43, 48, 395684),
 'log_count/INFO': 18,
 'request_depth_max': 16,
 'response_received_count': 445,
 'scheduler/dequeued': 445,
 'scheduler/dequeued/memory': 445,
 'scheduler/enqueued': 8901,
 'scheduler/enqueued/memory': 8901,
 'start_time': datetime.datetime(2019, 4, 6, 8, 43, 37, 309871)}
2019-04-06 16:43:48 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)

大概是每分钟2000个页面。

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,744评论 6 502
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,505评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 163,105评论 0 353
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,242评论 1 292
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,269评论 6 389
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,215评论 1 299
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,096评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,939评论 0 274
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,354评论 1 311
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,573评论 2 333
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,745评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,448评论 5 344
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,048评论 3 327
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,683评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,838评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,776评论 2 369
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,652评论 2 354

推荐阅读更多精彩内容