基于关键字在主流搜索引擎中抓取信息

本文首发于我的博客:http://gongyanli.com
代码传送门:https://github.com/Gladysgong/seCrawler
简书: https://www.jianshu.com/p/4e244563849a
CSDN: https://blog.csdn.net/u012052168/article/details/79762586

seCrawler(Search Engine Crawler)

A scrapy project can crawl search result of Google/Bing/Baidu

refer

copying by https://github.com/xtt129/seCrawler and rewrite,adding title and abstract.

prerequisite

python 3.5 and scrapy is needed.

commands

run one command to get 50 pages result from search engine with keyword, the result would be kept in the "urls.txt" under the current directory.

Bing

scrapy crawl keywordSpider -a keyword=Spider-Man -a se=bing -a pages=50

Baidu

scrapy crawl keywordSpider -a keyword=Spider-Man -a se=baidu -a pages=50

Google

scrapy crawl keywordSpider -a keyword=Spider-Man -a se=google -a pages=50

results

url,title and abstract will be stored in the urls.txt

limitation

The project doesn't provide any workaround to the anti-spider measure like CAPTCHA, IP ban list, etc.

But to reduce these measures, we recommand to set DOWNLOAD_DELAY=10 in settings.py file to add a temporisation (in second) between the crawl of two pages, see details in Scrapy Setting.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,499评论 0 10
  • 下午接回儿子,他回屋写作业了,今天的作业有点多,检查作业时数学错了两道,一道是不会,另一题粗心,自己认真的改过来了...
    李名妈妈阅读 169评论 0 0
  • 我很累。 有时候, 莫名的心情不好, 不想和任何人说话, 只想一个人静静的发呆。 有时候, 突然觉得心情烦躁, 看...
    涵凌阅读 146评论 1 1
  • 乐乐说想吃麻辣香锅,问我会不会做,孕妇的口味真是变化多端,好在我真的会,而且我也想吃,一拍即合,于是一起去超市买香...
    快乐的鹿小姐阅读 241评论 0 0
  • 凤仙花又名急性子,老百姓喜欢叫它指甲草。 全草都能入药,花能治疗灰指甲,甲沟炎。茎称为透骨草 ,外敷可以消肿消炎,...
    f06aabba873f阅读 471评论 0 2