数据获取-爬虫实践

爬虫入门文章

https://zhuanlan.zhihu.com/p/24669128
https://zhuanlan.zhihu.com/p/24769534
https://zhuanlan.zhihu.com/p/25200262
https://zhuanlan.zhihu.com/p/26257790

userAgent 和 动态IP设置

http://lawtech0902.com/2017/06/11/scrapy-useragent-proxyip/
https://zhuanlan.zhihu.com/p/29733174
https://github.com/hellysmile/fake-useragent

延迟和禁止cookies

https://blkstone.github.io/2016/03/02/crawler-anti-anti-cheat/

PhantomJs 和 selenium 处理Ajax

https://my.oschina.net/lewisgong/blog/872257
https://chaycao.github.io/2016/08/19/Scrapy-Selenium-Phantomjs/

页面解析 Beautiful xpath css.

https://cuiqingcai.com/1319.html

python

lxml安装

https://pypi.org/project/lxml/#files
pip install lxml-4.2.1-cp27-cp27m-win_amd64.whl
https://blog.csdn.net/g1apassz/article/details/46574963
https://blog.csdn.net/acingdreamer/article/details/53348649

pip升级

pip install --upgrade pip

requirements.txt的创建及使用

https://blog.csdn.net/orangleliu/article/details/60958525

python path 引用

https://blog.csdn.net/tony_wong/article/details/18044273

Scrapy安装错误:Microsoft Visual C++ 14.0 is required...

https://blog.csdn.net/nima1994/article/details/74931621?locationNum=10&fps=1

Scrapy shell

https://blog.csdn.net/laoyang360/article/details/52809927
Scrapy运行ImportError: No module named win32api错误
https://blog.csdn.net/u013687632/article/details/57075514

xpath

https://blog.csdn.net/manongpengzai/article/details/77109600

python log

https://blog.csdn.net/chosen0ne/article/details/7319306

scrapy link extrator

https://www.jianshu.com/p/ff9125650697

启动爬虫

进入项目的根目录,执行下列命令启动spider:
scrapy crawl dmoz

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

友情链接更多精彩内容