爬取糗事百科所有段子加上作者(图片部分不做爬取)
效果
感谢简书作者xiyouMc的建议和他的作品
爬取成人网
xiyouMc的个人主页
scrapy中文文档
'''获取开始地址'''
start_urls = ["https://www.qiushibaike.com"]
'''清除获取到span内的标签br因为有标签br导入文本会当做一个list的元素所以要清除span的标签br'''
def clear_span_br(self,txt):
p = re.compile(r'((<span>)|(</span>)|(<br>))+')
a = []
for t in txt:
a.append(p.sub(' ',t))
return a
def parse(self,response):
f = open('qiushibaike.txt','a')
page_url = response.selector.xpath('//ul[@class="pagination"]/li/a/@href').extract()
author = response.selector.xpath('//div[@id="content-left"]/div/div[1]/a[2]/@title').extract()
content = response.selector.xpath('//div[@class="content"]/span').extract()
content = self.clear_span_br(content)
pag_url = page_url[-1]
while(author and content):
a = author.pop()
c = content.pop()
f.write('author:'+a+'\n'+'content:'+c+'\n')
f.close()
'''返回获取到的下一页的连接'''
yield scrapy.Request("https://www.qiushibaike.com"+pag_url,callback=self.parse)
不能获取含有图片部分,因为获取图片部分后,没想好如何分析怎么去匹配它的作者。好像糗事百科有反爬的东西,我就自己写了heard,放在settings.py。还设置了爬取时间和禁用cookies,防止ip被ban
DOWNLOAD_DELAY = 1
COOKIES_ENABLED = False
DEFAULT_REQUEST_HEADERS = {
"Host":"www.qiushibaike.com",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.baidu.com/link?url=SFyp4zmJYyAg7F_zDBbe8h4Yar1zhOtxtmLVNqNCofb8tC0d7jb6Y_-OnEawuX_t&wd=&eqid=b3914e3e0001a74400000003594396a1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": 1,
"If-None-Match": "0586a81307a61d75056d3c997d786eca66211512"
}
恩就是这样第一次接触scarpy框架爬取网页