九、Scrapy框架–实战–猎云网爬虫(2)
cmd进入lyw文件夹,输入scrapy shell https://www.lieyunwang.com/archives/454266
回车,继续输入测试代码
response.xpath(“//h1[@class=’lyw-article-title’]//text()”).getall()
title_list= response.xpath(“//h1[@class=’lyw-article-title’]//text()”).getall()
“”.join(title_list).strip()
response.xpath("//h1[@class='lyw-article-title']/span/text()").get()
response.xpath(“//a[contains(@class,’author-name’)]/text()”).get()
response.xpath("//div[@class='main-text']").get()
response.url
items.py代码
import scrapy
class LywItem(scrapy.Item):
title = scrapy.Field()
pub_time = scrapy.Field()
author = scrapy.Field()
content = scrapy.Field()
origin = scrapy.Field()
续上例,lyw_spider.py 示例代码
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import LywItem
class LywSpiderSpider(CrawlSpider):
name = 'lyw_spider'
allowed_domains = ['lieyunwang.com']
start_urls =['http://lieyunwang.com/latest/p1.html']
rules = (
Rule(LinkExtractor(allow=r'/latest/p\d+.html'), follow=True),
Rule(LinkExtractor(allow=r'/archives/\d+'),callback="parse_detail", follow=False),
)
def parse_item(self, response):
pub_time =response.xpath("//h1[@class='lyw-article-title']/span/text()").get()
title_list =response.xpath("//h1[@class='lyw-article-title']/text()").getall()
title = "".join(title_list).strip()
author =response.xpath("//a[contains(@class,'author-name')]/text()").get()
content =response.xpath("//div[@class='main-text']").get()
origin = response.url
item = LywItem(title=title,pub_time=pub_time, content=content, author=author, origin=origin)
return item
上一篇文章 第六章 Scrapy框架(八) 2020-03-10 地址:
https://www.jianshu.com/p/08be0e880cff
下一篇文章 第六章 Scrapy框架(十) 2020-03-12 地址:
https://www.jianshu.com/p/b4bc58e806f9
以上资料内容来源网络,仅供学习交流,侵删请私信我,谢谢。