四、Scrapy框架– 实战– 古诗文网爬虫实战(1)
settings.py中设置ROBOTSTXT_OBEY = False,
DEFAULT_REQUEST_HEADERS添加请求头。
在gsww_spider中设置初始url:start_urls = ['https://gushiwen.org/default_1.aspx']
gsww_spider示例代码
import scrapy
from lxml import etree
from scrapy.http.response.html import HtmlResponse
from scrapy.selector.unified import Selector
class GswwSpiderSpider(scrapy.Spider):
name = 'gsww_spider'
allowed_domains = ['gushiwen.org']
start_urls = ['https://gushiwen.org/default_1.aspx']
def myprint(self,value):
print("="*30)
print(value)
print("="*30)
def parse(self, response):
# self.myprint(type(response))
gsw_divs =response.xpath("//div[@class='left']/div[@class='sons']")
print(type(gsw_divs))
for gsw_div in gsw_divs:
self.myprint(type(gsw_div))
上一篇文章 第六章 Scrapy框架(三) 2020-03-05 地址:
https://www.jianshu.com/p/5c752e9f3f61
下一篇文章 第六章 Scrapy框架(五) 2020-03-07 地址:
https://www.jianshu.com/p/cd1f301999c5
以上资料内容来源网络,仅供学习交流,侵删请私信我,谢谢。