发送POST请求:有时候我们想要在请求数据的时候发送POST请求,那么这时候需要使用
Request
的子类FromRequest
来实现,如果想要在爬虫一开始的时候就发送POST请求,那么需要在爬虫类中重写start_request(self)
方法,并且不再调用start_urls
里的url。
1、创建项目
D:\学习笔记\Python学习\Python_Crawler>scrapy startproject renrenLogin
New Scrapy project 'renrenLogin', using template directory 'c:\python38\lib\site-packages\scrapy\templates\project', created in:
D:\学习笔记\Python学习\Python_Crawler\renrenLogin
You can start your first spider with:
cd renrenLogin
scrapy genspider example example.com
2、创建爬虫
D:\学习笔记\Python学习\Python_Crawler>cd renrenLogin
D:\学习笔记\Python学习\Python_Crawler\renrenLogin>scrapy genspider renren "renren.com"
Created spider 'renren' using template 'basic' in module:
renrenLogin.spiders.renren
3、代码实现
A)settings.py文件配置:
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 1
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36',
}
B)start.py文件如下:
from scrapy import cmdline
cmdline.execute("scrapy crawl renren".split())
C)renren.py文件如下:
# -*- coding: utf-8 -*-
import scrapy
class RenrenSpider(scrapy.Spider):
name = 'renren'
allowed_domains = ['renren.com']
start_urls = ['http://renren.com/']
def start_requests(self):
url = "http://www.renren.com/PLogin.do"
data = {"email": "kevin19851228@gmail.com", "password": "1qaz@WSX"}
request = scrapy.FormRequest(url, formdata=data,callback=self.parse_page)
yield request
def parse_page(self, response):
# with open('renren.html', 'w', encoding='utf-8') as fp:
# fp.write(response.text)
request = scrapy.Request(url="http://www.renren.com/880151247/profile", callback=self.parse_profile)
yield request
def parse_profile(self, response):
with open('dpProfile.html', 'w', encoding='utf-8') as fp:
fp.write(response.text)
4、说明:
1)想要发送post请求,那么推荐使用“scrapyFormRequest”方法,可以方便的指定表单数据;
2)如果想要的爬虫一开始的时候就发送post请求,那么应该重写start_requests
方法,在这个方法中,发送post请求。