有时候,我们需要登录网站才能获取到特定的信息。我们以登录github login为例,下面是github登录的部分 html代码。
<form action="/session" accept-charset="UTF-8" method="post">
<input name="utf8" type="hidden" value="✓" />
<input type="hidden" name="authenticity_token" value="vr2Ebi0MMmJvjeZEQDEToGr96pQ2CK6TraSsU96M86B9PUI9D+59pAtOG99pv7UouYfN19Ptxwo+PaaVxYnWMQ==" />
<div class="auth-form-header p-0">
<h1>Sign in to GitHub</h1>
</div>
<div id="js-flash-container">
</div>
<div class="auth-form-body mt-3">
<label for="login_field"> Username or email address </label>
<input type="text" name="login" id="login_field" class="form-control input-block" tabindex="1" autocapitalize="off" autocorrect="off" autofocus="autofocus" />
<label for="password"> Password <a class="label-link" href="/password_reset">Forgot password?</a> </label>
<input type="password" name="password" id="password" class="form-control form-control input-block" tabindex="2" />
<input type="submit" name="commit" value="Sign in" tabindex="3" class="btn btn-primary btn-block" data-disable-with="Signing in…" />
</div>
</form>
<form>
的action
表示表单提交的地址;
<form>
的accept-charset
表示服务器接受的字符串集;
<form>
的method
表示提交数据的方式;
<form>
下的<input>
标签决定了表单提交的内容;
<input>
的name
属性决定了数据提交的名称,该名称对应的值根据type的不同,取值方式也不同,有些从value
属性取值,如hidden
;有些需要用户输入,如text
、password
等,具体可查阅HTML的文档,本文不再赘述。
下面是提交数据的过程与结果:
正如所看到的,表单数据是input
的键值对锁组成。
在返回包中,服务器会返回用户对应的Cookie信息,由于需要跳转302,返回包还提供了需要跳转的地址location
.
从以上看,登录的本质,就是向目标服务器发送含有表单数据的请求,一般是通过POST来请求的。
scrapy提供了一个Request
的子类FormRequest
来构造和提交表达数据。FormRequest
的构造参数在Request
的基础上添加了formdata
,该参数支持字典或元组的可迭代对象,当需要发起表单请求的时候,在构造时添加formdata
即可。
我们通过FormRequest
来实现登录github,通过www.github.com
是否包含Signed in as
来判断是否登录成功。
# scrapy shell https://github.com/login
>>> input_selector = response.css('input')
>>> fd = dict()
>>> for selector in input_selector:
... name = selector.css('input::attr(name)').extract_first()
... value = selector.css('input::attr(value)').extract_first()
... if value is None:
... value = ''
... fd[name] = value
...
>>> fd['login'] = '******@gmail.com'
>>> fd['password'] = '******'
>>> request = scrapy.FormRequest('https://github.com/session', formdata=fd)
>>> fetch(request)
2018-06-03 15:27:50 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://github.com/> from <POST https://github.com/session>
2018-06-03 15:27:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/> (referer: None)
>>> 'Signed in as' in response.text
True
除了使用FormRequest
之外,scrapy还提供了另一种方式来更为简单的提交表单,就是使用FormRequest
的类方法from_response(response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ...])
,使用时,第一个参数只需要提供Response对象,然后在formdata
中提供账号和密码,其余的其他隐藏参数,该方法会帮我们处理好。
下面再使用FormRequest.from_response()
来登录github;
# scrapy shell https://github.com/login
>>> fd = dict()
>>> fd['login'] = '******@gmail.com'
>>> fd['password'] = '******'
>>> request = scrapy.FormRequest.from_response(response, formdata=fd)
>>> fetch(request)
2018-06-03 15:35:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://github.com/> from <POST https://github.com/session>
2018-06-03 15:35:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/> (referer: None)
>>> 'Signed in as' in response.text
True
再来看看再实际的项目中如何实现登录的逻辑:
# profiles.py
# -*- coding: utf-8 -*-
import scrapy
class ProfilesSpider(scrapy.Spider):
name = 'profiles'
allowed_domains = ['github.com']
start_urls = ['http://github.com/']
login = 'https://github.com/login'
def parse(self, response): #开始正式爬虫后会默认调用的处理函数
pass
def after_login(self, response):
yield from suoer().start_requests() #调用spider的start_request()方法,以从start_urls开始爬取链接
def start_requests(self):
yield scrapy.Request(self.login_url, callback=self.parse_login)
def parse_login(self, response):
fd = dict()
post_headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.8,en;q=0.6",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Content-Type": "application/x-www-form-urlencoded",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36",
"Referer": "https://github.com/",
}
fd['login'] = '*******@gmail.com'
fd['password'] = '******'
yield scrapy.FormRequest.from_response(response, formdata=fd, callback=self.after_login,headers = post_headers)
总结
本篇简单讲解了如何通过FormRequest
来提交表单数据,至于带验证码的登录后面有时间研究一下。下一篇会研究如何动态页面的数据。