1. 访问服务端的html前端代码
- 安装requests库
pip install requests
- 导入
import requests
- 获取响应
url = 'https://www.baidu.com'
response = requests.get(url)
print(response) # 200代表访问成功的状态码
print('编码方式为:', response.encoding)
print('相应状态码为:', response.status_code)
print('获取相应头:', response.headers)
<Response [200]>
编码方式为: ISO-8859-1 ==>所以要“修改编码方式为utf-8”
相应状态码为: 200
获取相应头: {'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'Keep-Alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Fri, 25 Oct 2019 09:13:20 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:23:50 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
修改编码方式(乱码时用——万金油):
- 方法一:
response.encoding = 'UTF-8'
data = response.text # 如果直接response.text返回容易出现乱码。
print(data)
- 方法二:
data = response.content.decode("utf-8")
print(data)
完整代码如下(只需要的):
import requests # 引入requests库
url = 'https://www.baidu.com' # 目标地址
response = requests.get(url) # get方法获取url地址
response.encoding = 'UTF-8' # 修改编码方式为UTF-8
data = response.text # 把获取到的内容进行text解码后存进data里
print(data)
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');
</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>©2017 Baidu <a href=http://www.baidu.com/duty/>使用百度前必读</a> <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a> 京ICP证030173号 <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>
注意:百度首页别爬那么多,会被服务器检测到,因此有被加入黑名单的风险
- 写入本地
'w':写入
.write()
with open('baidu.html', 'w', encoding='UTF-8')as f:
f.write(data)
data_content = response.content
print(data_content)
这样,在我们的文件夹里面就有baidu.html文件了
baidu.html.png
response.content 和 response.text的区别
- response.text
返回类型:str
解码类型:根据http头部对相应的编码做出有根据的推测,推测文本编码
修改编码方式(赋值):response.encoding = 'UTF-8'- response.content
返回类型:bytes
修改编码方式:response.encoding.decode(UTF-8)
2. 破解反爬虫技术
- 添加headers
headers = {"User-Agent": "..."}
作用:模拟浏览器访问,欺骗服务器,获取与浏览器一直的内容
形式:字典
【练习】访问知乎
没有添加headers参数的知乎网站返回值(状态码status_code)为400
url = 'https://www.zhihu.com'
res = requests.get(url)
print(res.status_code)
400
添加headers参数的知乎网站返回值(状态码status_code)为200
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"}
response = requests.get(url, headers=headers)
print(response.status_code)
200
既然知乎现在也能爬,同理,我们可以爬百度而不被列入黑名单
有点儿意思,我们试试:
【练习】尝试多次爬百度而不被服务器检测列入黑名单
import requests
url = 'https://www.baidu.com'
这...headers的参数往哪儿找?
对,忘了介绍了,请接着看:
-
首先,我们按F12,打开网页的开发者工具(就是下面那个大框)
开发者工具.png 然后选择“Network”
Network.png
啥都没?好,那我们刷新看看
- 刷新页面,底部框拉到最上,出现“www.baidu.com”
微信图片_20191025193112.png - 接着,点击“www.baidu.com”,左边框拉到最下,出现“User-Agent:...”
微信图片_20191025193320.png
这一行就是我们所需要的headers参数
那应该怎么写呢?我们先复制过来:
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36
- 然后改格式(冒号前后都加上双引号,然后写进{}里就OK了):
header = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36"}
- 下一步,response
response = requests.get(url, headers=headers)
- 最后,输出
print(response.status_code)
完整代码如下:
import requests
url = 'https://www.baidu.com'
header = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36"}
response = requests.get(url, headers=headers)
data = response.content.decode('utf-8')
print(data)
效果如下(由于代码量过多,无法fab简书,因此把css和js这两部分的代码给删了):
<!DOCTYPE html>
<!--STATUS OK-->
<html>
<head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<meta content="always" name="referrer">
<meta name="theme-color" content="#2932e1">
<link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />
<link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="百度搜索" />
<link rel="icon" sizes="any" mask href="//www.baidu.com/img/baidu_85beaf5496f291521eb75ba38eacbd87.svg">
<link rel="dns-prefetch" href="//s1.bdstatic.com"/>
<link rel="dns-prefetch" href="//t1.baidu.com"/>
<link rel="dns-prefetch" href="//t2.baidu.com"/>
<link rel="dns-prefetch" href="//t3.baidu.com"/>
<link rel="dns-prefetch" href="//t10.baidu.com"/>
<link rel="dns-prefetch" href="//t11.baidu.com"/>
<link rel="dns-prefetch" href="//t12.baidu.com"/>
<link rel="dns-prefetch" href="//b1.bdstatic.com"/>
<title>百度一下,你就知道</title>
<body link="#0000cc">
<div id="wrapper" style="display:none;">
<div id="head"><div class="head_wrapper"><div class="s_form"><div class="s_form_wrapper"><style>.index-logo-srcnew {display: none;}@media (-webkit-min-device-pixel-ratio: 2),(min--moz-device-pixel-ratio: 2),(-o-min-device-pixel-ratio: 2),(min-device-pixel-ratio: 2){.index-logo-src {display: none;}.index-logo-srcnew {display: inline;}}</style><div id="lg"><img hidefocus="true" class='index-logo-src' src="//www.baidu.com/img/bd_logo1.png" width="270" height="129" usemap="#mp"><img hidefocus="true" class='index-logo-srcnew' src="//www.baidu.com/img/bd_logo1.png?qua=high" width="270" height="129" usemap="#mp"><map name="mp"><area style="outline:none;" hidefocus="true" shape="rect" coords="0,0,270,129" href="//www.baidu.com/s?wd=%E4%BB%8A%E6%97%A5%E6%96%B0%E9%B2%9C%E4%BA%8B&tn=SE_PclogoS_8whnvm25&sa=ire_dl_gh_logo&rsv_dl=igh_logo_pcs" onmousedown="return ns_c({fm: 'tab', tab: 'felogo', rsv_platform: 'wwwhome' })" target="_blank" title="点击一下,了解更多"onmousedown="return ns_c({'fm':'behs','tab':'bdlogo'})"></map></div><a href="/" id="result_logo" onmousedown="return c({'fm':'tab','tab':'logo'})"><img class='index-logo-src' src="//www.baidu.com/img/baidu_jgylogo3.gif" alt="到百度首页" title="到百度首页"><img class='index-logo-srcnew' src="//www.baidu.com/img/baidu_resultlogo@2.png" alt="到百度首页" title="到百度首页"></a><form id="form" name="f" action="/s" class="fm"><input type="hidden" name="ie" value="utf-8"><input type="hidden" name="f" value="8"><input type="hidden" name="rsv_bp" value="1"><input type="hidden" name="rsv_idx" value="1"><input type=hidden name=ch value=""><input type=hidden name=tn value="baidu"><input type=hidden name=bar value=""><span class="bg s_ipt_wr"><input id="kw" name="wd" class="s_ipt" value="" maxlength="255" autocomplete="off"></span><span class="bg s_btn_wr"><input type="submit" id="su" value="百度一下" class="bg s_btn"></span><span class="tools"><span id="mHolder"><div id="mCon"><span>输入法</span></div><ul id="mMenu"><li><a href="javascript:;" name="ime_hw">手写</a></li><li><a href="javascript:;" name="ime_py">拼音</a></li><li class="ln"></li><li><a href="javascript:;" name="ime_cl">关闭</a></li></ul></span></span><input type="hidden" name="rn" value=""><input type="hidden" name="oq" value=""><input type="hidden" name="rsv_pq" value="aa3d70660005dc89"><input type="hidden" name="rsv_t" value="05d6jzNwf3IFEFG4840Iz375f64UaIKHHcZH5nWYCzAta1OPzOGT+kHmPj8"><input type="hidden" name="rqlang" value="cn"></form><div id="m"></div></div></div><div id="u"><a class="toindex" href="/">百度首页</a><a href="javascript:;" name="tj_settingicon" class="pf">设置<i class="c-icon c-icon-triangle-down"></i></a><a href="https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F" name="tj_login" class="lb" onclick="return false;">登录</a></div><div id="u1"><a href="http://news.baidu.com" name="tj_trnews" class="mnav">新闻</a><a href="https://www.hao123.com" name="tj_trhao123" class="mnav">hao123</a><a href="http://map.baidu.com" name="tj_trmap" class="mnav">地图</a><a href="http://v.baidu.com" name="tj_trvideo" class="mnav">视频</a><a href="http://tieba.baidu.com" name="tj_trtieba" class="mnav">贴吧</a><a href="http://xueshu.baidu.com" name="tj_trxueshu" class="mnav">学术</a><a href="https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F" name="tj_login" class="lb" onclick="return false;">登录</a><a href="http://www.baidu.com/gaoji/preferences.html" name="tj_settingicon" class="pf">设置</a><a href="http://www.baidu.com/more/" name="tj_briicon" class="bri" style="display: block;">更多产品</a></div></div></div>
<div class="s_tab" id="s_tab">
<div class="s_tab_inner">
<b>网页</b>
<a href="//www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&word=" wdfield="word" onmousedown="return c({'fm':'tab','tab':'news'})" sync="true">资讯</a>
<a href="http://tieba.baidu.com/f?kw=&fr=wwwt" wdfield="kw" onmousedown="return c({'fm':'tab','tab':'tieba'})">贴吧</a>
<a href="http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt" wdfield="word" onmousedown="return c({'fm':'tab','tab':'zhidao'})">知道</a>
<a href="http://music.taihe.com/search?fr=ps&ie=utf-8&key=" wdfield="key" onmousedown="return c({'fm':'tab','tab':'music'})">音乐</a>
<a href="http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=" wdfield="word" onmousedown="return c({'fm':'tab','tab':'pic'})">图片</a>
<a href="http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=" wdfield="word" onmousedown="return c({'fm':'tab','tab':'video'})">视频</a>
<a href="http://map.baidu.com/m?word=&fr=ps01000" wdfield="word" onmousedown="return c({'fm':'tab','tab':'map'})">地图</a>
<a href="http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8" wdfield="word" onmousedown="return c({'fm':'tab','tab':'wenku'})">文库</a>
<a href="//www.baidu.com/more/" onmousedown="return c({'fm':'tab','tab':'more'})">更多»</a>
</div>
</div>
<div class="qrcodeCon">
<div id="qrcode">
<div class="qrcode-item qrcode-item-1">
<div class="qrcode-img"></div>
<div class="qrcode-text">
<p class="title">下载百度APP</p>
<p class="sub-title">有事搜一搜 没事看一看</p>
</div>
</div>
</div>
</div>
<div id="ftCon">
<div class="ftCon-Wrapper"><div id="ftConw"><p id="lh"><a id="setf" href="//www.baidu.com/cache/sethelp/help.html" onmousedown="return ns_c({'fm':'behs','tab':'favorites','pos':0})" target="_blank">把百度设为主页</a><a onmousedown="return ns_c({'fm':'behs','tab':'tj_about'})" href="http://home.baidu.com">关于百度</a><a onmousedown="return ns_c({'fm':'behs','tab':'tj_about_en'})" href="http://ir.baidu.com">About Baidu</a><a onmousedown="return ns_c({'fm':'behs','tab':'tj_tuiguang'})" href="http://e.baidu.com/?refer=888">百度推广</a></p><p id="cp">©2019 Baidu <a href="http://www.baidu.com/duty/" onmousedown="return ns_c({'fm':'behs','tab':'tj_duty'})">使用百度前必读</a> <a href="http://jianyi.baidu.com/" class="cp-feedback" onmousedown="return ns_c({'fm':'behs','tab':'tj_homefb'})">意见反馈</a> 京ICP证030173号 <i class="c-icon-icrlogo"></i> <a id="jgwab" target="_blank" href="http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001">京公网安备11000002000001号</a> <i class="c-icon-jgwablogo"></i></p></div></div></div>
<div id="wrapper_wrapper">
</div>
</div>
<div class="c-tips-container" id="c-tips-container"></div>
</body>
</html>
这样我们就可以很安全地爬百度了!而是从上面结构可以看出,其实做百度首页也不难,只是css和js这部分稍微烦人一丢丢,相信大家都能靠自己做出一个百度首页出来,加油!