2019-02-07 python初级爬虫实践（1）

今天试了下用python把自己的文章爬下来。。很简单，用第三方BeautifulSoup模块，BS4文档
BeautifulSoup中文文档

初学者主要是需要了解HTML的树形结构和BS4的4种节点

大家爬内容的时候可能会遇到一个问题

urllib.error.HTTPError: HTTP Error 403: Forbidden

这就是没有模拟HTTP头信息，可以通过浏览器审查元素NETWORK中查看User-Agent信息

我爬取了自己的第一篇文章，上代码

import urllib.request

from bs4 import BeautifulSoup

url='https://www.jianshu.com/p/67f98683f9fe'

headers={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.90 Safari/537.36 2345Explorer/9.6.0.18627'
        }

req=urllib.request.Request(url,headers=headers) 

resp=urllib.request.urlopen(req)

soup = BeautifulSoup(resp,'html.parser')


#获取正文
content = soup.find(name='div',attrs={"class":"show-content-free"})
content = str(content)

#获取标题
content_title = soup.find(name='h1',attrs={"class":"title"}).text

#获取作者
content_author = soup.find(name='span',attrs={"class":"name"}).text

#获取发布时间
content_publish_time = soup.find(name='span',attrs={"class":"publish-time"}).text

#把作者和出版时间添加进正文,换行看起来方便一点
content = 'content_author='+content_author+'\n'+'publish_time='+content_publish_time+'\n'+content

#保存
filename = content_title+'.txt'

with open(filename,'w',encoding='utf-8') as f:

    f.write(content)

之后要写解决比如登录问题，验证码问题等的方法

最后编辑于：2019.02.09 00:03:06

©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成，浏览时请结合常识与多方信息审慎甄别。
平台声明：文章内容（如有图片或视频亦包括在内）由作者上传并发布，文章内容仅代表作者本人观点，简书系信息发布平台，仅提供信息存储服务。

2019-02-07 python初级爬虫实践（1）

2019-02-07 python初级爬虫实践（1）

urllib.error.HTTPError: HTTP Error 403: Forbidden

相关阅读更多精彩内容

友情链接更多精彩内容