爬虫初试

爬虫初试

前言：今天看到一个开发群问有人会爬虫吗，然后突然想试试看，在慕课网上找到一个爬虫教程，跟着边看做，以下是学习记录。

1.一些概念性截图

Paste_Image.png

2.开发环境

Google了以下python开发用什么编辑器好，大部分答案说pycharm。去官网上下载了一个安装，需要激活或者试用三十天。搜了下pycharm激活方式，找了一个注册码安装成功了。
网页解析器需要用到beautiful soup 官网下载最新安装包，解压，终端进入目录，执行以下命令。

  sudo python setup.py install

beautiful soup 使用实例

#coding=utf-8

from bs4 import BeautifulSoup    #引入 beautiful soup 包
import re            #正则表达式包
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser', from_encoding='utf-8')

print 'get all links'
links = soup.find_all('a')
for link in links:
    print link.name, link['href'], link.get_text()


print '获取 lacie 的链接'
link_node = soup.find('a', href='http://example.com/lacie')
print link_node.name, link_node['href'], link_node.get_text()

print '正则匹配'
link_node1 = soup.find('a', href=re.compile(r"ill"))
print link_node1.name, link_node1['href'], link_node1.get_text()

print '获取p段落文字'
link_node2 = soup.find('p', class_="story")
print link_node2.name,  link_node2.get_text()

3.实例爬虫

a .步骤

Paste_Image.png

最后编辑于：2017.12.06 03:58:52

©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成，浏览时请结合常识与多方信息审慎甄别。
平台声明：文章内容（如有图片或视频亦包括在内）由作者上传并发布，文章内容仅代表作者本人观点，简书系信息发布平台，仅提供信息存储服务。

爬虫初试

1.一些概念性截图

2.开发环境

3.实例爬虫

a .步骤

推荐阅读更多精彩内容

友情链接更多精彩内容