Python_Scrapy-第三方模块安装与使用

第三方模块的安装

1、request库的安装与使用

requests库本质上就是模拟了我们用浏览器打开一个网页，发起请求是的动作。它能够迅速的把请求的html源文件保存到本地

安装方式
“win+R”输入“cmd”打开命令提示符面板，键入“pip install requests”,安装pip第三方模块。
查看安装结果
“win+R”输入“cmd”打开命令提示符面板，键入“pip list”,查看通过pip所安装的所有第三方模块。
简单使用

首先我们先导入requests这个包

import requests

我们来吧百度的index页面的html源码抓取到本地，并用r变量保存
注意这里，网页前面的http://一定要写出来，它并不能像真正的浏览器一样帮我们补全http协议

r = requests.get("http://www.baidu.com")

将下载到的内容打印一下：

print(r.text)

所获取的百度源码文件

2、bs4库的安装与使用

bs4库是解析、遍历、维护、“标签树“的功能库。

安装方式
“win+R”输入“cmd”打开命令提示符面板，键入“pip install beautifulsoup4”,安装pip第三方模块。
查看安装结果
“win+R”输入“cmd”打开命令提示符面板，键入“pip list”,查看通过pip所安装的所有第三方模块。
简单使用
1、以一段HTML代码将作为例子

<html>
<head>
    <title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
    http://example.com/elsie" class="sister" id="link1">Elsie,
    http://example.com/lacie" class="sister" id="link2">Lacie and
    http://example.com/tillie" class="sister" id="link3">Tillie;
    and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>

2、下面我们开始用bs4库解析这一段html网页代码。

#导入bs4模块
from bs4 import BeautifulSoup
soup = BeautifulSoup(html，'html.parser')
#输出结果
print(soup.prettify())

'''
OUT:

# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>
'''

通俗一点说就是： bs4库把html源代码重新进行了格式化，
从而方便我们对其中的节点、标签、属性等进行操作。

3、BS4库的解析器的安装与使用

我们所选用的是lxml解析器

安装
pip install lxml
具体使用
1、依旧使用上一节HTML文档
2、使用lxml进行解析

import bs4


#首先我们先将html文件已lxml的方式做成一锅汤
soup = bs4.BeautifulSoup(open('Beautiful Soup 爬虫/demo.html'),'lxml')

#我们把结果输出一下，是一个很清晰的树形结构。
#print(soup.prettify())

'''
OUT:

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
'''

Python_Scrapy-第三方模块安装与使用

Python_Scrapy-第三方模块安装与使用

第三方模块的安装

1、request库的安装与使用

2、bs4库的安装与使用

3、BS4库的解析器的安装与使用

相关阅读更多精彩内容

友情链接更多精彩内容