基于HTML网页的爬虫-爬取天气数据

1.进入你创建的环境:

如执行activate course_py35进入之前创建的 course_py35 环境

2.安装BeautifulSoup (可以通过 pip 来安装BeautifulSoup4 ) :

pip install beautifulsoup4

3.Jupyter 中实现网页的获取:

运行以下代码看BeautifulSoup 是否正常安装（若未提示错误则表示正常）

from bs4 import BeautifulSoup

4.使用BeautifulSoup解析HTML文档 :

一般格式为soup=BeautifulSoup(网页名称，'html.parser')

5.用 soup.prettify 打印网页

print(soup.prettify())

#BeautifulSoup 中 “soup.prettify” 这个方法可以让网页更加友好地打印出来#

实例：爬取“NATIONAL WEATHER”的天气数据

示例的旧金山天气页面地址为：

http://forecast.weather.gov/MapClick.php?lat=37.77492773500046&lon=-122.41941932299972#.WUnSFhN95E4

可以在浏览器提供的开发者工具中查看代码：更多工具 > 开发者工具

1.通过url.request 返回网页内容

import urllib.request as urlrequest

weather_url='http://forecast.weather.gov/MapClick.php?lat=37.77492773500046&lon=-122.41941932299972'

web_page=urlrequest.urlopen(weather_url).read()

print(web_page)

2.通过 BeautifulSoup 来抓取网页中的天气信息

from bs4 import BeautifulSoup

soup=BeautifulSoup(web_page,'html.parser')

print(soup.find(id='seven-day-forecast-body').get_text())

当然，你可以通过prettify输出一个美观的网页代码

from bs4 import BeautifulSoup

soup=BeautifulSoup(web_page,'html.parser')

print(soup.find(id='seven-day-forecast-container').prettify())

3.将天气数据完整有序地抽取出来

soup_forecast=soup.find(id='seven-day-forecast-container')

date_list=soup_forecast.find_all(class_='period-name')

desc_list=soup_forecast.find_all(class_='short-desc')

temp_list=soup_forecast.find_all(class_='temp')

for i in range(9):

date=date_list[i].get_text()

desc=desc_list[i].get_text()

temp=temp_list[i].get_text()

print("{}{}{}".format(date,desc,temp))

综合上述，这个简单爬虫的完整代码如下，注意每个步骤的作用

#导入需要的包和模块，这里需要的是 urllib.request 和 Beautifulsoup

import urllib.request as urlrequest

from bs4 import BeautifulSoup

#通过urllib来获取我们需要爬取的网页

weather_url='http://forecast.weather.gov/MapClick.php?lat=37.77492773500046&lon=-122.41941932299972'

web_page=urlrequest.urlopen(weather_url).read()

#用 BeautifulSoup 来解析和获取我们想要的内容块

soup=BeautifulSoup(web_page,'html.parser')

soup_forecast=soup.find(id='seven-day-forecast-container')

#找到我们想要的那一部分内容

date_list=soup_forecast.find_all(class_='period-name')

desc_list=soup_forecast.find_all(class_='short-desc')

temp_list=soup_forecast.find_all(class_='temp')

#将获取的内容更好地展示出来，用for循环来实现

for i in range(9):

date=date_list[i].get_text()

desc=desc_list[i].get_text()

temp=temp_list[i].get_text()

print("{}{}{}".format(date,desc,temp))

以上

最后编辑于：2019.11.30 00:10:19

©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成，浏览时请结合常识与多方信息审慎甄别。
平台声明：文章内容（如有图片或视频亦包括在内）由作者上传并发布，文章内容仅代表作者本人观点，简书系信息发布平台，仅提供信息存储服务。

基于HTML网页的爬虫-爬取天气数据

基于HTML网页的爬虫-爬取天气数据

实例：爬取“NATIONAL WEATHER”的天气数据

相关阅读更多精彩内容

友情链接更多精彩内容