1.进入你创建的环境:
如执行activate course_py35进入之前创建的 course_py35 环境
2.安装BeautifulSoup (可以通过 pip 来安装BeautifulSoup4 ) :
pip install beautifulsoup4
3.Jupyter 中实现网页的获取:
运行以下代码看BeautifulSoup 是否正常安装(若未提示错误则表示正常)
from bs4 import BeautifulSoup
4.使用BeautifulSoup解析HTML文档 :
一般格式为soup=BeautifulSoup(网页名称,'html.parser')
5.用 soup.prettify 打印网页
print(soup.prettify())
#BeautifulSoup 中 “soup.prettify” 这个方法可以让网页更加友好地打印出来#
实例:爬取“NATIONAL WEATHER”的天气数据
示例的旧金山天气页面地址为:
http://forecast.weather.gov/MapClick.php?lat=37.77492773500046&lon=-122.41941932299972#.WUnSFhN95E4
可以在浏览器提供的开发者工具中查看代码:更多工具 > 开发者工具
1.通过url.request 返回网页内容
import urllib.request as urlrequest
weather_url='http://forecast.weather.gov/MapClick.php?lat=37.77492773500046&lon=-122.41941932299972'
web_page=urlrequest.urlopen(weather_url).read()
print(web_page)
2.通过 BeautifulSoup 来抓取网页中的天气信息
from bs4 import BeautifulSoup
soup=BeautifulSoup(web_page,'html.parser')
print(soup.find(id='seven-day-forecast-body').get_text())
当然,你可以通过prettify输出一个美观的网页代码
from bs4 import BeautifulSoup
soup=BeautifulSoup(web_page,'html.parser')
print(soup.find(id='seven-day-forecast-container').prettify())
3.将天气数据完整有序地抽取出来
soup_forecast=soup.find(id='seven-day-forecast-container')
date_list=soup_forecast.find_all(class_='period-name')
desc_list=soup_forecast.find_all(class_='short-desc')
temp_list=soup_forecast.find_all(class_='temp')
for i in range(9):
date=date_list[i].get_text()
desc=desc_list[i].get_text()
temp=temp_list[i].get_text()
print("{}{}{}".format(date,desc,temp))
综合上述,这个简单爬虫的完整代码如下,注意每个步骤的作用
#导入需要的包和模块,这里需要的是 urllib.request 和 Beautifulsoup
import urllib.request as urlrequest
from bs4 import BeautifulSoup
#通过urllib来获取我们需要爬取的网页
weather_url='http://forecast.weather.gov/MapClick.php?lat=37.77492773500046&lon=-122.41941932299972'
web_page=urlrequest.urlopen(weather_url).read()
#用 BeautifulSoup 来解析和获取我们想要的内容块
soup=BeautifulSoup(web_page,'html.parser')
soup_forecast=soup.find(id='seven-day-forecast-container')
#找到我们想要的那一部分内容
date_list=soup_forecast.find_all(class_='period-name')
desc_list=soup_forecast.find_all(class_='short-desc')
temp_list=soup_forecast.find_all(class_='temp')
#将获取的内容更好地展示出来,用for循环来实现
for i in range(9):
date=date_list[i].get_text()
desc=desc_list[i].get_text()
temp=temp_list[i].get_text()
print("{}{}{}".format(date,desc,temp))
以上