环境:Win10 + Cmder + Python3.6.5
需求
获取 http://www.air-level.com/air/xian/ 的空气质量指数表格数据。骚年,是不是蠢蠢欲动要爬虫三步走了?
代码
我说三行代码就可以轻松搞定, 你信吗?(正经脸):
import pandas as pd
df = pd.read_html("http://www.air-level.com/air/xian/", encoding='utf-8', header=0)[0]
df.to_excel('xian_tianqi.xlsx', index=False)
然后先来看网页数据:
再来看Excel中的数据:
是不是被秀到啦?讲真,我也被秀到一脸...
解释
read_html()部分源码如下:
# 已省略部分代码,详细查看可在命令行执行:print(pd.read_html.__doc__)
def read_html(io, match='.+', flavor=None, header=None, index_col=None,
skiprows=None, attrs=None, parse_dates=False,
tupleize_cols=None, thousands=',', encoding=None,
decimal='.', converters=None, na_values=None,
keep_default_na=True, displayed_only=True):
r"""Read HTML tables into a ``list`` of ``DataFrame`` objects.
Parameters
----------
io : str or file-like
A URL, a file-like object, or a raw string containing HTML. Note that
lxml only accepts the http, ftp and file url protocols. If you have a
URL that starts with ``'https'`` you might try removing the ``'s'``.
flavor : str or None, container of strings
The parsing engine to use. 'bs4' and 'html5lib' are synonymous with
each other, they are both there for backwards compatibility. The
default of ``None`` tries to use ``lxml`` to parse and if that fails it
falls back on ``bs4`` + ``html5lib``.
header : int or list-like or None, optional
The row (or list of rows for a :class:`~pandas.MultiIndex`) to use to
make the columns headers.
......
可以看到,read_html() 方法的 io 参数默认了多种形式,URL 便是其中一种。然后函数默认调用 lxml 解析 table 标签里的每个 td 的数据,最后生成一个包含 Dataframe 对象的列表。通过索引获取到 DataFrame 对象即可。
最后
read_html() 仅支持静态网页解析。你可以通过其他方法获取动态页面加载后response.text 传入 read_html() 再获取表格数据。