第01章 Pandas基础
 第02章 DataFrame基础运算
第03章创建和持久化DataFrame
第04章开始数据分析
 第05章探索性数据分析
 第06章选取数据子集
 第07章过滤行
 第08章索引对齐

3.1 创建DataFrame

使用平行的列表创建DataFrame

>>> import pandas as pd
>>> import numpy as np
>>> fname = ["Paul", "John", "Richard", "George"]
>>> lname = ["McCartney", "Lennon", "Starkey", "Harrison"]
>>> birth = [1942, 1940, 1940, 1943]

创建字典：

>>> people = {"first": fname, "last": lname, "birth": birth}

用该字典创建DataFrame：

>>> beatles = pd.DataFrame(people)
>>> beatles
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943

原理

当调用DataFrame构造器时，Pandas会创建一个RangeIndex对象：

>>> beatles.index
RangeIndex(start=0, stop=4, step=1)

重新指定索引：

>>> pd.DataFrame(people, index=["a", "b", "c", "d"])
     first       last  birth
a     Paul  McCartney   1942
b     John     Lennon   1940
c  Richard    Starkey   1940
d   George   Harrison   1943

还可以用字典构成的列表构建DataFrame：

>>> pd.DataFrame(
...     [
...         {
...             "first": "Paul",
...             "last": "McCartney",
...             "birth": 1942,
...         },
...         {
...             "first": "John",
...             "last": "Lennon",
...             "birth": 1940,
...         },
...         {
...             "first": "Richard",
...             "last": "Starkey",
...             "birth": 1940,
...         },
...         {
...             "first": "George",
...             "last": "Harrison",
...             "birth": 1943,
...         },
...     ]
... )
   birth    first       last
0   1942     Paul  McCartney
1   1940     John     Lennon
2   1940  Richard    Starkey
3   1943   George   Harrison

使用columns参数指定列的顺序：

>>> pd.DataFrame(
...     [
...         {
...             "first": "Paul",
...             "last": "McCartney",
...             "birth": 1942,
...         },
...         {
...             "first": "John",
...             "last": "Lennon",
...             "birth": 1940,
...         },
...         {
...             "first": "Richard",
...             "last": "Starkey",
...             "birth": 1940,
...         },
...         {
...             "first": "George",
...             "last": "Harrison",
...             "birth": 1943,
...         },
...     ],
...     columns=["last", "first", "birth"],
... )
        last    first  birth
0  McCartney     Paul   1942
1     Lennon     John   1940
2    Starkey  Richard   1940
3   Harrison   George   1943

3.2 写入CSV

将DataFrame写入CSV文件：

使用.to_csv方法将DataFrame写入CSV文件：

>>> beatles
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943
>>> from io import StringIO
>>> fout = StringIO()
>>> beatles.to_csv(fout)  # 使用文件名

查看文件内容：

>>> print(fout.getvalue())
,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943

如果读取刚刚保存的CSV，会读入列名为Unnamed: 0的冗余列：

>>> _ = fout.seek(0)
>>> pd.read_csv(fout)
   Unnamed: 0    first       last  birth
0           0     Paul  McCartney   1942
1           1     John     Lennon   1940
2           2  Richard    Starkey   1940
3           3   George   Harrison   1943

使用index_col参数，可以指定列：

>>> _ = fout.seek(0)
>>> pd.read_csv(fout, index_col=0)
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943

如果存CSV文件时，不想保存行索引，可以将index参数设为False：

>>> fout = StringIO()
>>> beatles.to_csv(fout, index=False)
>>> print(fout.getvalue())
first,last,birth
Paul,McCartney,1942
John,Lennon,1940
Richard,Starkey,1940
George,Harrison,1943

3.3 读取大CSV文件

Pandas是在内存中处理文件的，通常来讲，内存的大小需要是文件大小的3至10倍。

这里使用的是diamonds数据集。使用nrows参数读取1000行数据。

>>> diamonds = pd.read_csv("data/diamonds.csv", nrows=1000)
>>> diamonds
     carat      cut color clarity  ...  price     x     y     z
0     0.23    Ideal     E     SI2  ...    326  3.95  3.98  2.43
1     0.21  Premium     E     SI1  ...    326  3.89  3.84  2.31
2     0.23     Good     E     VS1  ...    327  4.05  4.07  2.31
3     0.29  Premium     I     VS2  ...    334  4.20  4.23  2.63
4     0.31     Good     J     SI2  ...    335  4.34  4.35  2.75
..     ...      ...   ...     ...  ...    ...   ...   ...   ...
995   0.54    Ideal     D    VVS2  ...   2897  5.30  5.34  3.26
996   0.72    Ideal     E     SI1  ...   2897  5.69  5.74  3.57
997   0.72     Good     F     VS1  ...   2897  5.82  5.89  3.48
998   0.74  Premium     D     VS2  ...   2897  5.81  5.77  3.58
999   1.12  Premium     J     SI2  ...   2898  6.68  6.61  4.03

使用.info方法查看消耗的内存量：

>>> diamonds.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat      1000 non-null float64
cut        1000 non-null object
color      1000 non-null object
clarity    1000 non-null object
depth      1000 non-null float64
table      1000 non-null float64
price      1000 non-null int64
x          1000 non-null float64
y          1000 non-null float64
z          1000 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 78.2+ KB

可以看到1000行数据使用了78.2KB内存。如果有10亿行数据，则要占用78GB的内存。

使用dtype参数，设置读取的数值类型：

>>> diamonds2 = pd.read_csv(
...     "data/diamonds.csv",
...     nrows=1000,
...     dtype={
...         "carat": np.float32,
...         "depth": np.float32,
...         "table": np.float32,
...         "x": np.float32,
...         "y": np.float32,
...         "z": np.float32,
...         "price": np.int16,
...     },
... )
>>> diamonds2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat      1000 non-null float32
cut        1000 non-null object
color      1000 non-null object
clarity    1000 non-null object
depth      1000 non-null float32
table      1000 non-null float32
price      1000 non-null int16
x          1000 non-null float32
y          1000 non-null float32
z          1000 non-null float32
dtypes: float32(6), int16(1), object(3)
memory usage: 49.0+ KB

改变了数值类型，对比下新的DataFrame和原先的DataFrame：

>>> diamonds.describe()
             carat        depth  ...            y            z
count  1000.000000  1000.000000  ...  1000.000000  1000.000000
mean      0.689280    61.722800  ...     5.599180     3.457530
std       0.195291     1.758879  ...     0.611974     0.389819
min       0.200000    53.000000  ...     3.750000     2.270000
25%       0.700000    60.900000  ...     5.630000     3.450000
50%       0.710000    61.800000  ...     5.760000     3.550000
75%       0.790000    62.600000  ...     5.910000     3.640000
max       1.270000    69.500000  ...     7.050000     4.330000
>>> diamonds2.describe()
             carat        depth  ...            y            z
count  1000.000000  1000.000000  ...  1000.000000  1000.000000
mean      0.689453    61.718750  ...     5.601562     3.457031
std       0.195312     1.759766  ...     0.611816     0.389648
min       0.199951    53.000000  ...     3.750000     2.269531
25%       0.700195    60.906250  ...     5.628906     3.449219
50%       0.709961    61.812500  ...     5.761719     3.550781
75%       0.790039    62.593750  ...     5.910156     3.640625
max       1.269531    69.500000  ...     7.050781     4.328125

可以看到通过改变数据类型，节省了38%的内存。

使用dtype参数，将数据类型改为category。使用.value_counts先统计数据个数：

>>> diamonds2.cut.value_counts(
Ideal       333
Premium     290
Very Good   226
Good         89
Fair         62
Name: cut, dtype: int64
>>> diamonds2.color.value_counts()
E    240
F    226
G    139
D    129
H    125
I     95
J     46
Name: color, dtype: int64
>>> diamonds2.clarity.value_counts()
SI1     306
VS2     218
VS1     159
SI2     154
VVS2     62
VVS1     58
I1       29
IF       14
Name: clarity, dtype: int64

因为是低基数，将其转换为category，可以节省约37%的内存：

>>> diamonds3 = pd.read_csv(
...     "data/diamonds.csv",
...     nrows=1000,
...      dtype={
...        "carat": np.float32,
...         "depth": np.float32,
...         "table": np.float32,
...         "x": np.float32,
...         "y": np.float32,
...         "z": np.float32,
...         "price": np.int16,
...         "cut": "category",
...         "color": "category",
...         "clarity": "category",
...     },
... )
>>> diamonds3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat      1000 non-null float32
cut        1000 non-null category
color      1000 non-null category
clarity    1000 non-null category
depth      1000 non-null float32
table      1000 non-null float32
price      1000 non-null int16
x          1000 non-null float32
y          1000 non-null float32
z          1000 non-null float32
dtypes: category(3), float32(6), int16(1)
memory usage: 29.4 KB

使用参数usecols，可以指定加载哪些列。这里忽略了x、y、z三列：

>>> cols = [
...     "carat",
...     "cut",
...     "color",
...     "clarity",
...     "depth",
...     "table",
...     "price",
... ]
>>> diamonds4 = pd.read_csv(
...     "data/diamonds.csv",
...     nrows=1000,
...     dtype={
...         "carat": np.float32,
...        "depth": np.float32,
...         "table": np.float32,
...         "price": np.int16,
...         "cut": "category",
...         "color": "category",
...         "clarity": "category",
...     },
...     usecols=cols,
... )
>>> diamonds4.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
carat      1000 non-null float32
cut        1000 non-null category
color      1000 non-null category
clarity    1000 non-null category
depth      1000 non-null float32
table      1000 non-null float32
price      1000 non-null int16
dtypes: category(3), float32(3), int16(1)
memory usage: 17.7 KB

现在使用的内存只有原始的21%。

如果每次只处理数据的一部分，可以使用chunksize参数：

>>> cols = [
...     "carat",
...     "cut",
...     "color",
...     "clarity",
...     "depth",
...     "table",
...     "price",
... ]
>>> diamonds_iter = pd.read_csv(
...     "data/diamonds.csv",
...     nrows=1000,
...     dtype={
...         "carat": np.float32,
...         "depth": np.float32,
...         "table": np.float32,
...         "price": np.int16,
...         "cut": "category",
...         "color": "category",
...         "clarity": "category",
...     },
...     usecols=cols,
...     chunksize=200,
... )
>>> def process(df):
...     return f"processed {df.size} items"
>>> for chunk in diamonds_iter:
...     process(chunk)

因为CSV文件不保存数据类型，Pandas需要推断每列的数据类型是什么。如果一列的值都是整数，并且没有缺失值，则Pandas将其认定为int64。如果一列是数值类型，但不是整数，或存在缺失值，Pandas使用的是float64。这两种数据类型占用的内存比较大。例如，如果所有数都在200以下，你可以使用一个小的数据类型，比如np.int16（或np.int8，如果都是正数）。

如果某列都是非数值类型，Pandas会将其转换为object类型。object类型占用内存很多，因为它是将数据以Python字符串存储的，将类型改为category，可以大大节省空间，因为它对每个字符串只存储一次。

如果价格使用int8，会导致丢失信息。你可以使用NumPy的iinfo函数列出NumPy整数类型的范围：

>>> np.iinfo(np.int8)
iinfo(min=-128, max=127, dtype=int8)

使用.finfo可以查看浮点数类型的范围：

>>> np.finfo(np.float16)
finfo(resolution=0.001, min=-6.55040e+04,
      max=6.55040e+04, dtype=float16)

还可以用.memory_usage方法查询DataFrame或Series使用了多少字节。注意，其中是包含行索引的。另外，传入deep=True，可以查询带有对象类型的Series的内存用量：

>>> diamonds.price.memory_usage()
8080
>>> diamonds.price.memory_usage(index=False)
8000
>>> diamonds.cut.memory_usage()
8080
>>> diamonds.cut.memory_usage(deep=True)
63413

一旦确定了数据类型，可以将其以二进制并带有数据类型的形式保存下来，比如Feather格式，Pandas使用的是pyarrow库。

>>> diamonds4.to_feather("d.arr")
>>> diamonds5 = pd.read_feather("d.arr")

另一种方法是使用Parquet格式。

>>> diamonds4.to_parquet("/tmp/d.pqt")

3.4 使用Excel文件

需要安装xlwt或openpyxl来写入XLSX文件。

使用.to_excel方法，进行存储：

>>> beatles.to_excel("beat.xls")
>>> beatles.to_excel("beat.xlsx")

使用read_excel读取Excel文件：

>>> beat2 = pd.read_excel("/tmp/beat.xls")
>>> beat2
   Unnamed: 0    first       last  birth
0           0     Paul  McCartney   1942
1           1     John     Lennon   1940
2           2  Richard    Starkey   1940
3           3   George   Harrison   1943

用参数index_col，指定行索引：

>>> beat2 = pd.read_excel("/tmp/beat.xls", index_col=0)
>>> beat2
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943

Excel保存了数据类型：

>>> beat2.dtypes
first    object
last     object
birth     int64
dtype: object

使用sheet_name参数命名表单：

>>> xl_writer = pd.ExcelWriter("beat2.xlsx")
>>> beatles.to_excel(xl_writer, sheet_name="All")
>>> beatles[beatles.birth < 1941].to_excel(
...     xl_writer, sheet_name="1940"
... )
>>> xl_writer.save()

这个Excel文件包含两个表单，一个名字是All，一个名字是1940。

3.5 使用ZIP文件

如果CSV文件是ZIP文件中的唯一文件，可以直接使用read_csv函数：

>>> autos = pd.read_csv("data/vehicles.csv.zip")
>>> autos
       barrels08  barrelsA08  ...  phevHwy  phevComb
0      15.695714         0.0  ...        0         0
1      29.964545         0.0  ...        0         0
2      12.207778         0.0  ...        0         0
3      29.964545         0.0  ...        0         0
4      17.347895         0.0  ...        0         0
...          ...         ...  ...      ...       ...
41139  14.982273         0.0  ...        0         0
41140  14.330870         0.0  ...        0         0
41141  15.695714         0.0  ...        0         0
41142  15.695714         0.0  ...        0         0
41143  18.311667         0.0  ...        0         0
>>> autos.modifiedOn.dtype
dtype('O')

因为CSV文件中包含日期的列，它是字符串。可以在使用read_csv时使用parse_dates加载文件，另一种方法是加载文件后用to_datetime方法解析：

>>> autos= pd.read_csv(
...     "data/vehicles.csv.zip", parse_dates=["modifiedOn"]
... )
>>> autos.modifiedOn
0       2013-01-0...
1       2013-01-0...
2       2013-01-0...
3       2013-01-0...
4       2013-01-0...
            ...     
41139   2013-01-0...
41140   2013-01-0...
41141   2013-01-0...
41142   2013-01-0...
41143   2013-01-0...
Name: modifiedOn, Length: 41144, dtype: datetime64[ns, tzlocal()]

>>> autos.modifiedOn
0        Tue Jan 01 00:00:00 EST 2013
1        Tue Jan 01 00:00:00 EST 2013
2        Tue Jan 01 00:00:00 EST 2013
3        Tue Jan 01 00:00:00 EST 2013
4        Tue Jan 01 00:00:00 EST 2013
                     ...
39096    Tue Jan 01 00:00:00 EST 2013
39097    Tue Jan 01 00:00:00 EST 2013
39098    Tue Jan 01 00:00:00 EST 2013
39099    Tue Jan 01 00:00:00 EST 2013
39100    Tue Jan 01 00:00:00 EST 2013
Name: modifiedOn, Length: 39101, dtype: object
>>> pd.to_datetime(autos.modifiedOn)
0       2013-01-01
1       2013-01-01
2       2013-01-01
3       2013-01-01
4       2013-01-01
           ...
39096   2013-01-01
39097   2013-01-01
39098   2013-01-01
39099   2013-01-01
39100   2013-01-01
Name: modifiedOn, Length: 39101, dtype: datetime64[ns]

如果ZIP文件中有多个文件，可以使用zipfile模块。因为数据集第二行中包含问题，将其存入kag_questions。

>>> import zipfile
>>> with zipfile.ZipFile(
...     "data/kaggle-survey-2018.zip"
... ) as z:
...     print("\n".join(z.namelist()))
...     kag = pd.read_csv(
...         z.open("multipleChoiceResponses.csv")
...     )
...     kag_questions = kag.iloc[0]
...     survey = kag.iloc[1:]
multipleChoiceResponses.csv
freeFormResponses.csv
SurveySchema.csv
>>> survey.head(2).T
1          2
Time from...          710        434
Q1                 Female       Male
Q1_OTHER_...           -1         -1
Q2                  45-49      30-34
Q3            United S...  Indonesia
...                   ...        ...
Q50_Part_5            NaN        NaN
Q50_Part_6            NaN        NaN
Q50_Part_7            NaN        NaN
Q50_Part_8            NaN        NaN
Q50_OTHER...           -1         -1

如果压缩文件中只有一个文件，则read_csv方法还可以读取GZIP、BZ2和XZ文件。

3.6 使用数据库

创建SQLite数据库，存储Beatles信息：

>>> import sqlite3
>>> con = sqlite3.connect("data/beat.db")
>>> with con:
...     cur = con.cursor()
...     cur.execute("""DROP TABLE Band""")
...     cur.execute(
...         """CREATE TABLE Band(id INTEGER PRIMARY KEY,
...         fname TEXT, lname TEXT, birthyear INT)"""
...     )
...     cur.execute(
...         """INSERT INTO Band VALUES(
...         0, 'Paul', 'McCartney', 1942)"""
...     )
...     cur.execute(
...         """INSERT INTO Band VALUES(
...         1, 'John', 'Lennon', 1940)"""
...     )
...     _ = con.commit()

从DataFrame读取数据库，这里使用的是SQLAlchemy：

>>> import sqlalchemy as sa
>>> engine = sa.create_engine(
...     "sqlite:///data/beat.db", echo=True
... )
>>> sa_connection = engine.connect()
>>> beat = pd.read_sql(
...     "Band", sa_connection, index_col="id"
... )
>>> beat
   fname      lname  birthyear
id                            
0   Paul  McCartney       1942
1   John     Lennon       1940

使用SQL语句读取数据。可以使用SQLite或SQLAlchemy连接：

>>> sql = """SELECT fname, birthyear from Band"""
>>> fnames = pd.read_sql(sql, con)
>>> fnames
  fname  birthyear
0  Paul       1942
1  John       1940

3.7 读取JSON

JSON数据的编码和加载：

>>> import json
>>> encoded = json.dumps(people)
>>> encoded
'{"first": ["Paul", "John", "Richard", "George"], "last": ["McCartney", "Lennon", "Starkey", "Harrison"], "birth": [1942, 1940, 1940, 1943]}'
>>> json.loads(encoded)
{'first': ['Paull', 'John', 'Richard', 'George'], 'last': ['McCartney', 'Lennon', 'Starkey', 'Harrison'], 'birth': [1942, 1940, 1940, 1943]}

使用.read_json读取json数据，使用字典中的键名作为列名。

>>> beatles = pd.read_json(encoded)
>>> beatles
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943

读取JSON时，Pandas支持一些特定的方式：

columns —— （默认）将列名映射为列中的值的列表；
records —— 行的列表。每行是一个字典，一行映射到一个值；
split —— columns映射到列名，index映射到行索引值，data映射到每行数据组成的列表；
index —— 将索引映射到行，每行是一个列映射到值的字典；
values —— 数据行构成的列表（每行也是列表）。不包含列和行索引的值；
table —— 将schema映射到DataFrame的纲要，data映射为字典的列表。

参考下面的代码：

>>> records = beatles.to_json(orient="records")
>>> records
'[{"first":"Paul","last":"McCartney","birth":1942},{"first":"John","last":"Lennon","birth":1940},{"first":"Richard","last":"Starkey","birth":1940},{"first":"George","last":"Harrison","birth":1943}]'
>>> pd.read_json(records, orient="records")
   birth    first       last
0   1942     Paul  McCartney
1   1940     John     Lennon
2   1940  Richard    Starkey
3   1943   George   Harrison
>>> split = beatles.to_json(orient="split")
>>> split
'{"columns":["first","last","birth"],"index":[0,1,2,3],"data":[["Paul","McCartney",1942],["John","Lennon",1940],["Richard","Starkey",1940],["George","Harrison",1943]]}'
>>> pd.read_json(split, orient="split")
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943
>>> index = beatles.to_json(orient="index")
>>> index
'{"0":{"first":"Paul","last":"McCartney","birth":1942},"1":{"first":"John","last":"Lennon","birth":1940},"2":{"first":"Richard","last":"Starkey","birth":1940},"3":{"first":"George","last":"Harrison","birth":1943}}'
>>> pd.read_json(index, orient="index")
   birth    first       last
0   1942     Paul  McCartney
1   1940     John     Lennon
2   1940  Richard    Starkey
3   1943   George   Harrison
>>> values = beatles.to_json(orient="values")
>>> values
'[["Paul","McCartney",1942],["John","Lennon",1940],["Richard","Starkey",1940],["George","Harrison",1943]]'
>>> pd.read_json(values, orient="values")
         0          1     2
0     Paul  McCartney  1942
1     John     Lennon  1940
2  Richard    Starkey  1940
3   George   Harrison  1943
>>> (
...     pd.read_json(values, orient="values").rename(
...         columns=dict(
...             enumerate(["first", "last", "birth"])
...         )
...     )
... )
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943
>>> table = beatles.to_json(orient="table")
>>> table
'{"schema": {"fields":[{"name":"index","type":"integer"},{"name":"first","type":"string"},{"name":"last","type":"string"},{"name":"birth","type":"integer"}],"primaryKey":["index"],"pandas_version":"0.20.0"}, "data": [{"index":0,"first":"Paul","last":"McCartney","birth":1942},{"index":1,"first":"John","last":"Lennon","birth":1940},{"index":2,"first":"Richard","last":"Starkey","birth":1940},{"index":3,"first":"George","last":"Harrison","birth":1943}]}'
>>> pd.read_json(table, orient="table")
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943

如果要将数据转换为字典，可以使用.to_dict方法，方便添加数据：

>>> output = beat.to_dict()
>>> output
{'fname': {0: 'Paul', 1: 'John'}, 'lname': {0: 'McCartney', 1: 'Lennon'}, 'birthyear': {0: 1942, 1: 1940}}
>>> output["version"] = "0.4.1"
>>> json.dumps(output)
'{"fname": {"0": "Paul", "1": "John"}, "lname": {"0": "McCartney", "1": "Lennon"}, "birthyear": {"0": 1942, "1": 1940}, "version": "0.4.1"}'

3.8 读取HTML表格

可以使用Pandas读取HTML中的表格：

Wikipedia中的表格

# 译者注：en.wikipedia.org 需要科学上网
>>> url = 'https://en.wikipedia.org/wiki/The_Beatles_discography'
>>> dfs = pd.read_html(url)
>>> len(dfs)
51

一共读取了51个df，检查一下第一个：

>>> dfs[0]
  The Beatles discography The Beatles discography.1
0  The Beat...             The Beat...
1  Studio a...                      23
2  Live albums                       5
3  Compilat...                      53
4  Video al...                      15
5  Music vi...                      64
6          EPs                      21
7      Singles                      63
8     Mash-ups                       2
9     Box sets                      15

.read_html有一个match参数，可以是字符串或正则表达式。还有一个attrs参数，用于定位HTML标签。

检查下HTML的table元素：

<table class="wikitable plainrowheaders" style="text-align:center;">
  <caption>List of studio albums,<sup id="cite_ref-1" class="reference"><a href="#cite_note-1">[A]</a></sup> with selected chart positions and certifications
  </caption>
  <tbody>
    <tr>
      <th scope="col" rowspan="2" style="width:20em;">Title
      </th>
      <th scope="col" rowspan="2" style="width:20em;">Release
       ...

虽然没有属性，但可以使用字符串List of studio albums来匹配，缺失值na_values用"—"表示：

>>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
>>> dfs = pd.read_html(
...     url, match="List of studio albums", na_values="—"
... )
>>> len(dfs)
1
>>> dfs[0].columns
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

使用前两行作为列名，但还是很乱：

>>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
>>> dfs = pd.read_html(
...     url,
...     match="List of studio albums",
...     na_values="—",
...     header=[0, 1],
... )
>>> len(dfs)
1
>>> dfs[0]
          Title      Release  ... Peak chart positions Certifications
          Title      Release  ...             US[8][9] Certifications
0   Please P...  Released...  ...          NaN          BPI: Gol...
1   With the...  Released...  ...          NaN          BPI: Gol...
2   Introduc...  Released...  ...            2          RIAA: Pl...
3   Meet the...  Released...  ...            1          MC: Plat...
4   Twist an...  Released...  ...          NaN          MC: 3× P...
..          ...          ...  ...          ...                  ...
22  The Beat...  Released...  ...            1          BPI: 2× ...
23  Yellow S...  Released...  ...            2          BPI: Gol...
24   Abbey Road  Released...  ...            1          BPI: 2× ...
25    Let It Be  Released...  ...            1          BPI: Gol...
26  "—" deno...  "—" deno...  ...  "—" deno...          "—" deno...
>>> dfs[0].columns
MultiIndex(levels=[['Certifications', 'Peak chart positions', 'Release', 'Title'], ['AUS[3]', 'CAN[4]', 'Certifications', 'FRA[5]', 'GER[6]', 'NOR[7]', 'Release', 'Title', 'UK[1][2]', 'US[8][9]']],
  codes=[[3, 2, 1, 1, 1, 1, 1, 1, 1, 0], [7, 6, 8, 0, 1, 3, 4, 5, 9, 2]])

对于这种情况，最容易的方法是更新列索引：

>>> df = dfs[0]
>>> df.columns = [
...     "Title",
...     "Release",
...     "UK",
...     "AUS",
...     "CAN",
...     "FRA",
...     "GER",
...     "NOR",
...     "US",
...     "Certifications",
... ]
>>> df
          Title      Release  ...           US Certifications
0   Please P...  Released...  ...          NaN  BPI: Gol...
1   With the...  Released...  ...          NaN  BPI: Gol...
2   Introduc...  Released...  ...            2  RIAA: Pl...
3   Meet the...  Released...  ...            1  MC: Plat...
4   Twist an...  Released...  ...          NaN  MC: 3× P...
..          ...          ...  ...          ...          ...
22  The Beat...  Released...  ...            1  BPI: 2× ...
23  Yellow S...  Released...  ...            2  BPI: Gol...
24   Abbey Road  Released...  ...            1  BPI: 2× ...
25    Let It Be  Released...  ...            1  BPI: Gol...
26  "—" deno...  "—" deno...  ...  "—" deno...  "—" deno...

继续清理Release这列，在HTML中，代码是这样的：

<th scope="row" rowspan="2">
  <i><a href="/wiki/A_Hard_Day%27s_Night_(album)" title="A Hard Day's Night (album)">A Hard Day's Night</a></i>
  <img alt="double-dagger" src="//upload.wikimedia.org/wikipedia/commons/f/f9/Double-dagger-14-plain.png" decoding="async" width="9" height="14" data-file-width="9" data-file-height="14">
</th>

将release列分别存入release_date 和label两列：

>>> res = (
...     df.pipe(
...         lambda df_: df_[
...             ~df_.Title.str.startswith("Released")
...         ]
...     )
...     .assign(
...         release_date=lambda df_: pd.to_datetime(
...             df_.Release.str.extract(
...                 r"Released: (.*) Label"
...             )[0].str.replace(r"\[E\]", "")
...         ),
...         label=lambda df_: df_.Release.str.extract(
...             r"Label: (.*)"
...         ),
...     )
...     .loc[
...         :,
...         [
...             "Title",
...             "UK",
...             "AUS",
...             "CAN",
...             "FRA",
...             "GER",
...             "NOR",
...             "US",
...             "release_date",
...             "label",
...         ],
...     ]
... )
>>> res
          Title   UK  ... release_date        label
0   Please P...    1  ...   1963-03-22  Parlopho...
1   With the...    1  ...   1963-11-22  Parlopho...
2   Introduc...  NaN  ...   1964-01-10  Vee-Jay ...
3   Meet the...  NaN  ...   1964-01-20  Capitol ...
4   Twist an...  NaN  ...   1964-02-03  Capitol ...
..          ...  ...  ...          ...          ...
21  Magical ...   31  ...   1967-11-27  Parlopho...
22  The Beat...    1  ...   1968-11-22        Apple
23  Yellow S...    3  ...   1969-01-13  Apple (U...
24   Abbey Road    1  ...   1969-09-26        Apple
25    Let It Be    1  ...   1970-05-08        Apple

直接读取线上的csv文件：

>>> url = https://github.com/mattharrison/datasets/blob/master/data/anscombes.csv
>>> dfs = pd.read_html(url, attrs={"class": "csv-data"})
>>> len(dfs)
1
>>> dfs[0]
    Unnamed: 0 quadrant     x     y
0          NaN        I  10.0  8.04
1          NaN        I  14.0  9.96
2          NaN        I   6.0  7.24
3          NaN        I   9.0  8.81
4          NaN        I   4.0  4.26
..         ...      ...   ...   ...
39         NaN       IV   8.0  6.58
40         NaN       IV   8.0  7.91
41         NaN       IV   8.0  8.47
42         NaN       IV   8.0  5.25
43         NaN       IV   8.0  6.89

《Pandas 1.x Cookbook · 第二版》第03章创建和持久化DataFrame

《Pandas 1.x Cookbook · 第二版》第03章创建和持久化DataFrame

3.1 创建DataFrame

原理

更多

3.2 写入CSV

更多

3.3 读取大CSV文件

更多

3.4 使用Excel文件

更多

3.5 使用ZIP文件

更多

3.6 使用数据库

3.7 读取JSON

更多

3.8 读取HTML表格

更多

推荐阅读更多精彩内容

《Pandas 1.x Cookbook · 第二版》第03章 创建和持久化DataFrame

3.1 创建DataFrame

原理

更多

3.2 写入CSV

更多

3.3 读取大CSV文件

更多

3.4 使用Excel文件

更多

3.5 使用ZIP文件

更多

3.6 使用数据库

3.7 读取JSON

更多

3.8 读取HTML表格

更多

推荐阅读更多精彩内容

《Pandas 1.x Cookbook · 第二版》第03章创建和持久化DataFrame