第01章 Pandas基础
第02章 DataFrame基础运算
第03章 创建和持久化DataFrame
第04章 开始数据分析
第05章 探索性数据分析
第06章 选取数据子集
第07章 过滤行
第08章 索引对齐
3.1 创建DataFrame
使用平行的列表创建DataFrame
>>> import pandas as pd
>>> import numpy as np
>>> fname = ["Paul", "John", "Richard", "George"]
>>> lname = ["McCartney", "Lennon", "Starkey", "Harrison"]
>>> birth = [1942, 1940, 1940, 1943]
创建字典:
>>> people = {"first": fname, "last": lname, "birth": birth}
用该字典创建DataFrame:
>>> beatles = pd.DataFrame(people)
>>> beatles
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
原理
当调用DataFrame构造器时,Pandas会创建一个RangeIndex
对象:
>>> beatles.index
RangeIndex(start=0, stop=4, step=1)
重新指定索引:
>>> pd.DataFrame(people, index=["a", "b", "c", "d"])
first last birth
a Paul McCartney 1942
b John Lennon 1940
c Richard Starkey 1940
d George Harrison 1943
更多
还可以用字典构成的列表构建DataFrame:
>>> pd.DataFrame(
... [
... {
... "first": "Paul",
... "last": "McCartney",
... "birth": 1942,
... },
... {
... "first": "John",
... "last": "Lennon",
... "birth": 1940,
... },
... {
... "first": "Richard",
... "last": "Starkey",
... "birth": 1940,
... },
... {
... "first": "George",
... "last": "Harrison",
... "birth": 1943,
... },
... ]
... )
birth first last
0 1942 Paul McCartney
1 1940 John Lennon
2 1940 Richard Starkey
3 1943 George Harrison
使用columns
参数指定列的顺序:
>>> pd.DataFrame(
... [
... {
... "first": "Paul",
... "last": "McCartney",
... "birth": 1942,
... },
... {
... "first": "John",
... "last": "Lennon",
... "birth": 1940,
... },
... {
... "first": "Richard",
... "last": "Starkey",
... "birth": 1940,
... },
... {
... "first": "George",
... "last": "Harrison",
... "birth": 1943,
... },
... ],
... columns=["last", "first", "birth"],
... )
last first birth
0 McCartney Paul 1942
1 Lennon John 1940
2 Starkey Richard 1940
3 Harrison George 1943
3.2 写入CSV
将DataFrame写入CSV文件:
使用.to_csv
方法将DataFrame写入CSV文件:
>>> beatles
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
>>> from io import StringIO
>>> fout = StringIO()
>>> beatles.to_csv(fout) # 使用文件名
查看文件内容:
>>> print(fout.getvalue())
,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943
更多
如果读取刚刚保存的CSV,会读入列名为Unnamed: 0
的冗余列:
>>> _ = fout.seek(0)
>>> pd.read_csv(fout)
Unnamed: 0 first last birth
0 0 Paul McCartney 1942
1 1 John Lennon 1940
2 2 Richard Starkey 1940
3 3 George Harrison 1943
使用index_col
参数,可以指定列:
>>> _ = fout.seek(0)
>>> pd.read_csv(fout, index_col=0)
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
如果存CSV文件时,不想保存行索引,可以将index
参数设为False
:
>>> fout = StringIO()
>>> beatles.to_csv(fout, index=False)
>>> print(fout.getvalue())
first,last,birth
Paul,McCartney,1942
John,Lennon,1940
Richard,Starkey,1940
George,Harrison,1943
3.3 读取大CSV文件
Pandas是在内存中处理文件的,通常来讲,内存的大小需要是文件大小的3至10倍。
这里使用的是diamonds
数据集。使用nrows
参数读取1000行数据。
>>> diamonds = pd.read_csv("data/diamonds.csv", nrows=1000)
>>> diamonds
carat cut color clarity ... price x y z
0 0.23 Ideal E SI2 ... 326 3.95 3.98 2.43
1 0.21 Premium E SI1 ... 326 3.89 3.84 2.31
2 0.23 Good E VS1 ... 327 4.05 4.07 2.31
3 0.29 Premium I VS2 ... 334 4.20 4.23 2.63
4 0.31 Good J SI2 ... 335 4.34 4.35 2.75
.. ... ... ... ... ... ... ... ... ...
995 0.54 Ideal D VVS2 ... 2897 5.30 5.34 3.26
996 0.72 Ideal E SI1 ... 2897 5.69 5.74 3.57
997 0.72 Good F VS1 ... 2897 5.82 5.89 3.48
998 0.74 Premium D VS2 ... 2897 5.81 5.77 3.58
999 1.12 Premium J SI2 ... 2898 6.68 6.61 4.03
使用.info
方法查看消耗的内存量:
>>> diamonds.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat 1000 non-null float64
cut 1000 non-null object
color 1000 non-null object
clarity 1000 non-null object
depth 1000 non-null float64
table 1000 non-null float64
price 1000 non-null int64
x 1000 non-null float64
y 1000 non-null float64
z 1000 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 78.2+ KB
可以看到1000行数据使用了78.2KB内存。如果有10亿行数据,则要占用78GB的内存。
使用dtype
参数,设置读取的数值类型:
>>> diamonds2 = pd.read_csv(
... "data/diamonds.csv",
... nrows=1000,
... dtype={
... "carat": np.float32,
... "depth": np.float32,
... "table": np.float32,
... "x": np.float32,
... "y": np.float32,
... "z": np.float32,
... "price": np.int16,
... },
... )
>>> diamonds2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat 1000 non-null float32
cut 1000 non-null object
color 1000 non-null object
clarity 1000 non-null object
depth 1000 non-null float32
table 1000 non-null float32
price 1000 non-null int16
x 1000 non-null float32
y 1000 non-null float32
z 1000 non-null float32
dtypes: float32(6), int16(1), object(3)
memory usage: 49.0+ KB
改变了数值类型,对比下新的DataFrame和原先的DataFrame:
>>> diamonds.describe()
carat depth ... y z
count 1000.000000 1000.000000 ... 1000.000000 1000.000000
mean 0.689280 61.722800 ... 5.599180 3.457530
std 0.195291 1.758879 ... 0.611974 0.389819
min 0.200000 53.000000 ... 3.750000 2.270000
25% 0.700000 60.900000 ... 5.630000 3.450000
50% 0.710000 61.800000 ... 5.760000 3.550000
75% 0.790000 62.600000 ... 5.910000 3.640000
max 1.270000 69.500000 ... 7.050000 4.330000
>>> diamonds2.describe()
carat depth ... y z
count 1000.000000 1000.000000 ... 1000.000000 1000.000000
mean 0.689453 61.718750 ... 5.601562 3.457031
std 0.195312 1.759766 ... 0.611816 0.389648
min 0.199951 53.000000 ... 3.750000 2.269531
25% 0.700195 60.906250 ... 5.628906 3.449219
50% 0.709961 61.812500 ... 5.761719 3.550781
75% 0.790039 62.593750 ... 5.910156 3.640625
max 1.269531 69.500000 ... 7.050781 4.328125
可以看到通过改变数据类型,节省了38%的内存。
使用dtype
参数,将数据类型改为category
。使用.value_counts
先统计数据个数:
>>> diamonds2.cut.value_counts(
Ideal 333
Premium 290
Very Good 226
Good 89
Fair 62
Name: cut, dtype: int64
>>> diamonds2.color.value_counts()
E 240
F 226
G 139
D 129
H 125
I 95
J 46
Name: color, dtype: int64
>>> diamonds2.clarity.value_counts()
SI1 306
VS2 218
VS1 159
SI2 154
VVS2 62
VVS1 58
I1 29
IF 14
Name: clarity, dtype: int64
因为是低基数,将其转换为category
,可以节省约37%的内存:
>>> diamonds3 = pd.read_csv(
... "data/diamonds.csv",
... nrows=1000,
... dtype={
... "carat": np.float32,
... "depth": np.float32,
... "table": np.float32,
... "x": np.float32,
... "y": np.float32,
... "z": np.float32,
... "price": np.int16,
... "cut": "category",
... "color": "category",
... "clarity": "category",
... },
... )
>>> diamonds3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat 1000 non-null float32
cut 1000 non-null category
color 1000 non-null category
clarity 1000 non-null category
depth 1000 non-null float32
table 1000 non-null float32
price 1000 non-null int16
x 1000 non-null float32
y 1000 non-null float32
z 1000 non-null float32
dtypes: category(3), float32(6), int16(1)
memory usage: 29.4 KB
使用参数usecols
,可以指定加载哪些列。这里忽略了x
、y
、z
三列:
>>> cols = [
... "carat",
... "cut",
... "color",
... "clarity",
... "depth",
... "table",
... "price",
... ]
>>> diamonds4 = pd.read_csv(
... "data/diamonds.csv",
... nrows=1000,
... dtype={
... "carat": np.float32,
... "depth": np.float32,
... "table": np.float32,
... "price": np.int16,
... "cut": "category",
... "color": "category",
... "clarity": "category",
... },
... usecols=cols,
... )
>>> diamonds4.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
carat 1000 non-null float32
cut 1000 non-null category
color 1000 non-null category
clarity 1000 non-null category
depth 1000 non-null float32
table 1000 non-null float32
price 1000 non-null int16
dtypes: category(3), float32(3), int16(1)
memory usage: 17.7 KB
现在使用的内存只有原始的21%。
如果每次只处理数据的一部分,可以使用chunksize
参数:
>>> cols = [
... "carat",
... "cut",
... "color",
... "clarity",
... "depth",
... "table",
... "price",
... ]
>>> diamonds_iter = pd.read_csv(
... "data/diamonds.csv",
... nrows=1000,
... dtype={
... "carat": np.float32,
... "depth": np.float32,
... "table": np.float32,
... "price": np.int16,
... "cut": "category",
... "color": "category",
... "clarity": "category",
... },
... usecols=cols,
... chunksize=200,
... )
>>> def process(df):
... return f"processed {df.size} items"
>>> for chunk in diamonds_iter:
... process(chunk)
因为CSV文件不保存数据类型,Pandas需要推断每列的数据类型是什么。如果一列的值都是整数,并且没有缺失值,则Pandas将其认定为int64
。如果一列是数值类型,但不是整数,或存在缺失值,Pandas使用的是float64
。这两种数据类型占用的内存比较大。例如,如果所有数都在200以下,你可以使用一个小的数据类型,比如np.int16
(或np.int8
,如果都是正数)。
如果某列都是非数值类型,Pandas会将其转换为object
类型。object
类型占用内存很多,因为它是将数据以Python字符串存储的,将类型改为category
,可以大大节省空间,因为它对每个字符串只存储一次。
更多
如果价格使用int8
,会导致丢失信息。你可以使用NumPy的iinfo
函数列出NumPy整数类型的范围:
>>> np.iinfo(np.int8)
iinfo(min=-128, max=127, dtype=int8)
使用.finfo
可以查看浮点数类型的范围:
>>> np.finfo(np.float16)
finfo(resolution=0.001, min=-6.55040e+04,
max=6.55040e+04, dtype=float16)
还可以用.memory_usage
方法查询DataFrame或Series使用了多少字节。注意,其中是包含行索引的。另外,传入deep=True
,可以查询带有对象类型的Series的内存用量:
>>> diamonds.price.memory_usage()
8080
>>> diamonds.price.memory_usage(index=False)
8000
>>> diamonds.cut.memory_usage()
8080
>>> diamonds.cut.memory_usage(deep=True)
63413
一旦确定了数据类型,可以将其以二进制并带有数据类型的形式保存下来,比如Feather格式,Pandas使用的是pyarrow
库。
>>> diamonds4.to_feather("d.arr")
>>> diamonds5 = pd.read_feather("d.arr")
另一种方法是使用Parquet格式。
>>> diamonds4.to_parquet("/tmp/d.pqt")
3.4 使用Excel文件
需要安装xlwt
或openpyxl
来写入XLSX文件。
使用.to_excel
方法,进行存储:
>>> beatles.to_excel("beat.xls")
>>> beatles.to_excel("beat.xlsx")
使用read_excel
读取Excel文件:
>>> beat2 = pd.read_excel("/tmp/beat.xls")
>>> beat2
Unnamed: 0 first last birth
0 0 Paul McCartney 1942
1 1 John Lennon 1940
2 2 Richard Starkey 1940
3 3 George Harrison 1943
用参数index_col
,指定行索引:
>>> beat2 = pd.read_excel("/tmp/beat.xls", index_col=0)
>>> beat2
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
Excel保存了数据类型:
>>> beat2.dtypes
first object
last object
birth int64
dtype: object
更多
使用sheet_name
参数命名表单:
>>> xl_writer = pd.ExcelWriter("beat2.xlsx")
>>> beatles.to_excel(xl_writer, sheet_name="All")
>>> beatles[beatles.birth < 1941].to_excel(
... xl_writer, sheet_name="1940"
... )
>>> xl_writer.save()
这个Excel文件包含两个表单,一个名字是All,一个名字是1940。
3.5 使用ZIP文件
如果CSV文件是ZIP文件中的唯一文件,可以直接使用read_csv
函数:
>>> autos = pd.read_csv("data/vehicles.csv.zip")
>>> autos
barrels08 barrelsA08 ... phevHwy phevComb
0 15.695714 0.0 ... 0 0
1 29.964545 0.0 ... 0 0
2 12.207778 0.0 ... 0 0
3 29.964545 0.0 ... 0 0
4 17.347895 0.0 ... 0 0
... ... ... ... ... ...
41139 14.982273 0.0 ... 0 0
41140 14.330870 0.0 ... 0 0
41141 15.695714 0.0 ... 0 0
41142 15.695714 0.0 ... 0 0
41143 18.311667 0.0 ... 0 0
>>> autos.modifiedOn.dtype
dtype('O')
因为CSV文件中包含日期的列,它是字符串。可以在使用read_csv
时使用parse_dates
加载文件,另一种方法是加载文件后用to_datetime
方法解析:
>>> autos= pd.read_csv(
... "data/vehicles.csv.zip", parse_dates=["modifiedOn"]
... )
>>> autos.modifiedOn
0 2013-01-0...
1 2013-01-0...
2 2013-01-0...
3 2013-01-0...
4 2013-01-0...
...
41139 2013-01-0...
41140 2013-01-0...
41141 2013-01-0...
41142 2013-01-0...
41143 2013-01-0...
Name: modifiedOn, Length: 41144, dtype: datetime64[ns, tzlocal()]
>>> autos.modifiedOn
0 Tue Jan 01 00:00:00 EST 2013
1 Tue Jan 01 00:00:00 EST 2013
2 Tue Jan 01 00:00:00 EST 2013
3 Tue Jan 01 00:00:00 EST 2013
4 Tue Jan 01 00:00:00 EST 2013
...
39096 Tue Jan 01 00:00:00 EST 2013
39097 Tue Jan 01 00:00:00 EST 2013
39098 Tue Jan 01 00:00:00 EST 2013
39099 Tue Jan 01 00:00:00 EST 2013
39100 Tue Jan 01 00:00:00 EST 2013
Name: modifiedOn, Length: 39101, dtype: object
>>> pd.to_datetime(autos.modifiedOn)
0 2013-01-01
1 2013-01-01
2 2013-01-01
3 2013-01-01
4 2013-01-01
...
39096 2013-01-01
39097 2013-01-01
39098 2013-01-01
39099 2013-01-01
39100 2013-01-01
Name: modifiedOn, Length: 39101, dtype: datetime64[ns]
如果ZIP文件中有多个文件,可以使用zipfile
模块。因为数据集第二行中包含问题,将其存入kag_questions
。
>>> import zipfile
>>> with zipfile.ZipFile(
... "data/kaggle-survey-2018.zip"
... ) as z:
... print("\n".join(z.namelist()))
... kag = pd.read_csv(
... z.open("multipleChoiceResponses.csv")
... )
... kag_questions = kag.iloc[0]
... survey = kag.iloc[1:]
multipleChoiceResponses.csv
freeFormResponses.csv
SurveySchema.csv
>>> survey.head(2).T
1 2
Time from... 710 434
Q1 Female Male
Q1_OTHER_... -1 -1
Q2 45-49 30-34
Q3 United S... Indonesia
... ... ...
Q50_Part_5 NaN NaN
Q50_Part_6 NaN NaN
Q50_Part_7 NaN NaN
Q50_Part_8 NaN NaN
Q50_OTHER... -1 -1
更多
如果压缩文件中只有一个文件,则read_csv
方法还可以读取GZIP
、BZ2
和XZ
文件。
3.6 使用数据库
创建SQLite数据库,存储Beatles信息:
>>> import sqlite3
>>> con = sqlite3.connect("data/beat.db")
>>> with con:
... cur = con.cursor()
... cur.execute("""DROP TABLE Band""")
... cur.execute(
... """CREATE TABLE Band(id INTEGER PRIMARY KEY,
... fname TEXT, lname TEXT, birthyear INT)"""
... )
... cur.execute(
... """INSERT INTO Band VALUES(
... 0, 'Paul', 'McCartney', 1942)"""
... )
... cur.execute(
... """INSERT INTO Band VALUES(
... 1, 'John', 'Lennon', 1940)"""
... )
... _ = con.commit()
从DataFrame读取数据库,这里使用的是SQLAlchemy:
>>> import sqlalchemy as sa
>>> engine = sa.create_engine(
... "sqlite:///data/beat.db", echo=True
... )
>>> sa_connection = engine.connect()
>>> beat = pd.read_sql(
... "Band", sa_connection, index_col="id"
... )
>>> beat
fname lname birthyear
id
0 Paul McCartney 1942
1 John Lennon 1940
使用SQL语句读取数据。可以使用SQLite或SQLAlchemy连接:
>>> sql = """SELECT fname, birthyear from Band"""
>>> fnames = pd.read_sql(sql, con)
>>> fnames
fname birthyear
0 Paul 1942
1 John 1940
3.7 读取JSON
JSON数据的编码和加载:
>>> import json
>>> encoded = json.dumps(people)
>>> encoded
'{"first": ["Paul", "John", "Richard", "George"], "last": ["McCartney", "Lennon", "Starkey", "Harrison"], "birth": [1942, 1940, 1940, 1943]}'
>>> json.loads(encoded)
{'first': ['Paull', 'John', 'Richard', 'George'], 'last': ['McCartney', 'Lennon', 'Starkey', 'Harrison'], 'birth': [1942, 1940, 1940, 1943]}
使用.read_json
读取json数据,使用字典中的键名作为列名。
>>> beatles = pd.read_json(encoded)
>>> beatles
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
读取JSON时,Pandas支持一些特定的方式:
-
columns
—— (默认)将列名映射为列中的值的列表; -
records
—— 行的列表。每行是一个字典,一行映射到一个值; -
split
——columns
映射到列名,index
映射到行索引值,data
映射到每行数据组成的列表; -
index
—— 将索引映射到行,每行是一个列映射到值的字典; -
values
—— 数据行构成的列表(每行也是列表)。不包含列和行索引的值; -
table
—— 将schema
映射到DataFrame的纲要,data
映射为字典的列表。
参考下面的代码:
>>> records = beatles.to_json(orient="records")
>>> records
'[{"first":"Paul","last":"McCartney","birth":1942},{"first":"John","last":"Lennon","birth":1940},{"first":"Richard","last":"Starkey","birth":1940},{"first":"George","last":"Harrison","birth":1943}]'
>>> pd.read_json(records, orient="records")
birth first last
0 1942 Paul McCartney
1 1940 John Lennon
2 1940 Richard Starkey
3 1943 George Harrison
>>> split = beatles.to_json(orient="split")
>>> split
'{"columns":["first","last","birth"],"index":[0,1,2,3],"data":[["Paul","McCartney",1942],["John","Lennon",1940],["Richard","Starkey",1940],["George","Harrison",1943]]}'
>>> pd.read_json(split, orient="split")
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
>>> index = beatles.to_json(orient="index")
>>> index
'{"0":{"first":"Paul","last":"McCartney","birth":1942},"1":{"first":"John","last":"Lennon","birth":1940},"2":{"first":"Richard","last":"Starkey","birth":1940},"3":{"first":"George","last":"Harrison","birth":1943}}'
>>> pd.read_json(index, orient="index")
birth first last
0 1942 Paul McCartney
1 1940 John Lennon
2 1940 Richard Starkey
3 1943 George Harrison
>>> values = beatles.to_json(orient="values")
>>> values
'[["Paul","McCartney",1942],["John","Lennon",1940],["Richard","Starkey",1940],["George","Harrison",1943]]'
>>> pd.read_json(values, orient="values")
0 1 2
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
>>> (
... pd.read_json(values, orient="values").rename(
... columns=dict(
... enumerate(["first", "last", "birth"])
... )
... )
... )
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
>>> table = beatles.to_json(orient="table")
>>> table
'{"schema": {"fields":[{"name":"index","type":"integer"},{"name":"first","type":"string"},{"name":"last","type":"string"},{"name":"birth","type":"integer"}],"primaryKey":["index"],"pandas_version":"0.20.0"}, "data": [{"index":0,"first":"Paul","last":"McCartney","birth":1942},{"index":1,"first":"John","last":"Lennon","birth":1940},{"index":2,"first":"Richard","last":"Starkey","birth":1940},{"index":3,"first":"George","last":"Harrison","birth":1943}]}'
>>> pd.read_json(table, orient="table")
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
更多
如果要将数据转换为字典,可以使用.to_dict
方法,方便添加数据:
>>> output = beat.to_dict()
>>> output
{'fname': {0: 'Paul', 1: 'John'}, 'lname': {0: 'McCartney', 1: 'Lennon'}, 'birthyear': {0: 1942, 1: 1940}}
>>> output["version"] = "0.4.1"
>>> json.dumps(output)
'{"fname": {"0": "Paul", "1": "John"}, "lname": {"0": "McCartney", "1": "Lennon"}, "birthyear": {"0": 1942, "1": 1940}, "version": "0.4.1"}'
3.8 读取HTML表格
可以使用Pandas读取HTML中的表格:
# 译者注:en.wikipedia.org 需要科学上网
>>> url = 'https://en.wikipedia.org/wiki/The_Beatles_discography'
>>> dfs = pd.read_html(url)
>>> len(dfs)
51
一共读取了51个df,检查一下第一个:
>>> dfs[0]
The Beatles discography The Beatles discography.1
0 The Beat... The Beat...
1 Studio a... 23
2 Live albums 5
3 Compilat... 53
4 Video al... 15
5 Music vi... 64
6 EPs 21
7 Singles 63
8 Mash-ups 2
9 Box sets 15
.read_html
有一个match
参数,可以是字符串或正则表达式。还有一个attrs
参数,用于定位HTML标签。
检查下HTML的table
元素:
<table class="wikitable plainrowheaders" style="text-align:center;">
<caption>List of studio albums,<sup id="cite_ref-1" class="reference"><a href="#cite_note-1">[A]</a></sup> with selected chart positions and certifications
</caption>
<tbody>
<tr>
<th scope="col" rowspan="2" style="width:20em;">Title
</th>
<th scope="col" rowspan="2" style="width:20em;">Release
...
虽然没有属性,但可以使用字符串List of studio albums
来匹配,缺失值na_values
用"—"表示:
>>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
>>> dfs = pd.read_html(
... url, match="List of studio albums", na_values="—"
... )
>>> len(dfs)
1
>>> dfs[0].columns
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
使用前两行作为列名,但还是很乱:
>>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
>>> dfs = pd.read_html(
... url,
... match="List of studio albums",
... na_values="—",
... header=[0, 1],
... )
>>> len(dfs)
1
>>> dfs[0]
Title Release ... Peak chart positions Certifications
Title Release ... US[8][9] Certifications
0 Please P... Released... ... NaN BPI: Gol...
1 With the... Released... ... NaN BPI: Gol...
2 Introduc... Released... ... 2 RIAA: Pl...
3 Meet the... Released... ... 1 MC: Plat...
4 Twist an... Released... ... NaN MC: 3× P...
.. ... ... ... ... ...
22 The Beat... Released... ... 1 BPI: 2× ...
23 Yellow S... Released... ... 2 BPI: Gol...
24 Abbey Road Released... ... 1 BPI: 2× ...
25 Let It Be Released... ... 1 BPI: Gol...
26 "—" deno... "—" deno... ... "—" deno... "—" deno...
>>> dfs[0].columns
MultiIndex(levels=[['Certifications', 'Peak chart positions', 'Release', 'Title'], ['AUS[3]', 'CAN[4]', 'Certifications', 'FRA[5]', 'GER[6]', 'NOR[7]', 'Release', 'Title', 'UK[1][2]', 'US[8][9]']],
codes=[[3, 2, 1, 1, 1, 1, 1, 1, 1, 0], [7, 6, 8, 0, 1, 3, 4, 5, 9, 2]])
对于这种情况,最容易的方法是更新列索引:
>>> df = dfs[0]
>>> df.columns = [
... "Title",
... "Release",
... "UK",
... "AUS",
... "CAN",
... "FRA",
... "GER",
... "NOR",
... "US",
... "Certifications",
... ]
>>> df
Title Release ... US Certifications
0 Please P... Released... ... NaN BPI: Gol...
1 With the... Released... ... NaN BPI: Gol...
2 Introduc... Released... ... 2 RIAA: Pl...
3 Meet the... Released... ... 1 MC: Plat...
4 Twist an... Released... ... NaN MC: 3× P...
.. ... ... ... ... ...
22 The Beat... Released... ... 1 BPI: 2× ...
23 Yellow S... Released... ... 2 BPI: Gol...
24 Abbey Road Released... ... 1 BPI: 2× ...
25 Let It Be Released... ... 1 BPI: Gol...
26 "—" deno... "—" deno... ... "—" deno... "—" deno...
继续清理Release这列,在HTML中,代码是这样的:
<th scope="row" rowspan="2">
<i><a href="/wiki/A_Hard_Day%27s_Night_(album)" title="A Hard Day's Night (album)">A Hard Day's Night</a></i>
<img alt="double-dagger" src="//upload.wikimedia.org/wikipedia/commons/f/f9/Double-dagger-14-plain.png" decoding="async" width="9" height="14" data-file-width="9" data-file-height="14">
</th>
将release列分别存入release_date
和label
两列:
>>> res = (
... df.pipe(
... lambda df_: df_[
... ~df_.Title.str.startswith("Released")
... ]
... )
... .assign(
... release_date=lambda df_: pd.to_datetime(
... df_.Release.str.extract(
... r"Released: (.*) Label"
... )[0].str.replace(r"\[E\]", "")
... ),
... label=lambda df_: df_.Release.str.extract(
... r"Label: (.*)"
... ),
... )
... .loc[
... :,
... [
... "Title",
... "UK",
... "AUS",
... "CAN",
... "FRA",
... "GER",
... "NOR",
... "US",
... "release_date",
... "label",
... ],
... ]
... )
>>> res
Title UK ... release_date label
0 Please P... 1 ... 1963-03-22 Parlopho...
1 With the... 1 ... 1963-11-22 Parlopho...
2 Introduc... NaN ... 1964-01-10 Vee-Jay ...
3 Meet the... NaN ... 1964-01-20 Capitol ...
4 Twist an... NaN ... 1964-02-03 Capitol ...
.. ... ... ... ... ...
21 Magical ... 31 ... 1967-11-27 Parlopho...
22 The Beat... 1 ... 1968-11-22 Apple
23 Yellow S... 3 ... 1969-01-13 Apple (U...
24 Abbey Road 1 ... 1969-09-26 Apple
25 Let It Be 1 ... 1970-05-08 Apple
更多
直接读取线上的csv文件:
>>> url = https://github.com/mattharrison/datasets/blob/master/data/anscombes.csv
>>> dfs = pd.read_html(url, attrs={"class": "csv-data"})
>>> len(dfs)
1
>>> dfs[0]
Unnamed: 0 quadrant x y
0 NaN I 10.0 8.04
1 NaN I 14.0 9.96
2 NaN I 6.0 7.24
3 NaN I 9.0 8.81
4 NaN I 4.0 4.26
.. ... ... ... ...
39 NaN IV 8.0 6.58
40 NaN IV 8.0 7.91
41 NaN IV 8.0 8.47
42 NaN IV 8.0 5.25
43 NaN IV 8.0 6.89
第01章 Pandas基础
第02章 DataFrame基础运算
第03章 创建和持久化DataFrame
第04章 开始数据分析
第05章 探索性数据分析
第06章 选取数据子集
第07章 过滤行
第08章 索引对齐