第01章 Pandas基础
第02章 DataFrame基础运算
第03章 创建和持久化DataFrame
第04章 开始数据分析
第05章 探索性数据分析
第06章 选取数据子集
第07章 过滤行
第08章 索引对齐
2.1 从DataFrame中选择多列
使用列名列表提取DataFrame的多列:
>>> import pandas as pd
>>> import numpy as np
>>> movies = pd.read_csv("data/movie.csv")
>>> movie_actor_director = movies[
... [
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... "director_name",
... ]
... ]
>>> movie_actor_director.head()
actor_1_name actor_2_name actor_3_name director_name
0 CCH Pounder Joel Dav... Wes Studi James Ca...
1 Johnny Depp Orlando ... Jack Dav... Gore Ver...
2 Christop... Rory Kin... Stephani... Sam Mendes
3 Tom Hardy Christia... Joseph G... Christop...
4 Doug Walker Rob Walker NaN Doug Walker
# 提取单列时,列表和键名提取出来的数据类型不同。
>>> type(movies[["director_name"]])
<class 'pandas.core.frame.DataFrame'> # DataFrame类型
>>> type(movies["director_name"])
<class 'pandas.core.series.Series'> # Series类型
也可以使用loc
提取多列。
>>> type(movies.loc[:, ["director_name"]])
<class 'pandas.core.frame.DataFrame'>
>>> type(movies.loc[:, "director_name"])
<class 'pandas.core.series.Series'>
预先将列名存储在列表中,可以提高代码的可读性。
>>> cols = [
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... "director_name",
... ]
>>> movie_actor_director = movies[cols]
如果没有使用列表,则会报KeyError
错误。
>>> movies[
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... "director_name",
... ]
Traceback (most recent call last):
...
KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')
2.2 使用方法提取多列
缩短列名之后查看每种数据类型的个数:
>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
... return (
... str(col)
... .replace("facebook_likes", "fb")
... .replace("_for_reviews", "")
... )
>>> movies = movies.rename(columns=shorten)
>>> movies.dtypes.value_counts()
float64 13
int64 3
object 12
dtype: int64
使用.select_dtypes
方法提取整型的列:
>>> movies.select_dtypes(include="int").head()
num_voted_users cast_total_fb movie_fb
0 886204 4834 33000
1 471220 48350 0
2 275868 11700 85000
3 1144337 106759 164000
4 8 143 0
选择所有数值类型的列:
>>> movies.select_dtypes(include="number").head()
num_critics duration ... aspect_ratio movie_fb
0 723.0 178.0 ... 1.78 33000
1 302.0 169.0 ... 2.35 0
2 602.0 148.0 ... 2.35 85000
3 813.0 164.0 ... 2.35 164000
4 NaN NaN ... NaN 0
选择整型和字符串的列:
>>> movies.select_dtypes(include=["int", "object"]).head()
color direc/_name ... conte/ating movie_fb
0 Color James Cameron ... PG-13 33000
1 Color Gore Verbinski ... PG-13 0
2 Color Sam Mendes ... PG-13 85000
3 Color Christopher Nolan ... PG-13 164000
4 NaN Doug Walker ... NaN 0
提取所有非浮点类型的列:
>>> movies.select_dtypes(exclude="float").head()
color director_name ... content_rating movie_fb
0 Color James Ca... ... PG-13 33000
1 Color Gore Ver... ... PG-13 0
2 Color Sam Mendes ... PG-13 85000
3 Color Christop... ... PG-13 164000
4 NaN Doug Walker ... NaN 0
使用.filter
方法筛选所有列名中包含fb
的列:
>>> movies.filter(like="fb").head()
director_fb actor_3_fb ... actor_2_fb movie_fb
0 0.0 855.0 ... 936.0 33000
1 563.0 1000.0 ... 5000.0 0
2 0.0 161.0 ... 393.0 85000
3 22000.0 23000.0 ... 23000.0 164000
4 131.0 NaN ... 12.0 0
items
参数可以用来选择多列:
>>> cols = [
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... "director_name",
... ]
>>> movies.filter(items=cols).head()
actor_1_name ... director_name
0 CCH Pounder ... James Cameron
1 Johnny Depp ... Gore Verbinski
2 Christoph Waltz ... Sam Mendes
3 Tom Hardy ... Christopher Nolan
4 Doug Walker ... Doug Walker
regex
参数可以用来进行正则匹配,下面的代码提取出了列名中包含数字的列:
>>> movies.filter(regex=r"\d").head()
actor_3_fb actor_2_name ... actor_3_name actor_2_fb
0 855.0 Joel Dav... ... Wes Studi 936.0
1 1000.0 Orlando ... ... Jack Dav... 5000.0
2 161.0 Rory Kin... ... Stephani... 393.0
3 23000.0 Christia... ... Joseph G... 23000.0
4 NaN Rob Walker ... NaN 12.0
2.3 按列名进行排列
对列进行排序的原则:
- 将列分为分类型和连续型;
- 按照分类型和连续型对列分组;
- 分类型排在连续型的前面;
下面是个例子。先读取数据,缩短列名:
>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
... return col.replace("facebook_likes", "fb").replace(
... "_for_reviews", ""
... )
>>> movies = movies.rename(columns=shorten)
对下面的列名进行
>>> movies.columns
Index(['color', 'director_name', 'num_critic', 'duration', 'director_fb',
'actor_3_fb', 'actor_2_name', 'actor_1_fb', 'gross', 'genres',
'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_fb',
'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
'movie_imdb_link', 'num_user', 'language', 'country', 'content_rating',
'budget', 'title_year', 'actor_2_fb', 'imdb_score', 'aspect_ratio',
'movie_fb'],
dtype='object')
>>> cat_core = [
... "movie_title",
... "title_year",
... "content_rating",
... "genres",
... ]
>>> cat_people = [
... "director_name",
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... ]
>>> cat_other = [
... "color",
... "country",
... "language",
... "plot_keywords",
... "movie_imdb_link",
... ]
>>> cont_fb = [
... "director_fb",
... "actor_1_fb",
... "actor_2_fb",
... "actor_3_fb",
... "cast_total_fb",
... "movie_fb",
... ]
>>> cont_finance = ["budget", "gross"]
>>> cont_num_reviews = [
... "num_voted_users",
... "num_user",
... "num_critic",
... ]
>>> cont_other = [
... "imdb_score",
... "duration",
... "aspect_ratio",
... "facenumber_in_poster",
... ]
将上面所有列表连起来,组成最终的列的顺序,并确认没有遗漏任何列:
>>> new_col_order = (
... cat_core
... + cat_people
... + cat_other
... + cont_fb
... + cont_finance
... + cont_num_reviews
... + cont_other
... )
>>> set(movies.columns) == set(new_col_order)
True
将新的列数组传给movies,得到排好列的对象:
>>> movies[new_col_order].head()
movie_title title_year ... aspect_ratio facenumber_in_poster
0 Avatar 2009.0 ... 1.78 0.0
1 Pirates ... 2007.0 ... 2.35 0.0
2 Spectre 2015.0 ... 2.35 1.0
3 The Dark... 2012.0 ... 2.35 0.0
4 Star War... NaN ... NaN 0.0
2.4 对DataFrame进行概括性分析
查看数据集的属性:shape、size、ndim。
>>> movies = pd.read_csv("data/movie.csv")
>>> movies.shape
(4916, 28)
>>> movies.size
137648
>>> movies.ndim
2
.count
方法可以统计所有的非缺失值:
>>> movies.count()
color 4897
director_name 4814
num_critic_for_reviews 4867
duration 4901
director_facebook_likes 4814
...
title_year 4810
actor_2_facebook_likes 4903
imdb_score 4916
aspect_ratio 4590
movie_facebook_likes 4916
Length: 28, dtype: int64
.min
、.max
、.mean
、.median
、.std
方法,返回的是数值列的统计信息:
>>> movies.min()
num_critic_for_reviews 1.00
duration 7.00
director_facebook_likes 0.00
actor_3_facebook_likes 0.00
actor_1_facebook_likes 0.00
...
title_year 1916.00
actor_2_facebook_likes 0.00
imdb_score 1.60
aspect_ratio 1.18
movie_facebook_likes 0.00
Length: 16, dtype: float64
.describe
是一个非常强大的方法,可以返回描述性统计信息和分位数,如果想在屏幕中显示更多信息,可以用.T
进行矩阵转置:
>>> movies.describe().T
count mean ... 75% max
num_criti... 4867.0 137.988905 ... 191.00 813.0
duration 4901.0 107.090798 ... 118.00 511.0
director_... 4814.0 691.014541 ... 189.75 23000.0
actor_3_f... 4893.0 631.276313 ... 633.00 23000.0
actor_1_f... 4909.0 6494.488491 ... 11000.00 640000.0
... ... ... ... ... ...
title_year 4810.0 2002.447609 ... 2011.00 2016.0
actor_2_f... 4903.0 1621.923516 ... 912.00 137000.0
imdb_score 4916.0 6.437429 ... 7.20 9.5
aspect_ratio 4590.0 2.222349 ... 2.35 16.0
movie_fac... 4916.0 7348.294142 ... 2000.00 349000.0
.describe
方法中通过percentiles
参数,可以得到任意分位数:
>>> movies.describe(percentiles=[0.01, 0.3, 0.99]).T
count mean ... 99% max
num_criti... 4867.0 137.988905 ... 546.68 813.0
duration 4901.0 107.090798 ... 189.00 511.0
director_... 4814.0 691.014541 ... 16000.00 23000.0
actor_3_f... 4893.0 631.276313 ... 11000.00 23000.0
actor_1_f... 4909.0 6494.488491 ... 44920.00 640000.0
... ... ... ... ... ...
title_year 4810.0 2002.447609 ... 2016.00 2016.0
actor_2_f... 4903.0 1621.923516 ... 17000.00 137000.0
imdb_score 4916.0 6.437429 ... 8.50 9.5
aspect_ratio 4590.0 2.222349 ... 4.00 16.0
movie_fac... 4916.0 7348.294142 ... 93850.00 349000.0
如果在上述的描述性方法中,将参数skipna
设为False
,则可以将所有列都体现出来:
>>> movies.min(skipna=False)
num_critic_for_reviews NaN
duration NaN
director_facebook_likes NaN
actor_3_facebook_likes NaN
actor_1_facebook_likes NaN
...
title_year NaN
actor_2_facebook_likes NaN
imdb_score 1.6
aspect_ratio NaN
movie_facebook_likes 0.0
Length: 16, dtype: float64
2.5 DataFrame的链式方法
使用.isnull
方法,判断每个值是否是缺失值:
>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
... return col.replace("facebook_likes", "fb").replace(
... "_for_reviews", ""
... )
>>> movies = movies.rename(columns=shorten)
>>> movies.isnull().head()
color director_name ... aspect_ratio movie_fb
0 False False ... False False
1 False False ... False False
2 False False ... False False
3 False False ... False False
4 True False ... True False
.sum
方法可以对True
和False
求和,True
是1,False
是0,这样就能清楚地看到每列有多少缺失值:
>>> (movies.isnull().sum().head())
color 19
director_name 102
num_critic 49
duration 15
director_fb 102
dtype: int64
再进一步,两个.sum
方法连用,可以知道总共有多少缺失值:
>>> movies.isnull().sum().sum()
2654
如果仅仅想知道是否有缺失值,使用.any()
更为便捷:
>>> movies.isnull().any().any()
True
原理
.isnull
方法将原始的DataFrame转换为了相同大小的布尔值矩阵:
>>> movies.isnull().dtypes.value_counts()
bool 28
dtype: int64
更多
如果object
类型的数据存在缺失值,则在做聚合运算(.min
、.max
、.sum
)时,返回为空:
>>> movies[["color", "movie_title", "color"]].max()
Series([], dtype: float64)
如果想让返回值不为空,需要对缺失值进行填充:
>>> movies.select_dtypes(["object"]).fillna("").max()
color Color
director_name Étienne Faure
actor_2_name Zubaida Sahar
genres Western
actor_1_name Óscar Jaenada
...
plot_keywords zombie|zombie spoof
movie_imdb_link http://www.imdb....
language Zulu
country West Germany
content_rating X
Length: 12, dtype: object
出于可读性考虑,链式方法通常用括号括起来,这样对每个方法做注释和调试时非常方便:
>>> (movies.select_dtypes(["object"]).fillna("").max())
color Color
director_name Étienne Faure
actor_2_name Zubaida Sahar
genres Western
actor_1_name Óscar Jaenada
...
plot_keywords zombie|zombie spoof
movie_imdb_link http://www.imdb....
language Zulu
country West Germany
content_rating X
Length: 12, dtype: object
2.6 DataFrame运算
DataFrame的列的类型可能是数值,也可能是对象,直接+5的话,会报类型错误:
>>> colleges = pd.read_csv("data/college.csv")
>>> colleges + 5
Traceback (most recent call last):
...
TypeError: can only concatenate str (not "int") to str
用.filter
方法筛选出所有列名以'UGDS_'开头的列,该列是按照种族分类的本科生:
>>> colleges = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = colleges.filter(like="UGDS_")
>>> college_ugds.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.0333 0.9353 ... 0.0059 0.0138
Universit... 0.5922 0.2600 ... 0.0179 0.0100
Amridge U... 0.2990 0.4192 ... 0.0000 0.2715
Universit... 0.6988 0.1255 ... 0.0332 0.0350
Alabama S... 0.0158 0.9208 ... 0.0243 0.0137
Pandas采取的是“四舍六入五成双” 的 Banker's Rounding 规则 (五后有数入、五后无数凑偶数)。观察UGDS_BLACK
是如何变化的:
>>> name = "Northwest-Shoals Community College"
>>> college_ugds.loc[name]
UGDS_WHITE 0.7912
UGDS_BLACK 0.1250
UGDS_HISP 0.0339
UGDS_ASIAN 0.0036
UGDS_AIAN 0.0088
UGDS_NHPI 0.0006
UGDS_2MOR 0.0012
UGDS_NRA 0.0033
UGDS_UNKN 0.0324
Name: Northwest-Shoals Community College, dtype: float64
>>> college_ugds.loc[name].round(2)
UGDS_WHITE 0.79
UGDS_BLACK 0.12
UGDS_HISP 0.03
UGDS_ASIAN 0.00
UGDS_AIAN 0.01
UGDS_NHPI 0.00
UGDS_2MOR 0.00
UGDS_NRA 0.00
UGDS_UNKN 0.03
Name: Northwest-Shoals Community College, dtype: float64
如果圆整之前加0.0001
,看看变化:
>>> (college_ugds.loc[name] + 0.0001).round(2)
UGDS_WHITE 0.79
UGDS_BLACK 0.13
UGDS_HISP 0.03
UGDS_ASIAN 0.00
UGDS_AIAN 0.01
UGDS_NHPI 0.00
UGDS_2MOR 0.00
UGDS_NRA 0.00
UGDS_UNKN 0.03
Name: Northwest-Shoals Community College, dtype: float64
在开始圆整探险之前,将college_ugds
中的每个数加0.00501
:
>>> college_ugds + 0.00501
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.03831 0.94031 ... 0.01091 0.01881
Universit... 0.59721 0.26501 ... 0.02291 0.01501
Amridge U... 0.30401 0.42421 ... 0.00501 0.27651
Universit... 0.70381 0.13051 ... 0.03821 0.04001
Alabama S... 0.02081 0.92581 ... 0.02931 0.01871
... ... ... ... ... ...
SAE Insti... NaN NaN ... NaN NaN
Rasmussen... NaN NaN ...
使用//
,将DataFrame中的值圆整为百分比的整数:
>>> (college_ugds + 0.00501) // 0.01
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 3.0 94.0 ... 1.0 1.0
Universit... 59.0 26.0 ... 2.0 1.0
Amridge U... 30.0 42.0 ... 0.0 27.0
Universit... 70.0 13.0 ... 3.0 4.0
Alabama S... 2.0 92.0 ... 2.0 1.0
... ... ... ... ... ...
SAE Insti... NaN NaN ... NaN NaN
Rasmussen... NaN NaN ... NaN NaN
National ... NaN NaN ...
如果将其除以100,则:
>>> college_ugds_op_round =(
... (college_ugds + 0.00501) // 0.01 / 100
... )
>>> college_ugds_op_round.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.03 0.94 ... 0.01 0.01
Universit... 0.59 0.26 ... 0.02 0.01
Amridge U... 0.30 0.42 ... 0.00 0.27
Universit... 0.70 0.13 ... 0.03 0.04
Alabama S... 0.02 0.92 ... 0.02 0.01
下面使用round
方法,因为是bankers规则,加个0.00001:
>>> college_ugds_round = (college_ugds + 0.00001).round(2)
判断两个结果是否相同:
>>> college_ugds_op_round.equals(college_ugds_round)
True
原理
浮点运算会产生误差:
>>> 0.045 + 0.005
0.049999999999999996
加0.00001之后,则变为:
>>> 0.045 + 0.005 + 0.00001
0.05001
更多
和Series相同,DataFrame也有对应的函数运算方法:
>>> college2 = (
... college_ugds.add(0.00501).floordiv(0.01).div(100)
... )
>>> college2.equals(college_ugds_op_round)
True
2.7 比较缺失值
Pandas使用np.nan
表示缺失值,这个对象很独特:
>>> np.nan == np.nan
False
>>> None == None
True
>>> np.nan > 5
False
>>> 5 > np.nan
False
>>> np.nan != 5
True
先加载数据:
>>> college = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")
==
是对DataFrame中每个元素进行比较:
>>> college_ugds == 0.0019
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... False False ... False False
Universit... False False ... False False
但是,如果DataFrame中有缺失值,用==
就会出现问题:
>>> college_self_compare = college_ugds == college_ugds
>>> college_self_compare.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... True True ... True True
Universit... True True ... True True
Amridge U... True True ... True True
Universit... True True ... True True
Alabama S... True True ... True True
看起来好像没问题,但是如果用.all
方法,就会发现问题:
>>> college_self_compare.all()
UGDS_WHITE False
UGDS_BLACK False
UGDS_HISP False
UGDS_ASIAN False
UGDS_AIAN False
UGDS_NHPI False
UGDS_2MOR False
UGDS_NRA False
UGDS_UNKN False
dtype: bool
这是因为缺失值不能相互比较,如果像下面用`== np.nan判断有没有缺失值,就会得到0:
>>> (college_ugds == np.nan).sum()
UGDS_WHITE 0
UGDS_BLACK 0
UGDS_HISP 0
UGDS_ASIAN 0
UGDS_AIAN 0
UGDS_NHPI 0
UGDS_2MOR 0
UGDS_NRA 0
UGDS_UNKN 0
dtype: int64
统计缺失值个数的方法是使用.isna
:
>>> college_ugds.isna().sum()
UGDS_WHITE 661
UGDS_BLACK 661
UGDS_HISP 661
UGDS_ASIAN 661
UGDS_AIAN 661
UGDS_NHPI 661
UGDS_2MOR 661
UGDS_NRA 661
UGDS_UNKN 661
dtype: int64
比较两个DataFrame的正确方法是使用.equals
方法:
>>> college_ugds.equals(college_ugds)
True
更多
.eq
方法等价于==
:
>>> college_ugds.eq(0.0019) # same as college_ugds == .0019
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... False False ... False False
Universit... False False ... False False
Amridge U... False False ... False False
Universit... False False ... False False
Alabama S... False False ... False False
... ... ... ... ... ...
SAE Insti... False False ... False False
Rasmussen... False False ... False False
National ... False False ... False False
Bay Area ... False False ... False False
Excel Lea... False False ... False False
pandas.testing
包中有个断言方法assert_frame_equal
,可以用于判断两个DataFrame是否相同,如果不同返回AssertionError
,如果相同返回None
:
>>> from pandas.testing import assert_frame_equal
>>> assert_frame_equal(college_ugds, college_ugds) is None
True
2.8 转置DataFrame运算的方向
DataFrame的许多方法都使用了axis
参数,这个参数控制了运算方向。axis
参数可以是index
(0
)或columns
(1
)。字符串更清晰,建议使用字符串。
提取并筛选数据。
>>> college = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")
>>> college_ugds.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.0333 0.9353 ... 0.0059 0.0138
Universit... 0.5922 0.2600 ... 0.0179 0.0100
Amridge U... 0.2990 0.4192 ... 0.0000 0.2715
Universit... 0.6988 0.1255 ... 0.0332 0.0350
.count()
方法默认axis=0
,等价于college_ugds.count(axis=0)
和 college_ugds.count(axis='index')
:
>>> college_ugds.count()
UGDS_WHITE 6874
UGDS_BLACK 6874
UGDS_HISP 6874
UGDS_ASIAN 6874
UGDS_AIAN 6874
UGDS_NHPI 6874
UGDS_2MOR 6874
UGDS_NRA 6874
UGDS_UNKN 6874
dtype: int64
将axis
参数改为columns
,可以得到每行非空值的数量:
>>> college_ugds.count(axis="columns").head()
INSTNM
Alabama A & M University 9
University of Alabama at Birmingham 9
Amridge University 9
University of Alabama in Huntsville 9
Alabama State University 9
dtype: int64
计算每行是否是百分之百:
>>> college_ugds.sum(axis="columns").head()
INSTNM
Alabama A & M University 1.0000
University of Alabama at Birmingham 0.9999
Amridge University 1.0000
University of Alabama in Huntsville 1.0000
Alabama State University 1.0000
dtype: float64
计算每列的中位数:
>>> college_ugds.median(axis="index")
UGDS_WHITE 0.55570
UGDS_BLACK 0.10005
UGDS_HISP 0.07140
UGDS_ASIAN 0.01290
UGDS_AIAN 0.00260
UGDS_NHPI 0.00000
UGDS_2MOR 0.01750
UGDS_NRA 0.00000
UGDS_UNKN 0.01430
dtype: float64
更多
使用cumsum
可以沿着列的方向进行累计求和,能从另一个视角观察白人和黑人所占比例:
>>> college_ugds_cumsum = college_ugds.cumsum(axis=1)
>>> college_ugds_cumsum.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.0333 0.9686 ... 0.9862 1.0000
Universit... 0.5922 0.8522 ... 0.9899 0.9999
Amridge U... 0.2990 0.7182 ... 0.7285 1.0000
Universit... 0.6988 0.8243 ... 0.9650 1.0000
Alabama S... 0.0158 0.9366 ... 0.9863 1.0000
2.9 校园的多样性
US News 的多样性指数TOP10高校如下:
>>> pd.read_csv(
... "data/college_diversity.csv", index_col="School"
... )
Diversity Index
School
Rutgers University--Newark Newark, NJ 0.76
Andrews University Berrien Springs, MI 0.74
Stanford University Stanford, CA 0.74
University of Houston Houston, TX 0.74
University of Nevada--Las Vegas Las Vegas, NV 0.74
University of San Francisco San Francisco, CA 0.74
San Francisco State University San Francisco, CA 0.73
University of Illinois--Chicago Chicago, IL 0.73
New Jersey Institute of Technology Newark, NJ 0.72
Texas Woman's University Denton, TX 0.72
对于我们的问题,先读取数据:
>>> college = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")
计算每行有多少缺失值,并从大到小排列:
>>> (
... college_ugds.isnull()
... .sum(axis="columns")
... .sort_values(ascending=False)
... .head()
... )
INSTNM
Excel Learning Center-San Antonio South 9
Philadelphia College of Osteopathic Medicine 9
Assemblies of God Theological Seminary 9
Episcopal Divinity School 9
Phillips Graduate Institute 9
dtype: int64
如果一行的九列都是缺失值,则使用.dropna
方法删掉该行:
>>> college_ugds = college_ugds.dropna(how="all")
>>>; college_ugds.isnull().sum()
UGDS_WHITE 0
UGDS_BLACK 0
UGDS_HISP 0
UGDS_ASIAN 0
UGDS_AIAN 0
UGDS_NHPI 0
UGDS_2MOR 0
UGDS_NRA 0
UGDS_UNKN 0
dtype: int64
如果某个种族的比例超过15%,则进行统计:
>>> college_ugds.ge(0.15)
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... False True ... False False
Universit... True True ... False False
Amridge U... True True ... False True
Universit... True False ... False False
Alabama S... False True ... False False
... ... ... ... ... ...
Hollywood... True True ... False False
Hollywood... False True ... False False
Coachella... True False ... False False
Dewey Uni... False False ... False False
Coastal P... True True ... False False
使用sum
方法对每行进行统计:
>>> diversity_metric = college_ugds.ge(0.15).sum(
... axis="columns"
... )
>>> diversity_metric.head()
INSTNM
Alabama A & M University 1
University of Alabama at Birmingham 2
Amridge University 3
University of Alabama in Huntsville 1
Alabama State University 1
dtype: int64
使用.value_counts
查看该序列是如何分布的:
>>> diversity_metric.value_counts()
1 3042
2 2884
3 876
4 63
0 7
5 2
dtype: int64
惊讶地发现,有两所学校有5个种族的比例超过了15%。对diversity_metric
进行排列:
>>> diversity_metric.sort_values(ascending=False).head()
INSTNM
Regency Beauty Institute-Austin 5
Central Texas Beauty College-Temple 5
Sullivan and Cogliano Training Center 4
Ambria College of Nursing 4
Berkeley College-New York 4
dtype: int64
查看排名最高的两所学校:
>>> college_ugds.loc[
... [
... "Regency Beauty Institute-Austin",
... "Central Texas Beauty College-Temple",
... ]
... ]
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Regency B... 0.1867 0.2133 ... 0.0 0.2667
Central T... 0.1616 0.2323 ... 0.0 0.1515
再来看看 US News中的排名前五的学校的表现:
>>> us_news_top = [
... "Rutgers University-Newark",
... "Andrews University",
... "Stanford University",
... "University of Houston",
... "University of Nevada-Las Vegas",
... ]
>>> diversity_metric.loc[us_news_top]
INSTNM
Rutgers University-Newark 4
Andrews University 3
Stanford University 3
University of Houston 3
University of Nevada-Las Vegas 3
dtype: int64
更多
查看最不具有多样性的学校的前十名:
>>> (
... college_ugds.max(axis=1)
... .sort_values(ascending=False)
... .head(10)
... )
INSTNM
Dewey University-Manati 1.0
Yeshiva and Kollel Harbotzas Torah 1.0
Mr Leon's School of Hair Design-Lewiston 1.0
Dewey University-Bayamon 1.0
Shepherds Theological Seminary 1.0
Yeshiva Gedolah Kesser Torah 1.0
Monteclaro Escuela de Hoteleria y Artes Culinarias 1.0
Yeshiva Shaar Hatorah 1.0
Bais Medrash Elyon 1.0
Yeshiva of Nitra Rabbinical College 1.0
dtype: float64
是否存在所有种族比例都超过1%的学校:
>>> (college_ugds > 0.01).all(axis=1).any()
True
第01章 Pandas基础
第02章 DataFrame基础运算
第03章 创建和持久化DataFrame
第04章 开始数据分析
第05章 探索性数据分析
第06章 选取数据子集
第07章 过滤行
第08章 索引对齐