Week 4
数据排序
-
按值排序
DataFrame.sort_values(by = ['列一','列二'],axis = 0, ascending = Ture, inplace=False )
返回一个
dataframe
类型-
by
决定要依据哪一列(行)排序,如果是某一列
by = '列名'
如果是很多列
by = ['列一','列二']
-
axis
决定是上下排序还是左右排序,默认为上下排序axis = 0 按 index 排序,上下排序
axis = 1 按 columns 排序, 左右排序
-
ascending 决定是升序还是降序,默认是升序
ascending = True 升序
ascending = False 降序
inplace 决定是否替代原数据, 默认为否
-
-
按索引排序
DataFrame.sort_index(axis = 0, ascending = Ture)
-
axis
决定是上下排序还是左右排序,默认为上下排序axis = 0 按 index 排序,上下排序
axis = 1 按 columns 排序, 左右排序
-
ascending 决定是升序还是降序,默认是升序
ascending = True 升序
ascending = False 降序
import pandas as pd marketing = pd.read_excel('DirectMarketing.xlsx') marketing = marketing[['Age','Salary','AmountSpent']] print(marketing.sort_values(by = ['Salary'])) print(marketing.sort_values(by = ['Salary'], ascending = False))
-
数据计数
series.nunique()
-
返回
series
的唯一值的数量(出现了多少种不同的值)series.value_counts()
-
返回每个唯一值对应的数据个数※
marketing['Catalogs'].value_counts()
数据分组
dataframe.groupby()
可以用自然的方式将数据分成有共同点的几组,完成类似于“有几个小孩的人收入的平均数情况”等数据分析
-
分组的依据
列名为分组键:以某一列的值作为分组的依据
df.groupby('列名')
列名组成的列表为分组键:以某几列的值作为分组的依据
df.groupby(['列名1','列名2',...])
函数为分组键
任何被当做分组键的函数都会在各个索引值上被调用一次,其返回值就会被用作分组名称。df.groupby(func)
-
分组的结果
groupby()
函数返回的是一个GruopBy
类型对象,此时数据并没有做分离操作,直到用 GroupBy 类型对象执行某些数据操作之后才分离 -
groupBy
可用的操作方法:内建方法(优化过)
count 分组中非 NA 值的数量 sum 非 NA 值的和 mean 非 NA 值的平均值 median 非 NA 值的算数中位数 std, var 无偏(分母为 n-1)标准差和方差 min, max 非 NA 值的最小值和最大值 prod 非 NA 值的积 first, last 第一个和最后一个非 NA 值 size 计算组的尺寸 例子
import pandas as pd marketing = pd.read_excel('DirectMarketing.xlsx') #每个年龄组的总消费额 print(marketing.groupby('Age').sum()['AmountSpent'])
-
选择一组数据
groupby.getgroup('xxx')
可以根据某元素内容选择出某一组数据
-
自定义的聚合函数,通过传入 GroupBy.aggregate()
或 GroupBy.agg()
来实现
dataframe.groupby('列名').agg(func_name)
数据合并
dataframe1.merge(dataframe2,how='outer')
可以根据一个或多个键(key)将不同DataFrame中的行连接起来
唯一的要求是,在每个数据集中,有一相同的列。
返回的是一个 dataframe
类型的对象
-
how
参数可以选择合并的方式,
`inner`:内连接(取key的交集)
`outer`:外连接(取 key 的并集)
默认以重叠的列名当做连接键
默认做
inner
连接(取key的交集)Join()
concat()
-
split-apply-combine
Challenge
Sorting
Reads in the excel file DirectMarketing.xlsx, and aselect only the Age, Salary and AmountSpent columns. It should then sort the rows so that they are in increasing order of Salary. Using head(), display the first 5 rows.
读取文件,选择特定的列,根据 Salary 的值升序排列,截取前五行
根据值进行升序排序用DataFrame.sort_values()
import pandas as pd
marketing = pd.read_excel('DirectMarketing.xlsx')
marketing = marketing[['Age','Salary','AmountSpent']]
marketing.sort_values(by = 'Salary',inplace = True)
print(marketing.head(5))
Group by: salary per number of children
Reads in the excel file DirectMarketing.xlsx and group the data by number of children, and print out the mean salary for each group.
读取文件,根据children
的数据分组,打印每一组的salary
的mean
分组函数dataframe.groupby('列名')
按组求平均值Grouphy.mean()
从 Dataframe
的 某一列提取出series
:dataframe['列名']
import pandas as pd
marketing = pd.read_excel('DirectMarketing.xlsx')
child_groups = marketing.groupby('Children')
print(child_groups.mean()['Salary'])
Inner join: soccer and cricket
将两张表内连接,改变 dataframe
列的顺序
dataframe1.merge(dataframe2,how = 'inner')
import pandas as pd
soccer = pd.DataFrame({'People': ['Stephen', 'Alison', 'Jess'], 'SoccerTeam': ['Chelsea', 'Chelsea', 'Arsenal']})
cricket = pd.DataFrame({'People': ['Stephen', 'Tina', 'Jess', 'Will'], 'CricketTeam': ['Sixers', 'Sixers', 'Thunder', 'Thunder']})
sport = cricket.merge(soccer, how='inner')
print(sport[['People','SoccerTeam','CricketTeam']])
Outer join: soccer and cricket
将两张表外连接
dataframe1.merge(dataframe2,how = 'outer')
import pandas as pd
soccer = pd.DataFrame({'People': ['Stephen', 'Alison', 'Jess'], 'SoccerTeam': ['Chelsea', 'Chelsea', 'Arsenal']})
cricket = pd.DataFrame({'People': ['Stephen', 'Tina', 'Jess', 'Will'], 'CricketTeam': ['Sixers', 'Sixers', 'Thunder', 'Thunder']})
sport = soccer.merge(cricket, how = 'outer')
print(sport)
Mean marks
Add in an additional column to the combined dataset called mean_marks which gives the average mark for each person. Print the data in order of descending mean mark.
合并表格,计算每行的平均数,并在后面添加一列数据,最后根据平均数据倒序排序
import pandas as pd
BUSS6002 = pd.DataFrame({'People': ['Stephen', 'Tina', 'Alison', 'Jess'], 'BUSS6002_mark': [89, 63, 84, 70]})
QBUS6840 = pd.DataFrame({'People': ['Stephen', 'Cooper', 'Jess', 'Will', 'Chris'], 'QBUS6840_mark': [83, 71, 74, 68, 88]})
INFO6018 = pd.DataFrame({'People': ['Cooper', 'Jess', 'Alison', 'Chris'], 'INFO6018_mark': [68, 71, 86, 91]})
marks = BUSS6002.merge(QBUS6840,how = 'outer')
marks = marks.merge(INFO6018,how = 'outer')
marks['mean_mark'] = marks.mean(1)
print(marks.sort_values( by= 'mean_mark', ascending = False))