数据分析项目练习笔记(2)—— 数据重构和基础处理
使用工具:Jupyter
使用库:Pandas,NumPy
参考资料:
https://pandas.pydata.org/docs/user_guide/10min.html
https://www.osgeo.cn/numpy/user/absolute_beginners.html
1、合并
concat
result = pd.concat([left, right], axis = 1)
result = pd.concat([up, down], axis = 0)
join
up = left.join(right)
result = up.append(down)
merge
down = pd.merge(left,right, left_index = True, right_index = True)
result = up.append(down)
2、Series和DataFrame互转
S to DF
result_df = result_s.to_frame()
DF to S
result_s = result_df.stack()
3、GroupBy分组分析
基础数据分组
df_sex = df_result.groupby("Sex") #根据“Sex”标签分类汇总
list(df_sex)[1] #展示详细的统计结果,index是标签
分组汇总结果的统计数据
df_sex.describe()
查看某项统计数据
df_sex["Age"].mean()
利用Agg()同时统计多个维度(index为所选的基础统计维度)
mean_sex_agg = df_result.groupby("Sex").agg({"Fare": "mean", "Survived": "sum"}).rename(columns = {
"Fare": "Fare_mean", "Survived": "Survived_sum"
})
mean_sex_aggagg - 指定统计方式,rename - 针对列重命名
分组条件多个
mean_fare_pclass_sex = df_result.groupby(["Sex","Pclass"])["Fare"].mean()
合并两个分组条件一致的分组结果
sex_fare_survived = pd.merge(mean_fare_sex, mean_sex_survived, on="Sex")on - 分组条件,是合并依据,注意不能省略
4、最后一个实例分析(存疑,感觉逻辑不太对)
题目: 计算存活人数最高的存活率(存活人数/总人数)
答案:
age_survived = df_result.groupby("Age")["Survived"].sum() #每个年龄区间的存活人数
age_survived[age_survived.values == max(age_survived)] #寻找最高存货人数的年龄区间
age_survived_radio = max(age_survived)/sum(df_result["Survived"]) #计算比例
age_survived_radio