#2.1.5 Working With Missing Data.md

1.NaN(空值)与None(缺失值)

Missing data can take a few different forms:

  • In Python, the None
    keyword and type indicates no value.
  • The Pandas library uses NaN
    , which stands for "not a number", to indicate a missing value.

In general terms, both NaN and None can be called null values.

2.判断缺失值/空字符:pandas.isnull(XXX)

If we want to see which values are NaN, we can use the pandas.isnull() function which takes a pandas series and returns a series of True and False
values, the same way that NumPy did when we compared arrays.

input
age = titanic_survival["age"]
print(age.loc[10:20])
age_is_null = pandas.isnull(age)    # 如果是NaN或者None,返回True;否则,返回False
age_null_true = age[age_is_null]    # 
age_null_count = len(age_null_true)
print(age_null_count)
output
10    47.0 
11    18.0 
12    24.0 
13    26.0 
14    80.0 
15     NaN 
16    24.0 
17    50.0 
18    32.0 
19    36.0 
20    37.0 
Name: age, dtype: float64 
264

3.有null值时做加减乘除法


#计算有null值下的平均年龄

age_is_null = pd.isnull(titanic_survival["age"])

good_ages = titanic_survival['age'][age_is_null == False]

correct_mean_age1 = sum(good_ages) / len(good_ages)

#使用Series.mean()

correct_mean_age2 = titanic_survival["age"].mean()

4.用词典统计不同等级船舱的票价问题

input

passenger_classes = [1, 2, 3]  #泰坦尼克的船舱等级分为1,2,3

fares_by_class = {}   #创建一个空字典

for this_class in passenger_classes:
    pclass_rows = titanic_survival[titanic_survival['pclass'] == this_class]  # X等舱的所有数据
    mean_fares = pclass_rows['fare'].mean()   # X等舱的船票均值
    fares_by_class[this_class] = mean_fares  # 构建词典用于统计
print(fares_by_class)
output
{1: 87.508991640866881, 2: 21.179196389891697, 3: 13.302888700564973}

5.使用Dataframe.pivot_table()

Pivot tables provide an easy way to subset by one column and then apply a calculation like a sum or a mean.

刚才第4点的问题,可以用Dataframe.pivot_table()

  • The first parameter of the method, index
    tells the method which column to group by.

  • The second parameter values
    is the column that we want to apply the calculation to.

  • aggfunc
    specifies the calculation we want to perform. The default for the aggfunc
    parameter is actually the mean

input1

passenger_class_fares = titanic_survival.pivot_table(index="pclass", values="fare", aggfunc=numpy.mean)

print(passenger_class_fares)

output1

pclass 

1.0    87.508992 

2.0    21.179196 

3.0    13.302889 

Name: fare, dtype: float64

input2

passenger_age = titanic_survival.pivot_table(index="pclass", values="age",aggfunc=numpy.mean)

print(passenger_age)

output2
pclass 
1.0    39.159918 
2.0    29.506705 
3.0    24.816367 
Name: age, dtype: float64
input3
import numpy as np
port_stats = titanic_survival.pivot_table(index='embarked', values=["fare", "survived"], aggfunc=numpy.sum)
print(port_stats)

output3
                fare  survivedembarked                      C         16830.7922     150.0Q          1526.3085      44.0S         25033.3862     304.0

6.剔除缺失值:DataFrame.dropna()

The methodDataFrame.dropna()
will drop any rows that contain missing values.

drop_na_rows = titanic_survival.dropna(axis=0)  # 剔除所有含缺失值的行
drop_na_columns = titanic_survival.dropna(axis=1) # 剔除所有含缺失值的列
new_titanic_survival = titanic_survival.dropna(axis=0,subset=["age", "sex"])  # 剔除所有在‘age’和‘sex’中,有缺失值的行

7.Dataframe.loc[4]与Dataframe.iloc[4]

input

# We have already sorted new_titanic_survival by age
first_five_rows_1 = new_titanic_survival.iloc[5]   # 定位到按顺序第5的对象
first_five_rows_2 = new_titanic_survival.loc[5]   # 定位到索引值为5的对象
row_index_25_survived = new_titanic_survival.loc[25, 'survived']  # 定位到索引值为5,且列名为'survived'的对象
print(first_five_rows_1)
print('------------------------------------------')
print(first_five_rows_2)
output
pclass                          3survived                        0name         Connors, Mr. Patricksex                          maleage                          70.5sibsp                           0parch                           0ticket                     370369fare                         7.75cabin                         NaNembarked                        Qboat                          NaNbody                          171home.dest                     NaNName: 727, dtype: object------------------------------------------pclass                         1survived                       1name         Anderson, Mr. Harrysex                         maleage                           48sibsp                          0parch                          0ticket                     19952fare                       26.55cabin                        E12embarked                       Sboat                           3body                         NaNhome.dest           New York, NYName: 5, dtype: object

8.重新整理索引值:Dataframe.reset_index(drop=True)

input

titanic_reindexed = new_titanic_survival.reset_index(drop=True)
print(titanic_reindexed.iloc[0:5,0:3])
output
   pclass  survived                                               name0     1.0       1.0               Barkworth, Mr. Algernon Henry Wilson1     1.0       1.0  Cavendish, Mrs. Tyrell William (Julia Florence...2     3.0       0.0                                Svensson, Mr. Johan3     1.0       0.0                          Goldschmidt, Mr. George B4     1.0       0.0                            Artagaveytia, Mr. Ramon

9.Apply Functions Over a DataFrame

DataFrame.apply() will iterate through each column in a DataFrame, and perform on each function. When we create our function, we give it one parameter, apply() method passes each column to the parameter as a pandas series.
DataFrame可以调用apply函数对每一列(行)应用一个函数

input
def not_null_count(column):
    columns_null = pandas.isnull(column)  #
    null = column[column_null]
    return len(null)
column_null_count = titanic_survival.apply(not_null_count)
print(column_null_count)
output
pclass          1survived        1name            1sex             1age           264sibsp           1parch           1ticket          1fare            2cabin        1015embarked        3boat          824body         1189home.dest     565dtype: int64

10.Applying a Function to a Row

input

def age_label(row):
    age = row['age']
    if pandas.isnull(age):
        return 'unknown'
    elif age < 18:
        return 'minor'
    else:
        return 'adult'
age_labels = titanic_survival.apply(age_label, axis=1)  # use axis=1
 so that the apply()
 method applies your function over the rows 
print(age_labels[0:5])
output
0    adult1    minor2    minor3    adult4    adultdtype: object

11.Calculating Survival Percentage by Age Group

Now that we have age labels for everyone, let's make a pivot table to find the probability of survival for each age group.
We have added an "age_labels"
column to the dataframe containing the age_labels
variable from the previous step.

input
age_group_survival = titanic_survival.pivot_table(index="age_labels", values="survived")
print(age_group_survival)
output
age_labelsadult      0.387892minor      0.525974unknown    0.277567Name: survived, dtype: float64
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • **2014真题Directions:Read the following text. Choose the be...
    又是夜半惊坐起阅读 9,788评论 0 23
  • 妈妈 我昨晚做梦了 梦到两个你 两个你长得一样 脾气不一样 一个脾气是好的 一个脾气是坏的粗糙的很凶的 我还梦到咱...
    每日爱图阅读 153评论 0 0
  • 在我们家发现了折叠的桌子、指挥棒、脏衣收纳袋。都是为了平时节省空间,有需要时打开就可以了。 在网上找到了几样我喜欢...
    小熊爱吃阅读 194评论 0 0
  • 淌过上个世纪的海水,拖赘惫懒的皮囊,松垮泛白,像胃里爬出来的一块软泡囊肿,你试图抖擞精神,释放灵魂,振奋的皮屑在阳...
    张二天阅读 229评论 7 0
  • 我从小生活的地方,我见惯了的风光, 多少次梦里花落,醒来确身在异乡。
    每个人的孟母堂阅读 112评论 0 1