1.NaN(空值)与None(缺失值)

Missing data can take a few different forms:

In Python, the None
keyword and type indicates no value.
The Pandas library uses NaN
, which stands for "not a number", to indicate a missing value.

In general terms, both NaN and None can be called null values.

2.判断缺失值/空字符：pandas.isnull(XXX)

If we want to see which values are NaN, we can use the pandas.isnull() function which takes a pandas series and returns a series of True and False
values, the same way that NumPy did when we compared arrays.

input

age = titanic_survival["age"]
print(age.loc[10:20])
age_is_null = pandas.isnull(age)    # 如果是NaN或者None，返回True；否则，返回False
age_null_true = age[age_is_null]    # 
age_null_count = len(age_null_true)
print(age_null_count)

output

10    47.0 
11    18.0 
12    24.0 
13    26.0 
14    80.0 
15     NaN 
16    24.0 
17    50.0 
18    32.0 
19    36.0 
20    37.0 
Name: age, dtype: float64 
264

3.有null值时做加减乘除法


#计算有null值下的平均年龄

age_is_null = pd.isnull(titanic_survival["age"])

good_ages = titanic_survival['age'][age_is_null == False]

correct_mean_age1 = sum(good_ages) / len(good_ages)

#使用Series.mean()

correct_mean_age2 = titanic_survival["age"].mean()

4.用词典统计不同等级船舱的票价问题

input


passenger_classes = [1, 2, 3]  #泰坦尼克的船舱等级分为1，2，3

fares_by_class = {}   #创建一个空字典

for this_class in passenger_classes:
    pclass_rows = titanic_survival[titanic_survival['pclass'] == this_class]  # X等舱的所有数据
    mean_fares = pclass_rows['fare'].mean()   # X等舱的船票均值
    fares_by_class[this_class] = mean_fares  # 构建词典用于统计
print(fares_by_class)

output

{1: 87.508991640866881, 2: 21.179196389891697, 3: 13.302888700564973}

5.使用Dataframe.pivot_table()

Pivot tables provide an easy way to subset by one column and then apply a calculation like a sum or a mean.

刚才第4点的问题，可以用Dataframe.pivot_table()

The first parameter of the method, index
tells the method which column to group by.
The second parameter values
is the column that we want to apply the calculation to.
aggfunc
specifies the calculation we want to perform. The default for the aggfunc
parameter is actually the mean

input1


passenger_class_fares = titanic_survival.pivot_table(index="pclass", values="fare", aggfunc=numpy.mean)

print(passenger_class_fares)

output1


pclass 

1.0    87.508992 

2.0    21.179196 

3.0    13.302889 

Name: fare, dtype: float64

input2


passenger_age = titanic_survival.pivot_table(index="pclass", values="age",aggfunc=numpy.mean)

print(passenger_age)

output2

pclass 
1.0    39.159918 
2.0    29.506705 
3.0    24.816367 
Name: age, dtype: float64

input3

import numpy as np
port_stats = titanic_survival.pivot_table(index='embarked', values=["fare", "survived"], aggfunc=numpy.sum)
print(port_stats)

output3

                fare  survivedembarked                      C         16830.7922     150.0Q          1526.3085      44.0S         25033.3862     304.0

6.剔除缺失值：DataFrame.dropna()

The methodDataFrame.dropna()
will drop any rows that contain missing values.

drop_na_rows = titanic_survival.dropna(axis=0)  # 剔除所有含缺失值的行
drop_na_columns = titanic_survival.dropna(axis=1) # 剔除所有含缺失值的列
new_titanic_survival = titanic_survival.dropna(axis=0,subset=["age", "sex"])  # 剔除所有在‘age’和‘sex’中，有缺失值的行

7.Dataframe.loc[4]与Dataframe.iloc[4]

input


# We have already sorted new_titanic_survival by age
first_five_rows_1 = new_titanic_survival.iloc[5]   # 定位到按顺序第5的对象
first_five_rows_2 = new_titanic_survival.loc[5]   # 定位到索引值为5的对象
row_index_25_survived = new_titanic_survival.loc[25, 'survived']  # 定位到索引值为5，且列名为'survived'的对象
print(first_five_rows_1)
print('------------------------------------------')
print(first_five_rows_2)

output

pclass                          3survived                        0name         Connors, Mr. Patricksex                          maleage                          70.5sibsp                           0parch                           0ticket                     370369fare                         7.75cabin                         NaNembarked                        Qboat                          NaNbody                          171home.dest                     NaNName: 727, dtype: object------------------------------------------pclass                         1survived                       1name         Anderson, Mr. Harrysex                         maleage                           48sibsp                          0parch                          0ticket                     19952fare                       26.55cabin                        E12embarked                       Sboat                           3body                         NaNhome.dest           New York, NYName: 5, dtype: object

8.重新整理索引值：Dataframe.reset_index(drop=True)

input


titanic_reindexed = new_titanic_survival.reset_index(drop=True)
print(titanic_reindexed.iloc[0:5,0:3])

output

   pclass  survived                                               name0     1.0       1.0               Barkworth, Mr. Algernon Henry Wilson1     1.0       1.0  Cavendish, Mrs. Tyrell William (Julia Florence...2     3.0       0.0                                Svensson, Mr. Johan3     1.0       0.0                          Goldschmidt, Mr. George B4     1.0       0.0                            Artagaveytia, Mr. Ramon

9.Apply Functions Over a DataFrame

DataFrame.apply() will iterate through each column in a DataFrame, and perform on each function. When we create our function, we give it one parameter, apply() method passes each column to the parameter as a pandas series.
DataFrame可以调用apply函数对每一列（行）应用一个函数

input

def not_null_count(column):
    columns_null = pandas.isnull(column)  #
    null = column[column_null]
    return len(null)
column_null_count = titanic_survival.apply(not_null_count)
print(column_null_count)

output

pclass          1survived        1name            1sex             1age           264sibsp           1parch           1ticket          1fare            2cabin        1015embarked        3boat          824body         1189home.dest     565dtype: int64

10.Applying a Function to a Row

input


def age_label(row):
    age = row['age']
    if pandas.isnull(age):
        return 'unknown'
    elif age < 18:
        return 'minor'
    else:
        return 'adult'
age_labels = titanic_survival.apply(age_label, axis=1)  # use axis=1
 so that the apply()
 method applies your function over the rows 
print(age_labels[0:5])

output

0    adult1    minor2    minor3    adult4    adultdtype: object

11.Calculating Survival Percentage by Age Group

Now that we have age labels for everyone, let's make a pivot table to find the probability of survival for each age group.
We have added an "age_labels"
column to the dataframe containing the age_labels
variable from the previous step.

input

age_group_survival = titanic_survival.pivot_table(index="age_labels", values="survived")
print(age_group_survival)

output

age_labelsadult      0.387892minor      0.525974unknown    0.277567Name: survived, dtype: float64

#2.1.5 Working With Missing Data.md

#2.1.5 Working With Missing Data.md

1.NaN(空值)与None(缺失值)

2.判断缺失值/空字符：pandas.isnull(XXX)

input

output

3.有null值时做加减乘除法

4.用词典统计不同等级船舱的票价问题

input

output

5.使用Dataframe.pivot_table()

input1

output1

input2

output2

input3

output3

6.剔除缺失值：DataFrame.dropna()

7.Dataframe.loc[4]与Dataframe.iloc[4]

input

output

8.重新整理索引值：Dataframe.reset_index(drop=True)

input

output

9.Apply Functions Over a DataFrame

input

output

10.Applying a Function to a Row

input

output

11.Calculating Survival Percentage by Age Group

input

output

推荐阅读更多精彩内容