pandas 2

pandas 2

Learn to handle missing data using pandas and a data set on Titanic survival.

Introduction

import pandas as pd
titanic_survival = pd.read_csv("titanic_survival.csv")


Finding the Missing Data

The Pandas library uses NaN, which stands for "not a number", to indicate a missing value.

If we want to see which values are NaN, we can use the pandas.isnull() function which takes a pandas series and returns a series of True and False values, the same way that NumPy did when we compared arrays.

sex = titanic_survival["sex"]
sex_is_null = pandas.isnull(sex)

We can use this resultant series to select only the rows that have null values.

sex_null_true = sex[sex_is_null]

We'll use this structure to look at the null values for the "age" column.

Instructions

Count how many values in the "age" column have null values:

  • Use pandas.isnull() on age variable to create a Series of True and False values.

  • Use the resulting series to select only the elements in age that are null, and assign the result to age_null_true

  • Assign the length of age_null_true to age_null_count.

Print age_null_count to see how many null values are in the "age" column.

age = titanic_survival["age"]
print(age.loc[10:20])
age_is_null = pd.isnull(age)
age_null_true = age[age_is_null]
age_null_count = len(age_null_true)
print(age_null_count)


Easier Ways to Do Math

Luckily, missing data is so common that many pandas methods automatically filter for it. For example, if we use use the Series.mean() method to calculate the mean of a column, missing values will not be included in the calculation.

To calculate the mean age that we did earlier, we can replace all of our code with one line

correct_mean_age = titanic_survival["age"].mean()
############
age_is_null =pd.isnull(titanic_survival["age"])

good_ages = titanic_survival["age"][age_is_null == False]

correct_mean_age =sum(good_ages) / len(good_ages)

##########

correct_mean_fare =titanic_survival["fare"].mean()


Calculating Summary Statistics

Let's calculate more summary statistics for the data.

The pclass column indicates the cabin class for each passenger, which was either first class (1), second class (2), or third class (3).

passenger_classes = [1, 2, 3]

You'll use the list passenger_classes, which contains these values, in the following exercise.

Instructions

Use a for loop to iterate over passenger_classes. Within the for loop:

  • Select just the rows in titanic_survival where the pclass value is equivalent to the current iterator value (class).
for this_class in passenger_classes:
    pclass_rows =titanic_survival[titanic_survival["pclass"] == this_class]
  • Select just the fare column for the current subset of rows.
pclass_fares = pclass_rows["fare"]
  • Use the Series.mean method to calculate the mean of this subset.
fare_for_class = pclass_fares.mean()
  • Add the mean of the class to the fares_by_class dictionary with class as the key.

fares_by_class[this_class] = fare_for_class

Once the loop completes, the dictionary fares_by_class should have 1, 2, and 3 as keys, with the average fares as the corresponding values.

passenger_classes = [1, 2, 3]

fares_by_class = {}

for this_class in passenger_classes:

    pclass_rows =titanic_survival[titanic_survival["pclass"]== this_class]
    
    pclass_fares = pclass_rows["fare"]
    
    fare_for_class = pclass_fares.mean()
    
    fares_by_class[this_class] = fare_for_class
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • 相信许多朋友与我一样想用手绘板绘制自己的卡通形象,我正好有ps和手绘板,做一个处女秀,让大家一起学习进步吧。 首先...
    fcfdesign阅读 5,539评论 0 2
  • 又到了一个尴尬的时间点——换届,各个社团的部长都会旁敲侧击问小干留部的意愿。我虽是部门助理,也找自己的小干聊了聊,...
    风逍扬阅读 7,015评论 0 1
  • 《麦肯锡工作法》一书围绕麦肯锡精英的工作习惯提炼出39个工作习惯,告诉你兼顾效率与成功的工作技巧。 本书共六章,以...
    晓越明阅读 3,978评论 0 2
  • 凌初买了些烧烤,在网吧狭小的走道中小心翼翼的穿行。大力摔打键盘的声音此起彼伏,好像如此便能发泄心中的苦闷似的。...
    玥落无心阅读 1,605评论 0 0
  • 在多子女的社会里,父母对于子女的爱,很难做到一碗水端平、毫不偏心的。 从古至今,无论门第,父母的偏心总是一个大问题...
    学为圣人阅读 3,875评论 0 1