Pythonic Data Cleaning With NumPy and Pandas【一】

Pythonic Data Cleaning With NumPy and Pandas

生词释意

a handful of columns 少量字段
roughly 初略的大体的
enforce 强迫实施执行

github库

https://github.com/realpython/python-data-cleaning

数据集

BL-Flickr-Images-Book.csv – A CSV file containing information about books from the British Library
university_towns.txt – A text file containing names of college towns in every US state
olympics.csv – A CSV file summarizing the participation of all countries in the Summer and Winter Olympics

这篇文章中主要有以下几个content

数据集读取
按需删除字段
清理字段

>>> import pandas as pd
>>> import numpy as np

Dropping Columns in a `DataFrame`

Often, you’ll find that not all the categories of data in a dataset are useful to you. For example, you might have a dataset containing student information (name, grade, standard, parents’ names, and address) but want to focus on analyzing student grades.

In this case, the address or parents’ names categories are not important to you. Retaining these unneeded categories will take up unnecessary space and potentially also bog down runtime.

Pandas provides a handy way of removing unwanted columns or rows from a DataFramewith the drop() function. Let’s look at a simple example where we drop a number of columns from a DataFrame.

读取数据集

First, let’s create a DataFrame out of the CSV file ‘BL-Flickr-Images-Book.csv’. In the examples below, we pass a relative path to pd.read_csv, meaning that all of the datasets are in a folder named Datasets in our current working directory:

When we look at the first five entries using the head() method, we can see that a handful of columns provide ancillary information that would be helpful to the library but isn’t very descriptive of the books themselves: Edition Statement, Corporate Author, Corporate Contributors, Former owner, Engraver, Issuance type and Shelfmarks.

我们使用head()方法查看数据集的前几列基本信息。只有少量的字段对数据是有用的。

通过以下方式删除

>>> to_drop = ['Edition Statement',
...            'Corporate Author',
...            'Corporate Contributors',
...            'Former owner',
...            'Engraver',
...            'Contributors',
...            'Issuance type',
...            'Shelfmarks']

>>> df.drop(to_drop, inplace=True, axis=1)

Above, we defined a list that contains the names of all the columns we want to drop. Next, we call the drop() function on our object, passing in the inplace parameter as True and the axis parameter as 1. This tells Pandas that we want the changes to be made directly in our object and that it should look for the values to be dropped in the columns of the object.

我们把需要删除的列，单独以列表的形式，传递给drop方法，即可删除

When we inspect the DataFrame again, we’ll see that the unwanted columns have been removed:

重新查看列数

使用定位函数查看

>>> df.loc[206]
Place of Publication                                               London
Date of Publication                                           1879 [1878]
Publisher                                                S. Tinsley & Co.
Title                                   Walter Forbes. [A novel.] By A. A
Author                                                              A. A.
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 206, dtype: object

Tidying up Fields in the Data 整理字段

So far, we have removed unnecessary columns and changed the index of our DataFrame to something more sensible. In this section, we will clean specific columns and get them to a uniform format to get a better understanding of the dataset and enforce consistency. In particular, we will be cleaning Date of Publication and Place of Publication.

Upon inspection, all of the data types are currently the object dtype, which is roughly analogous to str in native Python.

It encapsulates any field that can’t be neatly fit as numerical or categorical data. This makes sense since we’re working with data that is initially a bunch of messy strings:

>>> df.get_dtype_counts()
object    6

get_dtype_counts 返回此对象中唯一dtypes的计数.
如果一列中含有多个类型,则该列的类型会是object,同样字符串类型的列也会被当成object类型.

One field where it makes sense to enforce a numeric value is the date of publication so that we can do calculations down the road:

出版日期应该强制转换为数字型，方便后续做计算

df.loc[1905:, 'Date of Publication'].head(10)
Identifier
1905           1888
1929    1839, 38-54
2836        [1897?]
2854           1865
2956        1860-63
2957           1873
3017           1866
3131           1899
4598           1814
4884           1820
Name: Date of Publication, dtype: object

A particular book can have only one date of publication. Therefore, we need to do the following:

一本确定的书，仅有一个确定的出版日期，因此我们需要做以下操作：

Remove the extra dates in square brackets, wherever present: 1879 [1878]
移除中括号内额外的日期

Convert date ranges to their “start date”, wherever present: 1860-63; 1839, 38-54

Completely remove the dates we are not certain about and replace them with NumPy’s NaN: [1897?]

完全清除不确定的日期，用NumPy的NaN类型替代

Convert the string nan to NumPy’s NaN value

转换string nan为 NumPy’s NaN

统计数据每列为空的数据个数的统计

df.isnull().sum()

查看数据的类型统计

df.get_dtype_counts()

dataframe 的时候发现所有string 类型的 column 都是object类型

原文中还有一部分关于数据清理的操作，下篇文章继续翻译和解读。