Pythonic Data Cleaning With NumPy and Pandas
生词释意
a handful of columns 少量字段
roughly 初略的 大体的
enforce 强迫实施 执行
github库
https://github.com/realpython/python-data-cleaning
数据集
- BL-Flickr-Images-Book.csv – A CSV file containing information about books from the British Library
- university_towns.txt – A text file containing names of college towns in every US state
- olympics.csv – A CSV file summarizing the participation of all countries in the Summer and Winter Olympics
这篇文章中主要有以下几个content
数据集读取
按需删除字段
清理字段
>>> import pandas as pd
>>> import numpy as np
Dropping Columns in a DataFrame
Often, you’ll find that not all the categories of data in a dataset are useful to you. For example, you might have a dataset containing student information (name, grade, standard, parents’ names, and address) but want to focus on analyzing student grades.
In this case, the address or parents’ names categories are not important to you. Retaining these unneeded categories will take up unnecessary space and potentially also bog down runtime.
Pandas provides a handy way of removing unwanted columns or rows from a DataFrame
with the drop()
function. Let’s look at a simple example where we drop a number of columns from a DataFrame
.
读取数据集
First, let’s create a DataFrame
out of the CSV file ‘BL-Flickr-Images-Book.csv’. In the examples below, we pass a relative path to pd.read_csv
, meaning that all of the datasets are in a folder named Datasets
in our current working directory:
When we look at the first five entries using the head() method, we can see that a handful of columns provide ancillary information that would be helpful to the library but isn’t very descriptive of the books themselves: Edition Statement, Corporate Author, Corporate Contributors, Former owner, Engraver, Issuance type and Shelfmarks.
我们使用head()方法查看数据集的前几列基本信息。只有少量的字段对数据是有用的。
通过以下方式删除
>>> to_drop = ['Edition Statement',
... 'Corporate Author',
... 'Corporate Contributors',
... 'Former owner',
... 'Engraver',
... 'Contributors',
... 'Issuance type',
... 'Shelfmarks']
>>> df.drop(to_drop, inplace=True, axis=1)
Above, we defined a list that contains the names of all the columns we want to drop. Next, we call the drop() function on our object, passing in the inplace parameter as True and the axis parameter as 1. This tells Pandas that we want the changes to be made directly in our object and that it should look for the values to be dropped in the columns of the object.
我们把需要删除的列,单独以列表的形式,传递给drop方法,即可删除
When we inspect the DataFrame again, we’ll see that the unwanted columns have been removed:
重新查看列数
使用定位函数查看
>>> df.loc[206]
Place of Publication London
Date of Publication 1879 [1878]
Publisher S. Tinsley & Co.
Title Walter Forbes. [A novel.] By A. A
Author A. A.
Flickr URL http://www.flickr.com/photos/britishlibrary/ta...
Name: 206, dtype: object
Tidying up Fields in the Data 整理字段
So far, we have removed unnecessary columns and changed the index of our DataFrame
to something more sensible. In this section, we will clean specific columns and get them to a uniform format to get a better understanding of the dataset and enforce consistency. In particular, we will be cleaning Date of Publication
and Place of Publication
.
Upon inspection, all of the data types are currently the object
dtype, which is roughly analogous to str
in native Python.
It encapsulates any field that can’t be neatly fit as numerical or categorical data. This makes sense since we’re working with data that is initially a bunch of messy strings:
>>> df.get_dtype_counts()
object 6
get_dtype_counts 返回此对象中唯一dtypes的计数.
如果一列中含有多个类型,则该列的类型会是object,同样字符串类型的列也会被当成object类型.
One field where it makes sense to enforce a numeric value is the date of publication so that we can do calculations down the road:
出版日期应该强制转换为数字型,方便后续做计算
df.loc[1905:, 'Date of Publication'].head(10)
Identifier
1905 1888
1929 1839, 38-54
2836 [1897?]
2854 1865
2956 1860-63
2957 1873
3017 1866
3131 1899
4598 1814
4884 1820
Name: Date of Publication, dtype: object
A particular book can have only one date of publication. Therefore, we need to do the following:
一本确定的书,仅有一个确定的出版日期,因此我们需要做以下操作:
Remove the extra dates in square brackets, wherever present: 1879 [1878]
移除中括号内额外的日期
Convert date ranges to their “start date”, wherever present: 1860-63; 1839, 38-54
Completely remove the dates we are not certain about and replace them with NumPy’s NaN: [1897?]
完全清除不确定的日期,用NumPy的NaN类型替代
Convert the string nan to NumPy’s NaN value
转换string nan为 NumPy’s NaN
统计数据每列为空的数据个数的统计
df.isnull().sum()
查看数据的类型统计
df.get_dtype_counts()
dataframe 的时候 发现所有string 类型的 column 都是object类型
原文中还有一部分关于数据清理的操作,下篇文章继续翻译和解读。