Abstract:18年复活节前的五天,kaggle举办了数据预处理的五个挑战。这里做每天学习到的技术要点的回顾。这篇是最后一天的内容,主要是有关替换文本信息中同一信息但是格式不统一的冗余数据。
有些文本数据里面会有拼写错误,多余的空格等情况,如果直接给这些原本有相同意义的数据分类,会让机器学习算法觉得他们是不同的数据,可能会阻碍正确的信息分类。
当然可以手动修改,但是随着数据量的越来越大,自动修改才是王道。
环境设置
需要用到的特殊的包是fuzzywuzzy。chardet在上一片第四天的文章中已经着重介绍过。
# modules we'll use
import pandas as pd
import numpy as np
# helpful modules
import fuzzywuzzy
from fuzzywuzzy import process
import chardet
查看重复项的成因
导入数据后,找出“City”这一列,看看有多少例情况:
cities = suicide_attacks['City'].unique()
# sort them alphabetically and then take a closer look
cities.sort()
cities
结果会得到一个全是城市名字的列表:
array(['ATTOCK', 'Attock ', 'Bajaur Agency', 'Bannu', 'Bhakkar ', 'Buner', 'Chakwal ', 'Chaman', 'Charsadda', 'Charsadda ', 'D. I Khan', 'D.G Khan', 'D.G Khan ', 'D.I Khan', 'D.I Khan ', 'Dara Adam Khel', 'Dara Adam khel', 'Fateh Jang', 'Ghallanai, Mohmand Agency ', 'Gujrat', 'Hangu', 'Haripur', 'Hayatabad', 'Islamabad', 'Islamabad ', 'Jacobabad', 'KURRAM AGENCY', 'Karachi', 'Karachi ', 'Karak', 'Khanewal', 'Khuzdar', 'Khyber Agency', 'Khyber Agency ', 'Kohat', 'Kohat ', 'Kuram Agency ', 'Lahore', 'Lahore ', 'Lakki Marwat', 'Lakki marwat', 'Lasbela', 'Lower Dir', 'MULTAN', 'Malakand ', 'Mansehra', 'Mardan', 'Mohmand Agency', 'Mohmand Agency ', 'Mohmand agency', 'Mosal Kor, Mohmand Agency', 'Multan', 'Muzaffarabad', 'North Waziristan', 'North waziristan', 'Nowshehra', 'Orakzai Agency', 'Peshawar', 'Peshawar ', 'Pishin', 'Poonch', 'Quetta', 'Quetta ', 'Rawalpindi', 'Sargodha', 'Sehwan town', 'Shabqadar-Charsadda', 'Shangla ', 'Shikarpur', 'Sialkot', 'South Waziristan', 'South waziristan', 'Sudhanoti', 'Sukkur', 'Swabi ', 'Swat', 'Swat ', 'Taftan', 'Tangi, Charsadda District', 'Tank', 'Tank ', 'Taunsa', 'Tirah Valley', 'Totalai', 'Upper Dir', 'Wagah', 'Zhob', 'bannu', 'karachi', 'karachi ', 'lakki marwat', 'peshawar', 'swat'], dtype=object)
可以观察到的是,其中有不少重复的地名,只是因为后面里面多一个空格或者大小写不一样被认为是不同的地方。
首先先去除大小写的困扰并删除字串首尾的多余空白。这两个简单操作可以去除英语字符里一大部分的不一致现象。
# convert to lower case
suicide_attacks['City'] = suicide_attacks['City'].str.lower()
# remove trailing white spaces
suicide_attacks['City'] = suicide_attacks['City'].str.strip()
用fuzzywuzzy替换相似项
这时的城市清单是这样:
array(['attock', 'bajaur agency', 'bannu', 'bhakkar', 'buner', 'chakwal', 'chaman', 'charsadda', 'd. i khan', 'd.g khan', 'd.i khan', 'dara adam khel', 'fateh jang', 'ghallanai, mohmand agency', 'gujrat', 'hangu', 'haripur', 'hayatabad', 'islamabad', 'jacobabad', 'karachi', 'karak', 'khanewal', 'khuzdar', 'khyber agency', 'kohat', 'kuram agency', 'kurram agency', 'lahore', 'lakki marwat', 'lasbela', 'lower dir', 'malakand', 'mansehra', 'mardan', 'mohmand agency', 'mosal kor, mohmand agency', 'multan', 'muzaffarabad', 'north waziristan', 'nowshehra', 'orakzai agency', 'peshawar', 'pishin', 'poonch', 'quetta', 'rawalpindi', 'sargodha', 'sehwan town', 'shabqadar-charsadda', 'shangla', 'shikarpur', 'sialkot', 'south waziristan', 'sudhanoti', 'sukkur', 'swabi', 'swat', 'taftan', 'tangi, charsadda district', 'tank', 'taunsa', 'tirah valley', 'totalai', 'upper dir', 'wagah', 'zhob'], dtype=object)
可以看到'd. i khan' 和 'd.i khan' 因为中间一个字符(空格)的区别被分成两类。于是需要用模糊匹配来找到相近的文本并把它替换掉。
模糊匹配的原理是:当你给出一个字符串,电脑会把它和数据中的字符比较并打分。比较相似度越高的项分数越高,最高100%。相似度越高,就意味着把其中一项改成另外一项需要变动的字符越少。比如“apple“和”snapple“相差两次变动,而”in“和”on“相差一次变动。语句如下:
# get the top 10 closest matches to "d.i khan"
matches = fuzzywuzzy.process.extract("d.i khan", cities, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
# take a look at them
matches
这个cell得到的是城市清单里和“d.i.khan"相似度最高的10项以及起相似度分数,由高到低排。
[('d. i khan', 100), ('d.i khan', 100), ('d.g khan', 88), ('khanewal', 50), ('sudhanoti', 47), ('hangu', 46), ('kohat', 46), ('dara adam khel', 45), ('chaman', 43), ('mardan', 43)]
下面就要编程替换相似度高于90的项(88的那货是另外一个城市,不是格式错误)。
当你需要重复实现一个通用功能的时候,编写一个funtion,以后可以直接调用,省事,也省的出错
# function to replace rows in the provided column of the provided dataframe
# that match the provided string above the provided ratio with the provided string
def replace_matches_in_column(df, column, string_to_match, min_ratio = 90):
# get a list of unique strings
strings = df[column].unique()
# get the top 10 closest matches to our input string
matches = fuzzywuzzy.process.extract(string_to_match, strings,
limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
# only get matches with a ratio > 90
close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]
# get the rows of all the close matches in our dataframe
rows_with_matches = df[column].isin(close_matches)
# replace all rows with close matches with the input matches
df.loc[rows_with_matches, column] = string_to_match
# let us know the function's done
print("All done!")
这里要说明的是:
-
close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]
是List Comprehension,其效果等同于:
close_matches = []
for matches in matches:
if matches[1] >= min_ratio:
close_matches.append(matches[0])
- pandas.DataFrame.isin(Value)会返回一个真值表,数据在value里的位置为1。
调用函数就可以完成所有和d.i khan相似的替换:
# use the function we just wrote to replace close matches to "d.i khan" with "d.i khan"
replace_matches_in_column(df=suicide_attacks, column='City', string_to_match="d.i khan")