Abstract:18年复活节前的五天,kaggle举办了数据预处理的五个挑战。这里做每天学习到的技术要点的回顾。这篇是第四天的内容,主要是有关字符数据的编码。
Introduction
Python的数据的默认编码是UTF-8,当导入其他编码的数据时会出错。今天的内容就是理解编码解码机制,以及解决如何将其他编码类型的数据导入python并储存成默认的UTF-8。
创建环境
需要用到chardet,character encoding detector。
# modules we'll use
import pandas as pd
import numpy as np
# helpful character encoding module
import chardet
# set seed for reproducibility
np.random.seed(0)
理解编码解码机制
文字类型的数据一般有两种格式,一种是string,字符串:
# start with a string
before = "This is the euro symbol: €"
一种是bytes,字节:
# encode it to a different encoding, replacing characters that raise errors
after = before.encode("utf-8", errors = "replace")
编码成utf-8以后的string变成了bytes格式,以「b」开头。
b'This is the euro symbol: \xe2\x82\xac'
用utf-8解码以后依然可以变回原状:
# convert it back to utf-8
print(after.decode("utf-8"))
This is the euro symbol: €
这里要注意的是:以utf-8编码后的数据没法用别的编码器(比如ascii)解码。
ASCII是最早的编码,只能编码英语和有限的符号。
# Your turn! Try encoding and decoding different symbols to ASCII and
# see what happens. I'd recommend $, #, 你好 and नमस्ते but feel free to
# try other characters. What happens? When would this cause problems?
mytext = "£,#,nihao,你好"
encode_utf = mytext.encode("utf-8", errors = "replace")
encode_ascii = mytext.encode("ascii", errors = "replace")
print(encode_utf)
print(encode_ascii)
print(encode_utf.decode("utf-8"))
print(encode_ascii.decode("ascii"))
结果分别是:
b'\xc2\xa3,#,nihao,\xe4\xbd\xa0\xe5\xa5\xbd'
b'?,#,nihao,??'
£,#,nihao,你好
?,#,nihao,??
由此可见,使用错误的编码器编码解码以后,原数据信息就丢失了。
读取不同编码格式数据
回到intro里提到的问题。如果编码不同,pd.read_csv会报错:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 11: invalid start byte
这就需要我们找到原数据是什么编码,然后告诉python按这个来。
但是编码方式千千万,挨个试有点累,还好有chardet这个lib可以自动找,和昨天自动找日期格式一样,虽然不保证100%正确,但是省事。先取出数据的前10000个字节检测,原因是1. 全部检测太慢;2.报错提示错误出现在position11,10000个字节可以涵盖。
PS:
open()
returns a file object, and is most commonly used with two arguments:open(filename, mode)
.
The first argument is a string containing the filename. The second argument is another string containing a few characters describing the way in which the file will be used. mode can be'r'
when the file will only be read,'w'
for only writing (an existing file with the same name will be erased), and'a'
opens the file for appending; any data written to the file is automatically added to the end.'r+'
opens the file for both reading and writing. The mode argument is optional;'r'
will be assumed if it’s omitted.
On Windows,'b'
appended to the mode opens the file in binary mode, so there are also modes like'rb'
,'wb'
, and'r+b'
. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that inJPEG
orEXE
files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a'b'
to the mode, so you can use it platform-independently for all binary files.
# look at the first ten thousand bytes to guess the character encoding
with open("../input/kickstarter-projects/ks-projects-201801.csv", 'rb') as rawdata:
result = chardet.detect(rawdata.read(10000))
# check what the character encoding might be
print(result)
结果{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
检测到encoding是windows-1252,可信度是73%。
那就告诉python这么搞,然后赶紧转换成utf-8(默认)存起来。
# read in the file with the encoding detected by chardet
kickstarter_2016 = pd.read_csv("../input/kickstarter-projects/ks-projects-201612.csv", encoding='Windows-1252')
# save our file (will be saved as UTF-8 by default!)
kickstarter_2016.to_csv("ks-projects-201801-utf8.csv")
最后要注意一点:有些时候,光看前10000个字节chardet测出一个结果(比如acsii),告诉python以后还是报错,那就尝试更多的字节,会得到不同的检查结果。