Unicode in Python3

Unicode and UTF-8

首先我们要理清两个事实：

所有东西（file， network）在计算机中都是以 byte 存储的。但是 byte 本身没有含义，为了使 byte 表示文本，我们需要一套相应的编码方案（比如 ASCII）。
但是事实是世界上的文字很多，而 ASCII 只能表示西方字符，要表示其他字符（比如中文）就无力了。因此每个国家都有一套自己的编码方案来编码自己国家的字符。但这样就产生了混乱（一个byte在不同的编码方案下产生的字符不同）。
为了解决这个问题，产生了 Unicode，每种字符只有一个相应的 coding point 表示。而将 Unicode 字符的 coding point map 到 bytes 的就是 UTF-8， UFT-16。
因此可以这么认为 **Unicode 是一套字符集，而 UFT-8 等是编码方案 **

Unicode in Python3

在 Python3 中，有 str, bytes, bytearray。str type 存储的是 Unicode 字符的coding point，而 bytes type 存储的是 bytes。而且在 Python3 中不会有 bytes 和 str 的隐形转换。（在 Python2 中有，这也往往是bug的来源）

data type for text or bytes.jpg

encode vs decode.jpg

text vs bytes.jpg

>>> "Hello" + b"World"                                                                  
Traceback (most recent call last):                                                      
  File "<stdin>", line 1, in <module>                                                   
TypeError: Can't convert 'bytes' object to str implicitly                               
>>> "Hello" == b"Hello"                                                                 
False                                                                                   
>>> d = {"Hello": "World"}                                                              
>>> d[b"Hello"]                                                                         
Traceback (most recent call last):                                                      
  File "<stdin>", line 1, in <module>                                                   
KeyError: b'Hello'

而且在 Python3 中，读取文件时，如果采用文本方式读取（即不是 b mode），Python3 会默认为你用系统编码方案解码(Python2 中不会，读入的是 bytes)，但是你可以用 encoding参数制定编码方法

>>> import sys                                                                     
>>> sys.getfilesystemencoding()                                                    
'mbcs'
>>> open('hello.txt').read()  # hello.txt 采用 utf8 编码                                                 
'浣犲ソ'                                                                              
>>> open('hello.txt', encoding="utf8").read()                                      
'你好'

Tips

一旦读入 bytes，就立马 decode 到 Unicode，在你的整个程序内部，只使用 Unicode，当要输出时，encode 到 bytes。
但有几个注意事项：

有些 library 给你的输入本身就是 Unicode 或者要求你的输出是 Unicode。
读入 byte 转化到 Unicode 的时的编码方案，由输入的信息提供（html，http 均会指明）（当然可能提供的是错的 = =）

摘取 Pragmatic Unicode 的大纲

All input and output of your program is bytes.

The world needs more than 256 symbols to communicate text.
Your program has to deal with both bytes and Unicode.
A stream of bytes can't tell you its encoding.
Encoding specifications can be wrong.

Unicode sandwich: keep all text in your program as Unicode, and convert as close to the edges as possible.

Know what your strings are: you should be able to explain which of your strings are Unicode, which are bytes, and for your byte strings, what encoding they use.
Test your Unicode support. Use exotic strings throughout your test suites to be sure you're covering all the cases.

Unicode in Python3

Unicode and UTF-8

Unicode in Python3

Tips

推荐阅读更多精彩内容