Python2 Python3编码问题

Python3有两种表示字符序列的类型: bytes 和 str. 前者的实例包含原始的字节;后者的实例包含Unicode字符.
Python2也有两种表示字符序列的类型: str 和 unicode. 与Python3不同的是,str 的实例包含原始的字节; 而unicode的实例,则包含Unicode字符.

把Unicode字符表示为二进制数据(也就是原始的字节)有很多方法,实际上就是编码. 最常见的编码方式就是UTF8. 但是要切记, Python3中的 str实例和Python2中的unicode实例都没有和具体的二进制编码形式相关联. 要想把Unicode字符转换成二进制数据,就必须使用encode方法.要是想把二进制数据转换成Unicode字符,则必须使用decode方法.

在编写Python程序的时候,一定要把编码和解码操作放在界面最外围来做.程序的核心部分应该使用Unicode字符类型,而且不要对字符编码做任何假设.这种方法可以使程序既能接受多种类型的文件编码,同事又可以保证输出的文本信息只采用一中编码形式(最好是UTF8).

在通常的开发环境时,我们经常会碰到以下两种情景:

开发者需要原始的字节,这些字节是以UTF-8来编码的.
开发者需要操作没有特定编码形式的Unicode字符.

下面给出两种辅助函数,以便在这两种情况之间转换,使得转换后的输入数据能够符合开发者的预期.

Python3

接受`str`或者`bytes`, 返回`str`

def to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes):
        value = bytes_or_str.decode('utf8')
    else:
        value = bytes_or_str
    return value

接受 `str`或者`bytes`, 返回`bytes`

def to_bytes(bytes_or_str):
    if isinstance(bytes_or_str, str):
        value = bytes_or_str.encode('utf8')
    else:
        value = bytes_or_str
    return value

Python2

接受`str`或者`unicode`, 返回`str`

def to_py2_str(unicode_or_str):
    if isinstance(unicode_or_str, unicode):
        value = unicode_or_str.encode('utf8')
    else:
        value = unicode_or_str
    return value

接受`str`或者`unicode`, 返回`unicode`

def to_unicode(unicode_or_str):
    if isinstance(unicode_or_str, str):
        value = unicode_or_str.decode('utf8')
    else:
        value = unicode_or_str
    return value