python 中文编码问题

被一个中文编码的 error 卡了好久，师兄给我的从mac拿过来的含有中文的txt文档。文件首部含有 BOM 字符！记录以自省…

最早的版本是这样，

with open('input/out_classes.txt', 'r', encoding='utf-8') as cf:
    for line in cf:
        label = line.strip('\r\n')
        label_list.append((label))
        print(label)

报错是这样：

UnicodeEncodeError: 'gbk' codec can't encode character '\ufeff' in position 0

stack overflow 的说法是这样：

The Unicode character U+FEFF is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you.

#!python2
#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print 'utf-8     %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16    %r' % e16
print 'utf-16le  %r' % e16le
print 'utf-16be  %r' % e16be
print
print 'utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8')
print 'utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le')

最后改成这样：

with open('input/out_classes.txt', 'r', encoding='utf-8-sig') as cf:
    for line in cf:
        label = line.strip('\r\n')
        label_list.append((label))
        print(label)

python 中文编码问题

推荐阅读更多精彩内容