python 中文编码问题

被一个中文编码的 error 卡了好久,师兄给我的从mac拿过来的含有中文的txt文档。文件首部含有 BOM 字符!记录以自省…

  • 最早的版本是这样,
with open('input/out_classes.txt', 'r', encoding='utf-8') as cf:
    for line in cf:
        label = line.strip('\r\n')
        label_list.append((label))
        print(label)
  • 报错是这样:

UnicodeEncodeError: 'gbk' codec can't encode character '\ufeff' in position 0

The Unicode character U+FEFF is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you.

#!python2
#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print 'utf-8     %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16    %r' % e16
print 'utf-16le  %r' % e16le
print 'utf-16be  %r' % e16be
print
print 'utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8')
print 'utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le')
  • 最后改成这样:
with open('input/out_classes.txt', 'r', encoding='utf-8-sig') as cf:
    for line in cf:
        label = line.strip('\r\n')
        label_list.append((label))
        print(label)
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容