In [1]: from unidecode import unidecode
In [2]: s = u'我把大门我的妈孙冬梅'
In [3]: s
Out[3]: u'\u6211\u628a\u5927\u95e8\u6211\u7684\u5988\u5b59\u51ac\u6885'
In [4]: unidecode(s)
Out[4]: 'Wo Ba Da Men Wo De Ma Sun Dong Mei '
In [20]: s = '我把大门我的妈孙冬梅'
In [21]: s
Out[21]: '\xe6\x88\x91\xe6\x8a\x8a\xe5\xa4\xa7\xe9\x97\xa8\xe6\x88\x91\xe7\x9a\x84\xe5\xa6\x88\xe5\xad\x99\xe5\x86\xac\xe6\xa2\x85'
In [26]: unidecode(s.decode('utf8'))
Out[26]: 'Wo Ba Da Men Wo De Ma Sun Dong Mei '
需要注意:若s是Unicode类型,就可以直接使用 unidecode(s)
。否则需要 decode
,一般使用utf8
即可。
这是最基本的情况,无法判断一些特殊情况,比如说多音字。
原理:
def _unidecode(string):
retval = []
for char in string:
codepoint = ord(char)
if codepoint < 0x80: # Basic ASCII
retval.append(str(char))
continue
if codepoint > 0xeffff:
continue # Characters in Private Use Area and above are ignored
if 0xd800 <= codepoint <= 0xdfff:
warnings.warn( "Surrogate character %r will be ignored. "
"You might be using a narrow Python build." % (char,),
RuntimeWarning, 2)
section = codepoint >> 8 # Chop off the last two hex digits
position = codepoint % 256 # Last two hex digits
try:
table = Cache[section]
except KeyError:
try:
mod = __import__('unidecode.x%03x'%(section), globals(), locals(), ['data'])
except ImportError:
Cache[section] = None
continue # No match: ignore this character and carry on.
Cache[section] = table = mod.data
if table and len(table) > position:
retval.append( table[position] )
return ''.join(retval)
可见,是通过对unicode解码,找到对应位置的ASCII字符。其中,mod = __import__('unidecode.x%03x'%(section), globals(), locals(), ['data'])
即为unidecode
包中的各个数据文件,其中包含了对应的ASCII字符元组。