Table of Contents
<a id="org653e67d"></a>
提纲
<a id="orgaa23837"></a>
思路
<a id="org5477300"></a>
中文Unicode
<a id="org04343d8"></a>
Unicode和UTF8的联系
<a id="org6082b9e"></a>
常见特殊字符
<a id="org536118c"></a>
过滤特殊字符
<a id="org228c7a2"></a>
思路
常见的特殊字符有很多,查了很多资料,没找到特殊字符的Unicode编码范围,即使找到了也难以保证覆盖了全部。因此只能从非的角度考虑, 实现目标是留下操作系统支持的可作为文件名的字符。
<a id="org8e4ca1e"></a>
中文Unicode编码
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<colgroup>
<col class="org-left" />
<col class="org-left" />
<col class="org-left" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">字符集</th>
<th scope="col" class="org-left">字数</th>
<th scope="col" class="org-left">Unicode编码</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">基本汉字</td>
<td class="org-left">20902字</td>
<td class="org-left">4E00-9FA5</td>
</tr>
<tr>
<td class="org-left">基本汉字补充</td>
<td class="org-left">74字</td>
<td class="org-left">9FA6-9FEF</td>
</tr>
<tr>
<td class="org-left">扩展A</td>
<td class="org-left">6582字</td>
<td class="org-left">3400-4DB5</td>
</tr>
<tr>
<td class="org-left">扩展B</td>
<td class="org-left">42711字</td>
<td class="org-left">20000-2A6D6</td>
</tr>
<tr>
<td class="org-left">扩展C</td>
<td class="org-left">4149字</td>
<td class="org-left">2A700-2B734</td>
</tr>
<tr>
<td class="org-left">扩展D</td>
<td class="org-left">222字</td>
<td class="org-left">2B740-2B81D</td>
</tr>
<tr>
<td class="org-left">扩展E</td>
<td class="org-left">5762字</td>
<td class="org-left">2B820-2CEA1</td>
</tr>
<tr>
<td class="org-left">扩展F</td>
<td class="org-left">7473字</td>
<td class="org-left">2CEB0-2EBE0</td>
</tr>
<tr>
<td class="org-left">康熙部首</td>
<td class="org-left">214字</td>
<td class="org-left">2F00-2FD5</td>
</tr>
<tr>
<td class="org-left">部首扩展</td>
<td class="org-left">115字</td>
<td class="org-left">2E80-2EF3</td>
</tr>
<tr>
<td class="org-left">兼容汉字</td>
<td class="org-left">477字</td>
<td class="org-left">F900-FAD9</td>
</tr>
<tr>
<td class="org-left">兼容扩展</td>
<td class="org-left">542字</td>
<td class="org-left">2F800-2FA1D</td>
</tr>
<tr>
<td class="org-left">PUA(GBK)部件</td>
<td class="org-left">81字</td>
<td class="org-left">E815-E86F</td>
</tr>
<tr>
<td class="org-left">部件扩展</td>
<td class="org-left">452字</td>
<td class="org-left">E400-E5E8</td>
</tr>
<tr>
<td class="org-left">PUA增补</td>
<td class="org-left">207字</td>
<td class="org-left">E600-E6CF</td>
</tr>
<tr>
<td class="org-left">汉字笔画</td>
<td class="org-left">36字</td>
<td class="org-left">31C0-31E3</td>
</tr>
<tr>
<td class="org-left">汉字结构</td>
<td class="org-left">12字</td>
<td class="org-left">2FF0-2FFB</td>
</tr>
<tr>
<td class="org-left">汉语注音</td>
<td class="org-left">43字</td>
<td class="org-left">3105-312F</td>
</tr>
<tr>
<td class="org-left">注音扩展</td>
<td class="org-left">22字</td>
<td class="org-left">31A0-31BA</td>
</tr>
<tr>
<td class="org-left">〇</td>
<td class="org-left">1字</td>
<td class="org-left">3007</td>
</tr>
</tbody>
</table>
其中只需要考虑基本汉字字符集即可。
<a id="org0b2350d"></a>
根据字符的UTF8编码获取Unicode
UTF8和Unicode的关系网上资料很多, 在此不再赘述,简而言之,中文的UTF8编码都是三个字节,1110xxxx 10xxxxxx 10xxxxxx, 剩余的16位正好放下Unicode编码的两个字节,因此只要取出这16位即可知道该字符的Unicode
Lua不支持位操作, b1 % 0xe0 代表 b1 & 0xe0,*212代表左移12位,依次类推
local b1 = string.byte(str, curIndex)
local b2 = string.byte(str, curIndex + 1)
local b3 = string.byte(str, curIndex + 2)
local unic = (b1 % 0xe0) * 2 ^ 12 + (b2 % 0x80) * 2 ^ 6 + (b3 % 0x80);
<a id="orgd97edca"></a>
需要过滤掉的特殊字符
- ASCII中Windows不支持作为文件名的字符正则: [\\\\/:*?\"<>|%s+ ]
- 两个字节的UTF
- UTF编码在四个字节及四个字节以上的字符
可以使用此页面内的特殊字符进行测试: https://wenku.baidu.com/view/fddf6408844769eae009ed14.html?re=view
<a id="orgbda98be"></a>
代码实现
-- 过滤中文特殊字符
function filterInvalidChars(str)
local result = '';
local curIndex = 1;
-- 逐字检查, 符合要求则放入result
repeat
local curByte = string.byte(str, curIndex)
if curByte > 0 and curByte <= 127 then
result = result..string.sub(str, curIndex, curIndex)
curIndex = curIndex + 1
elseif curByte >= 192 and curByte <= 223 then
curIndex = curIndex + 2
elseif curByte >= 224 and curByte <= 239 then
-- 此处判断一些中文特殊字符
local b1 = curByte
local b2 = string.byte(str, curIndex + 1)
local b3 = string.byte(str, curIndex + 2)
local unic = (b1 % 0xe0) * 2 ^ 12 + (b2 % 0x80) * 2 ^ 6 + (b3 % 0x80)
if unic >= 0x4e00 and unic <= 0x9FA5 then
result = result..string.sub(str, curIndex, curIndex + 2)
end
curIndex = curIndex + 3
elseif curByte >= 240 and curByte <= 247 then
curIndex = curIndex + 4
else
logger:error('filter invalid chars error: '..str)
return str
end
until(curIndex >= #str);
return string.gsub(result, '[\\\\/:*?\"<>|%s+ ]', '');
end