java解压ZIP 解决中文乱码 (GBK和UTF-8)

java解压ZIP 解决中文乱码 (GBK和UTF-8)

本文CSDN地址

工具使用 : zip4j

GitHub : zip4j

版本 : 2.2.8

Maven :


<dependency>
    <groupId>net.lingala.zip4j</groupId>
    <artifactId>zip4j</artifactId>
    <version>2.2.8</version>
</dependency>

ZIP解压现状

ZIP格式在不同平台上使用不同软件压缩,大致结果为两类:
1. Windows下使用WinRAR、好压、快压、百度压缩等工具压缩的文件
    特点:文件名为GBK编码
2. 使用Linux、MacOS等系统压缩的zip文件
    特点:文件名为UTF-8编码

过往解决方案

通过指定解压时的文件名编码类型来解决,最简单粗暴的就是

ZipFile zip = new ZipFile(dest);
//直接指定GBK,反正大多数人用win操作
zip.setFileNameCharset("GBK");
zip.extractAll(Constants.GOODS_ITEM_IMG_PATH);

但是现在用macOS办公的人越来越多,这种写法已经不满足需求了

新解决方案

通过阅读ZIP的协议文档,我们可以发现,Info-ZIP Unicode Path Extra Field (0x7075)
这个额外信息可以解决我们的问题,据笔者测试,WinRAR和百度压缩等使用GBK作为文件编码的压缩软件,
在这个区域会记录文件名的UTF-8编码的名称,但是因为这个字段不是必要字段,文件名使用UTF-8编码的
MacOS归档、Deepin归档等软件不会填充这个信息。

解决方案代码

     String extractAll(MultipartFile file) throws Exception { 
            String path = RECEIVABLE_SHEET_PATH;
             if(!new File(path).mkdirs()) {
                 return "上传文件失败,无法创建临时文件夹";
             }
             File dest = new File(path + "/"+file.getOriginalFilename());
             file.transferTo(dest);
    
             /* 解压 */
             try {
                ZipFile zip = new ZipFile(dest);
                
                zip.setCharset(Charset.forName("utf-8"));
                 System.out.println("begin unpack zip file....");
                 
                zip.getFileHeaders().forEach(v->{
                    String extractedFile = getFileNameFromExtraData(v);
                    try {
                        zip.extractFile(v, path,extractedFile);
                    } catch (ZipException e) {
                        System.out.println("解压失败 :"+extractedFile);
                        e.printStackTrace();
                        return;
                    }
                    System.out.println("解压成功 :"+extractedFile);
                });
                 System.out.println("unpack zip file success");
            } catch (ZipException e) {
                if(!new File(path).mkdirs())
                    return "解压失败";
            }
             return "success";
        }
     
      public static String getFileNameFromExtraData(FileHeader fileHeader) {
              if(fileHeader.getExtraDataRecords()!=null){
                  for (ExtraDataRecord extraDataRecord : fileHeader.getExtraDataRecords()) {
                      long identifier = extraDataRecord.getHeader();
                      if (identifier == 0x7075) {
                          byte[] bytes = extraDataRecord.getData();
                          ByteBuffer buffer = ByteBuffer.wrap(bytes);
                          byte version = buffer.get();
                          assert (version == 1);
                          int crc32 = buffer.getInt();
                          System.out.println("使用:fileHeader.getExtraDataRecords() ");
                          return new String(bytes, 5, buffer.remaining(), StandardCharsets.UTF_8);
                      }
                  }
              }
              System.out.println("使用:fileHeader.getFileName()");
              return fileHeader.getFileName();
          }

参考ZIP规范资料

Third party mappings commonly used are:

       0x07c8        Macintosh
       0x2605        ZipIt Macintosh
       0x2705        ZipIt Macintosh 1.3.5+
       0x2805        ZipIt Macintosh 1.3.5+
       0x334d        Info-ZIP Macintosh
       0x4341        Acorn/SparkFS
       0x4453        Windows NT security descriptor (binary ACL)
       0x4704        VM/CMS
       0x470f        MVS
       0x4b46        FWKCS MD5 (see below)
       0x4c41        OS/2 access control list (text ACL)
       0x4d49        Info-ZIP OpenVMS
       0x4f4c        Xceed original location extra field
       0x5356        AOS/VS (ACL)
       0x5455        extended timestamp
       0x554e        Xceed unicode extra field
       0x5855        Info-ZIP UNIX (original, also OS/2, NT, etc)
       0x6375        Info-ZIP Unicode Comment Extra Field
       0x6542        BeOS/BeBox
       0x7075        Info-ZIP Unicode Path Extra Field
       0x756e        ASi UNIX
       0x7855        Info-ZIP UNIX (new)
       0xa220        Microsoft Open Packaging Growth Hint
       0xfd4a        SMS/QDOS

-Info-ZIP Unicode Path Extra Field (0x7075):

       Stores the UTF-8 version of the file name field as stored in the
       local header and central directory header. (Last Revision 20070912)

       Value         Size        Description
       -----         ----        -----------
       (UPath) 0x7075        Short       tag for this extra block type ("up")
       TSize         Short       total data size for this block
       Version       1 byte      version of this extra field, currently 1
       NameCRC32     4 bytes     File Name Field CRC32 Checksum
       UnicodeName   Variable    UTF-8 version of the entry File Name

       Currently Version is set to the number 1.  If there is a need
       to change this field, the version will be incremented.  Changes
       may not be backward compatible so this extra field should not be
       used if the version is not recognized.

       The NameCRC32 is the standard zip CRC32 checksum of the File Name
       field in the header.  This is used to verify that the header
       File Name field has not changed since the Unicode Path extra field
       was created.  This can happen if a utility renames the File Name but
       does not update the UTF-8 path extra field.  If the CRC check fails,
       this UTF-8 Path Extra Field should be ignored and the File Name field
       in the header should be used instead.

       The UnicodeName is the UTF-8 version of the contents of the File Name
       field in the header.  As UnicodeName is defined to be UTF-8, no UTF-8
       byte order mark (BOM) is used.  The length of this field is determined
       by subtracting the size of the previous fields from TSize.  If both
       the File Name and Comment fields are UTF-8, the new General Purpose
       Bit Flag, bit 11 (Language encoding flag (EFS)), can be used to
       indicate that both the header File Name and Comment fields are UTF-8
       and, in this case, the Unicode Path and Unicode Comment extra fields
       are not needed and should not be created.  Note that, for backward
       compatibility, bit 11 should only be used if the native character set
       of the paths and comments being zipped up are already in UTF-8. It is
       expected that the same file name storage method, either general
       purpose bit 11 or extra fields, be used in both the Local and Central
       Directory Header for a file.

参考文献

unzip not correct with cjk filename. #45

Garbled chinese character #73

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

  • 引用: https://blog.csdn.net/IndexMan/article/details/801884...
    lekf123阅读 3,424评论 0 0
  • 作者:crane-yuan 日期:2017-05-02 问题 由于zip格式中并没有指定编码格式,Windows下...
    craneyuan阅读 4,019评论 0 3
  • 简单来说,unicode,gbk和大五码就是编码的值,而utf-8,uft-16之类就是这个值的表现形式.而前面那...
    百里求一阅读 1,394评论 0 2
  • 相信在电脑这个只认识0和1的蠢货世界里编码是个大问题,稍有不慎我们写的东西或做的东西都会变成一个一个小蝌蚪文,也就...
    代码和远方阅读 2,152评论 0 0
  • 这些护肤经验,美容院是不会主动跟你说的 对于护肤大家会不会有这样的困惑:天天喊着护肤,学习了各种方法,使用了各种护...
    adou阿东阅读 203评论 0 0

友情链接更多精彩内容