十、文件正确写入bom

0. 什么是BOM(byte order mark, 字节序标记)?

bom可认为是unicode编码格式的一个标识。bom的字符为\uFEFF,不同编码格式下会encoding为不同的字节序,如下图:

bom encoding table

1. BOM作用

  • 确定字节序,大端序 or 小端序(用于16-bit,32bit编码)
  • 确定文本流为Unicode编码格式
  • 确定当前使用的哪种Unicode编码格式

2. 细说UTF-8下的字节序

  • String.valueOf('\ufeff').getBytes("utf-8"),得到bom在utf-8下的字节序:0xef,0xbb,0xbf
  • 若某字符串起始字符为\ufeff,则通过String#getBytes("utf-8")产生含bom的utf-8字节数组

3. Java写入Bom示例:

1)使用PrintStream#write(int i),该方法写入的是字节,即最低位字节

  • 源码:
 /**
     * Writes the specified byte to this stream.  If the byte is a newline and
     * automatic flushing is enabled then the <code>flush</code> method will be
     * invoked.
     *
     * <p> Note that the byte is written as given; to write a character that
     * will be translated according to the platform's default character
     * encoding, use the <code>print(char)</code> or <code>println(char)</code>
     * methods.
     *
     * @param  b  The byte to be written
     * @see #print(char)
     * @see #println(char)
     */
    public void write(int b) {
        try {
            synchronized (this) {
                ensureOpen();
                out.write(b);
                if ((b == '\n') && autoFlush)
                    out.flush();
            }
        }
        catch (InterruptedIOException x) {
            Thread.currentThread().interrupt();
        }
        catch (IOException x) {
            trouble = true;
        }
    }
  • Demo:
PrintStream out = System.out;
out.write('\ufeef'); // emits 0xef
out.write('\ufebb'); // emits 0xbb
out.write('\ufebf'); // emits 0xbf
PrintStream out = System.out;
out.write(0xef); // emits 0xef
out.write(0xbb); // emits 0xbb
out.write(0xbf); // emits 0xbf

2)PrintStream#print(char c),该方法写入的char

  • 源码
 /**
     * Prints a character.  The character is translated into one or more bytes
     * according to the platform's default character encoding, and these bytes
     * are written in exactly the manner of the
     * <code>{@link #write(int)}</code> method.
     *
     * @param      c   The <code>char</code> to be printed
     */
    public void print(char c) {
        write(String.valueOf(c));
    }
  • Demo
PrintStream out = System.out;
out.print('\ufeff');

3)StringWriter.write(int c), 写入的是char,同PrintStream#print。

  • 源码
 /**
     * Write a single character.
     */
    public void write(int c) {
        buf.append((char) c);
    }

参考

  1. Byte order mark - wikipedia
  2. how-to-add-a-utf-8-bom-in-java - stackoverflow
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容