中级12 - 字符串原理与实战

文字是人类文明的标志，在Java中，它就是字符串，字符串是 Java 中最重要的引用类型。
互联网最重要的一件事：处理字符串，因此能处理好字符串是 Web 服务器的基本要求：

PHP
Python
Ruby
Perl
Java

1. 字符串的不可变性

字符串不可变，保证了线程安全，存储安全。但缺点是每当修改的时候，都需要重复创建新的字符串对象。

image.png

字符串是字符的容器：

String str = "abc";

char data[] = {'a', 'b', 'c'};
String str = new String(data);

为什么不可变：

String 类是 final 的，不可被继承
String 类所有的 public 方法都没有修改内部 char[] 容器的操作

总是创建新的 String 对象：

String s = "0";

for (int i = 1; i < 10; i++) {
    s = s + i; // IDEA 会提示在 for 循环中如此拼接字符串是低效率的，因为每次都会创建新的字符串对象
}

// alt + enter 快捷键把 String 转换为 StringBuilder
StringBuilder s = new StringBuilder("0");

for (int i = 1; i < 10; i++) {
    s.append(i);
}

如果就是想修改字符串呢？

2. StringBuilder 和 StringBuffer

2.1 StringBuilder

字符的可变序列，API 和 StringBuffer 兼容，但是不保证同步（线程安全）。单线程情况下使用 StringBuilder 即可。
可以不断的 append，因此作为 hashMap 的 key 是不安全的，违反了 hashCode 约定：

同⼀个对象必须始终返回相同的 hashCode
两个对象的 equals 返回 true，必须返回相同的 hashCode
两个对象不等，也可能返回相同的 hashCode

而 String 是安全的 key。

2.2 StringBuffer

线程安全的字符串可变序列，但为了保持同步所以速度比 StringBuilder 慢。

3. 字符串与编码

人类世界中的字符通过一种映射关系，转换为计算机世界中的字节，这种映射关系就是字符集。
如果世界上只有一套字符集，就不会出现乱码问题了。

编码：字符 -> 字节
解码：字节 -> 字符

3.1 字符集 Unicode

Unicode 使用 int 4个字节。GBK 2个字节存储一个汉字，UTF-8 是3个字节存储一个汉字。
code point（码点）即字符编号。Unicode 有种表示形式，比如可以使用 '\uxxx' 表示一个字符，参看 Character 类。

Unicode 占地方，因此出现了 UTF-8 和 UTF-16 编码形式用于对 Unicode 进行格式转化，从而更合理的存储和传输。因为直接使用固定用4个字节的 Unicode 编码效率低下，大量浪费内存空间。所以需要更精妙的转换格式。

3.2 UTF-16

是 Java 程序内部对于字符（Character）的存储方法。

UTF-16** **是对 Unicode 字符集进行编码的一种实现方案。即把 Unicode 字符集的抽象码位映射为16位长的整数（即码元）的序列，用于数据存储或传递。Unicode 字符的码位，需要1个或者2个16位长的码元来表示，因此 UTF-16 是一个变长表示。

UTF是"Unicode/UCS Transformation Format"的首字母缩写，即把Unicode字符转换为某种格式之意。

UTF-16 常用字符是2个字节（BMP，基本多语言平面，1个16位长的码元），不常用字符是4个字节（辅助平面，2个16位长的码元）。

3.3 UTF-8

Mac/Linux 默认编码是 UTF-8，Windows 默认的中文编码是 GBK（因为Windows 在 UTF-8 之前就诞生了）。如果没有意外，把你所有的编码方案都改成 UTF-8，宇宙级通用。

UTF-8 也是针对 Unicode 的一种变长编码方案。它可以用1至4个字节对 Unicode 字符集中的所有有效码点进行编码，是目前最好的多语言解决方案，支持最广泛。

与 UTF-16 相比，在 UTF-8 中 ASCII 字符占用的空间只有一半，可是某些字符的 UTF-8 编码占用的空间就要多出1/3，特别是方块文字如汉字要占3个字节。

3.4 思考字符串和 byte[]、char[] 的关系

字符串本质上是字节数组 byte[]。
而 Java 中的 char 可以认为是为了方便计算和表达 byte[] 抽离出来的更大容量的数据类型，在 Java 中可以方便地用1个 char 就能对应1个 Unicode 常用字符。而真正存储或网络传输时是 byte[]，即字节流，字节才是计算机世界中的通用单位，而不同语言中的 char 却可能不太一样。

Java 中的 char 固定用2个 byte 表示 Unitcode 中 0 ~ 65535 码点的字符，常用汉字和英文等字符足够了，但像 emoji 表情符号就不行，因为其 Unicode 码点超出了16位，2个 byte 放不下，需要占用4个 byte，也就是需要2个 Java 的 char 才能表达。

因此 Java 字符串 "晓风ABC😍" 内部的 char[] 长度是7（因为1个 emoji 需要2个 char）。

但是要注意如下的 我 经过 UTF-8 编码（encode）之后的字节数组 byte[] 长度是 3，而 中 和 A 都是 2，所以合起来的经过 UTF-8 编码后的 byte[] 长度是 7。

char[] chars = { '我', '中', 'A' };
byte[] bytes = { -26, -120, -111, -28, -72, -83, 65 };

为什么？1个 char 不是固定 2 个 byte 吗？怎么还能转出来 3 个 byte ?

在 Unicode 字符集中，我 16位 Unicode 编码是 6211，中 是 4E2D， A 是 0065，它们都没有超出16位的范围，所以各用2个 byte 存储这2个汉字的 Unicode 码点，空间方面显然是够的，用 Java 的 char 是可以表示的。因此这种差异只是来自于 UTF-8 自身的编码策略。

4. 实战

4.1 实现一个 StringBuilder

package com.github.hcsp.string;

import java.io.UnsupportedEncodingException;
import java.util.Arrays;

public class MyStringBuilder {

    private char[] value;

    private int count;

    private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;

    public MyStringBuilder() {
        value = new char[16];
    }

    public MyStringBuilder(int capacity) {
        value = new char[capacity];
    }

    private void ensureCapacityInternal(int minimumCapacity) {
        // overflow-conscious code
        if (minimumCapacity - value.length > 0) {
            value = Arrays.copyOf(value,
                    newCapacity(minimumCapacity));
        }
    }

    private int newCapacity(int minCapacity) {
        // overflow-conscious code
        int newCapacity = (value.length << 1) + 2;
        if (newCapacity - minCapacity < 0) {
            newCapacity = minCapacity;
        }
        return (newCapacity <= 0 || MAX_ARRAY_SIZE - newCapacity < 0)
                ? hugeCapacity(minCapacity)
                : newCapacity;
    }

    private int hugeCapacity(int minCapacity) {
        if (Integer.MAX_VALUE - minCapacity < 0) { // overflow
            throw new OutOfMemoryError();
        }
        return Math.max(minCapacity, MAX_ARRAY_SIZE);
    }

    // 在末尾添加一个字符
    public MyStringBuilder append(char ch) {
        ensureCapacityInternal(count + 1);
        value[count++] = ch;
        return this;
    }

    // 在末尾添加一个字符串，其数据需要从bytes字节数组中按照charsetName字符集解码得到
    // 请思考一下字节和字符串（字符串本质上是字节数组）之间的关系
    // 并查找相关API
    public MyStringBuilder append(byte[] bytes, String charsetName) throws UnsupportedEncodingException {
        // 给定的字节数组是按指定的编码方式encode后的，
        // 需要再原路decode成字符串，该字符串此后按照当前项目默认的编码方式进行编解码
        String string = new String(bytes, charsetName);
        append(string);
        return this;
    }

    // 在末尾添加字符串
    public MyStringBuilder append(String str) {
        if (str == null) {
            return this;
        }
        int len = str.length();
        ensureCapacityInternal(count + len);
        str.getChars(0, len, value, count);
        count += len;
        return this;
    }

    // 在index指定位置添加一个字符ch(index及之后整体往后挪一位)
    public MyStringBuilder insert(int index, char ch) {
        ensureCapacityInternal(count + 1);
        System.arraycopy(value, index, value, index + 1, count - index);
        value[index] = ch;
        count += 1;
        return this;
    }

    // 删除位于index处的字符(index之后整体往前挪一位)
    public MyStringBuilder deleteCharAt(int index) {
        if ((index < 0) || (index >= count)) {
            throw new StringIndexOutOfBoundsException(index);
        }
        System.arraycopy(value, index + 1, value, index, count - index - 1);
        count--;
        return this;
    }

    public int length() {
        return count;
    }

    @Override
    public String toString() {
        return new String(value, 0, count);
    }

    public static void main(String[] args) throws UnsupportedEncodingException {

        MyStringBuilder sb = new MyStringBuilder();

        for (int i = 0; i < 10; i++) {
            sb.append('a');
        }

        System.out.println(sb.length());

        String str = sb.toString();
        System.out.println(str);
        System.out.println(str.length());
        
        sb.append("今天天气不错".getBytes("GBK"), "GBK");
        System.out.println(sb.toString());

        sb.insert(2, '哈');
        System.out.println(sb.toString());

        sb.deleteCharAt(2);
        System.out.println(sb.toString());

    }
}

4.2 读取一个 GBK 编码的文件

package com.github.hcsp.string;

import java.io.File;
import java.io.IOException;
import java.nio.file.Files;

public class GbkFileReader {
    public static void main(String[] args) throws IOException {
        // 测试1
        File projectDir = new File(System.getProperty("basedir", System.getProperty("user.dir")));
        System.out.println(new GbkFileReader().readFileWithGBK(new File(projectDir, "gbk.txt")));

        // 测试2
        File tmp = File.createTempFile("tmp", "");
        String text = "窗前明月光\n疑似地上霜\n举头望明月\n低头思故乡";
        Files.write(tmp.toPath(), text.getBytes("GBK"));
        System.out.println(new GbkFileReader().readFileWithGBK(tmp));
    }

    // Java 平台的 char[] 和 String 都是使用 UTF-16 来表示的，
    // 所以，这里先纯粹地读取文件的 byte[]，然后按 GBK 的编码逻辑进行解码时，重新映射成了 Unicode 码点，
    // 之后以 UTF-16 的形式存储在 Java 的 char[] 中。
    public String readFileWithGBK(File file) throws IOException {
        return new String(Files.readAllBytes(file.toPath()), "GBK");
    }
}

5. 类库

StringUtils