Character Set

Character set is a map of characters and numbers(ascii code, code point).

ASCII

American Standard Code for Information Interchange
Map Latin characters to ASCII code (eg: A->0x41, a-> 0x61)
0 - 31 and 127 are Device Control Characters
32 - 126 are Printable Characters
0x00 - 0x7F

ASCII Table.png

GB

GB2312 -> GBK -> GB18030

Unicode

Map ALL characters in ALL languages to a unique number(a code point)
0x0000-0x10FFFF
17 Planes, 65536 code points in each plane
Sample Chinese characters in BMP 0x4E00-0x9FBF

Sample Chinese characters in BMP (0x4E00-0x9FBF)

Plane	Name	Range
Plane 0	Basic Multilingual Plane (BMP)	0x0000-0xFFFF
Plane 1	Supplementary Multilingual Plane	0x10000-0x1FFFF
Plane 2	Supplementary Ideographic Plane	0x20000-0x2FFFF
Plane 3	Tertiary Ideographic Plane (unassigned)	0x30000-0x3FFFF
Plane 4-13	unassigned	0x40000-0xDFFFF
Plane 14	Supplementary Special-purpose Plane	0xE0000-0xEFFFF
Plane 15	Supplementary Private Use Area planes - A	0xF0000-0xFFFFF
Plane 16	Supplementary Private Use Area planes - B	0x100000-0x10FFFF

Character Encoding

Character set translate characters to numbers, character encoding translate numbers into binary.

UTF-8

Unicode Transformation Format 8 bits
Variable width(1-4 bytes) character encoding
Backward compatible with ASCII
Chinese characters (0x0800-0xFFFF) are encoded to 3 bytes

Code Point Range	UTF-8 binary
0x00-0x7F	0xxxxxxx
0x80-0x07FF	110xxxxx 10xxxxxx
0x0800-0xFFFF	1110xxxx 10xxxxxx 10xxxxxx
0x10000-0x10FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF-16

Unicode Transformation Format 16 bits
Variable width(2 or 4 bytes) character encoding
Characters in BMP are encoded to 2 bytes
Characters in Plane1-16 are encoded to 4 bytes
Byte Order Mark(BOM)
- 0xFEFF big-ending BE
- 0xFFFE little-ending LE

Plane	Code Point Range	UTF-16 binary
Plane0(BMP)	0x0000-0xFFFF	xxxxxxxx xxxxxxxx
Plane1	0x10000-0x1FFFF	11011000 00xxxxxx 110111xx xxxxxxxx
Plane2	0x20000-0x2FFFF	11011000 01xxxxxx 110111xx xxxxxxxx
...	...	110110pp ppxxxxxx 110111xx xxxxxxxx
Plane15	0xF0000-0xFFFFF	11011011 10xxxxxx 110111xx xxxxxxxx
Plane16	0x100000-0x10FFFF	11011011 11xxxxxx 110111xx xxxxxxxx

UTF-32

Unicode Transformation Format 32 bits
Fixed width(4 bytes) character encoding

Code Point Range	UTF-32 binary
0x0000-0x10FFFF	00000000 000xxxxx xxxxxxxx xxxxxxxx

Practise in Java

import java.nio.ByteBuffer;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.List;

public class CharacterTest {
    private static final Charset UTF8 = Charset.forName("UTF-8");
    private static final Charset UTF16_BE = Charset.forName("UTF-16BE");
    private static final Charset UTF32 = Charset.forName("UTF-32");
    private static final List<Charset> charsets = new ArrayList();

    static {
        charsets.add(UTF8);
        charsets.add(UTF16_BE);
        charsets.add(UTF32);
    }

    public static void main(String[] args) {
        printCharacter("A");
        printCharacter("¼");
        printCharacter("一");
        printCharacter("𠀡");
    }

    /**
     * Print Code Points of str
     * Print str encoded by utf-8,utf-16,utf-32
     *
     * @param str
     */
    private static void printCharacter(String str) {
        str.codePoints().forEach((s) ->
                System.out.format("%10s%40s\n", "Code Point", "0x " + Integer.toHexString(s).toUpperCase() + " "));

        charsets.forEach((charset -> {
            ByteBuffer byteBuffer = charset.encode(str);
            System.out.format("%10s%40s\n", charset.name(), byteBufferToHexString(byteBuffer));
            System.out.format("%10s%40s\n", charset.name(), byteBufferToBinaryString(byteBuffer));

        }));
        System.out.println();
    }

    /**
     * ByteBuffer to hexadecimal string
     *
     * @param byteBuffer
     * @return
     */
    private static String byteBufferToHexString(ByteBuffer byteBuffer) {
        StringBuilder hexString = new StringBuilder("0x ");
        byteBuffer.rewind();

        while (byteBuffer.hasRemaining()) {
            int i = Byte.toUnsignedInt(byteBuffer.get());
            hexString.append(padZeros(Integer.toHexString(i), 2));
        }
        return hexString.toString();
    }


    /**
     * ByteBuffer to binary string
     *
     * @param byteBuffer
     * @return
     */
    private static String byteBufferToBinaryString(ByteBuffer byteBuffer) {
        StringBuilder binaryString = new StringBuilder("0b ");
        byteBuffer.rewind();

        while (byteBuffer.hasRemaining()) {
            int i = Byte.toUnsignedInt(byteBuffer.get());
            binaryString.append(padZeros(Integer.toBinaryString(i), 8));
        }
        return binaryString.toString();
    }

    /**
     * Pad len-str 0(s) to the left of str.
     *
     * @param str
     * @param len
     * @return
     */
    private static String padZeros(String str, int len) {
        int numOfZeros = len - str.length();
        StringBuilder stringBuilder = new StringBuilder();
        for (int i = 0; i < numOfZeros; i++) {
            stringBuilder.append("0");
        }
        stringBuilder.append(str.toUpperCase()).append(" ");
        return stringBuilder.toString();
    }
}

Character encoded in UTF-8/16/32

Character	Code Point	UTF-8	UTF-16	UTF-32
A	0x41	01000001 0x41	00000000 01000001 0x00 41	00000000 00000000 00000000 01000001 0x00 00 00 41
¼	0xBC	11000010 10111100	00000000 10111100 0x00 BC	00000000 00000000 00000000 10111100 0x00 00 00 BC
一	0x4E00	11100100 10111000 10000000	01001110 00000000 0x4E 00	00000000 00000000 01001110 00000000 0x00 00 4E 00
𠀡	0x20021	11110000 10100000 10000000 10100001	11011000 01000000 11011100 00100001	00000000 00000010 00000000 00100001 0x00 02 00 21

Reference

Unicode Wiki
Unicode code points
Unicode lookup
SOF - Unicode UTF-8 UTF-16
cnblogs - Unicode UTF-8 UTF-16

Character Set and Character Encoding