class Student（二）：内存分配情况

问题导向的学习方法

摘要：class Student系列，希望通过对一段非常简单的代码分析，以问题为导向，加深自己对代码的理解。

如题，一段非常简单的代码如下：

class Student {
    int age;
    String name;

    static Student demo() {
        Student xm = new Student();
        xm.age += 10;
        xm.name = "小ming😊";
        return xm;
    }

    public static void main(String[] args) {
        Student.demo();
    }

1. 执行demo方法时，哪些地方分配了哪些内存？分别是多大？

方法栈压栈一个新的栈帧
栈帧
- 局部变量表，有一个指向堆上对象的引用 xm: 8字节
  
  引用大小一般是机器字长，32位上是4字节，64机器是8字节, 后面讨论默认为64位机器），但是64位寻址空间位4G * 4G = 16GG，一般是用不到这么大内存的，因此部分JVM实现会压缩引用大小，用更少的空间存储引用。后面的讨论不考虑这个实现相关的优化
- 返回值地址（8字节）
- 操作数栈
  - 栈式虚拟机才存在，例如：Hotspot，普通的桌面级和server JVM实现 (编译时确定最大深度，根据代码不同而不同 x字节)
  - Android用的Dalvik和ART是寄存器式的，不存在
- 对常量方法的引用
堆上分配了一个Student对象，Student对象由四部分组成
- 对象头（包含了指向class的指针，gc信息，锁情况等相关信息）(16字节左右)
- field：age int/值类型，4字节
- field：name String/引用类型 一个机器字长（8字节）
- 对齐填充（一般是4字节或者8字节对齐）（对Student对象来说，需要填充4字节）
常量池里面的 “小ming😊”
- Java字符串使用UTF-16编码(所以，Character对象是16位)，“小ming” = 5 * 2 字节
- emoji 不在Unicode 常见字符编码内，需要用两个character表示 2字节 （参见Java String注释，CodePoint API）

思考题：10 在哪里？

2. 字符串占用多少内存？编码方式？

如上面，分析过了。Java中使用UTF-16编码，不在常见字符集内的，使用一个codePoint（两个Character）来表示。
其它地方，目前大部分默认使用UTF-8作为默认编码。UTF-8 是变长编码，前面的字符和Asicc兼容，一个汉字用三个字节表示。
字符/Unicode编码是一个比较复杂的话题，我了解的比较浅，这里就不班门弄斧了，有兴趣的小伙伴可以继续深入研究。

3. 方法栈上的内存布局是什么样的？堆上的内存布局？

注意：内存布局和内存大小是两个相关但不同的概念，内存布局含义更丰富一些，例如，内存是连续的还是离散的，不同内存之间的关系。相同的内存消耗，不同的内存布局可能对性能影响非常大

内存布局 (图画的有点久了，将就看下吧)

栈帧的结构

扩展:

1. String 的lazy，cache hashCode

    /** Cache the hash code for the string */
    private int hash; // Default to 0

    /**
     * Returns a hash code for this string. The hash code for a
     * {@code String} object is computed as
     * <blockquote><pre>
     * s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
     * </pre></blockquote>
     * using {@code int} arithmetic, where {@code s[i]} is the
     * <i>i</i>th character of the string, {@code n} is the length of
     * the string, and {@code ^} indicates exponentiation.
     * (The hash value of the empty string is zero.)
     *
     * @return  a hash code value for this object.
     */
    public int hashCode() {
        int h = hash;
        final int len = length();
        if (h == 0 && len > 0) {
            for (int i = 0; i < len; i++) {
                h = 31 * h + charAt(i);
            }
            hash = h;
        }
        return h;
    }

非常有意思的一点，String的hashCode和平时些的即时计算的代码非常不同，用了一个辅助成员变量来缓存，并且是延迟计算。

我理解主要是基于几点考虑：
1. String对象是不可变的，为缓存hashCode提供了前提
2. 对于字符串长度非常长的情况下，缓存策略可以避免非常严重的badcase（String的hash计算是和字符串长度成正比的, 复杂度O(n)）
3. 内存相对是廉价/不敏感的，毕竟对象头都占了16个字节了，一个int的消耗不足挂齿

2. Unicode, Character, String

String.java

/**
  * <p>A {@code String} represents a string in the UTF-16 format
 * in which <em>supplementary characters</em> are represented by <em>surrogate
 * pairs</em> (see the section <a href="Character.html#unicode">Unicode
 * Character Representations</a> in the {@code Character} class for
 * more information).
 * Index values refer to {@code char} code units, so a supplementary
 * character uses two positions in a {@code String}.
 * <p>The {@code String} class provides methods for dealing with
 * Unicode code points (i.e., characters), in addition to those for
 * dealing with Unicode code units (i.e., {@code char} values).
 */

Character.java

/**
 * <p><a name="BMP">The set of characters from U+0000 to U+FFFF</a> is
 * sometimes referred to as the <em>Basic Multilingual Plane (BMP)</em>.
 * <a name="supplementary">Characters</a> whose code points are greater
 * than U+FFFF are called <em>supplementary character</em>s.  The Java
 * platform uses the UTF-16 representation in {@code char} arrays and
 * in the {@code String} and {@code StringBuffer} classes. In
 * this representation, supplementary characters are represented as a pair
 * of {@code char} values, the first from the <em>high-surrogates</em>
 * range, (&#92;uD800-&#92;uDBFF), the second from the
 * <em>low-surrogates</em> range (&#92;uDC00-&#92;uDFFF).
*
 * <p>A {@code char} value, therefore, represents Basic
 * Multilingual Plane (BMP) code points, including the surrogate
 * code points, or code units of the UTF-16 encoding. An
 * {@code int} value represents all Unicode code points,
 * including supplementary code points. The lower (least significant)
 * 21 bits of {@code int} are used to represent Unicode code
 * points and the upper (most significant) 11 bits must be zero.
 * Unless otherwise specified, the behavior with respect to
 * supplementary characters and surrogate {@code char} values is
 * as follows:
 *
 * <ul>
 * <li>The methods that only accept a {@code char} value cannot support
 * supplementary characters. They treat {@code char} values from the
 * surrogate ranges as undefined characters. For example,
 * {@code Character.isLetter('\u005CuD840')} returns {@code false}, even though
 * this specific value if followed by any low-surrogate value in a string
 * would represent a letter.
 *
 * <li>The methods that accept an {@code int} value support all
 * Unicode characters, including supplementary characters. For
 * example, {@code Character.isLetter(0x2F81A)} returns
 * {@code true} because the code point value represents a letter
 * (a CJK ideograph).
 */

在Basic Multilingual Plane (BMP)范围内的字符可以使用一个Character表示，范围外的字符需要用两个Character表示

扩展思考

如何正确的截断一个带有emoji字符的文本？
为什么要进行对齐？除了对象的对齐之外，哪些地方还用到了对齐？
为什么我们自己定义的类大部分是直接计算hashCode ？哪些场景下适用String style lazy cache类似的hashCode实现？
为什么String对象要设计为不可变的？自定义类如何做到不可变？不可变对象有什么好处？
栈帧的局部变量表里面都有啥？有this？