摘要:class Student系列,希望通过对一段非常简单的代码分析,以问题为导向,加深自己对代码的理解。
如题,一段非常简单的代码如下:
class Student {
int age;
String name;
static Student demo() {
Student xm = new Student();
xm.age += 10;
xm.name = "小ming😊";
return xm;
}
public static void main(String[] args) {
Student.demo();
}
1. 执行demo方法时,哪些地方分配了哪些内存?分别是多大?
- 方法栈压栈一个新的栈帧
- 栈帧
-
局部变量表,有一个指向堆上对象的引用
xm
:8字节
引用大小一般是机器字长,32位上是4字节,64机器是8字节, 后面讨论默认为64位机器),但是64位寻址空间位4G * 4G = 16GG,一般是用不到这么大内存的,因此部分JVM实现会压缩引用大小,用更少的空间存储引用。后面的讨论不考虑这个实现相关的优化
返回值地址 (
8字节
)-
操作数栈
- 栈式虚拟机才存在,例如:Hotspot,普通的桌面级和server JVM实现 (编译时确定最大深度,根据代码不同而不同
x字节
) - Android用的Dalvik和ART是寄存器式的,不存在
- 栈式虚拟机才存在,例如:Hotspot,普通的桌面级和server JVM实现 (编译时确定最大深度,根据代码不同而不同
对常量方法的引用
-
- 堆上分配了一个Student对象,Student对象由四部分组成
- 对象头(包含了指向class的指针,gc信息,锁情况等相关信息)(
16字节
左右) - field:
age
int/值 类型
,4字节
- field:
name
String/引用类型
一个机器字长(8字节
) - 对齐填充(一般是4字节或者8字节对齐)(对Student对象来说,需要填充
4字节
)
- 对象头(包含了指向class的指针,gc信息,锁情况等相关信息)(
- 常量池里面的 “小ming😊”
- Java字符串使用UTF-16编码(所以,Character对象是16位),“小ming” =
5 * 2 字节
- emoji 不在Unicode 常见字符编码内,需要用两个character表示
2字节
(参见Java String注释,CodePoint API)
- Java字符串使用UTF-16编码(所以,Character对象是16位),“小ming” =
思考题:10 在哪里?
2. 字符串占用多少内存?编码方式?
如上面,分析过了。Java中使用UTF-16编码,不在常见字符集内的,使用一个codePoint(两个Character)来表示。
其它地方,目前大部分默认使用UTF-8作为默认编码。UTF-8 是变长编码,前面的字符和Asicc兼容,一个汉字用三个字节表示。
字符/Unicode编码是一个比较复杂的话题,我了解的比较浅,这里就不班门弄斧了,有兴趣的小伙伴可以继续深入研究。
3. 方法栈上的内存布局是什么样的?堆上的内存布局?
注意:内存布局和内存大小是两个相关但不同的概念,内存布局含义更丰富一些,例如,内存是连续的还是离散的,不同内存之间的关系。相同的内存消耗,不同的内存布局可能对性能影响非常大
扩展:
1. String 的lazy,cache hashCode
/** Cache the hash code for the string */
private int hash; // Default to 0
/**
* Returns a hash code for this string. The hash code for a
* {@code String} object is computed as
* <blockquote><pre>
* s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
* </pre></blockquote>
* using {@code int} arithmetic, where {@code s[i]} is the
* <i>i</i>th character of the string, {@code n} is the length of
* the string, and {@code ^} indicates exponentiation.
* (The hash value of the empty string is zero.)
*
* @return a hash code value for this object.
*/
public int hashCode() {
int h = hash;
final int len = length();
if (h == 0 && len > 0) {
for (int i = 0; i < len; i++) {
h = 31 * h + charAt(i);
}
hash = h;
}
return h;
}
非常有意思的一点,String的hashCode和平时些的即时计算的代码非常不同,用了一个辅助成员变量来缓存,并且是延迟计算。
我理解主要是基于几点考虑:
1. String对象是不可变的,为缓存hashCode提供了前提
2. 对于字符串长度非常长的情况下,缓存策略可以避免非常严重的badcase(String的hash计算是和字符串长度成正比的, 复杂度O(n))
3. 内存相对是廉价/不敏感的,毕竟对象头都占了16个字节了,一个int的消耗不足挂齿
2. Unicode, Character, String
String.java
/**
* <p>A {@code String} represents a string in the UTF-16 format
* in which <em>supplementary characters</em> are represented by <em>surrogate
* pairs</em> (see the section <a href="Character.html#unicode">Unicode
* Character Representations</a> in the {@code Character} class for
* more information).
* Index values refer to {@code char} code units, so a supplementary
* character uses two positions in a {@code String}.
* <p>The {@code String} class provides methods for dealing with
* Unicode code points (i.e., characters), in addition to those for
* dealing with Unicode code units (i.e., {@code char} values).
*/
Character.java
/**
* <p><a name="BMP">The set of characters from U+0000 to U+FFFF</a> is
* sometimes referred to as the <em>Basic Multilingual Plane (BMP)</em>.
* <a name="supplementary">Characters</a> whose code points are greater
* than U+FFFF are called <em>supplementary character</em>s. The Java
* platform uses the UTF-16 representation in {@code char} arrays and
* in the {@code String} and {@code StringBuffer} classes. In
* this representation, supplementary characters are represented as a pair
* of {@code char} values, the first from the <em>high-surrogates</em>
* range, (\uD800-\uDBFF), the second from the
* <em>low-surrogates</em> range (\uDC00-\uDFFF).
*
* <p>A {@code char} value, therefore, represents Basic
* Multilingual Plane (BMP) code points, including the surrogate
* code points, or code units of the UTF-16 encoding. An
* {@code int} value represents all Unicode code points,
* including supplementary code points. The lower (least significant)
* 21 bits of {@code int} are used to represent Unicode code
* points and the upper (most significant) 11 bits must be zero.
* Unless otherwise specified, the behavior with respect to
* supplementary characters and surrogate {@code char} values is
* as follows:
*
* <ul>
* <li>The methods that only accept a {@code char} value cannot support
* supplementary characters. They treat {@code char} values from the
* surrogate ranges as undefined characters. For example,
* {@code Character.isLetter('\u005CuD840')} returns {@code false}, even though
* this specific value if followed by any low-surrogate value in a string
* would represent a letter.
*
* <li>The methods that accept an {@code int} value support all
* Unicode characters, including supplementary characters. For
* example, {@code Character.isLetter(0x2F81A)} returns
* {@code true} because the code point value represents a letter
* (a CJK ideograph).
*/
在Basic Multilingual Plane (BMP)
范围内的字符可以使用一个Character表示,范围外的字符需要用两个Character表示
扩展思考
- 如何正确的截断一个带有emoji字符的文本?
- 为什么要进行对齐?除了对象的对齐之外,哪些地方还用到了对齐?
- 为什么我们自己定义的类大部分是直接计算hashCode ?哪些场景下适用String style lazy cache类似的hashCode实现?
- 为什么String对象要设计为不可变的?自定义类如何做到不可变?不可变对象有什么好处?
- 栈帧的局部变量表里面都有啥?有this?