HashMap小探(一) — 基本属性

hashmap里面的重要字段及方法：

capacity & size

capacity是指当前hashmap的容量，注意是当前，因为hashmap本身存在扩容。

size则是指当前hashmap中的实际元素个数。

注意这里两个值都不是table数组的长度，而是hashmap 数组 + 链表作为一个整体的元素。

其在hashmap.java中的定义如下：

    /**
     * The number of key-value mappings contained in this map.
     */
    transient int size;

    .......
        
    final int capacity() {
        return (table != null) ? table.length :
            (threshold > 0) ? threshold :
            DEFAULT_INITIAL_CAPACITY;
    }

    // 这里实现用到的table\threshold\DEFAULT_INITIAL_CAPACITY详见下文

下面代码说明：

        hashmap map = new hashmap();
        map.put("jin", "alan");

        // 获取map的类-类型
        Class mapClazz = map.getClass();

        /*
            capacity，当前容量，是个方法，其实现是：
        */
        // 通过反射获取其当前的capacity,默认会是16
        Method capacityMethod = mapClazz.getDeclaredMethod("capacity");
        // 设置访问权限
        capacityMethod.setAccessible(true);

        System.out.println("map capacity = " + capacityMethod.invoke(map));


        // 通过反射获取其 size，也就是目前hashmap中有多少元素，刚刚put了一个，所以会是1
        // 也可以用Method，因为有一个 int size()方法
        Field sizeField = mapClazz.getDeclaredField("size");
        sizeField.setAccessible(true);
        System.out.println("map size = " + sizeField.get(map));

通过反射取到其方法、字段，执行后得到对应的值，印证了注释中预估的结果。

但是需要注意，如果刚开始指定了大小，那么threshold的值就是tableSizeFor的值，也就是初始化的hashmap容量（2^n），jdk中源代码如下：

    /**
     * Constructs an empty <tt>hashmap</tt> with the specified initial
     * capacity and load factor.
     *
     * @param  initialCapacity the initial capacity
     * @param  loadFactor      the load factor
     * @throws IllegalArgumentException if the initial capacity is negative
     *         or the load factor is nonpositive
     */
    public hashmap(int initialCapacity, float loadFactor) {
        if (initialCapacity < 0)
            throw new IllegalArgumentException("Illegal initial capacity: " +
                                               initialCapacity);
        if (initialCapacity > MAXIMUM_CAPACITY)
            initialCapacity = MAXIMUM_CAPACITY;
        if (loadFactor <= 0 || Float.isNaN(loadFactor))
            throw new IllegalArgumentException("Illegal load factor: " +
                                               loadFactor);
        this.loadFactor = loadFactor;
        this.threshold = tableSizeFor(initialCapacity);
    }

看最后一行代码对threshold的赋值就知道了，下面写测试代码验证下：

        Map giveSizeMap = new hashmap(14);
        System.out.println("giveSizeMap capacity = " + capacityMethod.invoke(giveSizeMap)); //16
        System.out.println("giveSizeMap size = " + sizeField.get(giveSizeMap)); // 0
        System.out.println("giveSizeMap loadFactor = " + loadFactorField.get(giveSizeMap)); // 0.75
        System.out.println("giveSizeMap threldold =" + thresholdField.get(giveSizeMap)); // ?  = 16

不出所料，打印出来threshold的值是16.

但是下面往这个map中添加值，如下：

        giveSizeMap.put("1", "a");
 
        System.out.println("now giveSizeMap capacity = " + capacityMethod.invoke(giveSizeMap)); //16
        System.out.println("now giveSizeMap size = " + sizeField.get(giveSizeMap)); // 0
        System.out.println("now giveSizeMap loadFactor = " + loadFactorField.get(giveSizeMap)); // 0.75
        System.out.println("now giveSizeMap threldold =" + thresholdField.get(giveSizeMap)); // 12

就会看到，一但向其中put数据，threshold的值就会变成 capacity * loadFactor，具体put的实现下面会详细分析。

loadFactor & threshold

loadFactor的意思是装载因子，是一个小于等于1的小数，其作用就是定义扩容阈值百分比。默认值0.75。

threshold就是具体的扩容阈值了，threshold = loadFactor * capacity。

loadFactor的值为什么是0.75呢？capacity的值是2的N次方，所以当N大1时，后续3/4 * capacity，得到的会一直是整数。

默认的threshold = 16 * 0.75 = 12，当hashmap中的size值大于threshold时，会触发扩容。

下面使用代码验证hashmap的扩容过程。

        hashmap map3 = new hashmap(3);

        // 通过反射获取其当前的capacity,默认会是16
        Method capacityMethod = mapClazz.getDeclaredMethod("capacity");
        // 设置访问权限
        capacityMethod.setAccessible(true);
        
        // size
        Field sizeField = mapClazz.getDeclaredField("size");
        sizeField.setAccessible(true);

        // loadFactor
        Field loadFactorField = mapClazz.getDeclaredField("loadFactor");
        loadFactorField.setAccessible(true);

        // threshold
        Field thresholdField = mapClazz.getDeclaredField("threshold");
        thresholdField.setAccessible(true);

        // 使用map3 做扩容的验证过程
        map3.put("1","a");
        map3.put("2","a");
        map3.put("3","a");

        // 获取capacity、size、threshold、loadFactor
        System.out.println("map capacity = " + capacityMethod.invoke(map3)); //4
        System.out.println("map size = " + sizeField.get(map3)); // 3
        System.out.println("map loadFactor = " + loadFactorField.get(map3)); // 0.75
        System.out.println("map threldold ="+ thresholdField.get(map3)); // 4 * 0.75 = 3

        // 再增加一个元素,大于threshold，触发扩容
        System.out.println("capacity resize");
        map3.put("4","a");
        System.out.println("map capacity = " + capacityMethod.invoke(map3)); //8
        System.out.println("map size = " + sizeField.get(map3)); // 4
        System.out.println("map loadFactor = " + loadFactorField.get(map3)); // 0.75
        System.out.println("map threldold ="+ thresholdField.get(map3)); // 8 * 0.75 = 6

table

    /**
     * The table, initialized on first use, and resized as
     * necessary. When allocated, length is always a power of two.
     * (We also tolerate length zero in some operations to allow
     * bootstrapping mechanics that are currently not needed.)
     */
    transient Node<K,V>[] table;

table其实就是hashmap的存储底层，也就是一个数组，table.length总是2的N次方，不过<key，value>被封装到了Node对象里面，详细的可以看下面关于Node的内容。

获取table的长度

        // 获取table的长度
        Field tableField = mapClazz.getDeclaredField("table");
        tableField.setAccessible(true);
    
        System.out.println("map table length = " + ((HashMap.Entry[])tableField.get(map3)).length);

测试后可以发现，table.length与capacity的值总是一样，也就是 table是所有数据的存储容器。

tableSizeFor

这个方法用来计算扩容时hashmap的容量值，可以看到，hashmap的扩容并不会以用户指定的值为准，而是去寻找一个大于指定值的最小2的N次方。而计算的实现就是用这个方法来做的，先看代码：

    /**
     * Returns a power of two size for the given target capacity.
     */
    static final int tableSizeFor(int cap) {
        int n = cap - 1;   // 第一步：先把指定的值减1
        // 第二步，向右移位 并与当前值按位或；其目的对于一个数字的二进制，从第一个不为0的位开始，把后面的所有位都设置成1。
        n |= n >>> 1;
        n |= n >>> 2;
        n |= n >>> 4;
        n |= n >>> 8;
        n |= n >>> 16;
        // 第三步，计算极限值，最后+1 返回
        return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
    }

第一步为什么减1呢？先卖个关子，看核心的第二步

        n |= n >>> 1;
        n |= n >>> 2;
        n |= n >>> 4;
        n |= n >>> 8;
        n |= n >>> 16;

先说下右移位，也就是将二进制数向右移对应的位数。

按位或，就是将两个参与运算的二进制数，任意为1的位设置为1，否则设置为0.

先根据第一行举个例子，假设n=2，其对应的二级制数为 10，那么会有如下操作：

0010 >>> 1 =  0001 (右移1位)   ---- 第一行，右移1位

0010 | 0001 = 0011 （按位或），

0011 >>> 2 = 0000 (右移2位)    ---- 第二行，右移2位

0011 | 0000 = 0011 (按位或)

下面右移4位、8位、16位，最终得到的结果都是 0011 ，也就是十进制的3

最后第三步，当小于极限值的时候，需要+ 1，3+ 1 = 4

考虑到第一步，n = cap -1，也就是cap=3

也就是，十进制数3 的最近满足需要的容量是 4

这个不过瘾，n来个大点儿的数，比如18，二进制数是：0001 0010

0001 0010 >>> 1 = 0000 1001       ---- 第一行，右移1位
0001 0010 | 0000 1001 = 0001 1011  

0001 1011 >>> 2 = 0000 0110   ---- 第二行，右移两位
0001 1011 | 0000 0110 = 0001 1111   
到这里已经把第一位不为0，后面所有的0都设置为了1，再移位4、8、16，都是一样的结果

所以这里的值是 0001 1111 = 16+8+4+2+1 = 31

设置的值：cap = n + 1 = 19

其对应的实际容量是 31 + 1= 32

到这里，第二步和第三步都已经解释清楚了，那么第一步的 n = cap -1 是干啥的呢？

当cap本身就是2的N次方时，上面的算法是有点问题的，比如传入的capacity是4，那么得到的结果本应该是4，结果得到了8

假设没有 n = cap - 1 这一行
n = cap = 4，换算成二进制就是 0100，下面进行换算：
0100 >>> 1 = 0010
0010 | 0100 = 0110

0110 >> 2 = 0001
0001 | 0110 = 0111  --- 都变成了1，结束

执行第三步，实际的容量是 4+2+1  +1 = 8，其实是多设置了一倍。

所以刚开始就把n的值设置为4 - 1 =3，最终得到的容量会是4，完美解决。这就是n = cap - 1 的由来。

所以这里tableSizeFor只是做单一的事情：寻找距离最近的2的N次方。

而实际上，如果设置了cap=4，得到的实际capacity值也是4的话，马上就会触发一次扩容，反而性能不好。不过这里jdk的开发人员这么做，只是为了一个方法只做一件事情，无可厚非。所以在初始化设置就需要参考下一节的内容。

初始化时指定容量

如果在初始化hashmap时传入预估的容量值，hashmap会把值设置为大于指定值最近的一个2的N次方，比如1 —>1, 3—>4, 6—>8, 55—>64。下面进行代码测试：

        hashmap map = new hashmap(1); // 指定初始化值

        // 获取map的类-类型
        Class mapClazz = map.getClass();

        Method capacityMethod = mapClazz.getDeclaredMethod("capacity");
        // 设置访问权限
        capacityMethod.setAccessible(true);

        System.out.println("map capacity =" + capacityMethod.invoke(map1));  // 1

        hashmap map3 = new hashmap(3);
        System.out.println("map capacity = " + capacityMethod.invoke(map3)); // 4

        hashmap map55 = new hashmap(55);
        System.out.println("map capacity = " + capacityMethod.invoke(map55)); // 64

执行后结果印证了tableSizeFor的作用。另外注意，这里只用了一个反射获取对应的method，通过invoke时传入不同的对象来获取对应的值。

关于初始化值的设置：

如果知道了对应的数据个数为7，那么是不是直接传入7就可以了呢？

直接设置7，tableSizeFor得到的值是8，而threshold的值是6，7 >6，所以数据设置之后马上就会触发一次扩容，hashmap的capacity从8变成16，会有一定的性能损耗。

所以设置的初始值应该是：数据长度 / loadFactor + 1, 也就是 7 / 0.75 + 1 = 10,那么tableSizeFor得到的就是16，不需要进行扩容。

hashmap的hash

先看下JDK1.8里面的实现：

    /**
     * Computes key.hashCode() and spreads (XORs) higher bits of hash
     * to lower.  Because the table uses power-of-two masking, sets of
     * hashes that vary only in bits above the current mask will
     * always collide. (Among known examples are sets of Float keys
     * holding consecutive whole numbers in small tables.)  So we
     * apply a transform that spreads the impact of higher bits
     * downward. There is a tradeoff between speed, utility, and
     * quality of bit-spreading. Because many common sets of hashes
     * are already reasonably distributed (so don't benefit from
     * spreading), and because we use trees to handle large sets of
     * collisions in bins, we just XOR some shifted bits in the
     * cheapest possible way to reduce systematic lossage, as well as
     * to incorporate impact of the highest bits that would otherwise
     * never be used in index calculations because of table bounds.
     */
    static final int hash(Object key) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }

可以看到，这里的hash取值，将key.hashCode右移16位然后与其本身进行按位异或（(h = key.hashCode()) ^ (h >>> 16)），目的是对hashCode进行混淆，避免出现一些key对应的hashcode高位不同、低位相同，但是在取模（需要找hashmap中数组的下标）的时候低位相同就可能出现碰撞的问题。

这里JDK1.8 跟JDK1.7的差别还是蛮大的，看下jdk7的代码：

    /**
     * Applies a supplemental hash function to a given hashCode, which
     * defends against poor quality hash functions.  This is critical
     * because hashmap uses power-of-two length hash tables, that
     * otherwise encounter collisions for hashCodes that do not differ
     * in lower bits. Note: Null keys always map to hash 0, thus index 0.
     */
    static int hash(int h) {
        // This function ensures that hashCodes that differ only by
        // constant multiples at each bit position have a bounded
        // number of collisions (approximately 8 at default load factor).
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }

    /**
     * Returns index for hash code h.
     */
    static int indexFor(int h, int length) {
        return h & (length-1);
    }

这里的参数h是指key的hashcode(), 调用时是这样：int hash = hash(key.hashCode());

而一系列针对h的右移位、按位异或(^)，不过是跟jdk8中的 (h = key.hashCode()) ^ (h >>> 16) 一样的目的，扰乱hashcode的正常值，尽可能的避免hash碰撞。

注意jdk中的 indexFor方法，这个方法是获取hash值对应的数组下标的，不过这里为什么用了一个按位与(&)操作呢？

jdk7中indexFor为什么使用按位与

按位与在这里的作用，其实与取模的结果是一样的，indexFor方法本来就是为了根据hash值获取对应的数组下标。那么为什么不用取模呢？性能，一切为了性能！

在1.7的HashTable中，其实取下标是用了取模的，模是hashTable的长度：

int index = (hash & 0x7FFFFFFF) % tab.length; （这里&只是为了保证为正数，毕竟首位为0 代表正数，首位为1代表负数）。

不过HashTable在初始化的时候默认值是11，每次扩容是 2N +1 ，也就是始终都是奇数个，使用取模来处理更简单，而且按位与也无法保证结果与取模相同。

而hashmap则始终是2的N次方长度，这就为&操作提供了便利,记住下面这个公式：

X % 2^n = X & (2^n – 1)

2^n表示2的n次方，也就是说，一个数对2^n取模 == 一个数和(2^n – 1)做按位与运算 。

假设n为3，则2^3 = 8，表示成2进制就是1000。2^3 -1 = 7 ，即0111。

假设x=18，也就是 0001 0010，一起看下结果：
对 2的n次方 取模： 18 % 8 = 2
对 2的n次方-1 按位与： 0001 0010 & 0111 = 0000 0010 = 2

不过这个要求是， tab.length必须是2的N次方，而hashmap的capacity设计的就是2的N次方长度，此时按位与刚好满足需要，又可以提高性能。

至于HashTable为什么这么设计，个人猜想是本来内部就采用了synchronized机制，性能偏低，实现层面就简单点好了。另外，hashTable也确实不常用…...

JDK8中消失的indexFor

在jdk8的hashmap源码中，indexFor方法消失了。不过同样的代码其实包含在了具体的使用场景，比如get方法中引用的getNode实现：

first = tab[(n - 1) & hash]

其实就是将indexFor这个一行代码的方法合并到具体实现中了。