布隆过滤器基本原理及源码分析

Bloom-Filter

即布隆过滤器，1970年由Bloom中提出。它可以用于检索一个元素是否在一个集合中。
Bloom Filter（BF）是一种空间效率很高的随机数据结构，它利用位数组很简洁地表示一个集合，并能判断一个元素是否属于这个集合。它是一个判断元素是否存在集合的快速的概率算法。Bloom Filter有可能会出现错误判断，但不会漏掉判断。也就是Bloom Filter判断元素不再集合，那肯定不在。如果判断元素存在集合中，有一定的概率判断错误。因此，Bloom Filter不适合那些“零错误”的应用场合。而在能容忍低错误率的应用场合下，Bloom Filter比其他常见的算法（如hash，折半查找）极大节省了空间。
它的优点是空间效率和查询时间都远远超过一般的算法，缺点是有一定的误识别率和删除困难。

基本原理

布隆过滤器的原理是，当一个元素被加入集合时，通过K个散列函数将这个元素映射成一个位数组中的K个点，把它们置为1。检索时，我们只要看看这些点是不是都是1就（大约）知道集合中有没有它了：如果这些点有任何一个0，则被检元素一定不在；如果都是1，则被检元素很可能在。这就是布隆过滤器的基本思想。
Bloom Filter跟单哈希函数Bit-Map不同之处在于：Bloom Filter使用了k个哈希函数(源代码只用一个hash函数,见下文)，每个字符串跟k个bit对应。从而降低了冲突的概率。

image.png

应用时首先要先由用户决定要add的元素数n和希望的误差率P。这也是一个设计完整的布隆过滤器需要用户输入的仅有的两个参数，之后的所有参数将由系统计算。

Google guava源码分析

  /**
   * Creates a {@link BloomFilter} with the expected number of insertions and expected false 
   */
  public static <T> BloomFilter<T> create(
      Funnel<? super T> funnel, long expectedInsertions, double fpp) {
    return create(funnel, expectedInsertions, fpp, BloomFilterStrategies.MURMUR128_MITZ_64);//默认策略为MURMUR128_MITZ_64
  }

  @VisibleForTesting//仅在包内使用
  static <T> BloomFilter<T> create(
      Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) {
    if (expectedInsertions == 0) {
      expectedInsertions = 1;
    }
   //根据预insert 的个数和fpp计算bit数
    long numBits = optimalNumOfBits(expectedInsertions, fpp);
//根据预insert 的个数和bit 数预hash function 数
    int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);
    try {
      return new BloomFilter<T>(new LockFreeBitArray(numBits), numHashFunctions, funnel, strategy);
    } catch (IllegalArgumentException e) {
      throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e);
    }
  }

  /**
   * Creates a {@link BloomFilter} with the expected number of insertions and a default expected
   * false positive probability of 3%. FPP
* 默认误预测率为3%
   */
  public static <T> BloomFilter<T> create(Funnel<? super T> funnel, long expectedInsertions) {
    return create(funnel, expectedInsertions, 0.03); // FPP, for 3%, we always get 5 hash functions
  }

bit 数计算

/**
   * Computes m (total bits of Bloom filter) which is expected to achieve, for the specified
   * expected insertions, the required false positive probability.
   *
   * <p>See http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives for the
   * formula.
   *
   * @param n expected insertions (must be positive)
   * @param p false positive rate (must be 0 < p < 1)
   */
  @VisibleForTesting
  static long optimalNumOfBits(long n, double p) {
    if (p == 0) {
      p = Double.MIN_VALUE;
    }
    return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
  }

bit数估算公式

HashFunctions 个数计算

static int optimalNumOfHashFunctions(long expectEntries, long bitSize) {
        return Math.max(1, (int) Math.round((double) bitSize / expectEntries * Math.log(2)));//k=(ln2)*(m/n)
    }

hash次数计算公式

多次hash计算细节
以MURMUR128_MITZ_64为例

MURMUR128_MITZ_64() {
    @Override
    public <T> boolean put(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();//hash函数只计算一次
      long hash1 = lowerEight(bytes);//取低8位
      long hash2 = upperEight(bytes);//取高8位

      boolean bitsChanged = false;
      long combinedHash = hash1;
      for (int i = 0; i < numHashFunctions; i++) {
        // Make the combined hash positive and indexable
//确保combined（每次+高8位 hash2）为正数并且是可索引的
//combinedHash & Long.MAX_VALUE 会将负数的补码直接变成正数 以byte举例 127&-1 转成二进制补码为
0b01111111&0b11111111=0b01111111(即为正原码,正数原码，反码，补码相同)
        bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);//%bitSize即分桶操作
        combinedHash += hash2;
      }
      return bitsChanged;
    }

复杂度

空间复杂度方面，Bloom Filter不会动态增长，运行过程中维护的始终只是m位的bitset，所以空间复杂度只有O(m)。

时间复杂度方面，Bloom Filter的插入与属于操作主要都是在计算k个hash，所以都是O(k)。

public class BloomFilterTester {
    public static void main(String[] args) throws IOException {
        BloomFilter<Integer> bloomFilter = BloomFilter.create(Funnels.integerFunnel(),
                50, 0.001);
        ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
        bloomFilter.writeTo(byteArrayOutputStream);
        System.out.println("初始化bits:"+byteArrayOutputStream.toByteArray().length);
        for (int i = Integer.MAX_VALUE; i > Integer.MAX_VALUE-200; i--) {
            bloomFilter.put(i);
        }

        byteArrayOutputStream = new ByteArrayOutputStream();
        //序列化
        bloomFilter.writeTo(byteArrayOutputStream);
        System.out.println("写入后bits:"+byteArrayOutputStream.toByteArray().length);
        ByteArrayInputStream byteArrayInputStream=new ByteArrayInputStream(byteArrayOutputStream.toByteArray());
        //序列化
        BloomFilter bloomFilter1= BloomFilter.readFrom(byteArrayInputStream,Funnels.integerFunnel());
        System.out.println("反序列化后："+bloomFilter.approximateElementCount()+"-"+bloomFilter1.approximateElementCount());
    }
}

结果占位不变

初始化bits:102
写入后bits:102
反序列化后：196-196

基数估算

public long approximateElementCount() {
    long bitSize = bits.bitSize();/** Number of bits */ bit 的个数 m
    long bitCount = bits.bitCount(); //Number of set bits (1s).被置为1的bit数X

    /**
     * Each insertion is expected to reduce the # of clear bits by a factor of
     * `numHashFunctions/bitSize`. So, after n insertions, expected bitCount is `bitSize * (1 - (1 -
     * numHashFunctions/bitSize)^n)`. Solving that for n, and approximating `ln x` as `x - 1` when x
     * is close to 1 (why?), gives the following formula.
     */
    double fractionOfBitsSet = (double) bitCount / bitSize;
//Math.log(1+x) == Math.log1p(x)
    return DoubleMath.roundToLong(
        -Math.log1p(-fractionOfBitsSet) * bitSize / numHashFunctions, RoundingMode.HALF_UP);
  }

基数估算公式

错误率

错误率有两种：

FP = false positive
FN = false negative
对应Bloom Filter的情况下，FP就是「集合里没有某元素，查找结果是有该元素」，FN就是「集合里有某元素，查找结果是没有该元素」。FN显然总是0，FP会随着Bloom Filter中插入元素的数量而增加——极限情况就是所有bit都为1，这时任何元素都会被认为在集合里。

FP的推导并不复杂，wiki上有非常详细的过程，这里就简单地抄个结果，其中n是当前集合里元素的数量(并不是期望个数)：

错误率计算公式

从这个公式里可以看出 n = 0时，FP = 0；n趋于无穷大时，FP趋于1

参考

https://my.oschina.net/LucasZhu/blog/1813110
https://www.cnblogs.com/z941030/p/9218356.html