Refs:
Problem:
Reservoir Sampling is an algorithm for sampling elements from a stream of data. Imagine you are given a really large stream of data elements, for example:
- Queries on google searches in May
- Products bought at Walmart during the Christmas season
- Names in a phone book
Your goal is to efficiently return a random sample of 1,000 elements evenly distributed from the original stream. How would you do it?
简而言之就是data stream,最好是one pass,只遍历数据一遍就抽样。
解法:stream里面抽k个样,维持一个reservoir (k大小),之后每进(k+i)一个数A,以 k/(k+i)的比例抽样留下A,然后再在reservoir里面均匀抽样某一个数被A替换。这样子无论是新树A还是之前的所有数,在reservoir中survive的probability都是 k/(k+i). (之前的数: 1/(k+i-1) * k/(k+i) )