Reservoir Sampling

Problem:

  • Choose <code>k</code> entries from <code>n</code> numbers. Make sure each number is selected with the probability of <code>k/n</code>

Basic idea:

  • Choose <code>1, 2, 3, ..., k</code> first and put them into the reservoir.
  • For <code>k+1</code>, pick it with a probability of <code>k/(k+1)</code>, and randomly replace a number in the reservoir.
  • For <code>k+i</code>, pick it with a probability of <code>k/(k+i)</code>, and randomly replace a number in the reservoir.
  • Repeat until <code>k+i</code> reaches <code>n</code>

Proof:

  • For <code>k+i</code>, the probability that it is selected and will replace a number in the reservoir is <code>k/(k+i)</code>
  • For a number in the reservoir before (let's say <code>X</code>), the probability that it keeps staying in the reservoir is
    • <code>P(X was in the reservoir last time)</code> × <code>P(X is not replaced by k+i)</code>
    • <code>P(X was in the reservoir last time)</code> × (<code>1</code> - <code>P(k+i is selected and replaces X)</code>)
    • = <code>k/(k+i-1)</code> × (<code>1</code> - <code>k/(k+i)</code> × <code>1/k</code>)
    • = <code>k/(k+i)</code>
  • When <code>k+i</code> reaches <code>n</code>, the probability of each number staying in the reservoir is <code>k/n</code>

Example

  • Choose <code>3</code> numbers from <code>[111, 222, 333, 444]</code>. Make sure each number is selected with a probability of <code>3/4</code>
  • First, choose <code>[111, 222, 333]</code> as the initial reservior
  • Then choose <code>444</code> with a probability of <code>3/4</code>
  • For <code>111</code>, it stays with a probability of
    • <code>P(444 is not selected)</code> + <code>P(444 is selected but it replaces 222 or 333)</code>
    • = <code>1/4</code> + <code>3/4</code>*<code>2/3</code>
    • = <code>3/4</code>
  • The same case with <code>222</code> and <code>333</code>
  • Now all the numbers have the probability of <code>3/4</code> to be picked
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容