2020-08-30-Resampling

Resampling

João Neto

July 2016

Refs:

  • Ziefler - Randomization & Bootstrap Methods using R (2011)

  • Allen Downey - There is only One Test (post 1, post 2, youtube)

Introduction

Downey talks how the standard statistical tests can be seen as analytical solutions of simplified problems back when simulation was not available in pre-computer times.

When we can rely on simulation, we can follow the method presented in this diagram:

image.jpeg

Herein, the observed effect δ∗ is the value computed by a chosen test statistic over the observed data.

The null hypothesis H0 is the model asserting the observed effect δ∗ was due to chance.

The test statistic is a chosen measure of the difference between the data (either observed or simulated) with respect to H0.

The probability we which to compute is P(δ∗|H0). If P(δ∗|H0) is small, it suggests that the effect is probably real and not due to chance.

The Monte Carlo p-value (this is similar but not equal to the standard p-value of Frequentist Statistics) is the probability of having effect δ∗ or something more extreme under the assumption that H0 holds, ie, p(δ∗or more extreme effects|H0). Ie, it is the ratio of effects as extreme as the number observed effect, r, over the total number of simulated effects, n. This proportion tends to under-estimate the p-value, so Davison & Hinkley propose the following correction:

The next R function codes this:

compute.p.value <- function(results, observed.effect, precision=3) {

  # n = #experiences
  n <- length(results)
  # r = #replications at least as extreme as observed effect
  r <- sum(abs(results) >= observed.effect)  
  # compute Monte Carlo p-value with correction (Davison & Hinkley, 1997)
  list(mc.p.value=round((r+1)/(n+1), precision), r=r, n=n)
}

Therefore, the procedure consists of:

  1. Define the Null Hypothesis H0 (assume the effect was due to chance)
  2. Choose a test statistic measurement
  3. Create a stochastic model of H0 in order to produce simulated data
  4. Produce simulated data
  5. compute the MC p-value and assess H0
    The simulation assumes all data permutations are equally probable under H0 (ie, exchangeability)

If the simulation cannot be done - because it’s too slow -, we must search for analytic shortcuts or other methods (but beware of their own simplifying assumptions).

Before we see some egs, let’s add the next function to present the results in a histogram format:

present_results <- function(results, observed.effect,  label="") {

  lst <- compute.p.value(results, observed.effect)

  hist(results, breaks=50, prob=T, main=label,
       sub=paste0("MC p-value for H0: ", lst$mc.p.value),
       xlab=paste("found", lst$r, "as extreme effects for", lst$n, "replications"))
  abline(v=observed.effect, lty=2, col="red")
}

Example 1 - Replacing t-tests

Let’s see this technique used to perform a permutation test that replaces a t-test (eg taken from this youtube lecture):

data <- list(experiment = c(27,20,21,26,27,31,24,21,20,19,23,24,28,19,24,29,18,20,17,31,20,25,28,21,27),
             control    = c(21,22,15,12,21,16,19,15,22,24,19,23,13,22,20,24,18,20))

Our H0 model assumes that the data of both experiment and control are equal. The entire data will be resampled to produce artificial datasets to be compared with the real data; this is the stocasthic model following H0. According to H0 there’s no problem in mixing experiment and control.

The function resampling performs permutation tests on experiment/control datasets (it can be used on other egs):

resampling <- function(n, data, test.statistic) {

  all.data <- c(data$experiment, data$control)

  # get n random permutations of indexes with experiment size
  permutations <- replicate(n, sample(1:length(all.data), length(data$experiment)))

  # apply the test statistics for each permutation, and return all results
  apply(permutations, 2, function(permutation) {
    # all.data[ permutation] is a sample experiment
    # all.data[-permutation] is a sample control
    test.statistic(all.data[permutation], all.data[-permutation])
  })
}

We must also choose a test statistic.

We’ll pick two test statistics to check two different hypothesis:

  • check if is there a difference of means, ie, is the experience an improvement over the control data? (herein, a higher value is better). Which is to ask if under the Null Hypothesis Model, H0, what is the probability that the effect was due to chance?

  • check if the variances of both datasets are the same

diff.means <- function(x,y) mean(x) - mean(y) 
diff.vars  <- function(x,y) var(x)  - var(y)  

Now we apply the simulation and present the results:

n.resamplings <- 1e4

stats <- resampling(n.resamplings, data, diff.means)
present_results(stats, diff.means(data$experiment, data$control), 
                label="Difference of Means")
image.png
stats <- resampling(n.resamplings, data, diff.vars)
present_results(stats, diff.vars(data$experiment, data$control), 
                label="Difference of Variance")
image.png

So our conclusion, concerning the difference of means, is that H0 has strong evidence against it, ie, the observed effect is most probably not due to chance.

Regarding the difference of variance, the simulation favors H0, ie, the difference of variances is probably due to chance.

Example 2 - Replacing χ2-tests

Suppose you run a casino and you suspect that a customer has replaced a die provided by the casino with a ``crooked die’’; that is, one that has been tampered with to make one of the faces more likely to come up than the others. You apprehend the alleged cheater and confiscate the die, but now you have to prove that it is crooked. You roll the die 60 times and get the following results:

<center style="box-sizing: border-box; color: rgb(51, 51, 51); font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">

| | value | frequency |
| 1 | 1 | 8.00 |
| 2 | 2 | 9.00 |
| 3 | 3 | 19.00 |
| 4 | 4 | 6.00 |
| 5 | 5 | 8.00 |
| 6 | 6 | 10.00 |

</center>

What is the probability of seeing results like this by chance? – ref

observed <- c(8,9,19,6,8,10)

data <- list(observed = observed, 
             expected = rep(round(sum(observed)/6),6)) # the most probable result

Our chosen H0 states that the dice is fair

The test statistic is χ2:

The chi-squared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories – wikipedia

chiSquared <- function(expected, observed) {
  sum((observed-expected)^2/expected)
}

Let’s produce the stochastic model for H0:

resampling <- function(n, data, test.statistic) {

  n.throws <- sum(data$observed)

  get_throws <- function() {
    throws <- c(1:6,sample(1:6, n.throws, rep=TRUE)) # add 1:6 to prevent zeros
    as.numeric(table(throws)) - 1                    # -1 removes those extra
  }

  samples <- replicate(n, get_throws()) # get n dice frequency throws

  apply(samples, 2, function(a.sample) {test.statistic(data$expected, a.sample)})
}

Now we are ready to perform the simulation:

n.resamplings <- 1e4
stats <- resampling(n.resamplings, data, chiSquared)
present_results(stats, chiSquared(data$expected, data$observed))
image.png

There are some evidence that the dice might not be fair.

We can check another test statistic, say chiModule (sum the absolute differences instead of summing the squares). While there is no analytic solution, and so no classical test, here we just need to replace the test statistic chiSquared with this one:

chiModule <- function(expected, observed) {
  sum(abs(observed-expected)/expected)
}

stats <- resampling(n.resamplings, data, chiModule)
present_results(stats, chiModule(data$expected, data$observed))
image.png

This is an expected result, since the module does not punish extreme values as the square version does. This means that the 19’s threes are not so important here. That’s why this second simulation is not that certain about rejecting H0.

Bootstrap

The basic idea of bootstrapping is that inference about a population from sample data (sample -> population) can be modeled by resampling the sample data and performing inference on (resample -> sample). As the population is unknown, the true error in a sample statistic against its population value is unknowable. In bootstrap resamples, the ‘population’ is in fact the sample, and this is known; hence the quality of inference from resample data -> ‘true’ sample is measurable – wikipedia

The bootstrap uses Monte Carlo simulations to resample many datasets based on the original data. These resamples are used to study the variation of a given test statistic.

The bootstrap assumes that the different samples from the observed data are independent of one another.

Here’s a simple eg: one knows a sample of size 30 from a population with N(0,1) distribution. In practice we don’t know the population distribution (otherwise, the bootstrap would not be needed), but let’s assume that in order to compare results. Say, we wish to find out about the variation of its mean:

set.seed(333)
my.sample <- rnorm(30)

test.statistic <- mean
n.resamplings  <- 5e4

# execute bootstrap (resamplig from just the original sample):
boot.samples <- replicate(n.resamplings, test.statistic(sample(my.sample, replace=TRUE)))
# compare it with samples taken from the population:
real.samples <- replicate(n.resamplings, test.statistic(sample(rnorm(30), replace=TRUE)))

plot( density(real.samples), ylim=c(0,2.5), main="mean distributions")
lines(density(boot.samples), col="red")
abline(v=0, lty=2) # true value
legend("topright", c("from population", "from bootstrap", "true mean"), col=c(1,2,1), lty=c(1,1,2))
image.png

This can also be done with the boot package (more info):

library(boot)

# boot() needs a function applying the statistic to the original data over i, a vector of indexes
f <- function(data,i) { test.statistic(data[i]) }

boot.stat    <- boot(my.sample, f, n.resamplings)
boot.samples <- boot.stat$t # recover the bootstrap samples
boot.ci(boot.stat)          # compute confidence intervals
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 50000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot.stat)
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   (-0.3822,  0.3421 )   (-0.3830,  0.3448 )  
## 
## Level     Percentile            BCa          
## 95%   (-0.3837,  0.3442 )   (-0.3876,  0.3392 )  
## Calculations and Intervals on Original Scale

Bayesian Bootstrap

In standard bootstrapping observations are sampled with replacement. This implies that observation weights follow multinomial distribution. In Bayesian bootstrap multinomial distribution is replaced by Dirichlet distribution – ref

library(gtools) # use: rdirichlet

set.seed(333)
n.resamplings <- 1000

mean.bb <- function(x, n) {
  apply( rdirichlet(n, rep(1, length(x))), 1, weighted.mean, x = x )
}

boot.bayes <- mean.bb(my.sample, n.resamplings)
plot(density(real.samples), ylim=c(0,2.5))
lines(density(boot.bayes), col="red")
image.png
quantile(boot.bayes, c(0.025, 0.975)) # find credible intervals
##      2.5%     97.5% 
## -0.370165  0.331055

(Rubin (1981) introduced the Bayesian bootstrap. In contrast to the frequentist bootstrap which simulates the sampling distribution of a statistic estimating a parameter, the Bayesian bootstrap simulates the posterior distribution.

The data, X, are assumed to be independent and identically distributed (IID), and to be a representative sample of the larger (bootstrapped) population. Given that the data has N rows in one bootstrap replication, the row weights are sampled from a Dirichlet distribution with all N concentration parameters equal to 1 (a uniform distribution over an open standard N-1 simplex). The distributions of a parameter inferred from considering many samples of weights are interpretable as posterior distributions on that parameter – LaplacesDemon helpfile

Using bayesboot

This package from Rasmus Baath implements a Bayesian bootstrapping described here:

library(bayesboot)

boot.bayes2 <- bayesboot(my.sample, test.statistic)

plot(density(real.samples), ylim=c(0,2.5))
lines(density(boot.bayes2$V1), col="red")
image.png
summary(boot.bayes2)
## Bayesian bootstrap
## 
## Number of posterior draws: 4000 
## 
## Summary of the posterior (with 95% Highest Density Intervals):
##  statistic        mean        sd    hdi.low hdi.high
##         V1 -0.02366064 0.1829075 -0.3882545 0.325694
## 
## Quantiles:
##  statistic      q2.5%       q25%      median       q75%    q97.5%
##         V1 -0.3846715 -0.1452675 -0.02108793 0.09834785 0.3320632
## 
## Call:
##  bayesboot(data = my.sample, statistic = test.statistic)

To compare a statistic between two groups, we bootstrap each and compute the difference to calculate the posterior difference (eg from here):

# Heights of the last ten American presidents in cm (Kennedy to Obama).
heights <- c(183, 192, 182, 183, 177, 185, 188, 188, 182, 185)

# The heights of opponents of American presidents (first time they were elected).
# From Richard Nixon to John McCain
heights_opponents <- c(182, 180, 180, 183, 177, 173, 188, 185, 175)

# Running the Bayesian bootstrap for both datasets
b_presidents <- bayesboot(heights,           test.statistic)
b_opponents  <- bayesboot(heights_opponents, test.statistic)

# Calculating the posterior difference and converting back to a 
# bayesboot object for pretty plotting.
b_diff <- as.bayesboot(b_presidents - b_opponents)
plot(b_diff)
image.png

It seems the presidential winner tends to have more height than his opponent.

Example 1 - Obtaining a Confidence Interval

Let’s use the same data as in the first eg:

data <- list(experiment = c(27,20,21,26,27,31,24,21,20,19,23,24,28,19,24,29,18,20,17,31,20,25,28,21,27),
             control    = c(21,22,15,12,21,16,19,15,22,24,19,23,13,22,20,24,18,20))

This next resampling function selects bootstrap samples from the data and produces a population of difference of means.

resampling <- function(n, data, test.statistic) {

  size.experiment <- length(data$experiment)
  size.control    <- length(data$control)

  one.bootstrap <- function() {
    boot.experiment <- sample(data$experiment, size.experiment, replace=TRUE)
    boot.control    <- sample(data$control,    size.control,    replace=TRUE)
    test.statistic(boot.experiment, boot.control)
  }

  replicate(n, one.bootstrap())
}

Now let’s execute the bootstrap and reuse the previous present_results. Notice that now the shown Monte Carlo p-value does not make sense in this context, and should be around 50, ie, the observed difference of means should be around the median of the bootstrap empirical distribution:

n.resamplings <- 1e4
stats <- resampling(n.resamplings, data, diff.means)
present_results(stats, diff.means(data$experiment, data$control))
image.png
quantile(x=stats, probs = c(.025,.975)) # 95% confidence interval
##     2.5%    97.5% 
## 2.159944 6.660111

Concerning the confidence interval, since zero is not included, we could say that H0 - ie, the difference of means is due to chance - is not backed by evidence.

Let’s compare the bootstrap’s confidence interval with the classic t-test and the bayesian approach:

# Using the t-test should produce similar results
t.test(data$experiment, data$control)$conf.int
## [1] 1.957472 6.798084
## attr(,"conf.level")
## [1] 0.95
# Using the bayesian version of the t-test
# devtools::install_github("rasmusab/bayesian_first_aid")
library(BayesianFirstAid)
bayes.t.test(data$experiment, data$control, n.iter=1e4) 
## 
##  Bayesian estimation supersedes the t test (BEST) - two sample
## 
## data: data$experiment (n = 25) and data$control (n = 18)
## 
##   Estimates [95% credible interval]
## mean of data$experiment: 24 [22, 25]
## mean of data$control: 19 [17, 21]
## difference of the means: 4.2 [1.6, 6.8]
## sd of data$experiment: 4.2 [3.1, 5.7]
## sd of data$control: 3.8 [2.5, 5.4]
## 
## The difference of the means is greater than 0 by a probability of 0.999 
## and less than 0 by a probability of 0.001

Or using bayesboot package:

library(bayesboot)

experiment.means <- bayesboot(data$experiment, mean, R=1e4)
control.means    <- bayesboot(data$control,    mean, R=1e4)
stats            <- (experiment.means - control.means)$V1
quantile(x=stats, probs = c(.025,.975)) # 95% confidence interval
##     2.5%    97.5% 
## 2.226231 6.617550
present_results(stats, diff.means(data$experiment, data$control))
image.png

If we wish to compute the MC p-value we could consider the entire data, bootstrap it, and then split the simulated data accordingly to the sizes of the experiment and control datasets before applying the chosen test statistic:

resampling <- function(n, data, test.statistic) {

  all.data        <- c(data$experiment, data$control)
  size.all.data   <- length(all.data)
  size.experiment <- length(data$experiment)

  one.bootstrap <- function() {
    boot.all.data <- sample(all.data, size.all.data, replace=TRUE)
    test.statistic(boot.all.data[1:size.experiment],    # split bootstrap data
                   boot.all.data[(size.experiment+1):size.all.data])
  }

  replicate(n, one.bootstrap())
}

Now, the Monte Carlo p-value makes sense. The bootstrap procedure mirrors the difference of means between the observed experiment and control data:

n.resamplings <- 1e4
stats <- resampling(n.resamplings, data, diff.means)
present_results(stats, diff.means(data$experiment, data$control))
image.png

Example 2 – Pearson correlation of two samples

We wish to compute a sampling distribution of the correlation between these observed LSAT and GPA scores:

data <- list(LSAT = c(576, 635,558, 578, 666, 580, 555, 661, 651, 605, 653, 575, 545, 572, 594),
             GPA  = c(3.39,3.3,2.81,3.03,3.44,3.07,3.0,3.43,3.36,3.13,3.12,2.74,2.76,2.88,2.96))

The Pearson’s correlation coefficient for a sample (xi,yi),i=1…n can be calculated as follows:

rxy=1n−1∑i=1nxi−x¯¯¯sxyi−y¯¯¯sy #公式有问题

This is computed by:

# pre: x, y are samples with the same length
pearson.coor.sample <- function(x,y) {
  sum( ((x-mean(x))/sd(x)) * ((y-mean(y))/sd(y))) / (length(x)-1)
}

And this is the resampling function:

# resampling for Pearson correlation
resampling <- function(n, data, test.statistic) {

  size.sample <- length(data[[1]])

  one.bootstrap <- function() {
    # select a subset, but the original pairs must be kept together
    permutation <- sample(1:size.sample, size.sample, replace=TRUE)
    x <- data[[1]][permutation]
    y <- data[[2]][permutation]
    test.statistic(x,y)
  }

  replicate(n, one.bootstrap())
}

Let’s simulate it and then compare the results with the frequentist and bayesian alternatives:

n.resamplings <- 1e4
stats <- resampling(n.resamplings, data, pearson.coor.sample)
# again, the shown MC p-value is not a p-value, but should be around 50%
present_results(stats, pearson.coor.sample(data$LSAT, data$GPA))
image.png
mean(stats)
## [1] 0.7727508
quantile(x = stats, probs = c(.025,.975)) # 95% confidence interval
##      2.5%     97.5% 
## 0.4632874 0.9620066

# Using cor.test to find the correlation betweem paired samples
cor.test(data$LSAT, data$GPA)$conf.int
## [1] 0.4385108 0.9219648
## attr(,"conf.level")
## [1] 0.95

# The bayesian version, not surprisingly, it's closer to the resampling results
bayes.cor.test(data$LSAT, data$GPA, n.iter=n.resamplings) 
## 
##  Bayesian First Aid Pearson's Correlation Coefficient Test
## 
## data: data$LSAT and data$GPA (n = 15)
## Estimated correlation:
##   0.76 
## 95% credible interval:
##   0.46 0.96 
## The correlation is more than 0 by a probability of >0.999 
## and less than 0 by a probability of <0.001

Choosing between resampling and bootstrap tests

Here’s a quote from Ziefler’s book (page 174):

The randomization/permutation test and the bootstrap test were introduced in the previous two chapters as methods to test for group differences. Which method should be used? From a statistical theory point of view, the difference between the two methods is that the randomization method is conditioned on the marginal distribution under the null hypothesis. This means each permutation of the data will have the same marginal distribution. The bootstrap method allows the marginal distribution to vary, meaning the marginal distribution changes with each replicate data set. If repeated samples were drawn from a larger population, variation would be expected in the marginal distribution, even under the null hypothesis of no difference. This variation in the marginal distribution is not expected; however, there is only one sample from which groups are being randomly assigned so long as the null hypothesis is true. Thus the choice of method comes down to whether one should condition on the marginal distribution or not. […]

The choice of analysis method rests solely on the scope of inferences the researcher wants to make. If inferences to the larger population are to be made, then the bootstrap method should be used, as it is consistent with the idea of sample variation due to random sampling. In general, there is more variation in a test statistic due to random sampling than there is due to random assignment. That is, the standard error is larger under the bootstrap. Thus, the price a researcher pays to be able to make broader inferences is that all things being equal, the bootstrap method will generally produce a higher p-value than the randomization method.

原文链接:https://www.di.fc.ul.pt/~jpn/r/bootstrap/resamplings.html

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 214,588评论 6 496
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,456评论 3 389
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 160,146评论 0 350
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,387评论 1 288
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,481评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,510评论 1 293
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,522评论 3 414
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,296评论 0 270
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,745评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,039评论 2 330
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,202评论 1 343
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,901评论 5 338
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,538评论 3 322
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,165评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,415评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,081评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,085评论 2 352