Week 2: Random Variables and Central Limit Theorem
Part 3: Central Limit Theorem in Practice - The t-distribution in Practice
参考书籍:《Data Analysis for the Life Sciences》
参考视频:
Central Limit Theorem in Practice
Let’s use our data to see how well the central limit theorem approximates sample averages from our data. We will leverage our entire population dataset to compare the results we obtain by actually sampling from the distribution to what the CLT predicts.
dat <- read.csv("mice_pheno.csv") #file was previously downloaded
head(dat)
library(dplyr)
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%
select(Bodyweight) %>% unlist
hfPopulation <- filter(dat,Sex == "F" & Diet == "hf") %>%
select(Bodyweight) %>% unlist
Compute the population standard deviations as well. We do not use the R function sd because this would compute the estimates that divide by the sample size - 1 and we want the population estimates.
x <- controlPopulation
N <- length(x)
populationvar <- mean((x-mean(x))^2)
identical(var(x), populationvar)
# [1] FALSE
identical(var(x)*(N-1)/N, populationvar)
# [1] TRUE
So to be mathematically correct, we do not use sd
or var
. Instead, we use the popvar
and popsd
function in rafalib
library(rafalib)
sd_hf <- popsd(hfPopulation)
sd_control <- popsd(controlPopulation)
- sapply + replicate取代for loop
Ns <- c(3,12,25,50)
B <- 10000 #number of simulations
res <- sapply(Ns,function(n) {
replicate(B,mean(sample(hfPopulation,n))-mean(sample(controlPopulation,n)))
})
# 且当前代码需要运行4次
n <- 10000
res <- vector('numeric',n)
for (i in 1:n){
hf_sam <- sample(hfPopulation,3)
con_sam <- sample(controlPopulation,3)
mu_delta <- mean(hf_sam) - mean(con_sam)
res[i] <- mu_delta
}
- qqplot
# 调整画布分布
mypar(2,2)
# seq(along.with=x) 生成x向量的索引,也可直接写along
for (i in seq(along=Ns)) {
# signif保留有效数字
titleavg <- signif(mean(res[,i]),3)
titlesd <- signif(popsd(res[,i]),3)
# 字符串格式化
# paste默认sep=' ', paste0默认sep=''
title <- paste("N=",Ns[i],"Avg=",titleavg,"SD=",titlesd)
qqnorm(res[,i],main=title)
qqline(res[,i],col=2)
}
Ns <- c(3,12,25,50)
B <- 10000 #number of simulations
##function to compute a t-stat
computetstat <- function(n) {
y <- sample(hfPopulation,n)
x <- sample(controlPopulation,n)
(mean(y)-mean(x))/sqrt(var(y)/n+var(x)/n)
}
res <- sapply(Ns,function(n) {
replicate(B,computetstat(n))
})
mypar(2,2)
for (i in seq(along=Ns)) {
qqnorm(res[,i],main=Ns[i])
qqline(res[,i],col=2)
}
This simulation only proves that
is large enough in this case
- Exercises
# 数据获取
library(downloader)
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extd\
ata/femaleMiceWeights.csv"
filenames <- "femaleMiceWeights.csv"
if(!file.exists("femaleMiceWeights.csv")) download(url,destfile=filenames)
dat <- read.csv(filenames)
> 1
set.seed(1)
n <- 100
p <- 1/6
zs <- replicate(10000,{
x <- sample(1:6,n,replace=TRUE)
(mean(x==6) - p) / sqrt(p*(1-p)/n)
})
qqnorm(zs)
abline(0,1)
mean(abs(zs) > 2)
# [1] 0.0431
> 2
mypar(4,2)
origin <- matrix(c(0.5,0.5,0.01,0.01,5,30,30,100),ncol=2)
zs <- matrix('numeric',10000,ncol=4)
for (i in 1:4){
p <- origin[i,1]
sides <- 1/p
n <- origin[i,2]
zs <- replicate(10000,{
x <- sample(1:sides,n,replace=TRUE)
(mean(x==1) - p) / sqrt(p*(1-p)/n)
})
title <- paste("p=",p,"N=",n)
hist(zs) # nclass可以指定个数
qqnorm(zs,main=title)
abline(0,1)
}
# B) p=0.5 and n=30
> 3
X <- filter(dat, Diet=="chow") %>%
select(Bodyweight) %>%
unlist
mean(X)
# [1] 23.81333
> 4
# D) X¯ follows a normal distribution with mean µX and standard deviation √σx/12 where σx is the population standard deviation.
> 5
# 0
> 6
X <- filter(dat, Diet=="chow") %>%
select(Bodyweight) %>%
unlist
sd(X)
# [1] 3.022541
> 7
(1 - pnorm(5.21 / sd(X) * sqrt(12))) * 2
# [1] 2.356222e-09
> 8
sqrt(var(X)/12 + var(Y)/12)
# [1] 1.469867
> 9
(mean(Y) - mean(X)) / sqrt(var(X)/12 + var(Y)/12)
# 等价于
t.test(X,Y)$stat
# [1] 2.055174
> 10
# A) Normal with mean 0 and standard deviation 1
> 11
Z <- (mean(Y) - mean(X)) / sqrt(var(X)/12 + var(Y)/12)
(1 - pnorm(Z)) * 2
# [1] 0.0398622
> 12
t.test(X,Y)$p.value
# [1] 0.05299888
> 13
# B) These are two different assumptions. The t-distribution accounts for the variability introduced by the estimation of the standard error and thus, under the null, large values are more probable under the null distribution
附t.test形成列表的内容:
t-tests in Practice
library(dplyr)
dat <- read.csv("femaleMiceWeights.csv")
control <- filter(dat,Diet=="chow") %>% select(Bodyweight) %>% unlist
treatment <- filter(dat,Diet=="hf") %>% select(Bodyweight) %>% unlist
diff <- mean(treatment) - mean(control)
print(diff)
# [1] 3.020833
We learned that
diff
, referred to as the observed effect size, is a random variable.
The standard error
of this random variable is the population standard deviation divided by the square root of the sample size.
sd(control)/sqrt(length(control))
Use the sample standard deviation as an estimate of the population standard deviation.
The
variance
of the difference of two random variables is the sum of its variances
se_diff <- sqrt(
var(treatment)/length(treatment) +
var(control)/length(control)
)
# [1] 1.469867
Statistical theory tells us that if we divide a random variable by its SE, we get a new random variable with an SE of 1.
tstat <- diff / se_diff
# [1] 2.055174
# 其实本质就是
(mean(treatment) - mean(control)) /
sqrt(var(treatment)/length(treatment) +
var(control)/length(control)
)
CLT
tells us that for large sample sizes, both sample averages mean(treatment) and mean(control) are normal.
Statistical theory
tells us that the difference of two normally distributed random variables is again normal, so CLT tells us that tstat is approximately normal with mean 0 (the null hypothesis) and SD 1 (we divided by its SE).
- Q: How often does a normally distributed random variable exceed diff
Either smaller (more negative) than -abs(diff) or larger than abs(diff). We call these two regions “tails” and calculate their size
righttail <- 1 - pnorm(abs(tstat))
lefttail <- pnorm(-abs(tstat))
pval <- lefttail + righttail
print(pval)
# [1] 0.0398622
In this case, the p-value is smaller than 0.05 and using the conventional cutoff of 0.05, we would call the difference statistically significant.
The t-distribution in Practice
If the distribution of the population is normal, then we can work out the exact distribution of the t-statistic without the need for the CLT.
library(rafalib)
mypar(1,2)
qqnorm(treatment)
qqline(treatment,col=2)
qqnorm(control)
qqline(control,col=2)
If we use this approximation, then statistical theory tells us that the distribution of the random variable
tstat
follows a t-distribution.
The t-distribution has a location parameter like the normal and another parameter called degrees of freedom.
t分布是一簇曲线,其形态变化与n(确切地说与自由度
大小有关。自由度df越小,t分布曲线越低平;自由度df越大,t分布曲线越接近标准正态分布(u分布)曲线
CLT approximation considered the denominator of tstat practically fixed (with large samples it practically is), while the t-distribution approximation takes into account that the denominator (the standard error of the difference) is a random variable. The smaller the sample size, the more the denominator varies.
The test based on the CLT approximation is more likely to incorrectly reject the null hypothesis (false positive), while the t-distribution is more likely to incorrectly accept the null hypothesis (false negative).