模拟零假设

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(42)

full_data = pd.read_csv('coffee_dataset.csv')
sample_data = full_data.sample(200)

If you were interested in if the average height for coffee drinkers is the same as for non-coffee drinkers, what would the null and alternative be? Place them in the cell below, and use your answer to answer the first quiz question below.

Since there is no directional component associated with this statement, a not equal to seems most reasonable.

𝐻0:𝜇𝑐𝑜𝑓𝑓−𝜇𝑛𝑜=0

𝐻0:𝜇𝑐𝑜𝑓𝑓−𝜇𝑛𝑜≠0

𝜇𝑐𝑜𝑓𝑓 and 𝜇𝑛𝑜 are the population mean values for coffee drinkers and non-coffee drinkers, respectivley.
If you were interested in if the average height for coffee drinkers is less than non-coffee drinkers, what would the null and alternative be? Place them in the cell below, and use your answer to answer the second quiz question below.

In this case, there is a question associated with a direction - that is the average height for coffee drinkers is less than non-coffee drinkers. Below is one of the ways you could write the null and alternative. Since the mean for coffee drinkers is listed first here, the alternative would suggest that this is negative.

𝐻0:𝜇𝑐𝑜𝑓𝑓−𝜇𝑛𝑜≥0

𝐻0:𝜇𝑐𝑜𝑓𝑓−𝜇𝑛𝑜<0

𝜇𝑐𝑜𝑓𝑓 and 𝜇𝑛𝑜 are the population mean values for coffee drinkers and non-coffee drinkers, respectivley.
For 10,000 iterations: bootstrap the sample data, calculate the mean height for coffee drinkers and non-coffee drinkers, and calculate the difference in means for each sample. You will want to have three arrays at the end of the iterations - one for each mean and one for the difference in means. Use the results of your sampling distribution, to answer the third quiz question below.

nocoff_means, coff_means, diffs = [], [], []

for _ in range(10000):
    bootsamp = sample_data.sample(200, replace = True)
    coff_mean = bootsamp[bootsamp['drinks_coffee'] == True]['height'].mean()
    nocoff_mean = bootsamp[bootsamp['drinks_coffee'] == False]['height'].mean()
    # append the info 
    coff_means.append(coff_mean)
    nocoff_means.append(nocoff_mean)
    diffs.append(coff_mean - nocoff_mean)

np.std(nocoff_means) # the standard deviation of the sampling distribution for nocoff

np.std(coff_means) # the standard deviation of the sampling distribution for coff

np.std(diffs) # the standard deviation for the sampling distribution for difference in means

plt.hist(nocoff_means, alpha = 0.5);
plt.hist(coff_means, alpha = 0.5); # They look pretty normal to me!

plt.hist(diffs, alpha = 0.5); # again normal - this is by the central limit theorem

null_vals = np.random.normal(0, np.std(diffs), 10000) # Here are 10000 draws from the sampling distribution under the null

plt.hist(null_vals); #Here is the sampling distribution of the difference under the null

模拟零假设

推荐阅读更多精彩内容