Statistics Fundamentals in Python: Univariate Discrete Distributions

Framework

image.png

Univariate distributions can be divided into discrete distributions, where the observations can take on only integer values, such as the number of machines, and continuous distributions, where the observation variables are float values, such as the height and weight.

Characterizing a Distribution

Distribution Center

Common metrics of distribution center includes mean, median, and mode.

a) Mean
Mean can be computed by calling method np.mean(). One thing we need to notice is that in real life, there must be a number of missing values in our dataset. If we directly call np.mean() function to perform computation, the result will be NA. In this case, an alternative method is np.nanmean()

# create a numpy array
import numpy as np

x = np.arange(10)
np.mean(x)
>4.5
# now add some missing values into the array
xWtihNan = np.hstack((x,np.nan,np.nan,np.nan))
# call np.nanmean to address this situation
np.nanmean(xWithNan)
>4.5

b) Median
Median is the value that comes half when the data is ranked in an ascending order. In contrast to mean, median is not affected by outliers.

x = np.array([2,3,1,4,6,5,7])
np.median(x)
>4.0

c) Mode
The mode is the most frequently occuring value in a distribution. To compute mode, we usually call stats.mode().

from scipy import stats
data = [1,3,4,4,4,7,7,8]
stats.mode(data)
>ModeResult(mode=array([4]), count=array([3]))

Quantifying Variability

a) Range
In python, we can call np.ptp() to compute range of data. Range is easily affected by outliers. So it is not an accurate measure of data variablity.

x = np.arange(10)
np.ptp(x)
>9.0

b) Percentiles
Percentile, in python, corresponds to the cumulative distribution function(CDF). CDF is the integral of probability density function(PDF) and thus specifies the percentage of data that lie below a given value. With CDF, we can simplify the calculation of how likely it is to find a value x within the range a to b: The probability to find a value between a and b is computed by the interval over the PDF in that range, and thus can be found by the difference of the corresponding CDFvalues:

For example, the height of American men has a norm distributiion with mean of 170 and standard deviation of 4. We want to find the probability that a man randomly selected from American men falls between 170 and 175.

from scipy import stats
# create a norm distribution
mu,sigma = 170,4
normDist = stats.norm(mu,sigma)

# the probability to find a men's weight between 170 and 175 is given by the difference of corresponding CDF-values
normDist.cdf(175)-normDist.cdf(170)
>0.39435022633314465

There are several important and frequently encountered centiles:

  • To find the range which includes 95% of the data, one has to find the 2.5th and 97.5th of the sample data.
  • The 50th percentile is the median.
  • Also important are the quartiles, i.e., the 25th and the 75th percentile. The difference between them is called the inner-quartile range(IQR)

C) Standard Deviation and Variance
We usually need to distinguish between the sample variance, i.e. the variance of the data sampled, and population variance, i.e., the variance of the full population. The variance is given by the following formula:
var = \frac{\sum_{i=1}^n(x_i-sample ~mean)^2}{n}
This formula in fact systematically underestimates the population variance, and therefore is referred to as a biased estimator of the population variance. The reason for this bias is that sample mean is always chosen in a way that the variance of given sample data is minimized, and therefore underestimates the variance of population.
The solution to this problem is to use the following formula to compute the sample mean:
var = \frac{\sum_{i=1}^n(x_i-sample ~mean)^2}{n-1}
This can give us an unbiased estimator of population mean.

In python, we need to call np.var to compute variance and np.std to compute the standard deviaiton. Note: to get the sample standard deviation/variance, we have to set the degree of freedom to 1.

data = np.arange(5,18)

# compute the sample std
np.std(data,ddof=1)
>3.8944404818493075

# biased estimator of variance
np.std(data,ddof=0)
>3.7416573867739413

d) Standard Error
Standard error is a statistical term that measures the accuracy with which the smaple distribution represents a population by using the standard deviation. In statistics, the sample mean deviates from the actual mean of population and the deviation is called the standard error of mean(SEM).

How can we understand standard error?
Standard error is referred to the standard deviation of various sample statistics, such as mean or media. For example, the standard error of mean refers to the standard deviation of distribution of samples taken from a population. So the smaller the standard error is, the more representative the sample will be of the population.

Here's the formula about how to calculate it:
SEM = \frac{s}{\sqrt{n}} = \sqrt{\frac{\sum_{i=1}^n(x_i-\text{sample mean})^2}{n-1}}\frac{1}{\sqrt{n}}

import numpy as np
x = np.arange(7,14)
# get the sample mean
s = np.std(x,ddof=1)
# compute standard error of mean
s/np.sqt(x)
>0.816496580927726

We can also compute is by calling stats.sem

from scipy import stats
stats.sem(x)
>0.816496580927726

e) Confidence Interval
Confidence interval refers to the probability that estimated parameter falls between a set of values for a certain proportion of times.

ci = mean \pm N_{PPF}\frac{1-\alpha}{2}*std

  • where std is the standard deviation
  • N_{PPF} is the percentile point function for the standard normal distribution.

For example, for two-sided confidence intervals, we need to calculate PPF(0.025) of the standard normal distribution to get the lower and upper limit of the confidence interval.

Suppose that weight of men has a normal distribution with mean 120 and std 10. Randomly select one man from the population and what is the confidence interval given 5% of significance level.

from scipy import stats
mu,sigma = 120,10
normDist = stats.norm(120,10)
alpha = 0.05
normDist.ppf([alpha/2,1-alpha/2])
>array([100.40036015, 139.59963985])

Notes

  • To calculate the confidence interval for mean value, the standard deviation has to be replaced by the standard error.
  • If the distribution of dataset is skewed, the estimated ci given by the equation above is not correct.

Parameters Describing the Form of a Distribution

In scipy.stats, continuous distribituons are determined by location and scale. For example, for the normal distribution, location/shape are given by mean/standard deviation of the distribution; and for the uniform distribution, they are given by the (start-end) of the range between the distribution is different from zero.

a) Location
A locaition paramter determines the location of a distribution, such as mean, median, and mode.

b) Scale
The scale parameter describes the width of a probability distribution. Therefore, if the scale parameter is large, then the distribution will be more spread out; if scale is small then it will be more concentrated.

c) Shape Parameter
All of the parameters we use in all distributions are only two: skewness and Kurtosis.

Skewness
Skewness measures the asymmetry that deviates from the normal distribution in a dataset. If the curve is shifted to the left or right, it is said to be skewed. A standard normal distribution has a skew of zero.

Takeaways:

  • Skewnesss, in statistics, measures the asymmetry observed in a certain probability distribution.
  • Distributions can exhibit right(positive) skewness and left(negative) skewness; normal distribution has zero of skewness.
  • Investors consider skewness of distributions when judging a return distribution because it, like kurtosis, considers the extreme values of the dataset rather than the average.

Skewed distributions can be classified as positive skewness and negative skewness.

  • Positive Skewnes. Positive skewness means the tail on the right side of the distributuion is longer or faltter. Positively skewned dataset is characterized by the mean greater than the median.
  • Negative skewness. Negative skewness means the tail on the left side of the data is longer or flatter. And negatively skewed dataset is characterized by the mean smaller than the meidan.
    The following picture shows the difference:
def skewness(ax):
    """Normal distribution, positive skewness and negative skewness"""
    t = np.linspace(-10, 10, 201) # generate the desired x-values
    normal = stats.norm.pdf(t, 0, 1.6)   
    chi2 = stats.chi2.pdf(t, 3)
    mnius_chi2 = stats.chi2.pdf(-t, 3)

    ax.plot(t, normal, '--', label='normal')
    ax.plot(t, chi2, label='positive skew')
    ax.plot(t, new_chi2, label='negative skew')
    ax.xaxis.set_ticks_position('bottom')
    ax.yaxis.set_ticks_position('left')
    ax.set_xlim([-8,8])

    ax.margins(0,0)
    ax.legend()
fig,ax = plt.subplots()
skewness(ax)
image.png

Measuring Skewness
There are several coefficients of skewness for a distribution. Pearson's first coefficient of skewness, also called Pearson mode skewness, and Pearson's second coefficient of skewness, also called Pearson median skewness.

  • a) Pearson's Mode Skewness
    Sk_1 = \frac{\text{\bar{x}} - M_0}{s}

where \bar{x} is the mean, M_0 is the mode, and s is the standard deviation

  • b) Pearson's Median Skewness
    The second coefficient of skewness substracts the median from the mean, multiplies the difference by 3 and divides the product by standard deviation.
    Sk_2 = \frac{3(\text{\bar{x}} - M_1)}{s}

Compute Skew

from scipy.stats import skew
x = np.linspace(-3,3,200)
skew(x,bias=True)
>-3.2329954388862355e-16

What does skewness tell us? What message we can get if we have a skewed dataset?
Contexually, skewness has some implications in business. Investors usually examines skewness when judging a return distribution because it, like kurtosis, considers not only the extreme values of a distribution but also the average return. Short- and medium-term investors especially focus on skewness of a distribution because they tend to make profits in short-term and less likely to hold a postion long enough to be confident that the average works itslef out.

Also, a skewed dataset tells us that accuracy, if we aim to build a classification model, is not appropriate for model assessment because the result is going to be biased. Instead, we need to analyze the confusion matrix of a classifier.

Kurtosis
Kurtosis measures extreme values in either tail of distribution. Or we can define it as tail-heaviness of a distribution. Distributions with large kurtosis exhibit data exceeding the tails of normal distribution, while distributions with small kurtosis exhibit data that are generally less extreme than the tails of the normal distribution.

Type of Kurtosis
There are three categories of kurtosis that can be displayed by a set of data. All measures of kurtosis are compared to a standard normal distribution. The standard normal distribution has kurtosis of 3. Distributions with negative or positive excess kurtosis are called platykurtic distributions or leptokurtic distributions, respectively.

Google Image

What does Kurtosis can tell us?
A returen distribution of Kurtosis, for investors, is an implication that they might experience occasional extreme returns, more extreme than the three standard deviations from the mean predicted by the standard normal distribution. We can call this phenomenon as Kurtosis Risk.

Compute Kurtosis
To compute Kurtosis, we can import kurtosis from scipy package.

x = np.linspace(-3,3,200)
# compute population kurtosis
kurtosis(x,bias=True)
>-1.20

Summary

So we know that location and scale can determine a distribution. Typical location parameter includes mean, median, and mode; scale, also known as standard deviation, describes the width of a distribution.
All distributions have two parameters, skew and kurtosis. Skew measures the asymmetry observed in a probability distribution. If the tail on the right side of the distribution is longer, the distribution is positively skewed; if the tail on the left side is longer, the exact opposite is the case.
With these fundamentals of distributions, we can move onto a series of frequently encountered distributions in statistics. They are mainly categorized into discrete distributions and continuous distributions.

Discrete Distributions

In python, the most elegant way of working with distributions, either discrete or continuous, is a two-step procedure:

  • In the first step, we create a distribution. For example, create a normal distribution(e.g stats.norm()).
  • Then decide to which function we want to use from this distribution and calculate the function value for the desired x-input:
    Example: The U.S men's height has a normal distribution with mean 168 and std of 10. What is the chance that a man is higher than 165 cm?
# create a normal distribution
from scipy import stats
mu,std = 168,10
normDist = stats.norm(mu,std)
# decide which function to use based on requirement. 
# Here, we want to know the probability that a man is higher than 165cm. 
# So we can use Survival Function.
normDist.sf(165)
>0.6179114221889526

Bernoulli Distribution

Bernoulli Distribution is the simplest case of an univariate distribution and also the base of Binomial Distribution. It has only two states, labelled by n=0 and n=1, in which n=1 occurs with probability p and n=1 occurs with probability of q=1-p. So the probability mass function can be written as:
p(n) = \begin{cases} 1-p&\text{for } n=0\\ p& \text{for } n = 1 \end{cases}

which can also be written as:

p(n) = p^n (1-p)^{1-n}

Here, we can easily find that the probability of n=0 can be instantly determined when we get the probability for event n=1. So here's only one parameter P.

Exercise for Bernoulli Distribution

Let's directly start doing some questions about this distribution.
if we flip a coin, what is the probability that the coin heads up. In discrete distributions, we can call probability mass function to compute the probability that the coin heads up in one experiment.

# create a bernoulli distribution
from scipy import stats
p = 0.5
berDist = stats.bernoulli(p)

# compute the p that heads up
berDist.pmf(0)
>0.5

# we try to simulate 10 Bernoulli trails
bnumbers = berDist.rvs(10)
np.where(bnumbers>0,"head","tail")
>array(['head', 'head', 'head', 'head', 'tail', 'tail', 'tail', 'tail',
       'head', 'head'], dtype='<U4')

Binomial Distribution

Binomial distribution is exactly the same as Bernoulli distribution, but we will conduct the bernoulli experiment multiple times and try to find out how many times does the event occur. For example, if we flip a coin 10 times and want to know how many times the coin heads up, then the case is binomial distribution.

To better understand the distribution, there are several typical questions:

  • Out of ten tosses, how many times will this coin land heads?
  • From the children born in a given hospital on a given day, how many of them will
    be girls?
  • How many students in a given classroom will have green eyes?
  • How many mosquitoes, out of a swarm, will die when sprayed with insecticide?

We conduct n repeated experiments where the probability of success is given by the parameter p and add up the number of successes. This number of successes is represented by the random variable X. The value of X is then between 0 and n.

When a variable X has a binomial distribution with parameters n and p, we can write it as X~B(n,p). The probability mass function is given by:

p(n) = \begin{cases} {n\choose k}p^{k}(1-p)^{n-k}&0\geq k \leq n \\ 0 & \text{otherwise} \end{cases}

Exercise for Binomial Distribution

  • Out of ten tosses, suppose that we want to compute the probability that the coin heads up zero times, once, twice, three times,..., ten times
# create a binomial distribution
from scipy import stats
n,p = 10,0.5
biDist  = stats.binom(n,p)
x = np.arange(0,11)
biDist.pmf(x)
>array([0.00097656, 0.00976563, 0.04394531, 0.1171875 , 0.20507813,
       0.24609375, 0.20507813, 0.1171875 , 0.04394531, 0.00976563,
       0.00097656])
  • Exercise 2


    image.png
# create binom distribution
n,p = 10,0.262
binomDis = stats.binom(n,p)
# create the probability using pmf function
binomDis(8)
>0.0005441712184877579
  • Exercise 3
    Suppose we have a board game that depends on the roll of a die and attaches special importance to rolling a 6. In a particular game, the die is rolled 235 times, and 6 comes up 51 times. If the die is fair, we would expect 6 to come up 235/6 D 39.17 times. Is the proportion of 6’s significantly higher than would be expected by chance, on the null hypothesis of a fair die?

Solution

  • The probability that 6 heads up is always the same in every experiment, which is 1/6. And we conducted n repeated experiments. So this is a case of binomial distribution.
  • The experiment now has an observed result, 51times of 6 up, and whether the 51 times is a statistically different result?, Being significant means the result is not attained by chance. To confirm this, we need to know the p-value that the number of rolling a 6 is greater than 51.
  • Compare the p-value to significance level we set, we set 0.05 here, and we can judge whether this is a fair dice.
# first create a binomial distribution
num,p = 235, 1/6
binomDist = stats.binom(num,p)

# compute the probability that the times of sixes greater than or equal 51
#binomDist.pmf(np.arange(51,236)).sum()
alpha = 0.05
pval = 1-binomDist.cdf(51)+binomDist.pmf(51)
if pval<alpha:
    print("This is not a fair dice")
else:
    print("We should accept the null hypothesis, the die is unlikely to give as many sixes as that")
>This is not a fair dice

We can find that a fair die is unlikely to have as many sixes as that. We can also directly get p-val by using binomtest

stats.binom_test(51,n=235,p=1/6,alternative="greater")
>0.02654424571169947
  • Exercise 4
    A car manufacturer claims that no more than 10% of their cars are unsafe. 15 cars are inspected for safety, 3 were found to be unsafe. Test the manufacturer’s claim:
from scipy import stats
n,p = 15,0.1
binomDist  = stats.binom(n,p)
pval = binomDist.sf(3)+binomDist.pmf(3)
if pval<0.05:
    print("The manufacturer's claim is incorrect")
else:
    print("The null hypothesis cannot be rejected at the 5% level of significance \
because the returned p-value is greater than the critical value of 5%. \
This means that 3 is within the car manufacturer's expectation.")

>The null hypothesis cannot be rejected at the 5% level of significance 
because the returned p-value is greater than the critical value of 5%. 
This means that 3 is within the car manufacturer's expectation

Poisson Distribution

6.2.3 Poisson Distribution

The poisson distribution is very similar to the binomial distribution. We are examining the number of times an event occurs. The difference is subtle. Whereas the binomial looks at how many times you register a success over a fixed total number of trails, the Poisson distribution measures how many times a discrete event occurs over a period of continuous space or time.

The typical questions of Poisson distribution incluldes:

  • How many children will be delivered at the hospital today?
  • How many products will I sell after airing a new television commericial?
  • How many mosquito bites did you get today after having sprayed with insecticide?
  • How many defects will there be per 100m of rope sold?
  • How many calls received over a fixed period of time?

For poisson distribution, the only parameter we have is not the number of events occur but the average or expected number of events occur within our experiment, \lambda

P(X=k) = \frac{e^{-\lambda}\lambda^k}{k!}

Exercise

Consider that the number of visits on a web page is known to follow a Poisson distribution with mean 15 visits per hour. Please calculate the probability of getting 10 or less visits per hour.

from scipy import stats

lam = 15
poissonDist = stats.poisson(lam)

poissonDist.cdf(10)
>0.11846441152901499
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,591评论 6 501
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,448评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 162,823评论 0 353
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,204评论 1 292
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,228评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,190评论 1 299
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,078评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,923评论 0 274
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,334评论 1 310
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,550评论 2 333
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,727评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,428评论 5 343
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,022评论 3 326
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,672评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,826评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,734评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,619评论 2 354

推荐阅读更多精彩内容

  • 16宿命:用概率思维提高你的胜算 以前的我是风险厌恶者,不喜欢去冒险,但是人生放弃了冒险,也就放弃了无数的可能。 ...
    yichen大刀阅读 6,046评论 0 4
  • 公元:2019年11月28日19时42分农历:二零一九年 十一月 初三日 戌时干支:己亥乙亥己巳甲戌当月节气:立冬...
    石放阅读 6,877评论 0 2