Statistics Fundamentals in Python: Univariate Discrete Distributions

Framework

image.png

Univariate distributions can be divided into discrete distributions, where the observations can take on only integer values, such as the number of machines, and continuous distributions, where the observation variables are float values, such as the height and weight.

Characterizing a Distribution

Distribution Center

Common metrics of distribution center includes mean, median, and mode.

a) Mean
Mean can be computed by calling method np.mean(). One thing we need to notice is that in real life, there must be a number of missing values in our dataset. If we directly call np.mean() function to perform computation, the result will be NA. In this case, an alternative method is np.nanmean()

# create a numpy array
import numpy as np

x = np.arange(10)
np.mean(x)
>4.5

# now add some missing values into the array
xWtihNan = np.hstack((x,np.nan,np.nan,np.nan))
# call np.nanmean to address this situation
np.nanmean(xWithNan)
>4.5

b) Median
Median is the value that comes half when the data is ranked in an ascending order. In contrast to mean, median is not affected by outliers.

x = np.array([2,3,1,4,6,5,7])
np.median(x)
>4.0

c) Mode
The mode is the most frequently occuring value in a distribution. To compute mode, we usually call stats.mode().

from scipy import stats
data = [1,3,4,4,4,7,7,8]
stats.mode(data)
>ModeResult(mode=array([4]), count=array([3]))

Quantifying Variability

a) Range
In python, we can call np.ptp() to compute range of data. Range is easily affected by outliers. So it is not an accurate measure of data variablity.

x = np.arange(10)
np.ptp(x)
>9.0

b) Percentiles
Percentile, in python, corresponds to the cumulative distribution function(CDF). CDF is the integral of probability density function(PDF) and thus specifies the percentage of data that lie below a given value. With CDF, we can simplify the calculation of how likely it is to find a value x within the range a to b: The probability to find a value between a and b is computed by the interval over the PDF in that range, and thus can be found by the difference of the corresponding CDFvalues:

For example, the height of American men has a norm distributiion with mean of 170 and standard deviation of 4. We want to find the probability that a man randomly selected from American men falls between 170 and 175.

from scipy import stats
# create a norm distribution
mu,sigma = 170,4
normDist = stats.norm(mu,sigma)

# the probability to find a men's weight between 170 and 175 is given by the difference of corresponding CDF-values
normDist.cdf(175)-normDist.cdf(170)
>0.39435022633314465

There are several important and frequently encountered centiles:

To find the range which includes 95% of the data, one has to find the 2.5th and 97.5th of the sample data.
The 50th percentile is the median.
Also important are the quartiles, i.e., the 25th and the 75th percentile. The difference between them is called the inner-quartile range(IQR)

C) Standard Deviation and Variance
We usually need to distinguish between the sample variance, i.e. the variance of the data sampled, and population variance, i.e., the variance of the full population. The variance is given by the following formula:
$var = \frac{\sum_{i=1}^n(x_i-sample ~mean)^2}{n}$
This formula in fact systematically underestimates the population variance, and therefore is referred to as a biased estimator of the population variance. The reason for this bias is that sample mean is always chosen in a way that the variance of given sample data is minimized, and therefore underestimates the variance of population.
The solution to this problem is to use the following formula to compute the sample mean:
$var = \frac{\sum_{i=1}^n(x_i-sample ~mean)^2}{n-1}$
This can give us an unbiased estimator of population mean.

In python, we need to call np.var to compute variance and np.std to compute the standard deviaiton. Note: to get the sample standard deviation/variance, we have to set the degree of freedom to 1.

data = np.arange(5,18)

# compute the sample std
np.std(data,ddof=1)
>3.8944404818493075

# biased estimator of variance
np.std(data,ddof=0)
>3.7416573867739413

d) Standard Error
Standard error is a statistical term that measures the accuracy with which the smaple distribution represents a population by using the standard deviation. In statistics, the sample mean deviates from the actual mean of population and the deviation is called the standard error of mean(SEM).

How can we understand standard error?
Standard error is referred to the standard deviation of various sample statistics, such as mean or media. For example, the standard error of mean refers to the standard deviation of distribution of samples taken from a population. So the smaller the standard error is, the more representative the sample will be of the population.

Here's the formula about how to calculate it:
$SEM = \frac{s}{\sqrt{n}} = \sqrt{\frac{\sum_{i=1}^n(x_i-\text{sample mean})^2}{n-1}}\frac{1}{\sqrt{n}}$

import numpy as np
x = np.arange(7,14)
# get the sample mean
s = np.std(x,ddof=1)
# compute standard error of mean
s/np.sqt(x)
>0.816496580927726

We can also compute is by calling stats.sem

from scipy import stats
stats.sem(x)
>0.816496580927726

e) Confidence Interval
Confidence interval refers to the probability that estimated parameter falls between a set of values for a certain proportion of times.

$ci = mean \pm N_{PPF}\frac{1-\alpha}{2}*std$

where std is the standard deviation
$N_{PPF}$ is the percentile point function for the standard normal distribution.

For example, for two-sided confidence intervals, we need to calculate PPF(0.025) of the standard normal distribution to get the lower and upper limit of the confidence interval.

Suppose that weight of men has a normal distribution with mean 120 and std 10. Randomly select one man from the population and what is the confidence interval given 5% of significance level.

from scipy import stats
mu,sigma = 120,10
normDist = stats.norm(120,10)
alpha = 0.05
normDist.ppf([alpha/2,1-alpha/2])
>array([100.40036015, 139.59963985])

Notes

To calculate the confidence interval for mean value, the standard deviation has to be replaced by the standard error.
If the distribution of dataset is skewed, the estimated ci given by the equation above is not correct.

Parameters Describing the Form of a Distribution

In scipy.stats, continuous distribituons are determined by location and scale. For example, for the normal distribution, location/shape are given by mean/standard deviation of the distribution; and for the uniform distribution, they are given by the (start-end) of the range between the distribution is different from zero.

a) Location
A locaition paramter determines the location of a distribution, such as mean, median, and mode.

b) Scale
The scale parameter describes the width of a probability distribution. Therefore, if the scale parameter is large, then the distribution will be more spread out; if scale is small then it will be more concentrated.

c) Shape Parameter
All of the parameters we use in all distributions are only two: skewness and Kurtosis.

Skewness
Skewness measures the asymmetry that deviates from the normal distribution in a dataset. If the curve is shifted to the left or right, it is said to be skewed. A standard normal distribution has a skew of zero.

Takeaways:

Skewnesss, in statistics, measures the asymmetry observed in a certain probability distribution.
Distributions can exhibit right(positive) skewness and left(negative) skewness; normal distribution has zero of skewness.
Investors consider skewness of distributions when judging a return distribution because it, like kurtosis, considers the extreme values of the dataset rather than the average.

Skewed distributions can be classified as positive skewness and negative skewness.

Positive Skewnes. Positive skewness means the tail on the right side of the distributuion is longer or faltter. Positively skewned dataset is characterized by the mean greater than the median.
Negative skewness. Negative skewness means the tail on the left side of the data is longer or flatter. And negatively skewed dataset is characterized by the mean smaller than the meidan.
The following picture shows the difference:

def skewness(ax):
    """Normal distribution, positive skewness and negative skewness"""
    t = np.linspace(-10, 10, 201) # generate the desired x-values
    normal = stats.norm.pdf(t, 0, 1.6)   
    chi2 = stats.chi2.pdf(t, 3)
    mnius_chi2 = stats.chi2.pdf(-t, 3)

    ax.plot(t, normal, '--', label='normal')
    ax.plot(t, chi2, label='positive skew')
    ax.plot(t, new_chi2, label='negative skew')
    ax.xaxis.set_ticks_position('bottom')
    ax.yaxis.set_ticks_position('left')
    ax.set_xlim([-8,8])

    ax.margins(0,0)
    ax.legend()
fig,ax = plt.subplots()
skewness(ax)

image.png

Measuring Skewness
There are several coefficients of skewness for a distribution. Pearson's first coefficient of skewness, also called Pearson mode skewness, and Pearson's second coefficient of skewness, also called Pearson median skewness.

a) Pearson's Mode Skewness
$Sk_1 = \frac{\text{\bar{x}} - M_0}{s}$

where \bar{x} is the mean, M_0 is the mode, and s is the standard deviation

b) Pearson's Median Skewness
The second coefficient of skewness substracts the median from the mean, multiplies the difference by 3 and divides the product by standard deviation.
$Sk_2 = \frac{3(\text{\bar{x}} - M_1)}{s}$

Compute Skew

from scipy.stats import skew
x = np.linspace(-3,3,200)
skew(x,bias=True)
>-3.2329954388862355e-16

What does skewness tell us? What message we can get if we have a skewed dataset?
Contexually, skewness has some implications in business. Investors usually examines skewness when judging a return distribution because it, like kurtosis, considers not only the extreme values of a distribution but also the average return. Short- and medium-term investors especially focus on skewness of a distribution because they tend to make profits in short-term and less likely to hold a postion long enough to be confident that the average works itslef out.

Also, a skewed dataset tells us that accuracy, if we aim to build a classification model, is not appropriate for model assessment because the result is going to be biased. Instead, we need to analyze the confusion matrix of a classifier.

Kurtosis
Kurtosis measures extreme values in either tail of distribution. Or we can define it as tail-heaviness of a distribution. Distributions with large kurtosis exhibit data exceeding the tails of normal distribution, while distributions with small kurtosis exhibit data that are generally less extreme than the tails of the normal distribution.

Type of Kurtosis
There are three categories of kurtosis that can be displayed by a set of data. All measures of kurtosis are compared to a standard normal distribution. The standard normal distribution has kurtosis of 3. Distributions with negative or positive excess kurtosis are called platykurtic distributions or leptokurtic distributions, respectively.

Google Image

What does Kurtosis can tell us?
A returen distribution of Kurtosis, for investors, is an implication that they might experience occasional extreme returns, more extreme than the three standard deviations from the mean predicted by the standard normal distribution. We can call this phenomenon as Kurtosis Risk.

Compute Kurtosis
To compute Kurtosis, we can import kurtosis from scipy package.

x = np.linspace(-3,3,200)
# compute population kurtosis
kurtosis(x,bias=True)
>-1.20

Summary

So we know that location and scale can determine a distribution. Typical location parameter includes mean, median, and mode; scale, also known as standard deviation, describes the width of a distribution.
All distributions have two parameters, skew and kurtosis. Skew measures the asymmetry observed in a probability distribution. If the tail on the right side of the distribution is longer, the distribution is positively skewed; if the tail on the left side is longer, the exact opposite is the case.
With these fundamentals of distributions, we can move onto a series of frequently encountered distributions in statistics. They are mainly categorized into discrete distributions and continuous distributions.

Discrete Distributions

In python, the most elegant way of working with distributions, either discrete or continuous, is a two-step procedure:

In the first step, we create a distribution. For example, create a normal distribution(e.g stats.norm()).
Then decide to which function we want to use from this distribution and calculate the function value for the desired x-input:
Example: The U.S men's height has a normal distribution with mean 168 and std of 10. What is the chance that a man is higher than 165 cm?

# create a normal distribution
from scipy import stats
mu,std = 168,10
normDist = stats.norm(mu,std)
# decide which function to use based on requirement. 
# Here, we want to know the probability that a man is higher than 165cm. 
# So we can use Survival Function.
normDist.sf(165)
>0.6179114221889526

Bernoulli Distribution

Bernoulli Distribution is the simplest case of an univariate distribution and also the base of Binomial Distribution. It has only two states, labelled by $n=0$ and $n=1$ , in which $n=1$ occurs with probability p and $n=1$ occurs with probability of $q=1-p$ . So the probability mass function can be written as:
$p(n) = \begin{cases} 1-p&\text{for } n=0\\ p& \text{for } n = 1 \end{cases}$

which can also be written as:

$p(n) = p^n (1-p)^{1-n}$

Here, we can easily find that the probability of n=0 can be instantly determined when we get the probability for event n=1. So here's only one parameter $P$ .

Exercise for Bernoulli Distribution

Let's directly start doing some questions about this distribution.
if we flip a coin, what is the probability that the coin heads up. In discrete distributions, we can call probability mass function to compute the probability that the coin heads up in one experiment.

# create a bernoulli distribution
from scipy import stats
p = 0.5
berDist = stats.bernoulli(p)

# compute the p that heads up
berDist.pmf(0)
>0.5

# we try to simulate 10 Bernoulli trails
bnumbers = berDist.rvs(10)
np.where(bnumbers>0,"head","tail")
>array(['head', 'head', 'head', 'head', 'tail', 'tail', 'tail', 'tail',
       'head', 'head'], dtype='<U4')

Binomial Distribution

Binomial distribution is exactly the same as Bernoulli distribution, but we will conduct the bernoulli experiment multiple times and try to find out how many times does the event occur. For example, if we flip a coin 10 times and want to know how many times the coin heads up, then the case is binomial distribution.

To better understand the distribution, there are several typical questions:

Out of ten tosses, how many times will this coin land heads?
From the children born in a given hospital on a given day, how many of them will
be girls?
How many students in a given classroom will have green eyes?
How many mosquitoes, out of a swarm, will die when sprayed with insecticide?

We conduct n repeated experiments where the probability of success is given by the parameter p and add up the number of successes. This number of successes is represented by the random variable X. The value of X is then between 0 and n.

When a variable X has a binomial distribution with parameters $n$ and $p$ , we can write it as X~ $B(n,p)$ . The probability mass function is given by:

$p(n) = \begin{cases} {n\choose k}p^{k}(1-p)^{n-k}&0\geq k \leq n \\ 0 & \text{otherwise} \end{cases}$

Exercise for Binomial Distribution

Out of ten tosses, suppose that we want to compute the probability that the coin heads up zero times, once, twice, three times,..., ten times

# create a binomial distribution
from scipy import stats
n,p = 10,0.5
biDist  = stats.binom(n,p)
x = np.arange(0,11)
biDist.pmf(x)
>array([0.00097656, 0.00976563, 0.04394531, 0.1171875 , 0.20507813,
       0.24609375, 0.20507813, 0.1171875 , 0.04394531, 0.00976563,
       0.00097656])

Exercise 2

image.png

# create binom distribution
n,p = 10,0.262
binomDis = stats.binom(n,p)
# create the probability using pmf function
binomDis(8)
>0.0005441712184877579

Exercise 3
Suppose we have a board game that depends on the roll of a die and attaches special importance to rolling a 6. In a particular game, the die is rolled 235 times, and 6 comes up 51 times. If the die is fair, we would expect 6 to come up 235/6 D 39.17 times. Is the proportion of 6’s significantly higher than would be expected by chance, on the null hypothesis of a fair die?

Solution

The probability that 6 heads up is always the same in every experiment, which is 1/6. And we conducted n repeated experiments. So this is a case of binomial distribution.
The experiment now has an observed result, 51times of 6 up, and whether the 51 times is a statistically different result?, Being significant means the result is not attained by chance. To confirm this, we need to know the p-value that the number of rolling a 6 is greater than 51.
Compare the p-value to significance level we set, we set 0.05 here, and we can judge whether this is a fair dice.

# first create a binomial distribution
num,p = 235, 1/6
binomDist = stats.binom(num,p)

# compute the probability that the times of sixes greater than or equal 51
#binomDist.pmf(np.arange(51,236)).sum()
alpha = 0.05
pval = 1-binomDist.cdf(51)+binomDist.pmf(51)
if pval<alpha:
    print("This is not a fair dice")
else:
    print("We should accept the null hypothesis, the die is unlikely to give as many sixes as that")
>This is not a fair dice

We can find that a fair die is unlikely to have as many sixes as that. We can also directly get p-val by using binomtest

stats.binom_test(51,n=235,p=1/6,alternative="greater")
>0.02654424571169947

Exercise 4
A car manufacturer claims that no more than 10% of their cars are unsafe. 15 cars are inspected for safety, 3 were found to be unsafe. Test the manufacturer’s claim:

from scipy import stats
n,p = 15,0.1
binomDist  = stats.binom(n,p)
pval = binomDist.sf(3)+binomDist.pmf(3)
if pval<0.05:
    print("The manufacturer's claim is incorrect")
else:
    print("The null hypothesis cannot be rejected at the 5% level of significance \
because the returned p-value is greater than the critical value of 5%. \
This means that 3 is within the car manufacturer's expectation.")

>The null hypothesis cannot be rejected at the 5% level of significance 
because the returned p-value is greater than the critical value of 5%. 
This means that 3 is within the car manufacturer's expectation

Poisson Distribution

6.2.3 Poisson Distribution

The poisson distribution is very similar to the binomial distribution. We are examining the number of times an event occurs. The difference is subtle. Whereas the binomial looks at how many times you register a success over a fixed total number of trails, the Poisson distribution measures how many times a discrete event occurs over a period of continuous space or time.

The typical questions of Poisson distribution incluldes:

How many children will be delivered at the hospital today?
How many products will I sell after airing a new television commericial?
How many mosquito bites did you get today after having sprayed with insecticide?
How many defects will there be per 100m of rope sold?
How many calls received over a fixed period of time?

For poisson distribution, the only parameter we have is not the number of events occur but the average or expected number of events occur within our experiment, $\lambda$

$P(X=k) = \frac{e^{-\lambda}\lambda^k}{k!}$

Exercise

Consider that the number of visits on a web page is known to follow a Poisson distribution with mean 15 visits per hour. Please calculate the probability of getting 10 or less visits per hour.

from scipy import stats

lam = 15
poissonDist = stats.poisson(lam)

poissonDist.cdf(10)
>0.11846441152901499