Eng: Statistics of Data Analysis

1. Conditional probability 


2. Bayes theorem

P(A | B) is a conditional probability: the likelihood of event A occurring given that B is true.

P(B | A) is also a conditional probability: the likelihood of event B occurring given that A is true.

P(A) and P(B) are the probabilities of observing A and B independently of each other; this is known as the marginal probability.

Bayes theorem Interpretations:

Bayesian inference derives the posterior probability as a consequence of two antecedents: a prior probability and a "likelihood function" derived from a statistical model for the observed data. Bayesian inference computes the posterior probability according to Bayes' theorem:

H: stands for any hypothesis whose probability may be affected by data (called evidence below). Often there are competing hypotheses, and the task is to determine which is the most probable.

P(H): the prior probability, is the estimate of the probability of the hypothesis H before the data E, the current evidence, is observed.

P(H | E): the posterior probability, is the probability of H given E, i.e., after E is observed. This is what we want to know: the probability of a hypothesis given the observed evidence.

P(E | H): is the probability of observing E given H, and is called the likelihood. As a function of E with H fixed, it indicates the compatibility of the evidence with the given hypothesis. The likelihood function is a function of the evidence, E, while the posterior probability is a function of the hypothesis, H.

P(E): is sometimes termed the marginal likelihood or "model evidence". This factor is the same for all possible hypotheses being considered (as is evident from the fact that the hypothesis H does not appear anywhere in the symbol, unlike for all the other factors), so this factor does not enter into determining the relative probabilities of different hypotheses.

For different values of H, only the factors P(H) and P(E | H), both in the numerator, affect the value of P(H | E) – the posterior probability of a hypothesis is proportional to its prior probability (its inherent likeliness) and the newly acquired likelihood (its compatibility with the new observed evidence).

Sometimes, Bayes theorem can be written as:

where the factor P(E | H) / P(E) can be interpreted as the impact of E on the probability of H.


3. Binomial distribution

Binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: a random variable containing a single bit of information: success/yes/true/one (with probability p) or failure/no/false/zero (with probability q = 1 − p).

In general, if the random variable X follows the binomial distribution with parameters n ∈ ℕ and p ∈ [0,1], we write X ~ B(n, p). The probability of getting exactly k successes in n trials is given by the probability mass function:

for k = 0, 1, 2, …, n, where   
binomial coefficient

The cumulative distribution function can be expressed as:

where "|k|"* is the "floor" under k, i.e. the greatest integer less than or equal to k.

Mean: E(X) = np; Variance: Var(X) = npq = np(1-q); Mode:

*Covariance between two binomials:

If two binomially distributed random variables X and Y are observed together, estimating their covariance can be useful. The covariance is Cov(X,Y) = E(XY) - μX * μY

In the case n= 1 (the case of Bernoulli trialsXY is non-zero only when both X and Y are one, and μX and μY are equal to the two probabilities. Defining pB as the probability of both happening at the same time, this gives

for n independent pairwise trials

In a bivariate setting involving random variables X and Y, there is a particular expectation that is often of interest. It is called covariance and is given by: Cov(X,Y) = E((X-E(X))(Y-E(Y)) where the expectation is taken over the bivariate distribution of X and Y. Alternatively, Cov(X,Y) = E(XY) - E(X)E(Y)

Moreover, a scaled version of covariance is the correlation ρ which is given by 

ρ = Corr(X,Y) = Cov(x,y) / [sqrt(Var(X)*sqrt(Var(Y)], Var(X)=σx^2

The correlation ρ is the population analogue of the sample correlation coefficient r that is used to describe the degree of linear relationship involving paired data.


Confidence Interval

Assume that total number of successes X ~ B(n,p) with np>=5, n(1-p)>=5 so that the normal approximation to the binomial is reasonable.

In practice, p is unknown. Under the normal approximation, we have X ~ N(np, np(1-p)) and we define p^ = X/n as the proportion of successes. Since p^ is a linear combination of normal random variable, it follows that p^ ~ N(p,p(1-p)/n) then the probability statement is 

Let Za/2 denote the (1-a/2)100-th percentile for the standard normal distribution, a (1-a)100% approximation confidence interval (because we user normal distribution to the binomial and the substitution of p with p-hatfor p-hat is given by


4. Normal distribution

the normal (or Gaussian) distribution is a very common continuous probability distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known.

If X ~ N(μ,σ^2), then E(X) = μ, and Var(X) = σ^2 

σ^2 is the variance and not the standard deviation.

A random variable Z ~ N(0,1) is referred to as standard normal and it has the simplified pdf: 

The relationship between an arbitrary normal random variable X ~ N(μ, σ^2) and the standard normal distribution is expressed via (X-μ) / σ ~ N(0,1) 

Confidence Interval

In this case, we assume X1,X2,...,XN iid normal(μ, σ^2)  where our interest concerns that μ is unknown and σ is known for ease of development. (In real world, we can't find a case with known σ & unknown μ)

X-bar ~ N(μ, σ^2/n)

convert to standard normal distribution. when z= 1.96, F(x) = 0.975

Rearranging terms:

Finally, we obtain a 95% confidence interval (as follows) for μ

More generally, let Za/2 denote the (1-a/2)100-th percentile for the standard normal distribution, a (1-a)100% confidence interval for μ is given by

we use the observed value x-bar. It is understood that confidence intervals are functions of observed statistics.

When n is large, it turns out that the sample standard deviation s provides a good estimate of σ. Therefore, we are able to to provide a confidence interval for μ in the more realistic case where σ is unknown. We simply replace σ in above mentioned functions with s.


5. Descriptive statistics

It concerns the presentation of data (either numerical or graphical) in a way that makes it easier to digest data.

Dotplot: for univariate data

        outliers: too big or small

        centrality: values in the middle portion of the dotplot

        dispersion: spread or variation in the data

Histograms: for univariate data, the size of dataset n is fairly large

        modality: a histogram with two distinct humps is referred to as bimodal

        skewness:

        symmetry:

        How to choose interval as x-axis: choose the number of intervals roughly equal to sqrt(n) where n is the number of observations.

        For those intervals are not equal length, we should plot relative frequency divided by intervals length on the vertical axis, instead of using frequency.

Boxplot: for univariate data, is most appropriately used when the data are divided into groups.

        sample median (Q2); top-edge is 3/4 quantile (Q3); bottom-edge: 1/4 quantile (Q1)

        interquartile range (IQR) : Q3-Q1, known as ΔQ

        maximum interval: Q3+1.5ΔQ or 90th percentile

        minimum interval: Q1-1.5ΔQ or 10th percentile

        values that out of max & min intervals are Outliers.

        whiskers (vertical dashed lines) extend to the outer limits of the data and circles correspond to outliers.

Scatterplot: it is appropriate for paired data

        extrapolated data: when predicting, you should be cautious about predictions based on extrapolated data. There perhaps appears a positive increase trend from the pairplot with two variables X,Y, but it doesn't mean they have the same relationship for X, Y. (Data should be combined with the real world)

Correlation Coefficient: measure the degree of linear association between x and y

It is a numerical descriptive statistic for investigating paired data is the sample correlation or correlation or correlation coefficient r defined by 

Properties:

    -1 <= r <= 1

    when r close to 1, the points are clustered about a line with positive slope

    when r close to -1, the points are clustered about a line with negative slope

    when r close to 0, points are lack of linear relationship. However, there may be a quadratic relationship

    when x and y are correlated (not close to 0), it merely denotes the presence of a linear association. For example, weight and height are positively correlated, and it is obviously wrong to state that one causes the other.

    In order to establish cause and effect relationship, we should do a controlled study.

easy to calculate


6. Law of large numbers

the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.

7. Central limited theorem

For the CLT, we assume that the random variables X1,X2,...,Xn are *iid from a population with mean μ and variance σ^2. The CLT states that as n => infinity, the distribution of *(X_bar - μ)/(σ/sqrt(n)) converges to the distribution of a standard normal random variable.

*iid: independent and identically distributed, which means X's are independent of one another and arise from the same probability distribution.

从一个均值为 μ 、标准差为σ的总体中选取一个有n个观测值的随机样本。那么当n足够大时,x¯的抽样分布将近似服从均值μx¯=μ、标准差σx¯=σ/√n的正态分布。并且样本量越大,对x¯的抽样分布的正太近似越好

In probability theory, the central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a "bell curve") even if the original variables themselves are not normally distributed. 

要求:

1. 总体本身的分布不要求正态分布

2. 样本每组要足够大,但也不需要太大 n≥30

中心极限定理在理论上保证了我们可以用只抽样一部分的方法,达到推测研究对象统计参数的目的。


8. Linear regression: not mentioned Gradient Descent and Cost Function in machine learning aspect.

linear regression is predicting the value of a variable Y(dependent variable) based on some variable X(independent variable) provided there is a linear relationship between X and Y.

Y=b0 + b1X+e

(Recall that the regression equation without the error term, Y=b0 + b1X , is called the least squares line.)

a shallow-sloped estimated regression line, y^

SSTO, a.k.a SST, sum of squared total: sum of difference from the mean of y and data point yi

SSE, sum of squared error: sum of difference from the estimated regression line and data point yi

SSR, sum of squared regression: quantifies how far the estimated sloped regression line, y^i, is from the horizontal "no relationship line," the sample mean or y¯.

SST = SSR + SSE

From the above example, it tells us that most of the variation in the response y (SSTO = 1827.6) is just due to random variation (SSE = 1708.5), not due to the regression of y on x (SSR = 119.1).

Coefficient of Determination or r^2

between 0 and 1

    If r^2 = 1, all data points fall perfectly on the regression line. The predictor x accounts for all of the variation in y!

    If r^2 = 0, the estimated regression line is perfectly horizontal. The predictor x accounts for none of the variation in y!

    r^2 ×100 percent of the variation in y is 'explained by' the variation in predictor x.

    SSE is the amount of variation that is left unexplained by the model.

    R-squared Cautions:

        1. The coefficient of determination r^2 and the correlation coefficient r quantify the strength of a linear relationship. It is possible that r^2 = 0% and r = 0, suggesting there is no linear relation between x and y, and yet a perfect curved (or "curvilinear" relationship) exists.

        [Most misinterpreting concept] 2. A large r^2 value should not be interpreted as meaning that the estimated regression line fits the data well.

        Although the R-squared value is 92% and only 8% of the variation US population is left to explain after taking into account the year in a linear way. The plot suggests that a curve plot describe the relationship even better. (Its large value does suggest that taking into account year is better than not doing so. It just doesn't tell us that we could still do better.)

        3. The coefficient of determination r2 and the correlation coefficient r can both be greatly affected by just one data point (or a few data points).

        4. Correlation (or association) does not imply causation.

VIF variance inflation rate: 1/(1-r^2)

    VIF check the co-linearity between explanatory variables. Over 5 is too bad.


9. Hypothesis test

H0: null hypothesis; H1: alternative hypothesis.

Testing begins by assuming that H0 is true, and data is collected in an attempt to establish the truth of H1.

H0 is usually what you would typically expect (ie, H0 represents the status quo).

In inference step, we calculate a p-value, defined as the probability of observing data as extreme or more extreme (in the direction of H1) than what we observed given that H0 is true.

Significance level: a, usually equal to 0.01, 0.05

    If p-value is less than a, reject H0;

    If p-value is larger than a, fail to reject H0.

......


10. Model Selection: AIC, BIC, Normality, Homoscedasticity, Outlier Detection

When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model.

AIC Akaike information criterion: 2k - 2ln(L) where k is the number of parameters in the model (or the number of degrees of freedom being used  up); ln(L) is the 'log likelihood', which is a measure of how well the model fits the data. Low AIC is better. 2k is the 'penalty' term.

AIC measure the Goodness of fit & Complexity (number of terms)

Comparing AIC with the proportion of variance explained, R^2, R^2 only measures goodness of fit.

However, because of co-linearity, sometimes that variable is 'stealing' the significance from some other term. The AIC doesn't care which terms are significant, it just looks at how well the model fits as a whole.

BIC Bayesian Information Criterion: (ln(n)*k) - 2ln(L) where n is the number of observations, also call the sample size, k stands for the number of parameters (df).

BIC is similar to the AIC, but imposes a larger penalty term for complexity. Lower BIC is better. And BIC favors for simpler models, given a set of candidate models. What's more, BIC is easier to find significance in variables that are unimportant when n is large because of large penalty.


When selecting models, one criterion (AIC/BIC) is not sufficient to cover all the aspects of the model.

we also need to check influential outliers, homoscedasticity (equal variance) and normality.

Residual is to check above mentioned properties. 


To check normality: use Shapiro-Wilks Test

It is a hypothesis test whose null hypothesis is 'your data is normally distributed'

Large p-value, fail to reject H0, you have no evidence against normality; small p-value, reject H0, so you have evidence of non-normality


To check homoscedasticity: use Levene Test

Still hypothesis with null hypothesis: all input samples are from populations with equal variances.


Outlier Detection: in statistical method, not mention approaches in data mining aspect.

noise: it is random error or variance in a measured variable

    noise should be removed before outlier detection.

outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism. It violates the mechanism that generates the normal data.

Parametric Methods I: detection univariate outliers based on Normal Distribution

    μ+3σ region contains 99.7% data, outliers are out of this region.

Parametric Methods IIdetection of multivariate outliers. 

    bottom line: transform the multivariate outlier detection task into a univariate outlier detection problem

        use X^2-statistic: (chi square statistic)

O is observed value; E is expected value,  “i” is the “ith” position in the contingency table.

        If X^2-statistic is large, then Object Oi is an outlier.

        A low value for chi-square means there is a high correlation between your two sets of data. In theory, if your observed and expected values were equal (“no difference”) then chi-square would be zero — an event that is unlikely to happen in real life. You could take your calculated chi-square value and compare it to a critical value from a chi-square table. If the chi-square value is more than the critical value, then there is a significant difference.

        A chi-square statistic is one way to show a relationship between two categorical variables. In statistics, there are two types of variables: numerical (countable) variables and non-numerical (categorical) variables. The chi-squared statistic is a single number that tells you how much difference exists between your observed counts and the counts you would expect if there were no relationship at all in the population.

[Omit] Parametric Methods III: Using mixture of parametric distributions

Outlier Detection is a big topic that can be expand for another article. Let me stop it here in Statistics topic.


Statistics notation:

Note that statistics are quantities that we can calculate, and therefore do not depend on unknown parameters. Moveover, statistics have associated probability distributions, and we are sometimes interested in the distributions of statistics. 

Capital letters, in statistics, usually denote random variables

MLE: maximum likelihood estimate    最大似然估计

MSE:    mean squared error    误差均方

RMSE:    root mean squared error 误差均方根

r^2:    coefficient of determination    确定系数

SE:    standard error    标准误

SEM:    standard error of the mean     均数的标准误

SS:    sum of squares    平方和

SSE:    sum of squared error of the prediction function

SSR:    sum of squared residuals

SST:    total sum of squares


最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,185评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,445评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 157,684评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,564评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,681评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,874评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,025评论 3 408
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,761评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,217评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,545评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,694评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,351评论 4 332
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,988评论 3 315
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,778评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,007评论 1 266
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,427评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,580评论 2 349

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,312评论 0 10
  • 今天午饭在晋老周吃的,由于好久没来,点的多了。饭后我们在月坛散步,空气不是很好,也只能忍了。回来解决点工作问题,比...
    小王加油啊阅读 172评论 0 0
  • 精要主义如何让自己变得更美好 探索和思考对自己重要有意义的事~留有空白思考~比如读书~内心太过浮躁~外在影响的东西...
    星星_8d4c阅读 165评论 0 0
  • 两排课桌是我们课堂的距离两百名次是我们成绩的差距两颗糖果是你认识我的开始两本笔记是我们半年的悄悄话两张彩纸是我的告...
    刘白1996阅读 1,216评论 1 2
  • 想念pad,打字也舒服,借着地灯幽暗的光,听听写写看看……想要随身带着pad,就得有装得下的包包,包包的挑选不能太...
    大末子阅读 141评论 0 0