5.31 Probability distributions

Learn about probability distributions while analyzing bikesharing data.

dataset

In this mission, we'll construct probability distributions.

In the last mission, we defined p as the probability of an outcome occurring, and q as the probability of it not occurring, where q=1−p.

These types of probabilities are known as binomial -- there are two values, which add to 1 collectively.

Find the probability of there being more than 5000 riders in a single day (using the cntcolumn).
Assign the result to prob_over_5000.

import pandas 
bikes  = pandas.read_csv('bike_rental_day.csv')

prob_over_5000 = bikes[bikes['vnt']>5000].shape[0]/bikes.shape[0]
# outcome .39

we know that the probability is about .39 that there are more than 5000 riders in a single day.

we figured out how to find the probability of k outcomes out of N events occurring.

mission was:

(pk∗qN−k) ∗ N! / k!(N−k)!

create a function that can compute the probability of k outcomes out of N events occurring.

Use the function to find the probability of each number of outcomes in outcome_counts occurring.

An outcome is a day where there are more than 5000 riders, with p=.39.
You should have a list with 31 items, where the first item is the probability of 0 days out of 30 with more than 5000 riders, the second is the probability of 1 day out of 30, and so on, up to 30 days out of 30.
Assign the list to outcome_probs

import math
# each item in this list represents one k ,
# starting from 0 and going up to and including 30
outcome_counts = list(range(31))

def find_probability(N,k,p,q):
    # find the probability of any single combination
    term_1 = p ** k
    term_2 = q ** (N-k)
    comob_prob = term_1 * term_2
    # find the number of combinations.
    numerator = math.factorial(N)
    denominator = math.factorial(k) * math.factorial(N-k)
    combo_count = numerator/denominator
    return  combo_prob * combo_count

outcome_probs = [find_probability(30, k, .39, .61) for k in outcome_counts]

# len(outcome_probs)   31

plotting the distribution

You may have noticed that outcome_counts in the previous screen was 31 items long when N was only 30. This is because we need to account for 0. There's a chance of having k=0, where the outcome we want doesn't happen at all. This data point needs to be on our charts. We'll always want to add 1 to N when figuring out how many points to plot.

import matplotlib.pyplot as plt

# The most likely number of days 
# is between 10 and 15.
plt.bar(outcome_counts, outcome_probs)
plt.show()

Simplifying the computation by binom.pmf function

We can instead use the binom.pmf function from SciPy to do this faster.

Here's a usage example:

from scipy import linspace
from scipy.stats import binom

# Create a range of numbers from 0 to 30, 
# with 31 elements (each number has one entry).
outcome_counts = linspace(0,30,31)

# Create the binomial probabilities,
# one for each entry in outcome_counts.
dist = binom.pmf(outcome_counts,30,0.39)

The pmf function in SciPy is an implementation of the mathematical probability mass function. The pmf will give us the probability of each k in our outcome_counts list occurring.

A binomial distribution only needs two parameters. A parameter is the statistical term for a number that summarizes data for the entire population. For a binomial distribution, the parameters are:

  • N, the total number of events,
  • p, the probability of the outcome we're interested in seeing.

The SciPy function pmf matches this and takes in the following parameters:

x: the list of outcomes,
n: the total number of events,
p: the probability of the outcome we're interested in seeing.

Because we only need two parameters to describe a distribution, it doesn't matter whether we want to know if it will be sunny 5 days out of 5, or if 5 out of 5 coin flips will turn up heads. As long as the outcome that we care about has the same probability (p), and N is the same, the binomial distribution will look the same.

  • Generate a binomial distribution, and then find the probabilities for each value in outcome_counts.
  • Use N=30, and p=.39, as we're doing this for the bikesharing data.
  • Plot the resulting data as a bar chart.
import matplotlib.pyplot as plt
import scipy
from scipy import linsapce
from scipy.stats import binom

# Create a range of numbers from 0 to 30, 
# with 31 elements (each number has one entry).
# 功能:linspace是Matlab中的一个指令,
# 用于产生x1,x2之间的N点行矢量。其中x1、x2、N分别为起始值、终止值、元素个数。
# 若缺省N,默认点数为100。

outcome_counts = linspace(0,30,31) #nd.array
# outcome_counts = list(range(31)) #list
# list  nd.array  both works.


x = outcome_counts #list or ndarray
n = 30 #N
p = .39
outcome_probs = binom.pmf(x,n,p)


plt.bar(outcome_counts,outcome_probs)
plt.show()

How to think about a probability distribution

If we repeatedly look at 30 days of bikesharing data, we'll find that 10 of the days had more than 5000 riders about 12.4% of the time. We'll find that 12 of the days had more than 5000 riders about 14.6% of the time.

Computing the mean of a probability distribution

Sometimes we'll want to be able to tell people the expected value of a probability distribution -- the most likely result of a single sample that we look at.

To compute this, we just multiply N by p.

dist_mean = 30 * .39
# mean of distribution 11.7

Computing the standard deviation

Just as we can compute the mean, we can also compute the standard deviation of a probability distribution. This helps us find how much the actual values will vary from the mean when we take a sample.

Going back to the bikesharing example, we know that the actual values will be around 11.7 (from the last screen). But, we'll need a standard deviation to explain how much the actual values can vary from this expectation.

the formula for standard deviation of a probability distribution is

distribution_standard_deviation 
= math.sqrt(N∗p∗q)

Compute the standard deviation for the bikesharing data, where N=30, and p=.39.
Assign the result to dist_stdev.

import  math
dist_stdev = math.sqrt(30*.39*.61)

Adifferent plot

Just like we did with histograms and sampling a few missions ago, we can vary the parameters to change the distribution. Let's see what the plot would look like with only 10 events, or 100 events.

Generate a binomial distribution, with N=10, and p=.39.

Find the probabilities for each value in outcome_counts.
Plot the resulting data as a bar chart.

Generate a binomial distribution, with N=100, and p=.39.

Find the probabilities for each value in outcome_counts.
Plot the resulting data as a bar chart.

import matplotlib.pyplot as plt
import  scipy
from scipy import linspace
from scipy.stats import binom

outcome_counts = linspace(0,10,11)
outcome_probs = binom.pmf(outcome_counts, 10 ,.39)
plt.bar(outcome_counts,outcome_probs)
plt.show()

outcome_counts = linspace(0,100,101)
outcome_probs = binom.pmf(outcome_counts,100,.39)
plt.bar(outcome_counts,outcome_probs)
plt.show()

The normal distribution

From the last screen, the more events we looked at, the closer our distribution was to being normal. With N=10, we saw some rightward skew, but when we got up to N=100, the skew disappeared.

This is because the distribution got narrower relative to the x-axis range the more examples you add. With N=10, there's a reasonable chance that 8 to 10 days could have over 5000 riders. But, when we get up to N=100, it's statistically almost impossible that more than 60 days have over 5000 riders. This makes the distribution narrower.

As the distribution gets narrower, it gets more similar to the normal distribution. In the code cell, we plot a line chart instead of a bar chart and it looks almost exactly like a normal distribution.

# Create a range of numbers from 0 to 100, 
# with 101 elements (each number has one entry).
outcome_counts = scipy.linspace(0,100,101)

# Create a probability mass function
#along the outcome_counts.
outcome_probs = binom.pmf(outcome_counts,100,0.39)

# Plot a line, not a bar chart.
plt.plot(outcome_counts, outcome_probs)
plt.show()

Cumulative density function

So far, we've looked at the probability that single values of k will occur. What we can look at instead is the probability that k or less will occur. These probabilities can be generated by the cumulative density function.

Let's say we flip a coin 3 times -- N=3, and p=.5. When this happens, here are the probabilities:

k    probability

0    .125
1    .375
2    .375
3    .125

A cumulative distribution would look like this:

k    probability

0    .125
1    .5
2    .875
3    1

For each k, we fill in the probability that we'll see k outcomes or less. By the end of the distribution, we should get 1, because all the probabilities add to 1 (if we flip 3 coins, either 0, 1, 2, or 3 of them must be heads).

We can calculate this with binom.cdf in scipy.

from scipy import linspace
from scipy.stats import binom

# Create a range of numbers from 0 to 30, 
# with 31 elements (each number has one entry).
outcome_counts = linspace(0,30,31)

# Create the cumulative binomial probabilities, 
# one for each entry in outcome_counts.
dist = binom.cdf(outcome_counts,30,0.39)

Create a cumulative distribution where N=30 and p=.39 and generate a line plot of the distribution.

outcome_counts = linspace(0,30,31)
outcome_probs = binom.cdf(outcome_counts,30,0.39)
plt.plot(outcome_counts, outcome_probs)
plt.show()

Calculating z-scores

This is because every normal distribution, as we learned in an earlier mission, has the same properties when it comes to what percentage of the data is within a certain number of standard deviations of the mean. You can look these up in a standard normal table. About 68% of the data is within 1 standard deviation of the mean, 95% is within 2, and 99% is within 3.

We can calculate the mean (μ) and standard deviation (σ) of a binomial probability distribution using the formulas from earlier:


1.png

If we want to figure out the number of standard deviations from the mean a value is, we just do:
(k-μ)/σ

If we wanted to find how many standard deviations from the mean 16 days is:

(16 -μ) /σ
= 16-(30*.39)/(30*.39*.61)1/2(math.squar)
= 4.3/2.67 
= 1.61

This tells us that 16 days is approximately 1.61 standard deviations above the mean. In percentage form, this captures about 44.63% of our data. If we also include 1.61 standard deviations below the mean(both sides of distribution), this'll include 89.26% of our data.

There's a 5.37% chance that a value is 1.61 standard deviations or more above the mean (to the right of the mean), and there's a 5.37% chance that a value is 1.61 standard deviations below the mean (to the left).

Faster way to calculate likelihood

We don't want to have to use a z-score table every time we want to see how likely or unlikely a probability is. A much faster way is to use the cumulative distribution fuction (cdf) like we did earlier. This won't give us the exact same values as using z-scores, because the distribution isn't exactly normal, but it will give us the actual amount of probability in a distribution to the left of a given k.

To use it, we can run:

# The sum of all the probabilities to the left of k, including k.
left = binom.cdf(k,N,p)

# The sum of all probabilities to the right of k.
right = 1 - left

This will return the sum of all the probabilities to the left of and including k. If we subtract this value from 1, we get the sum of all the probabilities to the right of k.

Find the probability to the left of (including 16) when and .

Assign the result to left_16.

Find the probability to the right of when and .

Assign the result to right_16.

left_16 = None
right_16 = None
left_16 = binom.cdf(16,30,0.39)
right_16 = 1 - left_16
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,186评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,858评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,620评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,888评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,009评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,149评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,204评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,956评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,385评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,698评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,863评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,544评论 4 335
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,185评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,899评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,141评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,684评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,750评论 2 351

推荐阅读更多精彩内容