5.7 introduction to statistics

At its core, statistics is about counting and measuring.

In order to do both effectively, we have to define scales on which to base our counts. A scale represents the possible values that a variable can have.

统计数据的核心是计数和计量。

为了有效地做到这一点,我们必须定义量表来确定我们的计数。 量表代表变量可能具有的可能值。

Equal Interval Scales

等间隔标度

Equal interval scales are always consistent.

相等的间隔尺度始终保持一致。

Think of the speed of a car. No matter what speed you're traveling at, a difference of five miles per hour is always five miles per hour.

想想汽车的速度。 无论您乘坐哪种速度,每小时五英里的速度总是每小时五英里。

The difference between 60 and 55 miles per hour will always be equal to the difference between 10 and five miles per hour.
每小时60英里和55英里之间的差值总是等于每小时10英里和5英里之间的差值。

Logarithmic Scales

对数标度

Each step on a logarithmic scale represents a different order of magnitude. The Richter scale that measures the strength of earthquakes, for example, is a logarithmic scale.

对数刻度上的每一步表示不同的数量级。 例如,测量地震强度的里氏等级是对数尺度。

The difference between a 5 and a 6 on the Richter scale is more than the difference between a 4 and 5. This is because each number on the Richter scale represents 10 times the shaking amplitude of the previous number.

里氏刻度上的5和6之间的差异大于4和5之间的差异。这是因为里氏刻度上的每个数字表示前面数字的振动幅度的10倍。

A 6 on the Richter scale is 10 times more powerful (technically, powerful is the wrong term, but it makes thinking about this easier) than a 5, which is 10 times more powerful than a 4. A 6 is 100 times more powerful than a 4.

里氏6级的强度是技术上的10倍(技术上说,功能强大的是错误的术语,但它使得思考更容易)比5强大5倍,强度比4强大10倍。 一个4。

We can calculate the mean of the values on an equal interval scale by adding those values, and then dividing by the total number of values.

我们可以通过添加这些值来计算等间隔尺度上的值的均值,然后除以值的总数。

We could do the same for the values on a non-equal interval scale, but the results wouldn't be meaningful, due to the differences between units.

我们也可以对非等间隔的数值进行相同的处理,但由于单位间的差异,结果不会有意义。

  • Compute the mean of car_speeds, and assign the result to mean_car_speed.
  • Compute the mean of earthquake_intensities, and assign the result to mean_earthquake_intensities. Note that this value will not be meaningful, because we shouldn't average values on a logarithmic scale this way.
car_speeds = [10,20,30,50,20]
earthquake_intensities = [2,7,4,5,8]
mean_car_speed = sum(car_speeds) / len(car_speeds)
mean_earthquake_intensities = sum(earthquake_intensities) / len(earthquake_intensities)

2. Discrete and Continuous Scales

Scales can be either discrete or continuous.

比例可以是离散或连续的。

Think of someone marking down the number of inches a snail crawls every day. The snail could crawl 1 inch, 2 inches, 1.5 inches, 1.51 inches, or any other number, and it would be a valid observation. This is because inches are on a continuous scale, and even fractions of an inch are possible.

想象一下每天蜗牛爬行的英寸数量。 蜗牛爬行1英寸,2英寸,1.5英寸,1.51英寸或任何其他数字,这将是一个有效的观察。 这是因为英寸是连续的,甚至可能是几分之一英寸。

Now think of someone counting the number of cars in a parking lot each day. 1 car, 2 cars, and 10 cars are valid measurements, but 1.5 cars isn't valid.

现在想想每天有人在停车场中统计汽车数量。 1辆汽车,2辆汽车和10辆汽车均为有效测量值,但1.5辆汽车无效

Half of a car isn't a meaningful quantity, because cars are discrete. You can't have 52% of a car - you either have a car, or you don't.

汽车的一半并不是有意义的数量,因为汽车是离散的。 你不能拥有一辆车的52% - 你要么有一辆车,要么你没有。

You can still average items on discrete scales, though. You could say "1.75 cars use this parking lot each day, on average." Any daily value for number of cars, however, would need to be a whole number.

尽管如此,你仍然可以在离散的尺度上平均物品。 你可以说“平均每天有1.75辆汽车使用这个停车场。” 然而,任何汽车数量的每日价值都需要是一个整数。

Make a line plot with day_numbers on the x axis and snail_crawl_length on the y axis.
Make a line plot with day_numbers on the x axis and cars_in_parking_lot on the y axis.

day_numbers = [1,2,3,4,5,6,7]
snail_crawl_length = [.5,2,5,10,1,.25,4]
cars_in_parking_lot = [5,6,4,2,1,7,8]

import matplotlib.pyplot as plt 

plt.plot (day_numbers, snail_crawl_length)
plt.show()

plt.plot(day_numbers,cars_in_parking_lot)
plt.show()

tips

import matplotlib.pyplot as plt
x = np.arange(0, 5, 0.1);
y = np.sin(x)
plt.plot(x, y)

3. Understanding Scale Starting Points

Some scales use the zero value in different ways. Think of the number of cars in a parking lot.

Zero cars in the lot means that there are absolutely no cars at all, so absolute zero is at 0 cars. You can't have negative cars.

Now, think of degrees Fahrenheit.

Zero degrees doesn't mean that there isn't any warmth; the degree scale can also be negative, and absolute zero (when there is no warmth at all) is at -459.67 degrees.

Scales with absolute zero points that aren't at 0 don't enable us to take meaningful ratios. For example, if four cars parked in the lot yesterday and eight park today, I can safely say that twice as many cars are in the lot today.

However, if it was 32 degrees Fahrenheit yesterday, and it's 64 degrees today, I can't say that it's twice as warm today as yesterday.

  • Convert the values in fahrenheit_degrees so that absolute zero is at the value 0. If you think this is already the case, don't change anything. Assign the result to degrees_zero.
  • Convert the values in yearly_town_population so that absolute zero is at the value 0. If you think this is already the case, don't change anything. Assign the result to population_zero.
fahrenheit_degrees = [32, 64, 78, 102]
yearly_town_population = [100,102,103,110,105,120]
population_zero = yearly_town_population
degrees_zero = [f + 459.67 for f in fahrenheit_degrees]

tips
py f + 459.67 for f in fahrenheit_degrees


4. Working With Ordinal Scales

So far, we've looked at equal interval and discrete scales, where all of the values are numbers. We can also have ordinal scales, where items are ordered by rank.

到目前为止,我们已经看到了等间隔和离散尺度,其中所有的值都是数字。 我们还可以使用序数量表,其中按等级排列项目。

For example, we could ask people how many cigarettes they smoke per day, and the answers could be "none," "a few," "some," or "a lot." These answers don't map exactly to numbers of cigarettes, but we know that "a few" is more than "none."

例如,我们可以问人们他们每天吸多少支香烟,答案可能是“没有”,“少数”,“某些”或“很多”。 这些答案并不完全符合香烟数量,但我们知道“少数”比“没有”多。

This is an ordinal rating scale. We can assign numbers to the answers in a logical order to make them easier to work with.

这是一个有序的评级量表。 我们可以按照逻辑顺序将答案分配给答案,以使他们更容易处理。

For example, we could map 0 to "none," 1 to "a few," 2 to "some," and so on.

例如,我们可以将0映射到“无”,1映射到“一些”,2映射到“一些”等等。

  • In the following code block, assign a number to each survey response that corresponds with its position on the scale ("none" is 0, and so on).
  • Compute the average value of all the survey responses, and assign it to average_smoking.

# Results from our survey on how many cigarettes people smoke per day
survey_responses = ["none", "some", "a lot", "none", "a few", "none", "none"]

survey_scale = ["none", "a few", "some", "a lot"]
survey_numbers = [survey_scale.index(response) for response in survey_responses]
average_smoking = sum(survey_numbers) / len(survey_numbers)

tips
py survey_scale.index(response) for response in survey_responses


5. Grouping Values with Categorical Scales

We can also have categorical scales, which group values into general categories.

我们也可以有分类尺度,将价值分为一般类别。

One example is gender, which can be male or female.

Unlike ordinal scales, categorical scales don't have an order. In our gender example, for instance, one category isn't greater than or less than the other.

与序数标度不同,分类标度没有顺序。 例如,在我们的性别例子中,一个类别不会大于或小于另一个。

Categories are common in data science. You'll typically use them to split data into groups.

分类在数据科学中很常见。 您通常会使用它们将数据拆分成组。

  • Compute the average savings for everyone who is "male". Assign the result to male_savings.
  • Compute the average savings for everyone who is "female". Assign the result to female_savings.
# Let's say that these lists are both columns in a matrix.  
# Index 0 is the first row in both, and so on.
2
gender = ["male", "female", "female", "male", "male", "female"]
3
savings = [1200, 5000, 3400, 2400, 2800, 4100]

male_savings_list = [savings[i] for i in range(0, len(gender)) if gender[i] == "male"]

female_savings_list = [savings[i] for i in range(0, len(gender)) if gender[i] == "female"]

male_savings = sum(male_savings_list) / len(male_savings_list)
female_savings = sum(female_savings_list) / len(female_savings_list)

6. Visualizing Counts with Frequency Histograms

Remember how statistics is all about counting? A frequency histogram is a type of plot that helps us visualize counts of data.

还记得统计数字是如何计算的吗? 频率直方图是一种可帮助我们可视化数据计数的图表类型。

These plots tally how many times each value occurs in a list, then graph the values on the x-axis and the counts on the y-axis.

这些图表统计每个值出现在列表中的次数,然后绘制x轴上的值和y轴上的计数值。

Frequency histograms give us a better understanding of where values fall within a data set.

频率直方图让我们更好地理解数据集中的值。

  • Plot a histogram of student_scores.
# Let's say that we watch cars drive by and calculate average speed in miles per hour
average_speed = [10, 20, 25, 27, 28, 22, 15, 18, 17]
import matplotlib.pyplot as plt
plt.hist(average_speed)
plt.show()

# Let's say we measure student test scores from 0-100
student_scores = [15, 80, 95, 100, 45, 75, 65]

plt.hist(student_scores)
plt.show

7. Aggregating Values with Histogram Bins

You may have noticed that the code on the last screen plotted all of the values.

In contrast, histograms use bins to count values. Bins aggregate values into predefined "buckets."

您可能已经注意到最后一个屏幕上的代码绘制了所有值。

相反,直方图使用分箱来计数值。 Bins将值汇总到预定义的“桶”中

Here's how they work. If the x-axis ranges from 0 to 10 and we have 10 bins, the first bin would be for values between 0-1, the second would be for values between 1-2, and so on.

If we have five bins, the first bin would be for values between 0-2, the second would be for values between 2-4, and so on.

以下是他们的工作方式。如果x轴的取值范围是0到10,而我们有10个分档,则第一个分档的值将在0-1之间,第二个分档的值将在1-2之间,依此类推。

如果我们有五个垃圾箱,第一个垃圾箱的值将介于0-2之间,第二个垃圾箱的值介于2-4之间,依此类推。

Each value in the list that falls within the bin would increase the bin's count by one. The result looks like a bar chart. Bins give us a better understanding of the shape and distribution of the data than graphing each count individually.

Now that you know about bins, we'd like to point something out about what you saw on the previous screen. matplotlib's default number of bins for a plot is 10. We had fewer values than that, so matplotlib displayed all of the values.

Let's experiment a bit with using different numbers of bins to gain a better understanding of how they work.

列表中的每个值都位于垃圾箱内,会将垃圾箱的计数增加1。结果看起来像条形图。与单独绘制每个数字相比,数据库让我们更好地理解数据的形状和分布。

现在你已经知道垃圾箱了,我们想指出一些你在前一个屏幕上看到的内容。 matplotlib对于一个plot的默认bin数是10.我们的值比这少,所以matplotlib显示了所有的值。

让我们尝试一下使用不同数量的垃圾箱以更好地了解它们的工作方式。

  • Plot a histogram of average_speed with only 2 bins.
average_speed = [10, 20, 25, 27, 28, 22, 15, 18, 17]
import matplotlib.pyplot as plt
plt.hist(average_speed, bins=6)
plt.show()

# As you can see, matplotlib groups the values in the list into the nearest bins.
# If we have fewer bins, each bin will have a higher count 
#(because there will be fewer bins to group all of the values into).
# If there are more bins, the total for each one will decrease,
# because each one will contain fewer values.
plt.hist(average_speed, bins=4)
plt.show()

plt.hist(average_speed, bins=2)
plt.show()
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,657评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,662评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,143评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,732评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,837评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,036评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,126评论 3 410
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,868评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,315评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,641评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,773评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,470评论 4 333
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,126评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,859评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,095评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,584评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,676评论 2 351

推荐阅读更多精彩内容