At its core, statistics is about counting and measuring.
In order to do both effectively, we have to define scales on which to base our counts. A scale represents the possible values that a variable can have.
统计数据的核心是计数和计量。
为了有效地做到这一点,我们必须定义量表来确定我们的计数。 量表代表变量可能具有的可能值。
Equal Interval Scales
等间隔标度
Equal interval scales are always consistent.
相等的间隔尺度始终保持一致。
Think of the speed of a car. No matter what speed you're traveling at, a difference of five miles per hour is always five miles per hour.
想想汽车的速度。 无论您乘坐哪种速度,每小时五英里的速度总是每小时五英里。
The difference between 60 and 55 miles per hour will always be equal to the difference between 10 and five miles per hour.
每小时60英里和55英里之间的差值总是等于每小时10英里和5英里之间的差值。
Logarithmic Scales
对数标度
Each step on a logarithmic scale represents a different order of magnitude. The Richter scale that measures the strength of earthquakes, for example, is a logarithmic scale.
对数刻度上的每一步表示不同的数量级。 例如,测量地震强度的里氏等级是对数尺度。
The difference between a 5 and a 6 on the Richter scale is more than the difference between a 4 and 5. This is because each number on the Richter scale represents 10 times the shaking amplitude of the previous number.
里氏刻度上的5和6之间的差异大于4和5之间的差异。这是因为里氏刻度上的每个数字表示前面数字的振动幅度的10倍。
A 6 on the Richter scale is 10 times more powerful (technically, powerful is the wrong term, but it makes thinking about this easier) than a 5, which is 10 times more powerful than a 4. A 6 is 100 times more powerful than a 4.
里氏6级的强度是技术上的10倍(技术上说,功能强大的是错误的术语,但它使得思考更容易)比5强大5倍,强度比4强大10倍。 一个4。
We can calculate the mean of the values on an equal interval scale by adding those values, and then dividing by the total number of values.
我们可以通过添加这些值来计算等间隔尺度上的值的均值,然后除以值的总数。
We could do the same for the values on a non-equal interval scale, but the results wouldn't be meaningful, due to the differences between units.
我们也可以对非等间隔的数值进行相同的处理,但由于单位间的差异,结果不会有意义。
- Compute the mean of
car_speeds
, and assign the result tomean_car_speed
. - Compute the mean of
earthquake_intensities
, and assign the result tomean_earthquake_intensities
. Note that this value will not be meaningful, because we shouldn't average values on a logarithmic scale this way.
car_speeds = [10,20,30,50,20]
earthquake_intensities = [2,7,4,5,8]
mean_car_speed = sum(car_speeds) / len(car_speeds)
mean_earthquake_intensities = sum(earthquake_intensities) / len(earthquake_intensities)
2. Discrete and Continuous Scales
Scales can be either discrete or continuous.
比例可以是离散或连续的。
Think of someone marking down the number of inches a snail crawls every day. The snail could crawl 1 inch, 2 inches, 1.5 inches, 1.51 inches, or any other number, and it would be a valid observation. This is because inches are on a continuous scale, and even fractions of an inch are possible.
想象一下每天蜗牛爬行的英寸数量。 蜗牛爬行1英寸,2英寸,1.5英寸,1.51英寸或任何其他数字,这将是一个有效的观察。 这是因为英寸是连续的,甚至可能是几分之一英寸。
Now think of someone counting the number of cars in a parking lot each day. 1 car, 2 cars, and 10 cars are valid measurements, but 1.5 cars isn't valid.
现在想想每天有人在停车场中统计汽车数量。 1辆汽车,2辆汽车和10辆汽车均为有效测量值,但1.5辆汽车无效
Half of a car isn't a meaningful quantity, because cars are discrete. You can't have 52% of a car - you either have a car, or you don't.
汽车的一半并不是有意义的数量,因为汽车是离散的。 你不能拥有一辆车的52% - 你要么有一辆车,要么你没有。
You can still average items on discrete scales, though. You could say "1.75 cars use this parking lot each day, on average." Any daily value for number of cars, however, would need to be a whole number.
尽管如此,你仍然可以在离散的尺度上平均物品。 你可以说“平均每天有1.75辆汽车使用这个停车场。” 然而,任何汽车数量的每日价值都需要是一个整数。
Make a line plot with day_numbers
on the x axis and snail_crawl_length
on the y axis.
Make a line plot with day_numbers
on the x axis and cars_in_parking_lot
on the y axis.
day_numbers = [1,2,3,4,5,6,7]
snail_crawl_length = [.5,2,5,10,1,.25,4]
cars_in_parking_lot = [5,6,4,2,1,7,8]
import matplotlib.pyplot as plt
plt.plot (day_numbers, snail_crawl_length)
plt.show()
plt.plot(day_numbers,cars_in_parking_lot)
plt.show()
tips
import matplotlib.pyplot as plt
x = np.arange(0, 5, 0.1);
y = np.sin(x)
plt.plot(x, y)
3. Understanding Scale Starting Points
Some scales use the zero value in different ways. Think of the number of cars in a parking lot.
Zero cars in the lot means that there are absolutely no cars at all, so absolute zero is at 0 cars. You can't have negative cars.
Now, think of degrees Fahrenheit.
Zero degrees doesn't mean that there isn't any warmth; the degree scale can also be negative, and absolute zero (when there is no warmth at all) is at -459.67 degrees.
Scales with absolute zero points that aren't at 0 don't enable us to take meaningful ratios. For example, if four cars parked in the lot yesterday and eight park today, I can safely say that twice as many cars are in the lot today.
However, if it was 32 degrees Fahrenheit yesterday, and it's 64 degrees today, I can't say that it's twice as warm today as yesterday.
- Convert the values in
fahrenheit_degrees
so that absolute zero is at the value 0. If you think this is already the case, don't change anything. Assign the result todegrees_zero
. - Convert the values in
yearly_town_population
so that absolute zero is at the value 0. If you think this is already the case, don't change anything. Assign the result topopulation_zero
.
fahrenheit_degrees = [32, 64, 78, 102]
yearly_town_population = [100,102,103,110,105,120]
population_zero = yearly_town_population
degrees_zero = [f + 459.67 for f in fahrenheit_degrees]
tips
py f + 459.67 for f in fahrenheit_degrees
4. Working With Ordinal Scales
So far, we've looked at equal interval and discrete scales, where all of the values are numbers. We can also have ordinal scales, where items are ordered by rank.
到目前为止,我们已经看到了等间隔和离散尺度,其中所有的值都是数字。 我们还可以使用序数量表,其中按等级排列项目。
For example, we could ask people how many cigarettes they smoke per day, and the answers could be "none," "a few," "some," or "a lot." These answers don't map exactly to numbers of cigarettes, but we know that "a few" is more than "none."
例如,我们可以问人们他们每天吸多少支香烟,答案可能是“没有”,“少数”,“某些”或“很多”。 这些答案并不完全符合香烟数量,但我们知道“少数”比“没有”多。
This is an ordinal rating scale. We can assign numbers to the answers in a logical order to make them easier to work with.
这是一个有序的评级量表。 我们可以按照逻辑顺序将答案分配给答案,以使他们更容易处理。
For example, we could map 0 to "none," 1 to "a few," 2 to "some," and so on.
例如,我们可以将0映射到“无”,1映射到“一些”,2映射到“一些”等等。
- In the following code block, assign a number to each survey response that corresponds with its position on the scale (
"none"
is0
, and so on). - Compute the average value of all the survey responses, and assign it to
average_smoking
.
# Results from our survey on how many cigarettes people smoke per day
survey_responses = ["none", "some", "a lot", "none", "a few", "none", "none"]
survey_scale = ["none", "a few", "some", "a lot"]
survey_numbers = [survey_scale.index(response) for response in survey_responses]
average_smoking = sum(survey_numbers) / len(survey_numbers)
tips
py survey_scale.index(response) for response in survey_responses
5. Grouping Values with Categorical Scales
We can also have categorical scales, which group values into general categories.
我们也可以有分类尺度,将价值分为一般类别。
One example is gender, which can be male or female.
Unlike ordinal scales, categorical scales don't have an order. In our gender example, for instance, one category isn't greater than or less than the other.
与序数标度不同,分类标度没有顺序。 例如,在我们的性别例子中,一个类别不会大于或小于另一个。
Categories are common in data science. You'll typically use them to split data into groups.
分类在数据科学中很常见。 您通常会使用它们将数据拆分成组。
- Compute the average savings for everyone who is "male". Assign the result to male_savings.
- Compute the average savings for everyone who is "female". Assign the result to female_savings.
# Let's say that these lists are both columns in a matrix.
# Index 0 is the first row in both, and so on.
2
gender = ["male", "female", "female", "male", "male", "female"]
3
savings = [1200, 5000, 3400, 2400, 2800, 4100]
male_savings_list = [savings[i] for i in range(0, len(gender)) if gender[i] == "male"]
female_savings_list = [savings[i] for i in range(0, len(gender)) if gender[i] == "female"]
male_savings = sum(male_savings_list) / len(male_savings_list)
female_savings = sum(female_savings_list) / len(female_savings_list)
6. Visualizing Counts with Frequency Histograms
Remember how statistics is all about counting? A frequency histogram is a type of plot that helps us visualize counts of data.
还记得统计数字是如何计算的吗? 频率直方图是一种可帮助我们可视化数据计数的图表类型。
These plots tally how many times each value occurs in a list, then graph the values on the x-axis and the counts on the y-axis.
这些图表统计每个值出现在列表中的次数,然后绘制x轴上的值和y轴上的计数值。
Frequency histograms give us a better understanding of where values fall within a data set.
频率直方图让我们更好地理解数据集中的值。
- Plot a histogram of student_scores.
# Let's say that we watch cars drive by and calculate average speed in miles per hour
average_speed = [10, 20, 25, 27, 28, 22, 15, 18, 17]
import matplotlib.pyplot as plt
plt.hist(average_speed)
plt.show()
# Let's say we measure student test scores from 0-100
student_scores = [15, 80, 95, 100, 45, 75, 65]
plt.hist(student_scores)
plt.show
7. Aggregating Values with Histogram Bins
You may have noticed that the code on the last screen plotted all of the values.
In contrast, histograms use bins to count values. Bins aggregate values into predefined "buckets."
您可能已经注意到最后一个屏幕上的代码绘制了所有值。
相反,直方图使用分箱来计数值。 Bins将值汇总到预定义的“桶”中
Here's how they work. If the x-axis ranges from 0 to 10 and we have 10 bins, the first bin would be for values between 0-1, the second would be for values between 1-2, and so on.
If we have five bins, the first bin would be for values between 0-2, the second would be for values between 2-4, and so on.
以下是他们的工作方式。如果x轴的取值范围是0到10,而我们有10个分档,则第一个分档的值将在0-1之间,第二个分档的值将在1-2之间,依此类推。
如果我们有五个垃圾箱,第一个垃圾箱的值将介于0-2之间,第二个垃圾箱的值介于2-4之间,依此类推。
Each value in the list that falls within the bin would increase the bin's count by one. The result looks like a bar chart. Bins give us a better understanding of the shape and distribution of the data than graphing each count individually.
Now that you know about bins, we'd like to point something out about what you saw on the previous screen. matplotlib's default number of bins for a plot is 10. We had fewer values than that, so matplotlib displayed all of the values.
Let's experiment a bit with using different numbers of bins to gain a better understanding of how they work.
列表中的每个值都位于垃圾箱内,会将垃圾箱的计数增加1。结果看起来像条形图。与单独绘制每个数字相比,数据库让我们更好地理解数据的形状和分布。
现在你已经知道垃圾箱了,我们想指出一些你在前一个屏幕上看到的内容。 matplotlib对于一个plot的默认bin数是10.我们的值比这少,所以matplotlib显示了所有的值。
让我们尝试一下使用不同数量的垃圾箱以更好地了解它们的工作方式。
- Plot a histogram of
average_speed
with only 2 bins.
average_speed = [10, 20, 25, 27, 28, 22, 15, 18, 17]
import matplotlib.pyplot as plt
plt.hist(average_speed, bins=6)
plt.show()
# As you can see, matplotlib groups the values in the list into the nearest bins.
# If we have fewer bins, each bin will have a higher count
#(because there will be fewer bins to group all of the values into).
# If there are more bins, the total for each one will decrease,
# because each one will contain fewer values.
plt.hist(average_speed, bins=4)
plt.show()
plt.hist(average_speed, bins=2)
plt.show()