前言
在处理数据的时候,经常会遇到两个名词Scale
和Normalization
,这两个名词经常会被混杂着使用,让我在理解一些操作的时候经常会迷糊,那么我就结合R语言里面的scale()
函数讲解一下这两个名词的实在意义。
正文
One of the reasons that it's easy to get confused between scaling and normalization is because the terms are sometimes used interchangeably and, to make it even more confusing, they are very similar! In both cases, you're transforming the values of numeric variables so that the transformed data points have specific helpful properties. The difference is that, in scaling, you're changing the range of your data while in normalization you're changing the shape of the distribution of your data. Let's talk a little more in-depth about each of these options.
先说结论,Scale改变数据的range(范围),Normalization改变数据的distribution()分布。
认知
Scale
scale意味着你可以转化你的数据到一个制定的范围,类似于1-100或者0-1。当你使用某种基于数值大小的方法的时候(比如SVM或者KNN)时,就需要用到scale。
Normalization
scale只是改变你数据的range(范围),Normalization则是一个更加激进的转化。
Normalization的目的就在于把你的数据转化为一个正态分布,从而进行下游的数据分析(t-tests, ANOVAs, linear regression, linear discriminant analysis (LDA) and Gaussian naive Bayes).
R语言操作
首先在R console里面查看scale函数的用法:
?scale
## 可以得到以下的介绍
The value of center determines how column centering is performed. If center is a numeric-alike vector with length equal to the number of columns of x, then each column of x has the corresponding value from center subtracted from it. If center is TRUE then centering is done by subtracting the column means (omitting NAs) of x from their corresponding columns, and if center is FALSE, no centering is done.
The value of scale determines how column scaling is performed (after centering). If scale is a numeric-alike vector with length equal to the number of columns of x, then each column of x is divided by the corresponding value from scale. If scale is TRUE then scaling is done by dividing the (centered) columns of x by their standard deviations if center is TRUE, and the root mean square otherwise. If scale is FALSE, no scaling is done.
The root-mean-square for a (possibly centered) column is defined as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing values and n is the number of non-missing values. In the case center = TRUE, this is the same as the standard deviation, but in general it is not. (To scale by the standard deviations without centering, use scale(x, center = FALSE, scale = apply(x, 2, sd, na.rm = TRUE)).)
而且可以看到,scale函数的用法是scale(matrix, center = T/F, scale = T/F)
,那么就用示例说明一下问题。
> x <- matrix(1:20, ncol = 4)
> x
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
> scale(x, center = T, scale = T)
[,1] [,2] [,3] [,4]
[1,] -1.2649111 -1.2649111 -1.2649111 -1.2649111
[2,] -0.6324555 -0.6324555 -0.6324555 -0.6324555
[3,] 0.0000000 0.0000000 0.0000000 0.0000000
[4,] 0.6324555 0.6324555 0.6324555 0.6324555
[5,] 1.2649111 1.2649111 1.2649111 1.2649111
attr(,"scaled:center")
[1] 3 8 13 18
attr(,"scaled:scale")
[1] 1.581139 1.581139 1.581139 1.581139
> scale(x, center = T, scale = F)
[,1] [,2] [,3] [,4]
[1,] -2 -2 -2 -2
[2,] -1 -1 -1 -1
[3,] 0 0 0 0
[4,] 1 1 1 1
[5,] 2 2 2 2
attr(,"scaled:center")
[1] 3 8 13 18
> scale(x, center = T, scale = F)/sd(scale(x, center = T, scale = F)[1:5])
[,1] [,2] [,3] [,4]
[1,] -1.2649111 -1.2649111 -1.2649111 -1.2649111
[2,] -0.6324555 -0.6324555 -0.6324555 -0.6324555
[3,] 0.0000000 0.0000000 0.0000000 0.0000000
[4,] 0.6324555 0.6324555 0.6324555 0.6324555
[5,] 1.2649111 1.2649111 1.2649111 1.2649111
attr(,"scaled:center")
[1] 3 8 13 18
这里我们可以看出,scale()函数事实上做了两件事,center和scale,而这里的center就是减去每列的均值,scale则是用center后的数据除以该列的标准差,做了一个正态分布的转化,也就是,下面我作图以示转化过程。
data <- runif(100, min = 10, max = 100)
plot(1:100, data)
plot(1:100, scale(data, center = T, scale = F))
plot(1:100, scale(data, center = T, scale = T))
结语
R语言里面的scale()
函数的center
和scale
参数需要用对才可以正确处理你的数据。