机器学习(如KNN, SVM, DT)和数量遗传中,变量之间单位可能不同,造成数值差别很大,这就可能造成算法中对数值大的变量给予大的权重。所以通常会提前对所有变量做一个预处理。
经常见到的有Normalization和Standardization两种。列如数据预处理前后的如下图:
Normalization和Standardization
Normalization 会将所有变量的的数值变成 [0,1]。
公式:
Standardization 会将所有变量的均值为0 and 标准差为 1 (unit variance).
公式:
两者结果图示如下:
两者使用
Normalization:
当您知道数据的分布不遵循高斯分布时,可以使用Normalization。 这在不假设数据有任何分布的算法(例如K最近邻居和神经网络)中很有用。但是容易受到异常值影响。
Standarization:
在数据遵循高斯分布的情况下,Standarization可能会有所帮助。 但是,这不一定是正确的。 另外Standarization没有边界范围。 因此,即使数据中有异常值,它们也不会受到Standarization的影响。
但是,归根结底,选择使用哪一个将取决于你的问题和所使用的机器学习算法。 没有硬性规定可以告诉您何时对数据进行Normalization或Standarization。 您始终可以通过将模型拟合到原始,Normalization和Standarization的数据开始,并比较性能以获得最佳结果。
实际中最好是 fit the scaler on the training data ,再use it to transform the testing data.
This would avoid any data leakage during the model testing process. Also, the scaling of target values is generally not required
Python 代码
Normalization using sklearn
# data normalization with sklearn
from sklearn.preprocessing import MinMaxScaler
# fit scaler on training data
norm = MinMaxScaler().fit(X_train)
# transform training data
X_train_norm = norm.transform(X_train)
# transform testing dataabs
X_test_norm = norm.transform(X_test)
Standardization using sklearn (本人常使用)
# data standardization with sklearn
from sklearn.preprocessing import StandardScaler
# copy of datasets
X_train_stand = X_train.copy()
X_test_stand = X_test.copy()
# numerical features
num_cols = ['Item_Weight','Item_Visibility','Item_MRP','Outlet_Establishment_Year']
# apply standardization on numerical features
for i in num_cols:
# fit on training data column
scale = StandardScaler().fit(X_train_stand[[i]])
# transform the training data column
X_train_stand[i] = scale.transform(X_train_stand[[i]])
# transform the testing data column
X_test_stand[i] = scale.transform(X_test_stand[[i]])
标准化结合KNN, SVR, DT
以下均使用python代码实现
K-Nearest Neighbours
# training a KNN model
from sklearn.neighbors import KNeighborsRegressor
# measuring RMSE score
from sklearn.metrics import mean_squared_error
# knn
knn = KNeighborsRegressor(n_neighbors=7)
rmse = []
# raw, normalized and standardized training and testing data
trainX = [X_train, X_train_norm, X_train_stand]
testX = [X_test, X_test_norm, X_test_stand]
# model fitting and measuring RMSE
for i in range(len(trainX)):
# fit
knn.fit(trainX[i],y_train)
# predict
pred = knn.predict(testX[i])
# RMSE
rmse.append(np.sqrt(mean_squared_error(y_test,pred)))
# visualizing the result
df_knn = pd.DataFrame({'RMSE':rmse},index=['Original','Normalized','Standardized'])
df_knn
Support Vector Regressor
# training an SVR model
from sklearn.svm import SVR
# measuring RMSE score
from sklearn.metrics import mean_squared_error
# SVR
svr = SVR(kernel='rbf',C=5)
rmse = []
# raw, normalized and standardized training and testing data
trainX = [X_train, X_train_norm, X_train_stand]
testX = [X_test, X_test_norm, X_test_stand]
# model fitting and measuring RMSE
for i in range(len(trainX)):
# fit
svr.fit(trainX[i],y_train)
# predict
pred = svr.predict(testX[i])
# RMSE
rmse.append(np.sqrt(mean_squared_error(y_test,pred)))
# visualizing the result
df_svr = pd.DataFrame({'RMSE':rmse},index=['Original','Normalized','Standardized'])
df_svr
Decision Tree
# training a Decision Tree model
from sklearn.tree import DecisionTreeRegressor
# measuring RMSE score
from sklearn.metrics import mean_squared_error
# Decision tree
dt = DecisionTreeRegressor(max_depth=10,random_state=27)
rmse = []
# raw, normalized and standardized training and testing data
trainX = [X_train,X_train_norm,X_train_stand]
testX = [X_test,X_test_norm,X_test_stand]
# model fitting and measuring RMSE
for i in range(len(trainX)):
# fit
dt.fit(trainX[i],y_train)
# predict
pred = dt.predict(testX[i])
# RMSE
rmse.append(np.sqrt(mean_squared_error(y_test,pred)))
# visualizing the result
df_dt = pd.DataFrame({'RMSE':rmse},index=['Original','Normalized','Standardized'])
df_dt