1、业务了解
对可能导致糖尿病的风险因素进行预测。
2、数据了解和数据准备
数据集包含着MASS包里面,一个数据框是Pima.tr,另一个是Pima.te,两个数据框的数据结构都是一样的,我们将他合并,再划分训练集和测试集。
library(MASS)
data("Pima.tr")
data("Pima.te")
Pima <- rbind(Pima.tr,Pima.te)
str(Pima)
'data.frame': 532 obs. of 8 variables:
glu : int 86 195 77 165 107 97 83 193 142 128 ...
skin : int 28 33 41 43 25 27 31 16 15 37 ...
ped : num 0.364 0.163 0.156 0.259 0.133 ...
type : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 1 2 ...
type为响应变量。通过箱形图进行可视化分析。
library(reshape2)
library(ggplot2)
Pima.melt <- melt(Pima,id.vars = "type")
head(Pima.melt)
type variable value
1 No npreg 5
2 Yes npreg 7
3 No npreg 5
4 No npreg 0
5 No npreg 0
6 Yes npreg 5
ggplot(data = Pima.melt,aes(x=type,y=value))+geom_boxplot()+facet_wrap(~variable,ncol = 3)
由于共用Y轴的关系,某些特征和响应变量之间很难发现有什么关系,这里我们可以将数据标准化。scale后数据集变为matrix,需要转换为数据框。
Pima.scale <- data.frame(scale(Pima[,-8]))
Pima.scaletype
Pima.scale.melt <- melt(Pima.scale,id.vars = "type")
ggplot(data = Pima.scale.melt,aes(x=type,y=value))+geom_boxplot()+facet_wrap(~variable,ncol = 3)
标准化后,箱形图好看很多,我们再做一下相关性分析。
Pima.cor <- cor(Pima[,-8])
library(corrplot)
corrplot.mixed(Pima.cor)
有两对变量之间具有相关性。
按70/30比例划分训练集和测试集。
set.seed(222)
ind <- sample(2,nrow(Pima.scale),replace = TRUE,prob = c(0.7,0.3))
train <- Pima.scale[ind==1,]
test <- Pima.scale[ind==2,]
prop.table(table(train$type))
No Yes
0.6667 0.3333
prop.table(table(test$type))
No Yes
0.6687 0.3312
训练集和测试集划分平衡。
3、建立模型
library(caret)
grid1 <- expand.grid(.k = seq(2,20,by=1))
control <- trainControl(method = "cv")
set.seed(123)
knn.train <- train(type~.,data=train,method="knn",trControl=control,tuneGrid=grid1)
knn.train
k-Nearest Neighbors
372 samples
7 predictor
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 335, 335, 336, 334, 334, 335, ...
Resampling results across tuning parameters:
k Accuracy Kappa
2 0.7233 0.3645
3 0.7310 0.3649
4 0.7176 0.3325
5 0.7501 0.4086
6 0.7554 0.4239
7 0.7633 0.4352
8 0.7447 0.3905
9 0.7581 0.4078
10 0.7716 0.4491
11 0.7740 0.4477
12 0.7606 0.4121
13 0.7711 0.4333
14 0.7687 0.4284
15 0.7714 0.4331
16 0.7687 0.4298
17 0.7823 0.4618
18 0.7716 0.4362
19 0.7687 0.4220
20 0.7687 0.4245
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 17.
除了得到k=17这个结果之外,我们还可以看到正确率和Kappa统计量,以及交叉验证过程中产生的标准差。
正确率告诉我们模型正确分类的百分比。Kappa统计量又称科恩的K统计量,通常用于测量两个分类器对观测值分类的一致性,他对正确率进行了修正,去除了偶然性(或随机性)获得正确分类的因素。
library(class)
knn.test <- knn(train[,-8],test[,-8],train[,8],k=17)
table(knn.test,test$type)
knn.test No Yes
No 89 23
Yes 18 30
计算正确率
(89+30)/(length(test$type))
[1] 0.7438计算Kappa
prob.agree <- (89+30)/length(testtype)*(89+18)/length(test$type)
kappa <- (prob.agree - prob.chance)/(1 - prob.chance)
kappa
[1] 0.5182
测试集上74%的正确率比训练集的78%略低。kappa统计量的一致性强度处于中等水平,所以应该看看是否可以使用加权最近邻法得到更好的结果。
加权最近邻法提高了离观测点更近的邻居的影响力,降低了远离观测点的邻居的影响力。
要使用加权最近邻法,需要kknn包的train.kknn()函数。
有多种方法可以对近邻加权:retangular(不加权),triangular,epanechnikov,biweight,triweight,consine,inversion,gaussian,rank和optimal。
函数要求指定参数:k值最大值kmax,距离distance(1表示绝对值距离,2表示欧式距离),核函数。
library(kknn)
kknn.train <- train.kknn(type~.,data = train,kmax = 25,distance = 2,kernel = c("rectangular","triangular","epanechnikov"))
plot(kknn.train)
kknn.train
Call:
train.kknn(formula = type ~ ., data = train, kmax = 25, distance = 2, kernel = c("rectangular", "triangular", "epanechnikov"))
Type of response variable: nominal
Minimal misclassification: 0.2204
Best kernel: rectangular
Best k: 17
kknn.pre <- predict(kknn.train,newdata = test)
table(kknn.pre,test$type)
kknn.pre No Yes
No 89 23
Yes 18 30
加权K近邻法并没有提高准确率,我们试下e1071包的支持向量分类器。
library(e1071)
linear.tune <- tune.svm(type~.,data = train,kernel="linear",cost = c(0.001,0.01,0.1,1,5,10))
summary(linear.tune)
Parameter tuning of ‘svm’:
- sampling method: 10-fold cross validation
- best parameters:
cost
0.01 - best performance: 0.2038
- Detailed performance results:
cost error dispersion
1 1e-03 0.3332 0.08046
2 1e-02 0.2038 0.08909
3 1e-01 0.2145 0.09143
4 1e+00 0.2227 0.10206
5 5e+00 0.2200 0.10418
6 1e+01 0.2200 0.10418
最低误分类误差率为0.22。
best.linear <- linear.tune$best.model
tune.test <- predict(best.linear,newdata=test)
table(tune.test,testtype)
[1] 0.7688
从结果来看,线性分类器在训练集和测试集上的表现都比KNN稍好一点。我们可以调整优化参数,看分类器的表现是否会更好。
多项式核函数
set.seed(123)
poly.tune <- tune.svm(type~.,data = train,kernel="polynomial",degree = c(3,4,5),coef0 = c(0.1,0.5,1,2,3,4))
summary(poly.tune)
Parameter tuning of ‘svm’:
- sampling method: 10-fold cross validation
- best parameters:
degree coef0
3 0.1 - best performance: 0.2203
- Detailed performance results:
degree coef0 error dispersion
1 3 0.1 0.2203 0.07207
2 4 0.1 0.2474 0.07761
3 5 0.1 0.2608 0.08146
4 3 0.5 0.2284 0.05476
5 4 0.5 0.2555 0.06379
6 5 0.5 0.2555 0.05643
7 3 1.0 0.2418 0.05847
8 4 1.0 0.2637 0.05712
9 5 1.0 0.2797 0.05795
10 3 2.0 0.2472 0.06177
11 4 2.0 0.2634 0.06655
12 5 2.0 0.2956 0.06526
13 3 3.0 0.2526 0.05659
14 4 3.0 0.2767 0.07292
15 5 3.0 0.3173 0.07616
16 3 4.0 0.2553 0.05380
17 4 4.0 0.2847 0.07139
18 5 4.0 0.3090 0.08261
best.poly <- poly.tunetype)
poly.test No Yes
No 95 25
Yes 12 28
(96+28)/length(test$type)
[1] 0.775
这次的模型与线性模型相差不大,好一个点。
径向基核函数
set.seed(123)
rbf.tune <- tune.svm(type~.,data = train,kernel="radial",gamma = c(0.1,0.5,1,2,3,4))
summary(rbf.tune)
Parameter tuning of ‘svm’:
- sampling method: 10-fold cross validation
- best parameters:
gamma
0.1 - best performance: 0.2176
- Detailed performance results:
gamma error dispersion
1 0.1 0.2176 0.07112
2 0.5 0.2770 0.06471
3 1.0 0.2850 0.05404
4 2.0 0.3255 0.07611
5 3.0 0.3335 0.06680
6 4.0 0.3335 0.06043
best.rbf <- rbf.tunetype)
rbf.test No Yes
No 91 22
Yes 16 31
(91+31)/length(test$type)
[1] 0.7625
径向基核函数的非线性模型和线性模型的正确率几乎一样。
最后试一下kernel=“sigmoid”
set.seed(123)
sigmoid.tune <- tune.svm(type~.,data = train,kernel="sigmoid",gamma = c(0.1,0.5,1,2,3,4),coef0 = c(0.1,0.5,1,2,3,4))
summary(sigmoid.tune)
Parameter tuning of ‘svm’:
- sampling method: 10-fold cross validation
- best parameters:
gamma coef0
0.1 2 - best performance: 0.2122
- Detailed performance results:
gamma coef0 error dispersion
1 0.1 0.1 0.2477 0.06425
2 0.5 0.1 0.3282 0.07592
3 1.0 0.1 0.3471 0.08787
4 2.0 0.1 0.3417 0.08288
5 3.0 0.1 0.3255 0.07471
6 4.0 0.1 0.3336 0.07862
7 0.1 0.5 0.2448 0.03811
8 0.5 0.5 0.3094 0.05560
9 1.0 0.5 0.3579 0.08863
10 2.0 0.5 0.3496 0.07714
11 3.0 0.5 0.3388 0.08018
12 4.0 0.5 0.3417 0.07783
13 0.1 1.0 0.2447 0.03925
14 0.5 1.0 0.3144 0.08107
15 1.0 1.0 0.3147 0.06340
16 2.0 1.0 0.3553 0.08192
17 3.0 1.0 0.3550 0.08607
18 4.0 1.0 0.3390 0.08845
19 0.1 2.0 0.2122 0.06160
20 0.5 2.0 0.3331 0.07392
21 1.0 2.0 0.3280 0.04768
22 2.0 2.0 0.3174 0.07044
23 3.0 2.0 0.3470 0.07020
24 4.0 2.0 0.3524 0.08106
25 0.1 3.0 0.3335 0.06043
26 0.5 3.0 0.3413 0.07951
27 1.0 3.0 0.3254 0.04201
28 2.0 3.0 0.2989 0.10412
29 3.0 3.0 0.3336 0.07987
30 4.0 3.0 0.3417 0.08923
31 0.1 4.0 0.3335 0.06043
32 0.5 4.0 0.3226 0.06815
33 1.0 4.0 0.3444 0.05540
34 2.0 4.0 0.3469 0.06984
35 3.0 4.0 0.3712 0.09065
36 4.0 4.0 0.3175 0.08330
best.sigmoid <- sigmoid.tunetype)
sigmoid.test No Yes
No 83 17
Yes 24 36
(83+36)/length(test$type)
[1] 0.7438
4、模型选择
confusionMatrix(tune.test,test$type,positive = "Yes")
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 91 21
Yes 16 32
Accuracy : 0.769
95% CI : (0.696, 0.832)
No Information Rate : 0.669
P-Value [Acc > NIR] : 0.00377
Kappa : 0.465
Mcnemar's Test P-Value : 0.51080
Sensitivity : 0.604
Specificity : 0.850
Pos Pred Value : 0.667
Neg Pred Value : 0.812
Prevalence : 0.331
Detection Rate : 0.200
Detection Prevalence : 0.300
Balanced Accuracy : 0.727
'Positive' Class : Yes
confusionMatrix(rbf.test,test$type,positive = "Yes")
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 91 22
Yes 16 31
Accuracy : 0.762
95% CI : (0.689, 0.826)
No Information Rate : 0.669
P-Value [Acc > NIR] : 0.00634
Kappa : 0.448
Mcnemar's Test P-Value : 0.41730
Sensitivity : 0.585
Specificity : 0.850
Pos Pred Value : 0.660
Neg Pred Value : 0.805
Prevalence : 0.331
Detection Rate : 0.194
Detection Prevalence : 0.294
Balanced Accuracy : 0.718
'Positive' Class : Yes
confusionMatrix(sigmoid.test,test$type,positive = "Yes")
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 83 17
Yes 24 36
Accuracy : 0.744
95% CI : (0.669, 0.809)
No Information Rate : 0.669
P-Value [Acc > NIR] : 0.0249
Kappa : 0.44
Mcnemar's Test P-Value : 0.3487
Sensitivity : 0.679
Specificity : 0.776
Pos Pred Value : 0.600
Neg Pred Value : 0.830
Prevalence : 0.331
Detection Rate : 0.225
Detection Prevalence : 0.375
Balanced Accuracy : 0.727
'Positive' Class : Yes
三个模型比较起来似乎还是第一个线性分类器为最优。