Q1
一、某地29名13岁儿童身高(cm),体重(kg)和肺活量(L)数据见data,求:
(1)由身高,体重推算肺活量的回归方程;
(2)求出的方程是否有意义;
(3)剩余标准差
(1).
> setwd("C:/Users/My/Desktop/hw6/")
> data6_1 <- read.csv("homework-6.1-data.csv",header = T, sep = ",")
> data6_1
number Height weight vital.capacity
1 1 135.1 32.0 1.75
2 2 139.9 30.4 2.00
3 3 163.6 46.2 2.75
4 4 146.5 33.5 2.50
5 5 156.2 37.1 2.75
6 6 156.4 35.5 2.00
7 7 167.8 41.5 2.75
8 8 149.7 31.0 1.50
9 9 145.0 33.0 2.50
10 10 148.5 37.2 2.25
11 11 165.5 49.5 3.00
12 12 135.0 27.6 1.25
13 13 153.3 41.0 2.75
14 14 152.0 32.0 1.75
15 15 160.5 47.2 2.25
16 16 153.0 32.0 1.75
17 17 147.6 40.5 2.00
18 18 157.5 43.3 2.25
19 19 155.1 44.7 2.75
20 20 160.5 37.5 2.00
21 21 143.0 31.5 1.75
22 22 149.4 33.9 2.25
23 23 160.8 40.4 2.75
24 24 159.0 38.5 2.50
25 25 158.2 37.5 2.00
26 26 150.0 36.0 1.75
27 27 144.5 34.7 2.25
28 28 154.6 39.5 2.50
29 29 156.5 32.0 1.75
> lm.reg <- lm(data6_1$vital.capacity~data6_1$Height+data6_1$weight)
> summary(lm.reg)
Call:
lm(formula = data6_1$vital.capacity ~ data6_1$Height + data6_1$weight)
Residuals:
Min 1Q Median 3Q Max
-0.54117 -0.25524 -0.00266 0.22039 0.55425
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.565664 1.240127 -0.456 0.65208
data6_1$Height 0.005017 0.010575 0.474 0.63920
data6_1$weight 0.054061 0.015984 3.382 0.00228 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3137 on 26 degrees of freedom
Multiple R-squared: 0.546, Adjusted R-squared: 0.511
F-statistic: 15.63 on 2 and 26 DF, p-value: 3.485e-05
所以由身高(x1),体重(x2)推算的肺活量(y)的回归方程为
y= -0.5657 + 0.0050x1 + 0.0541
(2)因为F=15.63,p=3.485e-05<0.01,所以方程有意义。
(3)剩余标准差: 0.3137
Q2
二、某农场通过试验取得早稻收获量与春季降雨量和春季温度的数据如下:
收获量y(kg/mm2) 降雨量x1(mm) 温度x2(℃)
建立早稻收获量对春季降雨量和春季温度的二元线性回归方程,计算各回归系数的置信区间,并对回归模型的线性关系和回归系数进行检验(α=0.05)。
> data_2 <- read.table("homework-6.2-data.txt",header = T)
> fit2 <- lm(y~x1+x2,data = data_2)
> summary(fit2)
Call:
lm(formula = y ~ x1 + x2, data = data_2)
Residuals:
1 2 3 4 5 6 7
-275.101 90.464 216.483 140.280 150.676 -316.599 -6.203
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.591 505.004 -0.001 0.9991
x1 22.387 9.601 2.332 0.0801 .
x2 327.672 98.798 3.317 0.0295 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 261.4 on 4 degrees of freedom
Multiple R-squared: 0.9913, Adjusted R-squared: 0.987
F-statistic: 228.4 on 2 and 4 DF, p-value: 7.532e-05
所以回归方程: y = -0.591 + 22.387x1 + 327.672x2
> confint(fit2, level = 0.95)
2.5 % 97.5 %
(Intercept) -1402.707516 1401.52552
x1 -4.268921 49.04184
x2 53.364699 601.97873
降雨量x1的置信区间为(-4.268921,49.04184),含义是在温度不变的条件下,降雨量每变动1mm,收获量的平均变动在-4.268921到49.04184 kg/mm2之间。
温度x2的置信区间为(53.364699,601.97873),含义是在降雨量不变的条件下,温度每变动1℃,收获量的平均变动在53.364699到601.97873 kg/mm2之间。
线性关系检验是检验因变量y与k个自变量之间的关系是否显著,也称总体显著性检验。根据以上R输出结果,检验统计量F= 228.4,显著水平P= 7.532e-05< 0.05,拒绝H0,即收获量y与降雨量x1和温度x2之间的线性关系显著。
要判断每个自变量对因变量的影响是否都显著,需要对各回归系数βi分别进行t检验。根据R输出结果,降雨量x1和温度x2的回归系数相应的显著水平分别为0.0801和0.0295,只有温度对应的显著性水平小于0.05通过检验,这表明影响收获量的自变量中,只有温度对收获量的影响显著,而降雨量对收获量的影响不显著。
Q3_1
三、某葡萄酒爱好者想探索葡萄酒的品质与哪些因素相关。他有一个数据集包含了(1 -固定酸度,2 -挥发性酸度,3 -柠檬酸,4 -残余糖,5 -氯化物,6 -自由二氧化硫量,7 -二氧化硫总量,8 -密度,9 - pH值,10 -硫酸盐,11 -酒精浓度,和12 -品质(0 - 10分)。
1.查看数据集的前五行和数据集的总结
2.通过直方图展示固定酸度的分布和展示挥发性酸度与品质的散点图
3.计算这些变量与品质的相关性
4.通过方差分析不同品质的葡萄酒的酒精浓度是否有差异
5.通过多元线性回归建立一个品质预测模型,并说明哪些变量与品质显著相关。
> read.csv("homework-6.3-winequality-red.csv",header = T)
> head(data_3,n = 5)
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
1 7.4 0.70 0.00 1.9 0.076
2 7.8 0.88 0.00 2.6 0.098
3 7.8 0.76 0.04 2.3 0.092
4 11.2 0.28 0.56 1.9 0.075
5 7.4 0.70 0.00 1.9 0.076
free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
1 11 34 0.9978 3.51 0.56 9.4
2 25 67 0.9968 3.20 0.68 9.8
3 15 54 0.9970 3.26 0.65 9.8
4 17 60 0.9980 3.16 0.58 9.8
5 11 34 0.9978 3.51 0.56 9.4
quality
1 5
2 5
3 5
4 6
5 5
# 对数据集的汇总
> summary(data_3)
fixed.acidity volatile.acidity citric.acid residual.sugar
Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
chlorides free.sulfur.dioxide total.sulfur.dioxide density
Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
pH sulphates alcohol quality
Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
Median :3.310 Median :0.6200 Median :10.20 Median :6.000
Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
Q3_2
# 通过直方图展示固定酸度的分布
> hist(data_3$fixed.acidity,main = "the distribution of fixed acidity",xlab = "fixed.acidity")
# 展示挥发性酸度与品质的散点图
> plot(data_3$quality,data_3$volatile.acidity,main = "quality vs volatile acidity", ylab = "volatile.acidity", xlab = "quality" )
Q3-3
#计算变量与品质的相关性
> apply(data_3,2,function(x)cor(x, data_3$quality))
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
0.12405165 -0.39055778 0.22637251 0.01373164 -0.12890656
free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
-0.05065606 -0.18510029 -0.17491923 -0.05773139 0.25139708
alcohol quality
0.47616632 1.00000000
Q3_4
> summary(aov(alcohol~quality, data = data_3))
Df Sum Sq Mean Sq F value Pr(>F)
quality 1 411.5 411.5 468.3 <2e-16 ***
Residuals 1597 1403.3 0.9
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# p < 0.05, 不同品质的葡萄酒的酒精浓度有较显著的差异
Q3_5
> summary(lm(quality~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+alcohol,data = data_3))
Call:
lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
density + pH + sulphates + alcohol, data = data_3)
Residuals:
Min 1Q Median 3Q Max
-2.68911 -0.36652 -0.04699 0.45202 2.02498
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.197e+01 2.119e+01 1.036 0.3002
fixed.acidity 2.499e-02 2.595e-02 0.963 0.3357
volatile.acidity -1.084e+00 1.211e-01 -8.948 < 2e-16 ***
citric.acid -1.826e-01 1.472e-01 -1.240 0.2150
residual.sugar 1.633e-02 1.500e-02 1.089 0.2765
chlorides -1.874e+00 4.193e-01 -4.470 8.37e-06 ***
free.sulfur.dioxide 4.361e-03 2.171e-03 2.009 0.0447 *
total.sulfur.dioxide -3.265e-03 7.287e-04 -4.480 8.00e-06 ***
density -1.788e+01 2.163e+01 -0.827 0.4086
pH -4.137e-01 1.916e-01 -2.159 0.0310 *
sulphates 9.163e-01 1.143e-01 8.014 2.13e-15 ***
alcohol 2.762e-01 2.648e-02 10.429 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.648 on 1587 degrees of freedom
Multiple R-squared: 0.3606, Adjusted R-squared: 0.3561
F-statistic: 81.35 on 11 and 1587 DF, p-value: < 2.2e-16
与红酒品质显著相关的变量有:2 -挥发性酸度,5 -氯化物,6 -自由二氧化硫量,7 -二氧化硫总量,9 - pH值,10 -硫酸盐,11 -酒精浓度。
Q4
> data_4 <- read.table("homework-6.4-data.txt",header = T, sep = ",")
> data_4
Income Age y
1 45000 2 0
2 40000 4 0
3 60000 3 1
4 50000 2 1
5 55000 2 0
6 50000 5 1
7 35000 7 1
8 65000 2 1
9 53000 2 0
10 48000 1 0
11 37000 5 1
12 31000 7 1
13 40000 4 1
14 75000 2 0
15 43000 9 1
16 49000 2 0
17 37500 4 1
18 71000 1 0
19 34000 5 0
20 27000 6 0
> fit <- glm(y~Income+Age, family = binomial(), data = data_4)
> summary(fit)
Call:
glm(formula = y ~ Income + Age, family = binomial(), data = data_4)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5635 -0.8045 -0.1397 0.9535 1.7915
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.047e+00 4.674e+00 -1.508 0.132
Income 7.382e-05 6.371e-05 1.159 0.247
Age 9.879e-01 5.274e-01 1.873 0.061 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27.726 on 19 degrees of freedom
Residual deviance: 21.082 on 17 degrees of freedom
AIC: 27.082
Number of Fisher Scoring iterations: 5
> coef(fit)
(Intercept) Income Age
-7.047061e+00 7.381679e-05 9.878861e-01
所以回归系数: β0=−7.047,β1=0.00007382,β2=0.9879
回归方程为:
> predictdata <- data.frame(Income=c(45000), Age = c(5))
> predictdata$prob <- predict(fit,newdata = predictdata,type = "response")
> predictdata
Income Age prob
1 45000 5 0.7710279
所以预测的概率为 0.7710279
Q5_1
> data_5 <- read.csv("homework-6.5-Drivers.csv",header = T, sep = ",")
> head(data_5)
x1 x2 x3 y
1 1 17 1 1
2 1 44 0 0
3 1 48 1 0
4 1 55 0 0
5 1 75 1 1
6 0 35 0 1
> log.glm <- glm(y~x1+x2+x3,data = data_5, family = "binomial")
> summary(log.glm)
Call:
glm(formula = y ~ x1 + x2 + x3, family = "binomial", data = data_5)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5636 -0.9131 -0.7892 0.9637 1.6000
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.597610 0.894831 0.668 0.5042
x1 -1.496084 0.704861 -2.123 0.0338 *
x2 -0.001595 0.016758 -0.095 0.9242
x3 0.315865 0.701093 0.451 0.6523
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 62.183 on 44 degrees of freedom
Residual deviance: 57.026 on 41 degrees of freedom
AIC: 65.026
Number of Fisher Scoring iterations: 4
由此得到初步的logistic回归模型:
Q5_2
在上述模型中, 由于参数β2,β3没有通过检验, 即只有视力状况对是否发生事故有显著影响
去除非影响因素,可类似于线性模型, 用step( )做变量筛选.
# 通过step()函数进行变量筛选
> log.step <- step(log.glm)
Start: AIC=65.03
y ~ x1 + x2 + x3
Df Deviance AIC
- x2 1 57.035 63.035
- x3 1 57.232 63.232
<none> 57.026 65.026
- x1 1 61.936 67.936
Step: AIC=63.03
y ~ x1 + x3
Df Deviance AIC
- x3 1 57.241 61.241
<none> 57.035 63.035
- x1 1 61.991 65.991
Step: AIC=61.24
y ~ x1
Df Deviance AIC
<none> 57.241 61.241
- x1 1 62.183 64.183
# 变量筛选后进行汇总
> summary(log.step)
Call:
glm(formula = y ~ x1, family = "binomial", data = data_5)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4490 -0.8782 -0.8782 0.9282 1.5096
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.6190 0.4688 1.320 0.1867
x1 -1.3728 0.6353 -2.161 0.0307 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 62.183 on 44 degrees of freedom
Residual deviance: 57.241 on 43 degrees of freedom
AIC: 61.241
Number of Fisher Scoring iterations: 4
得到的新的回归方程为:
Q5_3
# 输入两位需要投保的司机的数据
> predictdata <- data.frame(person=c("A","B"),x1=c(0,1),x2=c(50,20),x3=c(1,0))
> predictdata
person x1 x2 x3
1 A 0 50 1
2 B 1 20 0
# 应用上面拟合出的回归方程进行预测
> predictdata$prob <- predict(object = log.step, newdata = predictdata, type = "response")
> predictdata
person x1 x2 x3 prob
1 A 0 50 1 0.65
2 B 1 20 0 0.32
所以A、B两者明年出事故的概率分别为0.65和0.32
只有B出事故的概率低于40 %, 所以只有B可以投保
Q6
> data_6 <- read.csv("homework-6.6-data.csv",header = T, sep = ",")
> head(data_6)
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean
1 842302 M 17.99 10.38 122.80 1001.0 0.11840
2 842517 M 20.57 17.77 132.90 1326.0 0.08474
3 84300903 M 19.69 21.25 130.00 1203.0 0.10960
4 84348301 M 11.42 20.38 77.58 386.1 0.14250
5 84358402 M 20.29 14.34 135.10 1297.0 0.10030
6 843786 M 12.45 15.70 82.57 477.1 0.12780
compactness_mean concavity_mean concave.points_mean symmetry_mean
1 0.27760 0.3001 0.14710 0.2419
2 0.07864 0.0869 0.07017 0.1812
3 0.15990 0.1974 0.12790 0.2069
4 0.28390 0.2414 0.10520 0.2597
5 0.13280 0.1980 0.10430 0.1809
6 0.17000 0.1578 0.08089 0.2087
fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se
1 0.07871 1.0950 0.9053 8.589 153.40 0.006399
2 0.05667 0.5435 0.7339 3.398 74.08 0.005225
3 0.05999 0.7456 0.7869 4.585 94.03 0.006150
4 0.09744 0.4956 1.1560 3.445 27.23 0.009110
5 0.05883 0.7572 0.7813 5.438 94.44 0.011490
6 0.07613 0.3345 0.8902 2.217 27.19 0.007510
compactness_se concavity_se concave.points_se symmetry_se fractal_dimension_se
1 0.04904 0.05373 0.01587 0.03003 0.006193
2 0.01308 0.01860 0.01340 0.01389 0.003532
3 0.04006 0.03832 0.02058 0.02250 0.004571
4 0.07458 0.05661 0.01867 0.05963 0.009208
5 0.02461 0.05688 0.01885 0.01756 0.005115
6 0.03345 0.03672 0.01137 0.02165 0.005082
radius_worst texture_worst perimeter_worst area_worst smoothness_worst
1 25.38 17.33 184.60 2019.0 0.1622
2 24.99 23.41 158.80 1956.0 0.1238
3 23.57 25.53 152.50 1709.0 0.1444
4 14.91 26.50 98.87 567.7 0.2098
5 22.54 16.67 152.20 1575.0 0.1374
6 15.47 23.75 103.40 741.6 0.1791
compactness_worst concavity_worst concave.points_worst symmetry_worst
1 0.6656 0.7119 0.2654 0.4601
2 0.1866 0.2416 0.1860 0.2750
3 0.4245 0.4504 0.2430 0.3613
4 0.8663 0.6869 0.2575 0.6638
5 0.2050 0.4000 0.1625 0.2364
6 0.5249 0.5355 0.1741 0.3985
fractal_dimension_worst
1 0.11890
2 0.08902
3 0.08758
4 0.17300
5 0.07678
6 0.12440
> fit <- glm(data_6$diagnosis ~ data_6$radius_mean + data_6$texture_mean + data_6$perimeter_mean + data_6$area_mean + data_6$smoothness_mean + data_6$compactness_mean + data_6$concavity_mean + data_6$concave.points_mean + data_6$symmetry_mean + data_6$fractal_dimension_mean, family = "binomial")
Warning message:
glm.fit:拟合機率算出来是数值零或一
> summary(fit)
Call:
glm(formula = data_6$diagnosis ~ data_6$radius_mean + data_6$texture_mean +
data_6$perimeter_mean + data_6$area_mean + data_6$smoothness_mean +
data_6$compactness_mean + data_6$concavity_mean + data_6$concave.points_mean +
data_6$symmetry_mean + data_6$fractal_dimension_mean, family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.95590 -0.14839 -0.03943 0.00429 2.91690
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.35952 12.85259 -0.573 0.5669
data_6$radius_mean -2.04930 3.71588 -0.551 0.5813
data_6$texture_mean 0.38473 0.06454 5.961 2.5e-09 ***
data_6$perimeter_mean -0.07151 0.50516 -0.142 0.8874
data_6$area_mean 0.03980 0.01674 2.377 0.0174 *
data_6$smoothness_mean 76.43227 31.95492 2.392 0.0168 *
data_6$compactness_mean -1.46242 20.34249 -0.072 0.9427
data_6$concavity_mean 8.46870 8.12003 1.043 0.2970
data_6$concave.points_mean 66.82176 28.52910 2.342 0.0192 *
data_6$symmetry_mean 16.27824 10.63059 1.531 0.1257
data_6$fractal_dimension_mean -68.33703 85.55666 -0.799 0.4244
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 751.44 on 568 degrees of freedom
Residual deviance: 146.13 on 558 degrees of freedom
AIC: 168.13
Number of Fisher Scoring iterations: 9
Q6_2
> reduced_feature <- c("texture_mean","area_mean","smoothness_mean","concave.points_mean")
> formula <- paste("diagnosis",paste(reduced_feature,collapse = "+") ,sep = "~")
> model <- glm(formula, data_6, family = "binomial")
Warning message:
glm.fit:拟合機率算出来是数值零或一
> summary(model)
Call:
glm(formula = formula, family = "binomial", data = data_6)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.31798 -0.15623 -0.04212 0.01662 2.84201
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -23.677816 3.882774 -6.098 1.07e-09 ***
texture_mean 0.362687 0.060544 5.990 2.09e-09 ***
area_mean 0.010342 0.002002 5.165 2.40e-07 ***
smoothness_mean 59.471304 25.965153 2.290 0.022 *
concave.points_mean 76.571210 16.427864 4.661 3.15e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 751.44 on 568 degrees of freedom
Residual deviance: 156.44 on 564 degrees of freedom
AIC: 166.44
Number of Fisher Scoring iterations: 8
Q6_3
> anova(fit, model, test = "Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: data_6$diagnosis
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 568 751.44
data_6$radius_mean 1 421.43 567 330.01 < 2.2e-16 ***
data_6$texture_mean 1 38.89 566 291.12 4.489e-10 ***
data_6$perimeter_mean 1 72.23 565 218.90 < 2.2e-16 ***
data_6$area_mean 1 7.53 564 211.37 0.0060655 **
data_6$smoothness_mean 1 42.14 563 169.22 8.481e-11 ***
data_6$compactness_mean 1 0.04 562 169.18 0.8359415
data_6$concavity_mean 1 13.22 561 155.96 0.0002771 ***
data_6$concave.points_mean 1 6.72 560 149.24 0.0095192 **
data_6$symmetry_mean 1 2.46 559 146.78 0.1166476
data_6$fractal_dimension_mean 1 0.65 558 146.13 0.4213577
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Q6_4
> trainData <- data_6[1:398,]
> testData <- data_6[399:569,]
> reduced_feature <- c("texture_mean","area_mean","smoothness_mean","concave.points_mean")
> formula <- paste("diagnosis",paste(reduced_feature,collapse = "+"),sep = "~")
> model <- glm(formula, trainData, family = "binomial")
> summary(model)
Call:
glm(formula = formula, family = "binomial", data = trainData)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.39278 -0.14454 -0.02447 0.03635 2.60665
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -27.47397 4.74798 -5.786 7.19e-09 ***
texture_mean 0.46244 0.08434 5.483 4.19e-08 ***
area_mean 0.01082 0.00235 4.606 4.11e-06 ***
smoothness_mean 90.11221 30.96961 2.910 0.003618 **
concave.points_mean 59.01212 17.51779 3.369 0.000755 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 544.93 on 397 degrees of freedom
Residual deviance: 108.30 on 393 degrees of freedom
AIC: 118.3
Number of Fisher Scoring iterations: 8
> predictdata <- predict(model, testData,type = "response")
> pred_num <- ifelse(predictdata > 0.5,1,0)
> y_pred <- factor(pred_num,levels = c(0,1))
> y_act <- factor(ifelse(testData$diagnosis=="B",0,1))
> sum(y_pred==y_act)
[1] 155
> 155/171
[1] 0.9064327