【R<-练习】多元线性回归与逻辑回归

Q1

一、某地29名13岁儿童身高(cm),体重(kg)和肺活量(L)数据见data,求:
(1)由身高,体重推算肺活量的回归方程;
(2)求出的方程是否有意义;
(3)剩余标准差

(1).

> setwd("C:/Users/My/Desktop/hw6/")
>  data6_1 <- read.csv("homework-6.1-data.csv",header = T, sep = ",")
>  data6_1
   number Height weight vital.capacity
1       1  135.1   32.0           1.75
2       2  139.9   30.4           2.00
3       3  163.6   46.2           2.75
4       4  146.5   33.5           2.50
5       5  156.2   37.1           2.75
6       6  156.4   35.5           2.00
7       7  167.8   41.5           2.75
8       8  149.7   31.0           1.50
9       9  145.0   33.0           2.50
10     10  148.5   37.2           2.25
11     11  165.5   49.5           3.00
12     12  135.0   27.6           1.25
13     13  153.3   41.0           2.75
14     14  152.0   32.0           1.75
15     15  160.5   47.2           2.25
16     16  153.0   32.0           1.75
17     17  147.6   40.5           2.00
18     18  157.5   43.3           2.25
19     19  155.1   44.7           2.75
20     20  160.5   37.5           2.00
21     21  143.0   31.5           1.75
22     22  149.4   33.9           2.25
23     23  160.8   40.4           2.75
24     24  159.0   38.5           2.50
25     25  158.2   37.5           2.00
26     26  150.0   36.0           1.75
27     27  144.5   34.7           2.25
28     28  154.6   39.5           2.50
29     29  156.5   32.0           1.75
> lm.reg <- lm(data6_1$vital.capacity~data6_1$Height+data6_1$weight)
> summary(lm.reg)

Call:
lm(formula = data6_1$vital.capacity ~ data6_1$Height + data6_1$weight)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.54117 -0.25524 -0.00266  0.22039  0.55425 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)   
(Intercept)    -0.565664   1.240127  -0.456  0.65208   
data6_1$Height  0.005017   0.010575   0.474  0.63920   
data6_1$weight  0.054061   0.015984   3.382  0.00228 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3137 on 26 degrees of freedom
Multiple R-squared:  0.546, Adjusted R-squared:  0.511 
F-statistic: 15.63 on 2 and 26 DF,  p-value: 3.485e-05

所以由身高(x1),体重(x2)推算的肺活量(y)的回归方程为

y= -0.5657 + 0.0050x1 + 0.0541

(2)因为F=15.63,p=3.485e-05<0.01,所以方程有意义。

(3)剩余标准差: 0.3137

Q2

二、某农场通过试验取得早稻收获量与春季降雨量和春季温度的数据如下:
收获量y(kg/mm2) 降雨量x1(mm) 温度x2(℃)

建立早稻收获量对春季降雨量和春季温度的二元线性回归方程,计算各回归系数的置信区间,并对回归模型的线性关系和回归系数进行检验(α=0.05)。

> data_2 <- read.table("homework-6.2-data.txt",header = T)
> fit2 <- lm(y~x1+x2,data = data_2)
> summary(fit2)

Call:
lm(formula = y ~ x1 + x2, data = data_2)

Residuals:
       1        2        3        4        5        6        7 
-275.101   90.464  216.483  140.280  150.676 -316.599   -6.203 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   -0.591    505.004  -0.001   0.9991  
x1            22.387      9.601   2.332   0.0801 .
x2           327.672     98.798   3.317   0.0295 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 261.4 on 4 degrees of freedom
Multiple R-squared:  0.9913,    Adjusted R-squared:  0.987 
F-statistic: 228.4 on 2 and 4 DF,  p-value: 7.532e-05

所以回归方程: y = -0.591 + 22.387x1 + 327.672x2

> confint(fit2, level = 0.95)
                   2.5 %     97.5 %
(Intercept) -1402.707516 1401.52552
x1             -4.268921   49.04184
x2             53.364699  601.97873

降雨量x1的置信区间为(-4.268921,49.04184),含义是在温度不变的条件下,降雨量每变动1mm,收获量的平均变动在-4.268921到49.04184 kg/mm2之间。
温度x2的置信区间为(53.364699,601.97873),含义是在降雨量不变的条件下,温度每变动1℃,收获量的平均变动在53.364699到601.97873 kg/mm2之间。

线性关系检验是检验因变量y与k个自变量之间的关系是否显著,也称总体显著性检验。根据以上R输出结果,检验统计量F= 228.4,显著水平P= 7.532e-05< 0.05,拒绝H0,即收获量y与降雨量x1和温度x2之间的线性关系显著。

要判断每个自变量对因变量的影响是否都显著,需要对各回归系数βi分别进行t检验。根据R输出结果,降雨量x1和温度x2的回归系数相应的显著水平分别为0.0801和0.0295,只有温度对应的显著性水平小于0.05通过检验,这表明影响收获量的自变量中,只有温度对收获量的影响显著,而降雨量对收获量的影响不显著。

Q3_1

三、某葡萄酒爱好者想探索葡萄酒的品质与哪些因素相关。他有一个数据集包含了(1 -固定酸度,2 -挥发性酸度,3 -柠檬酸,4 -残余糖,5 -氯化物,6 -自由二氧化硫量,7 -二氧化硫总量,8 -密度,9 - pH值,10 -硫酸盐,11 -酒精浓度,和12 -品质(0 - 10分)。
1.查看数据集的前五行和数据集的总结
2.通过直方图展示固定酸度的分布和展示挥发性酸度与品质的散点图
3.计算这些变量与品质的相关性
4.通过方差分析不同品质的葡萄酒的酒精浓度是否有差异
5.通过多元线性回归建立一个品质预测模型,并说明哪些变量与品质显著相关。

数据从此处下载

> read.csv("homework-6.3-winequality-red.csv",header = T)
> head(data_3,n = 5)

  fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
1           7.4             0.70        0.00            1.9     0.076
2           7.8             0.88        0.00            2.6     0.098
3           7.8             0.76        0.04            2.3     0.092
4          11.2             0.28        0.56            1.9     0.075
5           7.4             0.70        0.00            1.9     0.076
  free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
1                  11                   34  0.9978 3.51      0.56     9.4
2                  25                   67  0.9968 3.20      0.68     9.8
3                  15                   54  0.9970 3.26      0.65     9.8
4                  17                   60  0.9980 3.16      0.58     9.8
5                  11                   34  0.9978 3.51      0.56     9.4
  quality
1       5
2       5
3       5
4       6
5       5

# 对数据集的汇总
> summary(data_3)
 fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
 Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
 1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
 Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
 Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
 3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
 Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
   chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
 Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
 1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
 Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
 Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
 3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
 Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
       pH          sulphates         alcohol         quality     
 Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
 1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
 Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
 Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
 3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
 Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000 

Q3_2

# 通过直方图展示固定酸度的分布
> hist(data_3$fixed.acidity,main = "the distribution of fixed acidity",xlab = "fixed.acidity")
直方图
# 展示挥发性酸度与品质的散点图
> plot(data_3$quality,data_3$volatile.acidity,main = "quality vs volatile acidity", ylab = "volatile.acidity", xlab = "quality" )
散点图

Q3-3

#计算变量与品质的相关性
> apply(data_3,2,function(x)cor(x, data_3$quality))
       fixed.acidity     volatile.acidity          citric.acid       residual.sugar            chlorides 
          0.12405165          -0.39055778           0.22637251           0.01373164          -0.12890656 
 free.sulfur.dioxide total.sulfur.dioxide              density                   pH            sulphates 
         -0.05065606          -0.18510029          -0.17491923          -0.05773139           0.25139708 
             alcohol              quality 
          0.47616632           1.00000000 

Q3_4

> summary(aov(alcohol~quality, data = data_3))
              Df Sum Sq Mean Sq F value Pr(>F)    
quality        1  411.5   411.5   468.3 <2e-16 ***
Residuals   1597 1403.3     0.9                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# p < 0.05, 不同品质的葡萄酒的酒精浓度有较显著的差异


Q3_5

> summary(lm(quality~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+alcohol,data = data_3))

Call:
lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
    residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
    density + pH + sulphates + alcohol, data = data_3)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.68911 -0.36652 -0.04699  0.45202  2.02498 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           2.197e+01  2.119e+01   1.036   0.3002    
fixed.acidity         2.499e-02  2.595e-02   0.963   0.3357    
volatile.acidity     -1.084e+00  1.211e-01  -8.948  < 2e-16 ***
citric.acid          -1.826e-01  1.472e-01  -1.240   0.2150    
residual.sugar        1.633e-02  1.500e-02   1.089   0.2765    
chlorides            -1.874e+00  4.193e-01  -4.470 8.37e-06 ***
free.sulfur.dioxide   4.361e-03  2.171e-03   2.009   0.0447 *  
total.sulfur.dioxide -3.265e-03  7.287e-04  -4.480 8.00e-06 ***
density              -1.788e+01  2.163e+01  -0.827   0.4086    
pH                   -4.137e-01  1.916e-01  -2.159   0.0310 *  
sulphates             9.163e-01  1.143e-01   8.014 2.13e-15 ***
alcohol               2.762e-01  2.648e-02  10.429  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.648 on 1587 degrees of freedom
Multiple R-squared:  0.3606,    Adjusted R-squared:  0.3561 
F-statistic: 81.35 on 11 and 1587 DF,  p-value: < 2.2e-16

与红酒品质显著相关的变量有:2 -挥发性酸度,5 -氯化物,6 -自由二氧化硫量,7 -二氧化硫总量,9 - pH值,10 -硫酸盐,11 -酒精浓度。

Q4

> data_4 <- read.table("homework-6.4-data.txt",header = T, sep = ",")
> data_4
   Income Age y
1   45000   2 0
2   40000   4 0
3   60000   3 1
4   50000   2 1
5   55000   2 0
6   50000   5 1
7   35000   7 1
8   65000   2 1
9   53000   2 0
10  48000   1 0
11  37000   5 1
12  31000   7 1
13  40000   4 1
14  75000   2 0
15  43000   9 1
16  49000   2 0
17  37500   4 1
18  71000   1 0
19  34000   5 0
20  27000   6 0
> fit <- glm(y~Income+Age, family = binomial(), data = data_4)
> summary(fit)

Call:
glm(formula = y ~ Income + Age, family = binomial(), data = data_4)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5635  -0.8045  -0.1397   0.9535   1.7915  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)  
(Intercept) -7.047e+00  4.674e+00  -1.508    0.132  
Income       7.382e-05  6.371e-05   1.159    0.247  
Age          9.879e-01  5.274e-01   1.873    0.061 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 27.726  on 19  degrees of freedom
Residual deviance: 21.082  on 17  degrees of freedom
AIC: 27.082

Number of Fisher Scoring iterations: 5

> coef(fit)
  (Intercept)        Income           Age 
-7.047061e+00  7.381679e-05  9.878861e-01

所以回归系数: β0=−7.047,β1=0.00007382,β2=0.9879
回归方程为:


> predictdata <- data.frame(Income=c(45000), Age = c(5))
> predictdata$prob <- predict(fit,newdata = predictdata,type = "response") 
> predictdata
  Income Age      prob
1  45000   5 0.7710279

所以预测的概率为 0.7710279

Q5_1

> data_5 <- read.csv("homework-6.5-Drivers.csv",header = T, sep = ",")
> head(data_5)
  x1 x2 x3 y
1  1 17  1 1
2  1 44  0 0
3  1 48  1 0
4  1 55  0 0
5  1 75  1 1
6  0 35  0 1
> log.glm <- glm(y~x1+x2+x3,data = data_5, family = "binomial")
> summary(log.glm)

Call:
glm(formula = y ~ x1 + x2 + x3, family = "binomial", data = data_5)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5636  -0.9131  -0.7892   0.9637   1.6000  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept)  0.597610   0.894831   0.668   0.5042  
x1          -1.496084   0.704861  -2.123   0.0338 *
x2          -0.001595   0.016758  -0.095   0.9242  
x3           0.315865   0.701093   0.451   0.6523  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 62.183  on 44  degrees of freedom
Residual deviance: 57.026  on 41  degrees of freedom
AIC: 65.026

Number of Fisher Scoring iterations: 4

由此得到初步的logistic回归模型:


image.png

Q5_2

在上述模型中, 由于参数β2,β3没有通过检验, 即只有视力状况对是否发生事故有显著影响
去除非影响因素,可类似于线性模型, 用step( )做变量筛选.

# 通过step()函数进行变量筛选
> log.step <- step(log.glm)
Start:  AIC=65.03
y ~ x1 + x2 + x3

       Df Deviance    AIC
- x2    1   57.035 63.035
- x3    1   57.232 63.232
<none>      57.026 65.026
- x1    1   61.936 67.936

Step:  AIC=63.03
y ~ x1 + x3

       Df Deviance    AIC
- x3    1   57.241 61.241
<none>      57.035 63.035
- x1    1   61.991 65.991

Step:  AIC=61.24
y ~ x1

       Df Deviance    AIC
<none>      57.241 61.241
- x1    1   62.183 64.183

# 变量筛选后进行汇总
> summary(log.step)

Call:
glm(formula = y ~ x1, family = "binomial", data = data_5)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4490  -0.8782  -0.8782   0.9282   1.5096  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   0.6190     0.4688   1.320   0.1867  
x1           -1.3728     0.6353  -2.161   0.0307 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 62.183  on 44  degrees of freedom
Residual deviance: 57.241  on 43  degrees of freedom
AIC: 61.241

Number of Fisher Scoring iterations: 4

得到的新的回归方程为:


Q5_3

# 输入两位需要投保的司机的数据
> predictdata <- data.frame(person=c("A","B"),x1=c(0,1),x2=c(50,20),x3=c(1,0))
> predictdata
  person x1 x2 x3
1      A  0 50  1
2      B  1 20  0
# 应用上面拟合出的回归方程进行预测
> predictdata$prob <- predict(object = log.step, newdata = predictdata, type = "response")
> predictdata
  person x1 x2 x3 prob
1      A  0 50  1 0.65
2      B  1 20  0 0.32

所以A、B两者明年出事故的概率分别为0.65和0.32
只有B出事故的概率低于40 %, 所以只有B可以投保

Q6

> data_6 <- read.csv("homework-6.6-data.csv",header = T, sep = ",")
> head(data_6)
        id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean
1   842302         M       17.99        10.38         122.80    1001.0         0.11840
2   842517         M       20.57        17.77         132.90    1326.0         0.08474
3 84300903         M       19.69        21.25         130.00    1203.0         0.10960
4 84348301         M       11.42        20.38          77.58     386.1         0.14250
5 84358402         M       20.29        14.34         135.10    1297.0         0.10030
6   843786         M       12.45        15.70          82.57     477.1         0.12780
  compactness_mean concavity_mean concave.points_mean symmetry_mean
1          0.27760         0.3001             0.14710        0.2419
2          0.07864         0.0869             0.07017        0.1812
3          0.15990         0.1974             0.12790        0.2069
4          0.28390         0.2414             0.10520        0.2597
5          0.13280         0.1980             0.10430        0.1809
6          0.17000         0.1578             0.08089        0.2087
  fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se
1                0.07871    1.0950     0.9053        8.589  153.40      0.006399
2                0.05667    0.5435     0.7339        3.398   74.08      0.005225
3                0.05999    0.7456     0.7869        4.585   94.03      0.006150
4                0.09744    0.4956     1.1560        3.445   27.23      0.009110
5                0.05883    0.7572     0.7813        5.438   94.44      0.011490
6                0.07613    0.3345     0.8902        2.217   27.19      0.007510
  compactness_se concavity_se concave.points_se symmetry_se fractal_dimension_se
1        0.04904      0.05373           0.01587     0.03003             0.006193
2        0.01308      0.01860           0.01340     0.01389             0.003532
3        0.04006      0.03832           0.02058     0.02250             0.004571
4        0.07458      0.05661           0.01867     0.05963             0.009208
5        0.02461      0.05688           0.01885     0.01756             0.005115
6        0.03345      0.03672           0.01137     0.02165             0.005082
  radius_worst texture_worst perimeter_worst area_worst smoothness_worst
1        25.38         17.33          184.60     2019.0           0.1622
2        24.99         23.41          158.80     1956.0           0.1238
3        23.57         25.53          152.50     1709.0           0.1444
4        14.91         26.50           98.87      567.7           0.2098
5        22.54         16.67          152.20     1575.0           0.1374
6        15.47         23.75          103.40      741.6           0.1791
  compactness_worst concavity_worst concave.points_worst symmetry_worst
1            0.6656          0.7119               0.2654         0.4601
2            0.1866          0.2416               0.1860         0.2750
3            0.4245          0.4504               0.2430         0.3613
4            0.8663          0.6869               0.2575         0.6638
5            0.2050          0.4000               0.1625         0.2364
6            0.5249          0.5355               0.1741         0.3985
  fractal_dimension_worst
1                 0.11890
2                 0.08902
3                 0.08758
4                 0.17300
5                 0.07678
6                 0.12440

> fit <- glm(data_6$diagnosis ~ data_6$radius_mean + data_6$texture_mean + data_6$perimeter_mean + data_6$area_mean + data_6$smoothness_mean + data_6$compactness_mean + data_6$concavity_mean + data_6$concave.points_mean + data_6$symmetry_mean + data_6$fractal_dimension_mean, family = "binomial")
Warning message:
glm.fit:拟合機率算出来是数值零或一 
> summary(fit)

Call:
glm(formula = data_6$diagnosis ~ data_6$radius_mean + data_6$texture_mean + 
    data_6$perimeter_mean + data_6$area_mean + data_6$smoothness_mean + 
    data_6$compactness_mean + data_6$concavity_mean + data_6$concave.points_mean + 
    data_6$symmetry_mean + data_6$fractal_dimension_mean, family = "binomial")

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.95590  -0.14839  -0.03943   0.00429   2.91690  

Coefficients:
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    -7.35952   12.85259  -0.573   0.5669    
data_6$radius_mean             -2.04930    3.71588  -0.551   0.5813    
data_6$texture_mean             0.38473    0.06454   5.961  2.5e-09 ***
data_6$perimeter_mean          -0.07151    0.50516  -0.142   0.8874    
data_6$area_mean                0.03980    0.01674   2.377   0.0174 *  
data_6$smoothness_mean         76.43227   31.95492   2.392   0.0168 *  
data_6$compactness_mean        -1.46242   20.34249  -0.072   0.9427    
data_6$concavity_mean           8.46870    8.12003   1.043   0.2970    
data_6$concave.points_mean     66.82176   28.52910   2.342   0.0192 *  
data_6$symmetry_mean           16.27824   10.63059   1.531   0.1257    
data_6$fractal_dimension_mean -68.33703   85.55666  -0.799   0.4244    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 751.44  on 568  degrees of freedom
Residual deviance: 146.13  on 558  degrees of freedom
AIC: 168.13

Number of Fisher Scoring iterations: 9


Q6_2

> reduced_feature <- c("texture_mean","area_mean","smoothness_mean","concave.points_mean")
> formula <- paste("diagnosis",paste(reduced_feature,collapse = "+") ,sep = "~")
> model <- glm(formula, data_6, family = "binomial")
Warning message:
glm.fit:拟合機率算出来是数值零或一 
> summary(model)

Call:
glm(formula = formula, family = "binomial", data = data_6)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.31798  -0.15623  -0.04212   0.01662   2.84201  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -23.677816   3.882774  -6.098 1.07e-09 ***
texture_mean          0.362687   0.060544   5.990 2.09e-09 ***
area_mean             0.010342   0.002002   5.165 2.40e-07 ***
smoothness_mean      59.471304  25.965153   2.290    0.022 *  
concave.points_mean  76.571210  16.427864   4.661 3.15e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 751.44  on 568  degrees of freedom
Residual deviance: 156.44  on 564  degrees of freedom
AIC: 166.44

Number of Fisher Scoring iterations: 8

Q6_3

> anova(fit, model, test = "Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: data_6$diagnosis

Terms added sequentially (first to last)


                              Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                            568     751.44              
data_6$radius_mean             1   421.43       567     330.01 < 2.2e-16 ***
data_6$texture_mean            1    38.89       566     291.12 4.489e-10 ***
data_6$perimeter_mean          1    72.23       565     218.90 < 2.2e-16 ***
data_6$area_mean               1     7.53       564     211.37 0.0060655 ** 
data_6$smoothness_mean         1    42.14       563     169.22 8.481e-11 ***
data_6$compactness_mean        1     0.04       562     169.18 0.8359415    
data_6$concavity_mean          1    13.22       561     155.96 0.0002771 ***
data_6$concave.points_mean     1     6.72       560     149.24 0.0095192 ** 
data_6$symmetry_mean           1     2.46       559     146.78 0.1166476    
data_6$fractal_dimension_mean  1     0.65       558     146.13 0.4213577    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Q6_4

> trainData <- data_6[1:398,]
> testData <- data_6[399:569,]
> reduced_feature <- c("texture_mean","area_mean","smoothness_mean","concave.points_mean")

> formula <- paste("diagnosis",paste(reduced_feature,collapse = "+"),sep = "~")
> model <- glm(formula, trainData, family = "binomial")
> summary(model)

Call:
glm(formula = formula, family = "binomial", data = trainData)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.39278  -0.14454  -0.02447   0.03635   2.60665  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -27.47397    4.74798  -5.786 7.19e-09 ***
texture_mean          0.46244    0.08434   5.483 4.19e-08 ***
area_mean             0.01082    0.00235   4.606 4.11e-06 ***
smoothness_mean      90.11221   30.96961   2.910 0.003618 ** 
concave.points_mean  59.01212   17.51779   3.369 0.000755 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 544.93  on 397  degrees of freedom
Residual deviance: 108.30  on 393  degrees of freedom
AIC: 118.3

Number of Fisher Scoring iterations: 8

> predictdata <- predict(model, testData,type = "response")
> pred_num <- ifelse(predictdata > 0.5,1,0)
> y_pred <- factor(pred_num,levels = c(0,1))
> y_act <- factor(ifelse(testData$diagnosis=="B",0,1))
> sum(y_pred==y_act)
[1] 155
> 155/171
[1] 0.9064327

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,928评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,192评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 159,468评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,186评论 1 286
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,295评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,374评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,403评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,186评论 0 269
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,610评论 1 306
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,906评论 2 328
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,075评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,755评论 4 337
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,393评论 3 320
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,079评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,313评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,934评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,963评论 2 351

推荐阅读更多精彩内容