Question1.
农场调查了A、B、C等3个品种草莓的维生素C含量,测定结果如表1所示。
(1)检验维生素C含量对于品种A、B、C是否是正态的;如果是,是否满足方差齐次性要求;
(2)分析品种A、B、C的维生素C含量是否存在差异;如果是,那么分析各个品种两两之间存在差异与否。
注: 正态性检验可用shapiro.test,本题数据可以直接从附件strawberry.txt读取。
解答:
采用shapiro.test进行正态性检验,原假设为‘A品种草莓的维生素C含量服从正态分布’,备择假设为‘A品种草莓的维生素C含量不服从正态分布’。
# 设置工作目录
> setwd("C:/Users/My/Desktop/")
# 读取数据并保存到strawberry中
> strawberry <- read.table("strawberry.txt",header = T)
# 查看strawberry文件的内容
> strawberry
breed vc
1 a 117
2 a 99
3 a 107
4 a 112
5 a 113
6 a 106
7 b 81
8 b 77
9 b 79
10 b 76
11 b 85
12 b 87
13 b 74
14 b 45
15 b 72
16 b 80
17 c 80
18 c 82
19 c 78
20 c 84
21 c 89
22 c 53
23 c 86
24 c 88
# 对A草莓的维生素C的含量进行正态性检验
> shapiro.test(strawberry$vc[strawberry$breed=='a'])
Shapiro-Wilk normality test
data: strawberry$vc[strawberry$breed == "a"]
W = 0.96749, p-value = 0.8752
p-value 值为 0.8752 > 0.05, 所以接受原假设,A种草莓维C含量符合正态分布
同样对B草莓和C草莓进行正态性检验
> shapiro.test(strawberry$vc[strawberry$breed == 'b'])
Shapiro-Wilk normality test
data: strawberry$vc[strawberry$breed == "b"]
W = 0.75208, p-value = 0.003783
# p 值结果为 0.003783 < 0.05, 所以B草莓维生素C含量不符合正态分布
# 对草莓C 进行状态性检验
> shapiro.test(strawberry$vc[strawberry$breed == 'c'])
Shapiro-Wilk normality test
data: strawberry$vc[strawberry$breed == "c"]
W = 0.72984, p-value = 0.004876
p < 0.05, 所以C草莓的维生素C含量不符合正态分布
对方差齐性做检验
> bartlett.test(vc~breed, data = strawberry)
Bartlett test of homogeneity of variances
data: vc by breed
Bartlett's K-squared = 2.0697, df = 2, p-value = 0.3553
# p-value= 0.3553 > 0.05, 所以拒绝原假设,所以满足方差齐次要求
采用方差分析的方法分析不同品种的维生素C含量是否相同,原假设为‘三种品种的均值相同’,备择假设为‘至少有一个品种的均值与其他品种不同’,由于数据不符合正态分布,我们采用kruskal.test。
> attach(strawberry)
> kruskal.test(vc~breed)
Kruskal-Wallis rank sum test
data: vc by breed
Kruskal-Wallis chi-squared = 14.344, df = 2, p-value = 0.0007676
#p-value < 0.05, 所以品种A、B、C的维生素C含量存在差异
# 分析各个品种两两之间存在差异与否
> pairwise.wilcox.test(vc,breed,p.adjust.method = "bonferroni")
Pairwise comparisons using Wilcoxon rank sum test
data: vc and breed
a b
b 0.00075 -
c 0.00200 0.39216
P value adjustment method: bonferroni
Warning message:
In wilcox.test.default(xi, xj, paired = paired, ...) :
无法精確計算带连结的p值
可以看出A与B、A与C之间成wilcox.test 检验的p值远小于0.05,而B与C之间则大于0.05,所以A品种的维生素C含量与其他品种不同,B、C品种则没有明显差异。
Question2:
we have data collected from the flammability test. a cloth has been cut into 55 pieces. there are 5 labs, and each lab received 11 pieces. these measure correspond to the length in the chart(fabric.csv). if the tests each of these labs have been set up same way would expect results to be similar. since all these measurements were taken from the same bolt of cloth.
1.use box plots to see the data outline
2.are the variances equal across groups according to the Bartlett Test?
3.are the mean lengths the same across groups?
4.if not, which lab is the difference? (pair wise comparison tests)
解:
> fabric <- read.table("fabric.txt",header = T);fabric
Lab1 Lab2 Lab3 Lab4 Lab5
1 3.4 2.7 3.3 3.3 4.1
2 3.1 3.4 3.3 3.2 4.1
3 3.4 3.6 3.5 3.4 3.7
4 3.7 3.2 3.5 2.7 4.2
5 3.1 4.0 2.8 2.7 3.1
6 4.2 4.1 2.8 3.3 3.5
7 3.7 3.8 3.2 2.9 2.8
8 3.9 3.8 2.8 3.2 3.5
9 3.1 4.3 3.8 2.9 3.7
10 3.0 3.4 3.5 2.6 3.5
11 2.9 3.3 3.8 2.8 3.9
> boxplot(fabric)
What we see in the boxplot is there is some variation. The boxplot may suggest there is difference in the mean lengths across the labs So let's test that now use one-way ANOVA, cause there is one factor. we preprocess the data by rearranging this data to two columns(dependent variable, levels)
# 将数据重新排列成两列
> stfabric<- stack(fabric)
# 修改列名为“len”和“lab”
> names(stfabric) <- c("len","lab")
> stfabric
len lab
1 3.4 Lab1
2 3.1 Lab1
3 3.4 Lab1
4 3.7 Lab1
5 3.1 Lab1
6 4.2 Lab1
7 3.7 Lab1
8 3.9 Lab1
9 3.1 Lab1
10 3.0 Lab1
11 2.9 Lab1
12 2.7 Lab2
13 3.4 Lab2
14 3.6 Lab2
15 3.2 Lab2
16 4.0 Lab2
17 4.1 Lab2
18 3.8 Lab2
19 3.8 Lab2
20 4.3 Lab2
21 3.4 Lab2
22 3.3 Lab2
23 3.3 Lab3
24 3.3 Lab3
25 3.5 Lab3
26 3.5 Lab3
27 2.8 Lab3
28 2.8 Lab3
29 3.2 Lab3
30 2.8 Lab3
31 3.8 Lab3
32 3.5 Lab3
33 3.8 Lab3
34 3.3 Lab4
35 3.2 Lab4
36 3.4 Lab4
37 2.7 Lab4
38 2.7 Lab4
39 3.3 Lab4
40 2.9 Lab4
41 3.2 Lab4
42 2.9 Lab4
43 2.6 Lab4
44 2.8 Lab4
45 4.1 Lab5
46 4.1 Lab5
47 3.7 Lab5
48 4.2 Lab5
49 3.1 Lab5
50 3.5 Lab5
51 2.8 Lab5
52 3.5 Lab5
53 3.7 Lab5
54 3.5 Lab5
55 3.9 Lab5
# 加载stfabric数据
> attach(stfabric)
> str(stfabric)
'data.frame': 55 obs. of 2 variables:
$ len: num 3.4 3.1 3.4 3.7 3.1 4.2 3.7 3.9 3.1 3 ...
$ lab: Factor w/ 5 levels "Lab1","Lab2",..: 1 1 1 1 1 1 1 1 1 1 ...
# 检验各组数据的正态性
> shapiro.test(stfabric[stfabric$lab=="Lab1",1])$p.value
[1] 0.3230597
> shapiro.test(stfabric[stfabric$lab=="Lab2",1])$p.value
[1] 0.9332868
> shapiro.test(stfabric[stfabric$lab=="Lab3",1])$p.value
[1] 0.1293238
> shapiro.test(stfabric[stfabric$lab=="Lab4",1])$p.value
[1] 0.2042771
> shapiro.test(stfabric[stfabric$lab=="Lab5",1])$p.value
[1] 0.4806684
# 其p-value均小于0.05,表明各组数据均服从正态分布
# 检验数据的方差齐性
> bartlett.test(len~lab)
Bartlett test of homogeneity of variances
data: len by lab
Bartlett's K-squared = 2.4179, df = 4, p-value = 0.6594
# 其中p-value < 0.05, 证明具有方差齐性
p-value is higher than 0.05 in each group. the result means that the variances are equal across groups according to the Bartlett Test.
# Are the mean lengths the same across groups? (4’)
# H0: There is no difference means across labs H1: there is at least one difference
# One way anova
> av1 <- aov(len~lab,data = stfabric)
> summary(av1)
Df Sum Sq Mean Sq F value Pr(>F)
lab 4 2.969 0.7423 4.676 0.00277 **
Residuals 50 7.936 0.1587
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#the F value is 4.5. P value is 0.003, which is highly significant. what we conclude is there is a difference in the means across labs. we reject the null hypothesis
which lab is the difference? (pair wise comparison tests) next obvious question to ask is where the difference lie? here we use the pairwise comparison test TukeyHSD, which gives a reasonable intervals and only works on balanced designs.
# 探究各变量之间的不同,进行各组均值差异的成对检验
> tk <- TukeyHSD(av1)
> tk
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = len ~ lab, data = stfabric)
$`lab`
diff lwr upr p adj
Lab2-Lab1 0.19090909 -0.2898210 0.67163915 0.7932570
Lab3-Lab1 -0.10909091 -0.5898210 0.37163915 0.9673122
Lab4-Lab1 -0.40909091 -0.8898210 0.07163915 0.1299798
Lab5-Lab1 0.23636364 -0.2443664 0.71709370 0.6359580
Lab3-Lab2 -0.30000000 -0.7807301 0.18073006 0.4044590
Lab4-Lab2 -0.60000000 -1.0807301 -0.11926994 0.0076519
Lab5-Lab2 0.04545455 -0.4352755 0.52618461 0.9988350
Lab4-Lab3 -0.30000000 -0.7807301 0.18073006 0.4044590
Lab5-Lab3 0.34545455 -0.1352755 0.82618461 0.2654246
Lab5-Lab4 0.64545455 0.1647245 1.12618461 0.0034711
置信区间包含0组别说明差异不显著(p>0.05),所以由图可知,Lab4-Lab2, Lab5-Lab4 之间有着显著差异,并且其p-adj值均小于0.05
3.Question
用大麦生产啤酒的过程中,需要研究烘烤方式(A)和大麦水分(B)对糖化时间的影响,选择两种烘烤方式、四种大麦水分,共八种处理方式,每一次处理重复三次,结果如表2所示。
(1)根据课件双因素方法分析给出各组假设,列出方差分析表(给出表达式即可)并指明表中各值的意义;
(2)试分析烘烤方式A、水分B以及二者相互作用对糖化时间的影响;
(3)对AiBj条件下平均时间做多重比较,并指出与A1B3组有显著差异的条件;
(4)用图形展示交互效应并作说明。
注:假设数据满足正态性和方差齐次性;
本题数据可直接从附件beer.txt读取。
解:
(2)
> setwd("C:/Users/My/Desktop")
> beer <- read.table("beer.txt",header = T)
> head(beer)
A B time
1 A1 B1 12.0
2 A1 B1 13.0
3 A1 B1 14.5
4 A1 B2 9.5
5 A1 B2 10.0
6 A1 B2 12.5
> attach(beer)
下面对数据进行两因素的方差分析,原假设为‘因素A、B及其相互作用不影响糖化时间’,备择假设为‘存在某个因素影响糖化时间’。
# 双因素ANOVA分析
> ff <- aov(time~A*B,data = beer)
> summary(ff)
Df Sum Sq Mean Sq F value Pr(>F)
A 1 5.51 5.51 4.008 0.0625 .
B 3 228.86 76.29 55.482 1.12e-08 ***
A:B 3 107.61 35.87 26.088 2.12e-06 ***
Residuals 16 22.00 1.38
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
以上结果可以看出,烘烤方式A对应的p值大于0.05,不对糖化时间造成显著影响;而水分B以及A、B的相互作用对糖化时间有显著影响。
# 多重比较
> multi.compr <- TukeyHSD(ff)
> AB <- multi.compr$`A:B`
> AB[AB[,4]<0.05,]
diff lwr upr p adj
A2:B1-A1:B1 -7.500000 -10.81475692 -4.185243 1.615920e-05
A2:B3-A1:B1 4.166667 0.85190974 7.481424 9.013947e-03
A1:B4-A1:B1 4.833333 1.51857641 8.148090 2.329391e-03
A1:B2-A2:B1 5.000000 1.68524308 8.314757 1.667245e-03
A2:B2-A2:B1 8.333333 5.01857641 11.648090 4.122553e-06
A1:B3-A2:B1 9.500000 6.18524308 12.814757 7.018864e-07
A2:B3-A2:B1 11.666667 8.35190974 14.981424 3.850844e-08
A1:B4-A2:B1 12.333333 9.01857641 15.648090 1.726279e-08
A2:B4-A2:B1 10.500000 7.18524308 13.814757 1.733446e-07
A2:B2-A1:B2 3.333333 0.01857641 6.648090 4.821463e-02
A1:B3-A1:B2 4.500000 1.18524308 7.814757 4.572530e-03
A2:B3-A1:B2 6.666667 3.35190974 9.981424 6.934742e-05
A1:B4-A1:B2 7.333333 4.01857641 10.648090 2.146571e-05
A2:B4-A1:B2 5.500000 2.18524308 8.814757 6.207373e-04
A2:B3-A2:B2 3.333333 0.01857641 6.648090 4.821463e-02
A1:B4-A2:B2 4.000000 0.68524308 7.314757 1.265663e-02
从上述结果中可以看出,和A1B3存在差异的的组有A2B1和A1B2
(3)
> par(mfrow=c(1,2))
> attach(beer)
> interaction.plot(A,B,time,type = "b",col = c("red","blue","green"))
> interaction.plot(B,A,time,type = "b",col = c("red","blue","green"))
从上图可以看出,固定一个因素(A或者B)时候,另外一个因素的变化线段彼此之间不平行,会受到前者(B或A)的影响。故而A与B存在一定的交互作用。