今天在处理土壤酶活数据时发现8个技术重复中存在离群值,于是想通过R将这些离群值删除。原始数据如下:
sample_id | r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 |
---|---|---|---|---|---|---|---|---|
BT1 | 16.388 | 4.088 | 5.138 | 4.873 | 5.686 | 4.115 | 4.626 | 4.461 |
BT2 | 7.525 | 3.808 | 3.917 | 4.102 | 4.166 | 4.104 | 3.373 | 5.581 |
BT3 | 11.897 | 3.928 | 4.505 | 4.465 | 4.901 | 4.089 | 3.979 | 4.996 |
BS1 | -0.851 | -0.895 | -0.892 | -0.734 | -0.855 | -0.889 | -0.971 | -0.892 |
BS2 | -0.678 | -0.735 | -0.919 | -0.941 | -0.902 | -0.859 | -0.804 | -0.907 |
BS3 | -0.761 | -0.811 | -0.901 | -0.833 | -0.874 | -0.87 | -0.883 | -0.895 |
10T1 | 2.748 | 2.748 | 2.989 | 2.292 | 2.678 | 3.037 | 4.328 | 2.207 |
10T2 | 3.2 | 4.7 | 2.719 | 4.17 | 3.282 | 3.932 | 4.118 | 3.654 |
10T3 | 2.959 | 3.705 | 2.84 | 3.215 | 2.965 | 3.467 | 4.202 | 2.915 |
10S1 | 0.154 | 0.723 | 1.961 | 0.267 | 0.052 | 2.049 | -0.053 | 0.184 |
10S2 | -0.008 | 0.232 | 0.318 | 0.252 | 1.179 | -0.056 | 0.415 | 0.163 |
10S3 | 0.072 | 0.475 | 1.134 | 0.258 | 0.612 | 0.992 | 0.018 | 0.173 |
4T1 | 6.281 | 9.823 | 8.209 | 5.994 | 5.735 | 5.356 | 4.573 | 6.228 |
4T2 | 7.042 | 6.291 | 6.272 | 5.743 | 4.587 | 6.156 | 3.504 | 5.678 |
4T3 | 6.628 | 8.017 | 7.204 | 5.839 | 5.135 | 5.727 | 4.019 | 5.923 |
4S1 | -0.705 | -0.661 | -0.081 | -0.334 | 1.331 | -0.566 | -0.306 | 1.138 |
4S2 | -0.76 | -0.5 | -0.69 | -0.716 | -0.774 | -0.604 | -0.802 | -0.743 |
4S3 | -0.729 | -0.577 | -0.384 | -0.522 | 0.278 | -0.582 | -0.511 | 0.196 |
一眼就能看出BT1和BT3样品的第一个数据明显偏离了整体的均值,那么如何处理呢?
## 定义一个将数据离群值替换为NA的函数
fun.outlier <- function(x,time.iqr=1.5) {
outlier.low <- quantile(x,probs=c(0.25))-IQR(x)*time.iqr
outlier.high <- quantile(x,probs=c(0.75))+IQR(x)*time.iqr
x[which(x>outlier.high | x<outlier.low)]<-NA
x
}
## 导入数据文件,数据转成数据框的格式
enzyme_data <- as.data.frame(t(read.csv("enzyme_activity.csv",header = TRUE, row.names = 1)))
outlier.enzyme <- enzyme_data %>% plyr::summarise( BT1=fun.outlier(BT1),
BT2=fun.outlier(BT2),
BT3=fun.outlier(BT3),
BS1=fun.outlier(BS1),
BS2=fun.outlier(BS2),
BS3=fun.outlier(BS3),
10T1=fun.outlier(10T1),
10T2=fun.outlier(10T2),
10T3=fun.outlier(10T3),
10S1=fun.outlier(10S1),
10S2=fun.outlier(10S2),
10S3=fun.outlier(10S3),
4T1=fun.outlier(4T1),
4T2=fun.outlier(4T2),
4T3=fun.outlier(4T3),
4S1=fun.outlier(4S1),
4S2=fun.outlier(4S2),
4S3=fun.outlier(4S3)
)
## 求替换后的各组的平均值
enzyme_m <- sapply(outlier.enzyme, mean, na.rm = TRUE)
enzy_act_o <- as.data.frame(enzyme_m)
head(enzy_act_o)
enzyme_m
# BT1 4.712429
# BT2 3.911667
# BT3 4.409000
# BS1 -0.879000
# BS2 -0.843125
# BS3 -0.853500
## 将结果保存为对应csv文件
write.csv(enzy_act_o, file = "enzyme_xyloside_mean.csv")
参考资料:
https://blog.csdn.net/qq_43872984/article/details/105479445