1.读取raw count表达矩阵和分组信息
rawcount <- read.table("data/raw_counts.txt",row.names = 1,
sep = "\t", header = T)
group <- read.table("data/filereport_read_run_PRJNA229998_tsv.txt",
header = T,sep = "\t", quote = "\"")
提取表达矩阵对应的样本表型信息
library(stringr)
group <- group[match(colnames(rawcount), group$run_accession),
c("run_accession","sample_title")]
group$sample_title <- str_split_fixed(group$sample_title,pattern = "_", n=2)[,2]
group
write.table(group,file = "data/group.txt",row.names = F,sep = "\t",quote = F)
这里主要是因为raw count表达矩阵和分组信息并非一一对应,因此需要用match在分组信息里面挑选
2.表达矩阵预处理
keep <- rowSums(rawcount>0) >= floor(0.75*ncol(rawcount))
table(keep)##
filter_count <- rawcount[keep,]
filter_count[1:4,1:4]
dim(filter_count)
这里的预处理主要是处理raw count表达矩阵的含有0的行,代码的意思是如果一行的表达量大于0的个数有0.75倍的行数就保留这行,反之就删除这行。
计算counts per millio(cpm) 表达矩阵,这里用到的是edgeR包的cpm函数
library(edgeR)
express_cpm <- cpm(filter_count)
express_cpm[1:6,1:6]
# 保存表达矩阵和分组结果
save(filter_count, express_cpm, group, file = "data/Step01-airwayData.Rdata")