Differential Expression of RNA-seq data
本文将要介绍的是在R中进行RNA-seq 数据基因表达差异分析的实战代码.
原文地址:https://bioinformatics-core-shared-training.github.io/RNAseq-R/rna-seq-de.nb.html
书接上文(用R也可以完成的RNA-Seq下游分析-2)
在预处理生成了表达矩阵且标准化后,接下来我们要做的就是差异分析了。
本次流程需要的包
同时载入上次预处理完的数据preprocessing.Rdata
library(edgeR)
library(limma)
library(Glimma)
library(gplots)
library(org.Mm.eg.db)
load("preprocessing.Rdata")
在edgeR
包的分析流程中,构建了表达矩阵后,便需要使用model.matrix
函数创建保存有分组信息的矩阵design matrix。该矩阵以的方式存储了分组信息。本次分析中的分组信息包括了:小鼠状态与细胞类型,现在我们先在两种因素并不存在相互作用的假设下拟合线性模型。
> group <- as.character(group)
> group
[1] "basal.virgin" "basal.virgin" "basal.pregnant" "basal.pregnant" "basal.lactate"
[6] "basal.lactate" "luminal.virgin" "luminal.virgin" "luminal.pregnant" "luminal.pregnant"
[11] "luminal.lactate" "luminal.lactate"
group的分组信息包括了状态和细胞类型,因此我们使用strsplit
函数,以‘.’为分隔把细胞类型信息和状态信息分别提取出来。
> type <- sapply(strsplit(group, ".", fixed=T), function(x) x[1])
> status <- sapply(strsplit(group, ".", fixed=T), function(x) x[2])
> type
[1] "basal" "basal" "basal" "basal" "basal" "basal" "luminal" "luminal" "luminal"
[10] "luminal" "luminal" "luminal"
> status
[1] "virgin" "virgin" "pregnant" "pregnant" "lactate" "lactate" "virgin" "virgin"
[9] "pregnant" "pregnant" "lactate" "lactate"
成功提取后,构建分组信息矩阵
> design <- model.matrix(~ type + status)
> design
(Intercept) typeluminal statuspregnant statusvirgin
1 1 0 0 1
2 1 0 0 1
3 1 0 1 0
4 1 0 1 0
5 1 0 0 0
6 1 0 0 0
7 1 1 0 1
8 1 1 0 1
9 1 1 1 0
10 1 1 1 0
11 1 1 0 0
12 1 1 0 0
attr(,"assign")
[1] 0 1 2 2
attr(,"contrasts")
attr(,"contrasts")$type
[1] "contr.treatment"
attr(,"contrasts")$status
[1] "contr.treatment"
Estimating the dispersion
> dgeObj <- estimateCommonDisp(dgeObj)
#Then we estimate gene-wise dispersion estimates, allowing a possible trend with averge count size:
> dgeObj <- estimateGLMTrendedDisp(dgeObj)
> dgeObj <- estimateTagwiseDisp(dgeObj)
#Plot the estimated dispersions
> plotBCV(dgeObj)
Testing for differential expression
> fit <- glmFit(dgeObj, design)
# 看一眼系数
> head(coef(fit))
(Intercept) typeluminal statuspregnant statusvirgin
497097 -11.187922 -7.58804851 -0.7085514 -0.09305118
20671 -12.715063 -1.85287334 0.2269001 0.49554506
27395 -11.221391 0.56368066 -0.1415910 -0.29221577
18777 -10.146793 0.08280255 -0.1845489 -0.48795441
21399 -9.909825 -0.24195503 0.1753606 0.13494615
58175 -16.310131 3.09936215 1.1975518 0.84742701
# Conduct likelihood ratio tests for luminal vs basal and show the top genes:
> lrt.BvsL <- glmLRT(fit, coef=2)
# 查看前10个最显著的差异表达基因
> topTags(lrt.BvsL)
Coefficient: typeluminal
logFC logCPM LR PValue FDR
110308 -8.940579 10.264297 24.89789 6.044844e-07 0.0004377961
50916 -8.636503 5.749781 24.80037 6.358512e-07 0.0004377961
12293 -8.362247 6.794788 24.68526 6.749827e-07 0.0004377961
56069 -8.419433 6.124377 24.41532 7.764861e-07 0.0004377961
24117 -9.290691 6.757163 24.32506 8.137331e-07 0.0004377961
12818 -8.216790 8.172247 24.24233 8.494462e-07 0.0004377961
22061 -8.034712 7.255370 24.16987 8.820135e-07 0.0004377961
12797 -9.001419 9.910795 24.12854 9.011487e-07 0.0004377961
50706 -7.697022 10.809629 24.06926 9.293193e-07 0.0004377961
237979 -8.167451 5.215921 24.03528 9.458678e-07 0.0004377961
最后,保存差异分析的结果(实际分析时可不保存,可以直接代码分析)
save(lrt.BvsL,dgeObj,group,file="DE.Rdata")