之前和毕业师姐一起做差异基因分析的时候用到过DESeq2,但一直没有系统的学习过很多地方都不懂,所以从今天开始打算在课题之余把官方文档中差异分析部分系统学习一遍。
本文完全用做个人学习记忆,摘抄自DESeq2官方说明文档
参考链接:http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
Analyzing RNA-seq data with DESeq2(一)
Analyzing RNA-seq data with DESeq2(二)
Analyzing RNA-seq data with DESeq2(三)
Analyzing RNA-seq data with DESeq2(四)
Analyzing RNA-seq data with DESeq2(五)
Standard workflow
Quick start
Here we show the most basic steps for a differential expression analysis. There are a variety of steps upstream of DESeq2 that result in the generation of counts or estimated counts for each sample, which we will discuss in the sections below.
This code chunk assumes that you have a count matrix called cts and a table of sample information called coldata. The design indicates how to model the samples, here, that we want to measure the effect of the condition, controlling for batch differences. The two factor variables batch(批次) and condition(条件) should be columns of coldata
dds <- DESeqDataSetFromMatrix(countData = cts,
colData = coldata,
design= ~ batch + condition) # condition : the variables which will be used in modeling
dds <- DESeq(dds)
resultsNames(dds) # lists the coefficients
res <- results(dds, name="condition_trt_vs_untrt")# or to shrink log fold changes association with condition:
res <- lfcShrink(dds, coef="condition_trt_vs_untrt", type="apeglm")
此代码块假定有一个称为cts的计数矩阵和一个称为coldata的样本信息表。 design指出了如何对样本建模,在这里,我们要测量条件的影响,控制批次差异。 批次和条件这两个因素变量应为Coldata列。
The DESeqDataSet
The object class used by the DESeq2 package to store the read counts and the intermediate estimated quantities during statistical analysis is the DESeqDataSet, which will usually be represented in the code here as an object dds.
A DESeqDataSet object must have an associated design formula. The design formula expresses the variables which will be used in modeling. The formula should be a tilde (~) followed by the variables with plus signs between them (it will be coerced into an formula if it is not already). The design can be changed later, however then all differential analysis steps should be repeated, as the design formula is used to estimate the dispersions and to estimate the log2 fold changes of the model.
个人理解:提供一个condition,用于后续的比较分析,这个condition中包含了变量信息。比如说我要做两个处理间的差异分析,但是又存在多次重复处理,为了避免重复那就需要告诉这两个不同处理是什么。
> #from matrix:从数值矩阵得到,比如用featurecount统计得到的数值矩阵
> a <- as.matrix(read.table("6sample_count_matrix.txt",sep="\t",header = T,row.names=1))
> head(a)
WT1 WT2 WT3 acc1.51 acc1.52 acc1.53
AT1G01010 226 183 247 514 353 383
AT1G01020 386 364 453 369 483 443
AT1G01030 77 65 87 52 62 74
AT1G01040 1322 1258 1621 1333 1526 1641
AT1G01050 1531 1518 2029 1573 1888 1987
AT1G01060 124 151 823 146 197 297
> b <- read.table("coldata.txt",header = T,row.names = 1)
> b
condition
WT1 WT
WT2 WT
WT3 WT
acc1.51 acc1.5
acc1.52 acc1.5
acc1.53 acc1.5
> all(rownames(b) == colnames(a)) #make sure that it's true
[1] TRUE
> dds3 <- DESeqDataSetFromMatrix(countData = a,
+ DataFrame(b),
+ design = ~ condition)
DESeq2提供了四种方式来构建DESeqDataSet
- From transcript abundance files and tximport
- From a count matrix
- From htseq-count files
- From a SummarizedExperiment object
这里只学习两种方法:
- From a count matrix
- From htseq-count files
Count matrix input
Here we show the most basic steps for a differential expression analysis.
To use DESeqDataSetFromMatrix, the user should provide the counts matrix, the information about the samples (the columns of the count matrix) as a DataFrame or data.frame, and the design formula.
这里选用pasilla包中的数据作为练习
BiocManager::install("pasilla") ##需要安装bioconductor
library("pasilla")
pasCts <- system.file("extdata",
"pasilla_gene_counts.tsv",
package="pasilla", mustWork=TRUE)
pasAnno <- system.file("extdata",
"pasilla_sample_annotation.csv",
package="pasilla", mustWork=TRUE)
cts <- as.matrix(read.csv(pasCts,sep="\t",row.names="gene_id"))
#> head(cts)
# untreated1 untreated2 untreated3 untreated4 treated1 treated2 treated3
#FBgn0000003 0 0 0 0 0 0 1
#FBgn0000008 92 161 76 70 140 88 70
#FBgn0000014 5 1 0 0 4 0 0
#FBgn0000015 0 2 1 2 1 0 0
#FBgn0000017 4664 8714 3564 3150 6205 3072 3334
#FBgn0000018 583 761 245 310 722 299 308
coldata <- read.csv(pasAnno, row.names=1)
#> coldata
# condition type number.of.lanes total.number.of.reads exon.counts
#treated1fb treated single-read 5 35158667 15679615
#treated2fb treated paired-end 2 12242535 (x2) 15620018
#treated3fb treated paired-end 2 12443664 (x2) 12733865
#untreated1fb untreated single-read 2 17812866 14924838
#untreated2fb untreated single-read 6 34284521 20764558
#untreated3fb untreated paired-end 2 10542625 (x2) 10283129
#untreated4fb untreated paired-end 2 12214974 (x2) 11653031
coldata <- coldata[,c("condition","type")]
#> coldata
# condition type
#treated1fb treated single-read
#treated2fb treated paired-end
#treated3fb treated paired-end
#untreated1fb untreated single-read
#untreated2fb untreated single-read
#untreated3fb untreated paired-end
#untreated4fb untreated paired-end
coldata$condition <- factor(coldata$condition)
coldata$type <- factor(coldata$type)
rownames(coldata) <- sub("fb", "", rownames(coldata))
#> coldata
# condition type
#treated1 treated single-read
#treated2 treated paired-end
#treated3 treated paired-end
#untreated1 untreated single-read
#untreated2 untreated single-read
#untreated3 untreated paired-end
#untreated4 untreated paired-end
all(rownames(coldata) %in% colnames(cts))
##[1] TRUE
all(rownames(coldata) == colnames(cts))
## [1] FALSE
cts <- cts[, rownames(coldata)]
#> head(cts)
# treated1 treated2 treated3 untreated1 untreated2 untreated3 untreated4
#FBgn0000003 0 0 1 0 0 0 0
#FBgn0000008 140 88 70 92 161 76 70
#FBgn0000014 4 0 0 5 1 0 0
#FBgn0000015 1 0 0 0 2 1 2
#FBgn0000017 6205 3072 3334 4664 8714 3564 3150
#FBgn0000018 722 299 308 583 761 245 310
all(rownames(coldata) == colnames(cts))
## [1] TRUE
可以看到这时cts(count matrix)的列名和coldata(column data)的行名顺序和名称是一致的
> pasCts
[1] "/home/R-3.6.1/library/pasilla/extdata/pasilla_gene_counts.tsv"
[XXX@localhost extdata]$ less pasilla_gene_counts.tsv
gene_id untreated1 untreated2 untreated3 untreated4 treated1 treated2 treated3
FBgn0000003 0 0 0 0 0 0 1
FBgn0000008 92 161 76 70 140 88 70
FBgn0000014 5 1 0 0 4 0 0
FBgn0000015 0 2 1 2 1 0 0
FBgn0000017 4664 8714 3564 3150 6205 3072 3334
FBgn0000018 583 761 245 310 722 299 308
FBgn0000022 0 1 0 0 0 0 0
FBgn0000024 10 11 3 3 10 7 5
FBgn0000028 0 1 0 0 0 1 1
FBgn0000032 1446 1713 615 672 1698 696 757
FBgn0000036 2 1 0 0 1 0 1
> pasAnno
[1] "/home/R-3.6.1/library/pasilla/extdata/pasilla_sample_annotation.csv"
[XXX@localhost extdata]$ less pasilla_sample_annotation.csv
"file","condition","type","number of lanes","total number of reads","exon counts"
"treated1fb","treated","single-read",5,"35158667",15679615
"treated2fb","treated","paired-end",2,"12242535 (x2)",15620018
"treated3fb","treated","paired-end",2,"12443664 (x2)",12733865
"untreated1fb","untreated","single-read",2,"17812866",14924838
"untreated2fb","untreated","single-read",6,"34284521",20764558
"untreated3fb","untreated","paired-end",2,"10542625 (x2)",10283129
"untreated4fb","untreated","paired-end",2,"12214974 (x2)",11653031
到这里读取数据的准备已经完成,并且我们需要记住的是:
计数矩阵(count matrix)的列和列数据(coldata)的行(有关示例的信息)的顺序是相同的。DESeq 2不会猜测计数矩阵的哪一列属于列数据的哪一行,必须按照一致的顺序提供给DESeq 2。
因此数据构建准备时,我们要实现:
顺序一致、名称一致!!!
顺序一致、名称一致!!!
顺序一致、名称一致!!!
Construct a DESeqDataSet
数据文件已经准备好了,那接下来就可以构建DESeqDataSet了。
library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData = cts,
colData = coldata,
design = ~ condition)
dds
## class: DESeqDataSet
## dim: 14599 7
## metadata(1): version
## assays(1): counts
## rownames(14599): FBgn0000003 FBgn0000008 ... FBgn0261574 FBgn0261575
## rowData names(0):
## colnames(7): treated1 treated2 ... untreated3 untreated4
## colData names(2): condition type
head(dds@assays@data$counts)
## treated1 treated2 treated3 untreated1 untreated2 untreated3 untreated4
##FBgn0000003 0 0 1 0 0 0 0
##FBgn0000008 140 88 70 92 161 76 70
##FBgn0000014 4 0 0 5 1 0 0
##FBgn0000015 1 0 0 0 2 1 2
##FBgn0000017 6205 3072 3334 4664 8714 3564 3150
##FBgn0000018 722 299 308 583 761 245 310
原始count数据查看
如果想添加一部分 feature data进去,可以使用DataFrame函数 frame:框架
featureData <- data.frame(gene=rownames(cts))
mcols(dds) <- DataFrame(mcols(dds), featureData) #mcols(x) <- value: Get or set the metadata columns.
mcols(dds)
## DataFrame with 14599 rows and 1 column
## gene
## <character>
## FBgn0000003 FBgn0000003
## FBgn0000008 FBgn0000008
## FBgn0000014 FBgn0000014
## FBgn0000015 FBgn0000015
## FBgn0000017 FBgn0000017
## ... ...
## FBgn0261571 FBgn0261571
## FBgn0261572 FBgn0261572
## FBgn0261573 FBgn0261573
## FBgn0261574 FBgn0261574
## FBgn0261575 FBgn0261575
> dds
class: DESeqDataSet
dim: 14599 7
metadata(1): version
assays(1): counts
rownames(14599): FBgn0000003 FBgn0000008 ... FBgn0261574 FBgn0261575
rowData names(1): gene
colnames(7): treated1 treated2 ... untreated3 untreated4
colData names(2): condition type
#注意这时rowData names部分变成了gene
If you have additional feature data, it can be added to the DESeqDataSet by adding to the metadata columns of a newly constructed object.
(Here we add redundant data just for demonstration, as the gene names are already the rownames of the dds.)
今天先到这里了,改天见哦!