之前和毕业师姐一起做差异基因分析的时候用到过DESeq2，但一直没有系统的学习过很多地方都不懂，所以从今天开始打算在课题之余把官方文档中差异分析部分系统学习一遍。

本文完全用做个人学习记忆，摘抄自DESeq2官方说明文档
参考链接：http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

Analyzing RNA-seq data with DESeq2（一）
Analyzing RNA-seq data with DESeq2（二）
Analyzing RNA-seq data with DESeq2（三）
Analyzing RNA-seq data with DESeq2（四）
Analyzing RNA-seq data with DESeq2（五）

Standard workflow

Quick start

Here we show the most basic steps for a differential expression analysis. There are a variety of steps upstream of DESeq2 that result in the generation of counts or estimated counts for each sample, which we will discuss in the sections below.

This code chunk assumes that you have a count matrix called cts and a table of sample information called coldata. The design indicates how to model the samples, here, that we want to measure the effect of the condition, controlling for batch differences. The two factor variables batch(批次) and condition(条件) should be columns of coldata

dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design= ~ batch + condition) # condition : the variables which will be used in modeling
dds <- DESeq(dds)
resultsNames(dds) # lists the coefficients
res <- results(dds, name="condition_trt_vs_untrt")# or to shrink log fold changes association with condition:
res <- lfcShrink(dds, coef="condition_trt_vs_untrt", type="apeglm")

此代码块假定有一个称为cts的计数矩阵和一个称为coldata的样本信息表。 design指出了如何对样本建模，在这里，我们要测量条件的影响，控制批次差异。批次和条件这两个因素变量应为Coldata列。

The DESeqDataSet

The object class used by the DESeq2 package to store the read counts and the intermediate estimated quantities during statistical analysis is the DESeqDataSet, which will usually be represented in the code here as an object dds.

A DESeqDataSet object must have an associated design formula. The design formula expresses the variables which will be used in modeling. The formula should be a tilde (~) followed by the variables with plus signs between them (it will be coerced into an formula if it is not already). The design can be changed later, however then all differential analysis steps should be repeated, as the design formula is used to estimate the dispersions and to estimate the log2 fold changes of the model.

个人理解：提供一个condition，用于后续的比较分析，这个condition中包含了变量信息。比如说我要做两个处理间的差异分析，但是又存在多次重复处理，为了避免重复那就需要告诉这两个不同处理是什么。

> #from matrix：从数值矩阵得到，比如用featurecount统计得到的数值矩阵
> a <- as.matrix(read.table("6sample_count_matrix.txt",sep="\t",header = T,row.names=1))
> head(a)
           WT1  WT2  WT3 acc1.51 acc1.52 acc1.53
AT1G01010  226  183  247     514     353     383
AT1G01020  386  364  453     369     483     443
AT1G01030   77   65   87      52      62      74
AT1G01040 1322 1258 1621    1333    1526    1641
AT1G01050 1531 1518 2029    1573    1888    1987
AT1G01060  124  151  823     146     197     297
> b <- read.table("coldata.txt",header = T,row.names = 1)
> b
        condition
WT1            WT
WT2            WT
WT3            WT
acc1.51    acc1.5
acc1.52    acc1.5
acc1.53    acc1.5
> all(rownames(b) == colnames(a)) #make sure that it's true
[1] TRUE
> dds3 <- DESeqDataSetFromMatrix(countData = a,
+                                DataFrame(b),
+                                design = ~ condition)

DESeq2提供了四种方式来构建DESeqDataSet

From transcript abundance files and tximport

From a count matrix

From htseq-count files

From a SummarizedExperiment object

这里只学习两种方法：

From a count matrix
From htseq-count files

Count matrix input

Here we show the most basic steps for a differential expression analysis.
To use DESeqDataSetFromMatrix, the user should provide the counts matrix, the information about the samples (the columns of the count matrix) as a DataFrame or data.frame, and the design formula.

这里选用pasilla包中的数据作为练习

BiocManager::install("pasilla") ##需要安装bioconductor
library("pasilla")
pasCts <- system.file("extdata",
                      "pasilla_gene_counts.tsv",
                      package="pasilla", mustWork=TRUE)
pasAnno <- system.file("extdata",
                       "pasilla_sample_annotation.csv",
                       package="pasilla", mustWork=TRUE)
cts <- as.matrix(read.csv(pasCts,sep="\t",row.names="gene_id"))
#> head(cts)
#            untreated1 untreated2 untreated3 untreated4 treated1 treated2 treated3
#FBgn0000003          0          0          0          0        0        0        1
#FBgn0000008         92        161         76         70      140       88       70
#FBgn0000014          5          1          0          0        4        0        0
#FBgn0000015          0          2          1          2        1        0        0
#FBgn0000017       4664       8714       3564       3150     6205     3072     3334
#FBgn0000018        583        761        245        310      722      299      308

coldata <- read.csv(pasAnno, row.names=1)
#> coldata
#            condition        type number.of.lanes total.number.of.reads exon.counts
#treated1fb     treated single-read               5              35158667    15679615
#treated2fb     treated  paired-end               2         12242535 (x2)    15620018
#treated3fb     treated  paired-end               2         12443664 (x2)    12733865
#untreated1fb untreated single-read               2              17812866    14924838
#untreated2fb untreated single-read               6              34284521    20764558
#untreated3fb untreated  paired-end               2         10542625 (x2)    10283129
#untreated4fb untreated  paired-end               2         12214974 (x2)    11653031

coldata <- coldata[,c("condition","type")]
#> coldata
#             condition        type
#treated1fb     treated single-read
#treated2fb     treated  paired-end
#treated3fb     treated  paired-end
#untreated1fb untreated single-read
#untreated2fb untreated single-read
#untreated3fb untreated  paired-end
#untreated4fb untreated  paired-end

coldata$condition <- factor(coldata$condition)
coldata$type <- factor(coldata$type)

rownames(coldata) <- sub("fb", "", rownames(coldata))
#> coldata
#           condition        type
#treated1     treated single-read
#treated2     treated  paired-end
#treated3     treated  paired-end
#untreated1 untreated single-read
#untreated2 untreated single-read
#untreated3 untreated  paired-end
#untreated4 untreated  paired-end

all(rownames(coldata) %in% colnames(cts))
##[1] TRUE
all(rownames(coldata) == colnames(cts))
## [1] FALSE
cts <- cts[, rownames(coldata)]
#> head(cts)
#            treated1 treated2 treated3 untreated1 untreated2 untreated3 untreated4
#FBgn0000003        0        0        1          0          0          0          0
#FBgn0000008      140       88       70         92        161         76         70
#FBgn0000014        4        0        0          5          1          0          0
#FBgn0000015        1        0        0          0          2          1          2
#FBgn0000017     6205     3072     3334       4664       8714       3564       3150
#FBgn0000018      722      299      308        583        761        245        310

all(rownames(coldata) == colnames(cts))
## [1] TRUE

可以看到这时cts（count matrix）的列名和coldata（column data）的行名顺序和名称是一致的

> pasCts
[1] "/home/R-3.6.1/library/pasilla/extdata/pasilla_gene_counts.tsv"

[XXX@localhost extdata]$ less pasilla_gene_counts.tsv
gene_id untreated1      untreated2      untreated3      untreated4      treated1        treated2        treated3
FBgn0000003     0       0       0       0       0       0       1
FBgn0000008     92      161     76      70      140     88      70
FBgn0000014     5       1       0       0       4       0       0
FBgn0000015     0       2       1       2       1       0       0
FBgn0000017     4664    8714    3564    3150    6205    3072    3334
FBgn0000018     583     761     245     310     722     299     308
FBgn0000022     0       1       0       0       0       0       0
FBgn0000024     10      11      3       3       10      7       5
FBgn0000028     0       1       0       0       0       1       1
FBgn0000032     1446    1713    615     672     1698    696     757
FBgn0000036     2       1       0       0       1       0       1

> pasAnno
[1] "/home/R-3.6.1/library/pasilla/extdata/pasilla_sample_annotation.csv"

[XXX@localhost extdata]$ less pasilla_sample_annotation.csv
"file","condition","type","number of lanes","total number of reads","exon counts"
"treated1fb","treated","single-read",5,"35158667",15679615
"treated2fb","treated","paired-end",2,"12242535 (x2)",15620018
"treated3fb","treated","paired-end",2,"12443664 (x2)",12733865
"untreated1fb","untreated","single-read",2,"17812866",14924838
"untreated2fb","untreated","single-read",6,"34284521",20764558
"untreated3fb","untreated","paired-end",2,"10542625 (x2)",10283129
"untreated4fb","untreated","paired-end",2,"12214974 (x2)",11653031

到这里读取数据的准备已经完成，并且我们需要记住的是：
计数矩阵（count matrix）的列和列数据（coldata）的行(有关示例的信息)的顺序是相同的。DESeq 2不会猜测计数矩阵的哪一列属于列数据的哪一行，必须按照一致的顺序提供给DESeq 2。

因此数据构建准备时，我们要实现：

顺序一致、名称一致！！！

Construct a DESeqDataSet

数据文件已经准备好了，那接下来就可以构建DESeqDataSet了。

library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ condition)
dds
## class: DESeqDataSet 
## dim: 14599 7 
## metadata(1): version
## assays(1): counts
## rownames(14599): FBgn0000003 FBgn0000008 ... FBgn0261574 FBgn0261575
## rowData names(0):
## colnames(7): treated1 treated2 ... untreated3 untreated4
## colData names(2): condition type

head(dds@assays@data$counts)
##            treated1 treated2 treated3 untreated1 untreated2 untreated3 untreated4
##FBgn0000003        0        0        1          0          0          0          0
##FBgn0000008      140       88       70         92        161         76         70
##FBgn0000014        4        0        0          5          1          0          0
##FBgn0000015        1        0        0          0          2          1          2
##FBgn0000017     6205     3072     3334       4664       8714       3564       3150
##FBgn0000018      722      299      308        583        761        245        310
原始count数据查看

如果想添加一部分 feature data进去，可以使用DataFrame函数 frame:框架

featureData <- data.frame(gene=rownames(cts))
mcols(dds) <- DataFrame(mcols(dds), featureData) #mcols(x) <- value: Get or set the metadata columns.
mcols(dds)

## DataFrame with 14599 rows and 1 column
##                    gene
##             <character>
## FBgn0000003 FBgn0000003
## FBgn0000008 FBgn0000008
## FBgn0000014 FBgn0000014
## FBgn0000015 FBgn0000015
## FBgn0000017 FBgn0000017
## ...                 ...
## FBgn0261571 FBgn0261571
## FBgn0261572 FBgn0261572
## FBgn0261573 FBgn0261573
## FBgn0261574 FBgn0261574
## FBgn0261575 FBgn0261575

> dds
class: DESeqDataSet 
dim: 14599 7 
metadata(1): version
assays(1): counts
rownames(14599): FBgn0000003 FBgn0000008 ... FBgn0261574 FBgn0261575
rowData names(1): gene
colnames(7): treated1 treated2 ... untreated3 untreated4
colData names(2): condition type
#注意这时rowData names部分变成了gene

If you have additional feature data, it can be added to the DESeqDataSet by adding to the metadata columns of a newly constructed object.
(Here we add redundant data just for demonstration, as the gene names are already the rownames of the dds.)

今天先到这里了，改天见哦！

大家一起学习讨论鸭！

来一杯！