Analyzing RNA-seq data with DESeq2(一)

之前和毕业师姐一起做差异基因分析的时候用到过DESeq2,但一直没有系统的学习过很多地方都不懂,所以从今天开始打算在课题之余把官方文档中差异分析部分系统学习一遍。

本文完全用做个人学习记忆,摘抄自DESeq2官方说明文档
参考链接:http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

Analyzing RNA-seq data with DESeq2(一)
Analyzing RNA-seq data with DESeq2(二)
Analyzing RNA-seq data with DESeq2(三)
Analyzing RNA-seq data with DESeq2(四)
Analyzing RNA-seq data with DESeq2(五)


Standard workflow

Quick start

Here we show the most basic steps for a differential expression analysis. There are a variety of steps upstream of DESeq2 that result in the generation of counts or estimated counts for each sample, which we will discuss in the sections below.

This code chunk assumes that you have a count matrix called cts and a table of sample information called coldata. The design indicates how to model the samples, here, that we want to measure the effect of the condition, controlling for batch differences. The two factor variables batch(批次) and condition(条件) should be columns of coldata

dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design= ~ batch + condition) # condition : the variables which will be used in modeling
dds <- DESeq(dds)
resultsNames(dds) # lists the coefficients
res <- results(dds, name="condition_trt_vs_untrt")# or to shrink log fold changes association with condition:
res <- lfcShrink(dds, coef="condition_trt_vs_untrt", type="apeglm")

此代码块假定有一个称为cts的计数矩阵和一个称为coldata的样本信息表。 design指出了如何对样本建模,在这里,我们要测量条件的影响,控制批次差异。 批次和条件这两个因素变量应为Coldata列。


The DESeqDataSet

The object class used by the DESeq2 package to store the read counts and the intermediate estimated quantities during statistical analysis is the DESeqDataSet, which will usually be represented in the code here as an object dds.

A DESeqDataSet object must have an associated design formula. The design formula expresses the variables which will be used in modeling. The formula should be a tilde (~) followed by the variables with plus signs between them (it will be coerced into an formula if it is not already). The design can be changed later, however then all differential analysis steps should be repeated, as the design formula is used to estimate the dispersions and to estimate the log2 fold changes of the model.

个人理解:提供一个condition,用于后续的比较分析,这个condition中包含了变量信息。比如说我要做两个处理间的差异分析,但是又存在多次重复处理,为了避免重复那就需要告诉这两个不同处理是什么。

> #from matrix:从数值矩阵得到,比如用featurecount统计得到的数值矩阵
> a <- as.matrix(read.table("6sample_count_matrix.txt",sep="\t",header = T,row.names=1))
> head(a)
           WT1  WT2  WT3 acc1.51 acc1.52 acc1.53
AT1G01010  226  183  247     514     353     383
AT1G01020  386  364  453     369     483     443
AT1G01030   77   65   87      52      62      74
AT1G01040 1322 1258 1621    1333    1526    1641
AT1G01050 1531 1518 2029    1573    1888    1987
AT1G01060  124  151  823     146     197     297
> b <- read.table("coldata.txt",header = T,row.names = 1)
> b
        condition
WT1            WT
WT2            WT
WT3            WT
acc1.51    acc1.5
acc1.52    acc1.5
acc1.53    acc1.5
> all(rownames(b) == colnames(a)) #make sure that it's true
[1] TRUE
> dds3 <- DESeqDataSetFromMatrix(countData = a,
+                                DataFrame(b),
+                                design = ~ condition)

DESeq2提供了四种方式来构建DESeqDataSet

  1. From transcript abundance files and tximport
  2. From a count matrix
  3. From htseq-count files
  4. From a SummarizedExperiment object

这里只学习两种方法

  1. From a count matrix
  2. From htseq-count files

Count matrix input

Here we show the most basic steps for a differential expression analysis.
To use DESeqDataSetFromMatrix, the user should provide the counts matrix, the information about the samples (the columns of the count matrix) as a DataFrame or data.frame, and the design formula.

这里选用pasilla包中的数据作为练习

BiocManager::install("pasilla") ##需要安装bioconductor
library("pasilla")
pasCts <- system.file("extdata",
                      "pasilla_gene_counts.tsv",
                      package="pasilla", mustWork=TRUE)
pasAnno <- system.file("extdata",
                       "pasilla_sample_annotation.csv",
                       package="pasilla", mustWork=TRUE)
cts <- as.matrix(read.csv(pasCts,sep="\t",row.names="gene_id"))
#> head(cts)
#            untreated1 untreated2 untreated3 untreated4 treated1 treated2 treated3
#FBgn0000003          0          0          0          0        0        0        1
#FBgn0000008         92        161         76         70      140       88       70
#FBgn0000014          5          1          0          0        4        0        0
#FBgn0000015          0          2          1          2        1        0        0
#FBgn0000017       4664       8714       3564       3150     6205     3072     3334
#FBgn0000018        583        761        245        310      722      299      308

coldata <- read.csv(pasAnno, row.names=1)
#> coldata
#            condition        type number.of.lanes total.number.of.reads exon.counts
#treated1fb     treated single-read               5              35158667    15679615
#treated2fb     treated  paired-end               2         12242535 (x2)    15620018
#treated3fb     treated  paired-end               2         12443664 (x2)    12733865
#untreated1fb untreated single-read               2              17812866    14924838
#untreated2fb untreated single-read               6              34284521    20764558
#untreated3fb untreated  paired-end               2         10542625 (x2)    10283129
#untreated4fb untreated  paired-end               2         12214974 (x2)    11653031

coldata <- coldata[,c("condition","type")]
#> coldata
#             condition        type
#treated1fb     treated single-read
#treated2fb     treated  paired-end
#treated3fb     treated  paired-end
#untreated1fb untreated single-read
#untreated2fb untreated single-read
#untreated3fb untreated  paired-end
#untreated4fb untreated  paired-end

coldata$condition <- factor(coldata$condition)
coldata$type <- factor(coldata$type)

rownames(coldata) <- sub("fb", "", rownames(coldata))
#> coldata
#           condition        type
#treated1     treated single-read
#treated2     treated  paired-end
#treated3     treated  paired-end
#untreated1 untreated single-read
#untreated2 untreated single-read
#untreated3 untreated  paired-end
#untreated4 untreated  paired-end

all(rownames(coldata) %in% colnames(cts))
##[1] TRUE
all(rownames(coldata) == colnames(cts))
## [1] FALSE
cts <- cts[, rownames(coldata)]
#> head(cts)
#            treated1 treated2 treated3 untreated1 untreated2 untreated3 untreated4
#FBgn0000003        0        0        1          0          0          0          0
#FBgn0000008      140       88       70         92        161         76         70
#FBgn0000014        4        0        0          5          1          0          0
#FBgn0000015        1        0        0          0          2          1          2
#FBgn0000017     6205     3072     3334       4664       8714       3564       3150
#FBgn0000018      722      299      308        583        761        245        310

all(rownames(coldata) == colnames(cts))
## [1] TRUE

可以看到这时cts(count matrix)的列名和coldata(column data)的行名顺序和名称是一致的
> pasCts
[1] "/home/R-3.6.1/library/pasilla/extdata/pasilla_gene_counts.tsv"

[XXX@localhost extdata]$ less pasilla_gene_counts.tsv
gene_id untreated1      untreated2      untreated3      untreated4      treated1        treated2        treated3
FBgn0000003     0       0       0       0       0       0       1
FBgn0000008     92      161     76      70      140     88      70
FBgn0000014     5       1       0       0       4       0       0
FBgn0000015     0       2       1       2       1       0       0
FBgn0000017     4664    8714    3564    3150    6205    3072    3334
FBgn0000018     583     761     245     310     722     299     308
FBgn0000022     0       1       0       0       0       0       0
FBgn0000024     10      11      3       3       10      7       5
FBgn0000028     0       1       0       0       0       1       1
FBgn0000032     1446    1713    615     672     1698    696     757
FBgn0000036     2       1       0       0       1       0       1
> pasAnno
[1] "/home/R-3.6.1/library/pasilla/extdata/pasilla_sample_annotation.csv"

[XXX@localhost extdata]$ less pasilla_sample_annotation.csv
"file","condition","type","number of lanes","total number of reads","exon counts"
"treated1fb","treated","single-read",5,"35158667",15679615
"treated2fb","treated","paired-end",2,"12242535 (x2)",15620018
"treated3fb","treated","paired-end",2,"12443664 (x2)",12733865
"untreated1fb","untreated","single-read",2,"17812866",14924838
"untreated2fb","untreated","single-read",6,"34284521",20764558
"untreated3fb","untreated","paired-end",2,"10542625 (x2)",10283129
"untreated4fb","untreated","paired-end",2,"12214974 (x2)",11653031

到这里读取数据的准备已经完成,并且我们需要记住的是
计数矩阵(count matrix)的列和列数据(coldata)的行(有关示例的信息)的顺序是相同的。DESeq 2不会猜测计数矩阵的哪一列属于列数据的哪一行,必须按照一致的顺序提供给DESeq 2。

因此数据构建准备时,我们要实现:

顺序一致、名称一致!!!

顺序一致、名称一致!!!

顺序一致、名称一致!!!


Construct a DESeqDataSet

数据文件已经准备好了,那接下来就可以构建DESeqDataSet了。

library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ condition)
dds
## class: DESeqDataSet 
## dim: 14599 7 
## metadata(1): version
## assays(1): counts
## rownames(14599): FBgn0000003 FBgn0000008 ... FBgn0261574 FBgn0261575
## rowData names(0):
## colnames(7): treated1 treated2 ... untreated3 untreated4
## colData names(2): condition type

head(dds@assays@data$counts)
##            treated1 treated2 treated3 untreated1 untreated2 untreated3 untreated4
##FBgn0000003        0        0        1          0          0          0          0
##FBgn0000008      140       88       70         92        161         76         70
##FBgn0000014        4        0        0          5          1          0          0
##FBgn0000015        1        0        0          0          2          1          2
##FBgn0000017     6205     3072     3334       4664       8714       3564       3150
##FBgn0000018      722      299      308        583        761        245        310
原始count数据查看

如果想添加一部分 feature data进去,可以使用DataFrame函数 frame:框架

featureData <- data.frame(gene=rownames(cts))
mcols(dds) <- DataFrame(mcols(dds), featureData) #mcols(x) <- value: Get or set the metadata columns.
mcols(dds)

## DataFrame with 14599 rows and 1 column
##                    gene
##             <character>
## FBgn0000003 FBgn0000003
## FBgn0000008 FBgn0000008
## FBgn0000014 FBgn0000014
## FBgn0000015 FBgn0000015
## FBgn0000017 FBgn0000017
## ...                 ...
## FBgn0261571 FBgn0261571
## FBgn0261572 FBgn0261572
## FBgn0261573 FBgn0261573
## FBgn0261574 FBgn0261574
## FBgn0261575 FBgn0261575

> dds
class: DESeqDataSet 
dim: 14599 7 
metadata(1): version
assays(1): counts
rownames(14599): FBgn0000003 FBgn0000008 ... FBgn0261574 FBgn0261575
rowData names(1): gene
colnames(7): treated1 treated2 ... untreated3 untreated4
colData names(2): condition type
#注意这时rowData names部分变成了gene

If you have additional feature data, it can be added to the DESeqDataSet by adding to the metadata columns of a newly constructed object.
(Here we add redundant data just for demonstration, as the gene names are already the rownames of the dds.)


今天先到这里了,改天见哦!

大家一起学习讨论鸭!

来一杯!
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
禁止转载,如需转载请通过简信或评论联系作者。
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,884评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,755评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,369评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,799评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,910评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,096评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,159评论 3 411
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,917评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,360评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,673评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,814评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,509评论 4 334
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,156评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,882评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,123评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,641评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,728评论 2 351