TCGA新版数据差异分析

上一篇文章里面简单学习了一下表达矩阵的提取,顺便探索了一下SummarizedExperiment对象。

今天学习下用TCGAbiolinks做差异分析。

加载R包和数据

rm(list = ls())

library(SummarizedExperiment)
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
## 
##     windows
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
library(TCGAbiolinks)

首先是查询、下载、整理,但是这一步我们在之前已经做好了,直接加载就可以了!

下载方法:请翻看之前的推文

# 查询
query <- GDCquery(project = "TCGA-COAD",
                    data.category = "Transcriptome Profiling",
                    data.type = "Gene Expression Quantification",
                    workflow.type = "STAR - Counts"
                    )
# 下载
GDCdownload(query, files.per.chunk = 100) #每次下载100个文件
  
# 整理
GDCprepare(query,save = T,save.filename = "TCGA-COAD_mRNA.Rdata")

我们已经下载好了,就直接加载即可:

load("TCGA-mRNA/TCGA-COAD_mRNA.Rdata")

通常我们会区分mRNA和lncRNA,所以我们这里只选择mRNA即可,方法在上一篇也说过了,非常简单!直接对SummarizedExperiment对象取子集即可!

se_mrna <- data[rowData(data)$gene_type == "protein_coding",]

se_mrna
## class: RangedSummarizedExperiment 
## dim: 19962 521 
## metadata(1): data_release
## assays(6): unstranded stranded_first ... fpkm_unstrand fpkm_uq_unstrand
## rownames(19962): ENSG00000000003.15 ENSG00000000005.6 ...
##   ENSG00000288674.1 ENSG00000288675.1
## rowData names(10): source type ... hgnc_id havana_gene
## colnames(521): TCGA-A6-5664-01A-21R-1839-07
##   TCGA-D5-6530-01A-11R-1723-07 ... TCGA-A6-2683-01A-01R-0821-07
##   TCGA-A6-2683-11A-01R-A32Z-07
## colData names(107): barcode patient ... paper_vascular_invasion_present
##   paper_vital_status

数据预处理

在进行差异分析前进行一些预处理。

dim(se_mrna)
## [1] 19962   521

首先根据spearman相关系数去除异常值:

coad_coroutliers <- TCGAanalyze_Preprocessing(se_mrna,cor.cut = 0.7)
## Number of outliers: 0

dim(coad_coroutliers)
## [1] 19962   521

还会生成一张相关图:Array Array Intensity correlation (AAIC):


image.png

接下来进行标准化:

# normalization of genes
coadNorm <- TCGAanalyze_Normalization(
    tabDF = coad_coroutliers, 
    geneInfo =  geneInfoHT
)
## I Need about  127 seconds for this Complete Normalization Upper Quantile  [Processing 80k elements /s]
## Step 1 of 4: newSeqExpressionSet ...
## Step 2 of 4: withinLaneNormalization ...
## Step 3 of 4: betweenLaneNormalization ...
## Step 4 of 4: exprs ...

会使用EDASeq包中的方法:

EDASeq::newSeqExpressionSet
EDASeq::withinLaneNormalization
EDASeq::betweenLaneNormalization
EDASeq::counts
dim(coadNorm)
## [1] 19469   521

然后过滤掉低表达的基因:

# quantile filter of genes
coadFilt <- TCGAanalyze_Filtering(
    tabDF = coadNorm,
    method = "quantile", 
    qnt.cut =  0.25
)

看看一通操作下来后还剩多少基因?

dim(coadFilt)
## [1] 14600   521

从最开始的19962变成了现在的14600,过滤掉了5000+基因......

最后是根据肿瘤组织和正常组织进行分组:
这里我们只选择了实体瘤和部分正常组织。如果你想选择更多,只要在typesample参数中添加更多类型即可。

可选类型见下图,也是根据TCGA-barcode进行判断的:


image.png
# selection of normal samples "NT"
samplesNT <- TCGAquery_SampleTypes(
    barcode = colnames(coadFilt),
    typesample = c("NT")
)

# selection of tumor samples "TP"
samplesTP <- TCGAquery_SampleTypes(
    barcode = colnames(coadFilt), 
    typesample = c("TP")
)

所以分组这一步我们还是自己搞定!就是根据barcode的第14,15位数,结合上面那张图判断,毫无难度。


image.png
# 小于10就是tumor
samplesTumor <- as.numeric(substr(colnames(coadFilt),14,15))<10

差异分析

非常简单,支持edgeRlimma两种方法,当然也可以无缝连接DESeq2进行差异分析!

# Diff.expr.analysis (DEA)
coadDEGs <- TCGAanalyze_DEA(
    mat1 = coadFilt[,!samplesTumor], # normal矩阵
    mat2 = coadFilt[,samplesTumor], # tumor矩阵
    Cond1type = "Normal",
    Cond2type = "Tumor",
    fdr.cut = 0.01, 
    logFC.cut = 1,
    pipeline = "edgeR", # limma
    method = "glmLRT"
)
## Batch correction skipped since no factors provided
## ----------------------- DEA -------------------------------
## o 41 samples in Cond1type Normal
## o 480 samples in Cond2type Tumor
## o 14600 features as miRNA or genes
## This may take some minutes...
## ----------------------- END DEA -------------------------------

差异分析就做好了!结果非常完美,同时提供了gene_name和gene_type,也就是说我们一开始不取子集也是可以的~~,最后再取也行!

image.png

使用DESeq2进行差异分析

连接DESeq2那真是太简单了,无缝衔接!!

library(DESeq2)

直接把SummarizedExperiment对象传给DESeqDataSet()函数即可。

不过需要分组信息,这个需要我们手动制作一下。

制作分组信息

其实我们的对象中包含了sample_type这一列信息,就在coldata中,但是有点过于详细了。

table(colData(se_mrna)$sample_type)
## 
##          Metastatic       Primary Tumor     Recurrent Tumor Solid Tissue Normal 
##                   1                 478                   1                  41

我们给它修改一下~

new_type <- ifelse(colData(se_mrna)$sample_type == "Solid Tissue Normal",
                   "Normal",
                   "Tumor")
colData(se_mrna)$sample_type <- new_type

table(colData(se_mrna)$sample_type)
## 
## Normal  Tumor 
##     41    480

这样就把分组搞定了!skr!

差异分析

然后就是愉快的进行差异分析~

ddsSE <- DESeqDataSet(se_mrna, design = ~ sample_type)
## renaming the first element in assays to 'counts'
ddsSE
## class: DESeqDataSet 
## dim: 19962 521 
## metadata(2): data_release version
## assays(6): counts stranded_first ... fpkm_unstrand fpkm_uq_unstrand
## rownames(19962): ENSG00000000003.15 ENSG00000000005.6 ...
##   ENSG00000288674.1 ENSG00000288675.1
## rowData names(10): source type ... hgnc_id havana_gene
## colnames(521): TCGA-A6-5664-01A-21R-1839-07
##   TCGA-D5-6530-01A-11R-1723-07 ... TCGA-A6-2683-01A-01R-0821-07
##   TCGA-A6-2683-11A-01R-A32Z-07
## colData names(107): barcode patient ... paper_vascular_invasion_present
##   paper_vital_status

先过滤,这里我们简单点,

keep <- rowSums(counts(ddsSE)) >= 50
ddsSE <- ddsSE[keep,]

ddsSE
## class: DESeqDataSet 
## dim: 18820 521 
## metadata(2): data_release version
## assays(6): counts stranded_first ... fpkm_unstrand fpkm_uq_unstrand
## rownames(18820): ENSG00000000003.15 ENSG00000000005.6 ...
##   ENSG00000288674.1 ENSG00000288675.1
## rowData names(10): source type ... hgnc_id havana_gene
## colnames(521): TCGA-A6-5664-01A-21R-1839-07
##   TCGA-D5-6530-01A-11R-1723-07 ... TCGA-A6-2683-01A-01R-0821-07
##   TCGA-A6-2683-11A-01R-A32Z-07
## colData names(107): barcode patient ... paper_vascular_invasion_present
##   paper_vital_status

确定谁和谁比,我们设定Tumor-Normal

ddsSE$sample_type <- factor(ddsSE$sample_type, levels = c("Tumor","Normal"))

接下来就可以进行差异分析了:

dds <- DESeq(ddsSE)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 1544 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
res <- results(dds, contrast = c("sample_type","Tumor","Normal"))
res
## log2 fold change (MLE): sample_type Tumor vs Normal 
## Wald test p-value: sample_type Tumor vs Normal 
## DataFrame with 18820 rows and 6 columns
##                     baseMean log2FoldChange     lfcSE       stat      pvalue
##                    <numeric>      <numeric> <numeric>  <numeric>   <numeric>
## ENSG00000000003.15  5198.688       0.456284 0.1445636    3.15628 1.59793e-03
## ENSG00000000005.6     42.064       0.414580 0.3014120    1.37546 1.68989e-01
## ENSG00000000419.13  1743.704       0.849473 0.1134251    7.48929 6.92491e-14
## ENSG00000000457.14   463.909      -0.261646 0.0690276   -3.79046 1.50371e-04
## ENSG00000000460.17   328.546       1.272811 0.0940574   13.53229 1.00837e-41
## ...                      ...            ...       ...        ...         ...
## ENSG00000288658.1   5.317261    -1.45986802  0.274148 -5.3251160 1.00889e-07
## ENSG00000288660.1   4.648268     1.47152585  0.450554  3.2660389 1.09063e-03
## ENSG00000288669.1   0.143343    -0.13840184  1.339247 -0.1033431 9.17691e-01
## ENSG00000288674.1   3.201667     0.00548347  0.174731  0.0313824 9.74965e-01
## ENSG00000288675.1  15.048025     2.01434132  0.174631 11.5348224 8.80688e-31
##                           padj
##                      <numeric>
## ENSG00000000003.15 2.75622e-03
## ENSG00000000005.6  2.13290e-01
## ENSG00000000419.13 3.14116e-13
## ENSG00000000457.14 2.95220e-04
## ENSG00000000460.17 2.77449e-40
## ...                        ...
## ENSG00000288658.1  2.74461e-07
## ENSG00000288660.1  1.92231e-03
## ENSG00000288669.1  9.29645e-01
## ENSG00000288674.1  9.78866e-01
## ENSG00000288675.1  1.20917e-29

结果很棒,不过没有gene_symbol了,需要自己添加哦~

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 203,547评论 6 477
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,399评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 150,428评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,599评论 1 274
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,612评论 5 365
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,577评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,941评论 3 395
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,603评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,852评论 1 297
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,605评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,693评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,375评论 4 318
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,955评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,936评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,172评论 1 259
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 43,970评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,414评论 2 342

推荐阅读更多精彩内容