TCGA数据下载分析(1-1):RTCGA包gene获取表达值并可视化

写在前面:
  • 方法:下载处理TCGA数据的R包很多,数据来源也不一样,这一部分开始对几个包分别进行使用,写出心得。
  • 结果最终想得到的是用其中两个包
  • 这部分,RTCGA包
  • 参考:作者文档这个以及生信技能树
---------------------------------------------------------------------------

1 安装并加载包

# Load the bioconductor installer. 
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("RTCGA")
# Install the clinical and mRNA gene expression data packages
biocLite("RTCGA.clinical") ## 14Mb
biocLite('RTCGA.rnaseq') ##  (612.6 MB)
biocLite("RTCGA.mRNA") ##  (85.0 MB)
biocLite('RTCGA.mutations')  ## (103.8 MB)
library(RTCGA.clinical) 
library(RTCGA.mRNA)
library(RTCGA.rnaseq)
library(RTCGA.mutations)
> library(RTCGA)
Welcome to the RTCGA (version: 1.8.0).
#查看BRCA的内容
> checkTCGA('DataSets', 'LIHC')
   Size                                                                                                                                     Name
1   61K                                                                                   LIHC.Clinical_Pick_Tier1.Level_4.2016012800.0.0.tar.gz
2  932K                                                                                        LIHC.Merge_Clinical.Level_1.2016012800.0.0.tar.gz
3  1.6G LIHC.Merge_methylation__humanmethylation450__jhu_usc_edu__Level_3__within_bioassay_data_set_function__data.Level_3.2016012800.0.0.tar.gz
4  1.5M                  LIHC.Merge_mirnaseq__illuminahiseq_mirnaseq__bcgsc_ca__Level_3__miR_gene_expression__data.Level_3.2016012800.0.0.tar.gz
5   22M               LIHC.Merge_mirnaseq__illuminahiseq_mirnaseq__bcgsc_ca__Level_3__miR_isoform_expression__data.Level_3.2016012800.0.0.tar.gz
6  255K                LIHC.Merge_protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data.Level_3.2016012800.0.0.tar.gz
7   78K   LIHC.Merge_protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data.Level_3.2016012800.0.0.tar.gz.bak.20160128
8   83M                           LIHC.Merge_rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__exon_expression__data.Level_3.2016012800.0.0.tar.gz
9  8.7M                           LIHC.Merge_rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__gene_expression__data.Level_3.2016012800.0.0.tar.gz
10 8.0M                LIHC.Merge_rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__splice_junction_expression__data.Level_3.2016012800.0.0.tar.gz
11 102M                            LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes__data.Level_3.2016012800.0.0.tar.gz
12  32M                 LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0.tar.gz
13 281M                         LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_isoforms__data.Level_3.2016012800.0.0.tar.gz
14  80M              LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_isoforms_normalized__data.Level_3.2016012800.0.0.tar.gz
15 905M                   LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__exon_quantification__data.Level_3.2016012800.0.0.tar.gz
16  70M               LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__junction_quantification__data.Level_3.2016012800.0.0.tar.gz
17 5.8M                        LIHC.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg18__seg.Level_3.2016012800.0.0.tar.gz
18 5.8M                        LIHC.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg19__seg.Level_3.2016012800.0.0.tar.gz
19 1.5M     LIHC.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg18__seg.Level_3.2016012800.0.0.tar.gz
20 1.5M     LIHC.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.Level_3.2016012800.0.0.tar.gz
21 210M                                                                                LIHC.Methylation_Preprocess.Level_3.2016012800.0.0.tar.gz
22 2.0M                                                                               LIHC.Mutation_Packager_Calls.Level_3.2016012800.0.0.tar.gz
23 416M                                                                            LIHC.Mutation_Packager_Coverage.Level_3.2016012800.0.0.tar.gz
24  25M                                                                     LIHC.Mutation_Packager_Oncotated_Calls.Level_3.2016012800.0.0.tar.gz
25  47M                                                                 LIHC.Mutation_Packager_Oncotated_Raw_Calls.Level_3.2016012800.0.0.tar.gz
26 3.8M                                                                           LIHC.Mutation_Packager_Raw_Calls.Level_3.2016012800.0.0.tar.gz
27 765M                                                                        LIHC.Mutation_Packager_Raw_Coverage.Level_3.2016012800.0.0.tar.gz
28 585K                                                                                 LIHC.RPPA_AnnotateWithGene.Level_3.2016012800.0.0.tar.gz
29 275K                                                                    LIHC.RPPA_AnnotateWithGene.Level_3.2016012800.0.0.tar.gz.bak.20160128
30 312M                                                                                    LIHC.mRNAseq_Preprocess.Level_3.2016012800.0.0.tar.gz
31 3.4M                                                                              LIHC.miRseq_Mature_Preprocess.Level_3.2016012800.0.0.tar.gz
32 2.8M                                                                                     LIHC.miRseq_Preprocess.Level_3.2016012800.0.0.tar.gz
关于这些数据
  • mRNA是芯片数据
  • ranseq是测序数据
    具体参考这里这里
  • miRSeq is micro-RNA seq. Micro RNAs are a class of non protein coding RNAs that have regulatory functions (they typically bind to the 3`UTR of coding mRNAs and regulate that way)
  • mRNA, in this situation, refers to a cDNA microarray, where pre-designed probes on the microarray surface will bind to known target mRNAs, which may be coding or non-coding
  • mRNASeq is what most will loosely refer to as 'RNA-seq'. Most protocols will capture all classes of RNA species that have poly-adenylated (poly(A)) tails. RNAs that don't have these tails include ribsomal RNAs and many enhancer RNAs, but there are always exceptions to these rules.

推荐(建议参考)

miRSeq

  • illuminahiseq_mirnaseq-miR_gene_expression - normalised micro-RNAseq counts over each micro-RNA
  • illuminahiseq_mirnaseq-miR_isoform_expression - nomalised micro-RNAseq counts over each splice isoform of each micro-RNA
    mRNA
  • agilent4502a_07_3-unc_lowess_normalization_gene_level - LOWESS-normalised cDNA expression values over each gene (data from University of North Carolina)
    mRNASeq
  • illuminahiseq_rnaseq-gene_expression - normalised RNAseq counts over each gene
  • illuminahiseq_rnaseq-exon_expression - normalised RNAseq counts over each exon of each gene
> checkTCGA('Dates')
 [1] "2011-10-26" "2011-11-15" "2011-11-28" "2011-12-06" "2011-12-30" "2012-01-10" "2012-01-24" "2012-02-17" "2012-03-06" "2012-03-21"
[11] "2012-04-12" "2012-04-25" "2012-05-15" "2012-05-25" "2012-06-06" "2012-06-23" "2012-07-07" "2012-07-25" "2012-08-04" "2012-08-25"
[21] "2012-09-13" "2012-10-04" "2012-10-18" "2012-10-20" "2012-10-24" "2012-11-02" "2012-11-14" "2012-12-06" "2012-12-21" "2013-01-16"
[31] "2013-02-03" "2013-02-22" "2013-03-09" "2013-03-26" "2013-04-06" "2013-04-21" "2013-05-08" "2013-05-23" "2013-06-06" "2013-06-23"
[41] "2013-07-15" "2013-08-09" "2013-09-23" "2013-10-10" "2013-11-14" "2013-12-10" "2014-01-15" "2014-02-15" "2014-03-16" "2014-04-16"
[51] "2014-05-18" "2014-06-14" "2014-07-15" "2014-09-02" "2014-10-17" "2014-12-06" "2015-02-02" "2015-02-04" "2015-04-02" "2015-06-01"
[61] "2015-08-21" "2015-11-01" "2016-01-28"

2 下载某肿瘤中Datasets中的某类数据

可以用前述的checkTCGA查看类型,然后针对性下载,用法如下:
downloadTCGA(cancerTypes, dataSet = "Merge_Clinical.Level_1", destDir, date = NULL, untarFile = TRUE, removeTar = TRUE, allDataSets = FALSE)

  • dataSet在checkTCGA中查看,可以看出数据类型名字很长,所以这里只要部分(连续)匹配即可
  • destDir在当前工作目录下创建一个新文件
setwd("E:/00TCGA/data/LIHC")
#downloading the miRNA DATA
dir.create('miRNA')
downloadTCGA(cancerTypes = 'LIHC',
             dataSet = 'miR_gene_expression',
             destDir = 'rnaseq',
             date = tail(checkTCGA('Dates'), 2)[1])

#downloading the ranseq data
dir.create('rnaseq')
downloadTCGA(cancerTypes = 'LIHC',
             dataSet = 'Level_3__gene_expression',
             destDir = 'rnaseq',
             date = tail(checkTCGA('Dates'), 2)[1])

3 获取某基因在任意癌症中的mRNA表达数据并可视化(ggplot2,ggpubrboxplotTCGA

3.0 获取mRNA表达数据

获取EZH2,PTEN,HGFAC,FYN,TIGD1等5个gene在BRCA,OV中的mRNA表达数据(LIHC没有mRNA)
expr <- expressionsTCGA(BRCA.mRNA, OV.mRNA,LUSC.mRNA,
                        extract.cols = c("EZH2", "PTEN", "HGFAC","FYN", "TIGD1"))
> expr
# A tibble: 1,305 x 7
   bcr_patient_barcode          dataset     EZH2   PTEN HGFAC    FYN  TIGD1
   <chr>                        <chr>      <dbl>  <dbl> <dbl>  <dbl>  <dbl>
 1 TCGA-A1-A0SD-01A-11R-A115-07 BRCA.mRNA -1.79   1.36  -2.31 -0.798 -0.86 
 2 TCGA-A1-A0SE-01A-11R-A084-07 BRCA.mRNA -1.57   0.428 -2.49 -0.532 -1.06 
 3 TCGA-A1-A0SH-01A-11R-A084-07 BRCA.mRNA -2.49   1.31  -2.65  0.007 -1.06 
 4 TCGA-A1-A0SJ-01A-11R-A084-07 BRCA.mRNA -2.14   0.810 -2.58 -0.466 -0.714
 5 TCGA-A1-A0SK-01A-12R-A084-07 BRCA.mRNA  0.529  0.251 -2.95  0.960 -0.664
 6 TCGA-A1-A0SM-01A-11R-A084-07 BRCA.mRNA -1.44   1.31  -3.26 -0.622 -0.490
 7 TCGA-A1-A0SO-01A-22R-A084-07 BRCA.mRNA -0.426 -0.237 -2.71  0.208  0.759
 8 TCGA-A1-A0SP-01A-11R-A084-07 BRCA.mRNA -0.579 -1.24  -2.45  0.563  0.022
 9 TCGA-A2-A04N-01A-11R-A115-07 BRCA.mRNA -1.16   1.21  -2.04 -0.616 -0.287
10 TCGA-A2-A04P-01A-31R-A034-07 BRCA.mRNA -0.894  0.288 -2.31 -0.314  1.33 
# ... with 1,295 more rows

3.1 ggplot2绘制指定基因在不同肿瘤中的表达boxplot

expr<-expressionsTCGA(BRCA.mRNA, OV.mRNA,LUSC.mRNA,
                        extract.cols = c("EZH2", "PTEN", "HGFAC","FYN", "TIGD1"))
expr1<-expr[,-1]
expr2<-gather(expr1, key ="mRNA", value="value", -dataset)

ggplot(expr2, aes(y=value,
             x=reorder(dataset, value, mean),
             fill= dataset))+
  geom_boxplot()+
  theme_RTCGA()+
  scale_fill_brewer(palette = "Set3")+
  facet_grid(mRNA~.)+
  theme(legend.position = "top")

boxplot如下


5genes in three kinds of cancers.jpeg

3.2 ggpubr绘制指定基因在不同肿瘤中的表达boxplot并进行统计学分析作图

这部分参考jimmy
查看样本量

table(expr$dataset)
> nb_samples
BRCA.mRNA LUSC.mRNA   OV.mRNA 
      590       154       561

bcr_patient_barcode这列改名,以便下一步可视化作图

expr$dataset <- gsub(pattern = ".mRNA", replacement = "",  expr$dataset)
expr$bcr_patient_barcode <- paste0(expr$dataset, c(1:590, 1:561, 1:154))
expr
> expr
# A tibble: 1,305 x 7
   bcr_patient_barcode dataset   EZH2   PTEN HGFAC    FYN  TIGD1
   <chr>               <chr>    <dbl>  <dbl> <dbl>  <dbl>  <dbl>
 1 BRCA1               BRCA    -1.79   1.36  -2.31 -0.798 -0.86 
 2 BRCA2               BRCA    -1.57   0.428 -2.49 -0.532 -1.06 
 3 BRCA3               BRCA    -2.49   1.31  -2.65  0.007 -1.06 
 4 BRCA4               BRCA    -2.14   0.810 -2.58 -0.466 -0.714
 5 BRCA5               BRCA     0.529  0.251 -2.95  0.960 -0.664
 6 BRCA6               BRCA    -1.44   1.31  -3.26 -0.622 -0.490
 7 BRCA7               BRCA    -0.426 -0.237 -2.71  0.208  0.759
 8 BRCA8               BRCA    -0.579 -1.24  -2.45  0.563  0.022
 9 BRCA9               BRCA    -1.16   1.21  -2.04 -0.616 -0.287
10 BRCA10              BRCA    -0.894  0.288 -2.31 -0.314  1.33

绘制EZH2表达值的boxplot图

library(ggpubr)
ggboxplot(expr, x = "dataset", y = "EZH2",
          title = "EZH2", ylab = "Expression",
          color = "dataset", palette = "jco")
ezh2.jpeg

4 获取某基因在任意癌症中的rnaseq表达数据并可视化(ggplot2,ggpubr和boxplotTCGA)

  • ggplot2和ggpubr的用法与第3部分几乎一致,而boxplotTCGA无法直接获取mRNA表达数据,只有rnaseq,另外还有mutation等数据。

  • 不同的是,不仅需要gene symbol还要entrez的ID,如MET|4233

4.0 获取rnaseq表达数据

  • 获取EZH2,PTEN,HGFAC,FYN等4个gene在BRCA,OV和LUSC中的rnaseq表达数据
expr<-expressionsTCGA(BRCA.rnaseq, OV.rnaseq,LUSC.rnaseq,
                      extract.cols = c("EZH2|2146", "PTEN|5728", "HGFAC|3083","FYN|2534"))
> expr
# A tibble: 2,071 x 6
   bcr_patient_barcode          dataset     `EZH2|2146` `PTEN|5728` `HGFAC|3083` `FYN|2534`
   <chr>                        <chr>             <dbl>       <dbl>        <dbl>      <dbl>
 1 TCGA-3C-AAAU-01A-11R-A41B-07 BRCA.rnaseq        487.       1724.        4.83       247. 
 2 TCGA-3C-AALI-01A-11R-A41B-07 BRCA.rnaseq        941.       1107.        9.79       523. 
 3 TCGA-3C-AALJ-01A-31R-A41B-07 BRCA.rnaseq        492.       1479.        7.25       814. 
 4 TCGA-3C-AALK-01A-11R-A41B-07 BRCA.rnaseq        334.       1877.        9.10       562. 
 5 TCGA-4H-AAAK-01A-12R-A41B-07 BRCA.rnaseq        297.       1740.        1.70       549. 
 6 TCGA-5L-AAT0-01A-12R-A41B-07 BRCA.rnaseq        238.       1597.        7.63       689. 
 7 TCGA-5L-AAT1-01A-12R-A41B-07 BRCA.rnaseq        258.       1374.        0.815      853. 
 8 TCGA-5T-A9QA-01A-11R-A41B-07 BRCA.rnaseq        253.       2181.        3.14        84.7
 9 TCGA-A1-A0SB-01A-11R-A144-07 BRCA.rnaseq        392.       2529.        0.901      740. 
10 TCGA-A1-A0SD-01A-11R-A115-07 BRCA.rnaseq        191.       1876.        0          522. 
# ... with 2,061 more rows

4.1 ggplot2绘制指定基因在不同肿瘤中的表达boxplot(rnaseq)

#rnaseq ggplot2
expr<-expressionsTCGA(BRCA.rnaseq, OV.rnaseq,LUSC.rnaseq,
                      extract.cols = c("EZH2|2146", "PTEN|5728", "HGFAC|3083","FYN|2534"))

expr1<-expr[,-1]
expr2<-gather(expr1, key ="rnaseq", value="value", -dataset)

ggplot(expr2, aes(y=value,
                  x=reorder(dataset, value, mean),
                  fill= dataset))+
  geom_boxplot()+
  theme_RTCGA()+
  scale_fill_brewer(palette = "Set3")+
  facet_grid(rnaseq~.)+
  theme(legend.position = "top")
rnaseq ggplot2.jpeg

4.2 ggpubr

查看样本量

nb_samples <- table(expr$dataset)
nb_samples
> nb_samples
BRCA.rnaseq LUSC.rnaseq   OV.rnaseq 
       1212         552         307 
ggboxplot(expr, x = "dataset", y = "`PTEN|5728`",
          title = "ESR1|2099", ylab = "Expression",
          color = "dataset", palette = "jco")
PTEN_5728.jpeg

4.3 boxplotTCGA

具体参考RTCGA的文档

expressionsTCGA(LIHC.rnaseq, BLCA.rnaseq, BRCA.rnaseq, OV.rnaseq,extra
                extract.cols = "MET|4233") %>%
  rename(cohort = dataset,
  MET = `MET|4233`) %>%
  #cancer samples
  filter(substr(bcr_patient_barcode, 14, 15) == "01") -> 
  ACC_BLCA_BRCA_OV.rnaseq
boxplotTCGA(ACC_BLCA_BRCA_OV.rnaseq, "cohort", "MET")
MET RNASEQ.jpeg
boxplotTCGA(ACC_BLCA_BRCA_OV.rnaseq, "cohort", "log1p(MET)")
Rplot.jpeg

后记-----------------------------------------

这部分只是用RTCGA演示了如何下载数据;mRNA和rnaseq表达值的plot。

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 219,589评论 6 508
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 93,615评论 3 396
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 165,933评论 0 356
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,976评论 1 295
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,999评论 6 393
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,775评论 1 307
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,474评论 3 420
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,359评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,854评论 1 317
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 38,007评论 3 338
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,146评论 1 351
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,826评论 5 346
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,484评论 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 32,029评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,153评论 1 272
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,420评论 3 373
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 45,107评论 2 356

推荐阅读更多精彩内容

  • 使用cBioPortal进行复杂的癌症基因组和临床profiles整合分析(Y大宽原创,转载需要说明) 主要来自于...
    Y大宽阅读 73,638评论 10 127
  • ——人们总是愿意答应自己认识和喜爱的人提出的要求。 从销售上来看: 【投其所好】——在销售中,有效模仿对方的说话方...
    庄七七阅读 190评论 0 0
  • View简介 View是Android里面所有控件的基类,ViewGroup也是继承自View,我们常用的控件都是...
    修行与蜕变阅读 479评论 0 2
  • 子曰: “不患人之不己知,患不知人也。 ” 孔子说:不怕别人不了解自己,只怕自己不了解别人。 其实我的想法与其不同...
    魔笛大公举阅读 1,042评论 0 0
  • 1.原文:末那识特征:第一,它总是追求愉悦。第二,它尝试逃离痛苦。第三,它忽视追求愉悦的危险。第四,他会忽视痛苦的...
    寅颖阅读 224评论 0 0