目前研究转录组的方法主要三种:
(1)基于杂交技术的cDNA芯片和寡聚核苷酸芯片;
(2)基于sanger测序法的SAGE (serial analysis of gene expression)、LongSAGE和MPSS(massively parallel signature sequencing);
(3)基于第二代测序技术的转录组测序,又称为RNA-Seq。
其中sanger测序的数据并不多见,GEO上以芯片数据和二代测序数据(简称测序数据)这两大类居多。
GEO上常见的芯片数据,一般是寡聚核苷酸芯片中的in situ oligonucleotide和spotted oligonucleotide,以及cDNA芯片中的spotted DNA/cDNA。其中Affymetrix的芯片很多都是in situ oligonucleotide。芯片公司还有很多家,如Agilent、Applied Biosystems(AB)等。
芯片数据得到的是信号强度值(非整数),这东西和counts不同,采用的分析流程也有所区别。另外,芯片数据使用探针来标记基因,不同平台标记的编号不同,因此需要使用相应的GPL文件进行注释(当然很多平台的注释数据库被写成了工具包,可从Bioconductor安装)
二代测序数据也有不同的 workflow ,根据使用的处理软件不同而不同,如TCGA上有
其中以HTSeq居多。还有一种常见的workflow叫 RSEM 数据(使用RSEM算法估计的表达量)。
RSEM是一个神奇的东西,如下引用一些关于RSEM的解释。
RSEM(RNA-Seq by Expectation-Maximization)是使用EM算法对表达量进行估算的方法。解决的主要问题是:由于可变剪切等原因,部分reads可能mapping到多个转录本上,使得counts定量不确定。经典的Alexa-seq算法只比对到一个参考位置上的reads数量计算表达量。而RSEM方法采用EM算法进行估计定量(RSEM是在2010年发表的)。
RSEM流程得到的数据应当使用EBSeq工具包进行差异分析,而不推荐DESeq或edgeR等。
RSEM1,2 is an RNA-Seq transcript quantification program developed in 2009. You need a server with Linux/Mac OS. To run RSEM, your server should have C++, Perl and R installed. In addition, you need at least one aligner to align RNA-Seq reads for you. RSEM can call Bowtie, Bowtie 2 or STAR for you if you have them installed. Last but not least, you need to install the latest version of RSEM.
另外文献(RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome)中有这么一段:
The primary output of RSEM consists of two files, one for isoform-level estimates, and the other for gene-level estimates. Abundance estimates are given in terms of two measures. The first is an estimate of the number of fragments that are derived from a given isoform or gene. We can only estimate this quantity because reads often do not map uniquely to a single transcript. This count is generally a non-integer value and is the expectation of the number of alignable and unfiltered fragments that are derived from a isoform or gene given the ML abundances. These (possibly rounded) counts may be used by a differential expression method such as edgeR [9] or DESeq [8]. The second measure of abundance is the estimated fraction of transcripts made up by a given isoform or gene. This measure can be used directly as a value between zero and one or can be multiplied by 106 to obtain a measure in terms of transcripts per million (TPM). The transcript fraction measure is preferred over the popular RPKM [18] and FPKM [6] measures because it is independent of the mean expressed transcript length and is thus more comparable across samples and species [7].
RSEM推荐参考资料
RSEM文档
使用RSEM进行差异表达分析
Alignment-based的转录本定量-RSEM
转录组分析学习笔记
TCGA中RSEM问题探讨
转录组分析流程——STAR+RSEM+Deseq2
关于RNA-Seg其他的推荐资料
RNA_seq_Biotrainee