(2)基于sanger测序法的SAGE (serial analysis of gene expression)、LongSAGE和MPSS(massively parallel signature sequencing);
GEO上常见的芯片数据,一般是寡聚核苷酸芯片中的in situ oligonucleotide和spotted oligonucleotide,以及cDNA芯片中的spotted DNA/cDNA。其中Affymetrix的芯片很多都是in situ oligonucleotide。芯片公司还有很多家,如Agilent、Applied Biosystems(AB)等。
二代测序数据也有不同的 workflow ,根据使用的处理软件不同而不同,如TCGA上有
其中以HTSeq居多。还有一种常见的workflow叫 RSEM 数据(使用RSEM算法估计的表达量)。
RSEM(RNA-Seq by Expectation-Maximization)是使用EM算法对表达量进行估算的方法。解决的主要问题是:由于可变剪切等原因,部分reads可能mapping到多个转录本上,使得counts定量不确定。经典的Alexa-seq算法只比对到一个参考位置上的reads数量计算表达量。而RSEM方法采用EM算法进行估计定量(RSEM是在2010年发表的)。
RSEM1,2 is an RNA-Seq transcript quantification program developed in 2009. You need a server with Linux/Mac OS. To run RSEM, your server should have C++, Perl and R installed. In addition, you need at least one aligner to align RNA-Seq reads for you. RSEM can call Bowtie, Bowtie 2 or STAR for you if you have them installed. Last but not least, you need to install the latest version of RSEM.
另外文献(RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome)中有这么一段:
The primary output of RSEM consists of two files, one for isoform-level estimates, and the other for gene-level estimates. Abundance estimates are given in terms of two measures. The first is an estimate of the number of fragments that are derived from a given isoform or gene. We can only estimate this quantity because reads often do not map uniquely to a single transcript. This count is generally a non-integer value and is the expectation of the number of alignable and unfiltered fragments that are derived from a isoform or gene given the ML abundances. These (possibly rounded) counts may be used by a differential expression method such as edgeR [9] or DESeq [8]. The second measure of abundance is the estimated fraction of transcripts made up by a given isoform or gene. This measure can be used directly as a value between zero and one or can be multiplied by 106 to obtain a measure in terms of transcripts per million (TPM). The transcript fraction measure is preferred over the popular RPKM [18] and FPKM [6] measures because it is independent of the mean expressed transcript length and is thus more comparable across samples and species [7].