生物信息学流程：mRNA Analysis Pipeline

Introduction 介绍

GDC mRNA定量分析管道测量 HT-Seq 原始reads统计中的基因表达水平，Fragments per Kilobase of transcript per Million mapped reads（FPKM）和FPKM-UQ（上四分位标准化）。首先将reads与GRCh38 reference genome 参考基因组比对，然后通过量化映射的reads产生这些值。为了促进样品间归一化，所有RNA-Seq读数在分析过程中都被视为unstranded的状态.

Data Processing Steps 数据处理步骤

RNA-Seq 比对流程

以 Alignment Workflow 开始比对的流程, 该流程使用 STAR 中重复比对方法执行. STAR 分别比对每个 read group 然后将得到的比对文件合并为一个。按照国际癌症基因组协会 ICGC ( github) 使用的方法, the two-pass method 包含剪接点检测步骤，其用于产生最终比对。此工作流程输出基因组BAM文件，其中包含比对和未比对的reads。质量评估在比对前用 FASTQC 进行，并在比对后用 Picard Tools 进行。.

除了上面详述的基因组比对之外，在数据发布之后处理的文件具有相关的转录组和嵌合比对。这仅用于至少有1组paired-end reads的等份样品. 嵌合的BAM文件包含mapping到不同染色体或链的reads（融合比对）。基因组比对文件包含嵌合和未对齐的reads，以便于检索所有原始reads。转录组比对报告使用转录物坐标而不是基因组坐标比对reads。转录组比对队列也被不同地排序以促进下游分析。这种排序方法不允许在这些排列上进行BAM切片，故不支持BAM索引文件配对。这些对齐的拼接头文件也可用。

RNA Alignment Pipeline

I/O	Entity	Format
Input	Submitted Unaligned Reads or Submitted Aligned Reads	FASTQ or BAM
Output	Aligned Reads	BAM

RNA-Seq Alignment 命令行参数

请注意，由于正在进行管道开发和改进，从GDC门户下载的文件中的版本号可能会有所不同。

# STAR-2.4.2a

### For users with access to the ICGC pipeline:

python star_align.py \
--genomeDir <star_index_path> \
--FastqFileIn <input_fastq_path> \
--workDir <work_dir> \
--out <output_bam> \
--genomeFastaFiles <reference> \
--runThreadN 8 \
--outFilterMultimapScoreRange 1 \
--outFilterMultimapNmax 20 \
--outFilterMismatchNmax 10 \
--alignIntronMax 500000 \
--alignMatesGapMax 1000000 \
--sjdbScore 2 \
--limitBAMsortRAM 0 \
--alignSJDBoverhangMin 1 \
--genomeLoad NoSharedMemory \
--outFilterMatchNminOverLread 0.33 \
--outFilterScoreMinOverLread 0.33 \
--twopass1readsN -1 \
--sjdbOverhang 100 \
--outSAMstrandField intronMotif \
--outSAMunmapped Within

### For users without access to the ICGC pipeline:

### Step 1: Building the STAR index.*

STAR
--runMode genomeGenerate
--genomeDir <star_index_path>
--genomeFastaFiles <reference>
--sjdbOverhang 100
--sjdbGTFfile <gencode.v22.annotation.gtf>
--runThreadN 8

### Step 2: Alignment 1st Pass.

STAR
--genomeDir <star_index_path>
--readFilesIn <fastq_left_1>,<fastq_left2>,... <fastq_right_1>,<fastq_right_2>,...
--runThreadN <runThreadN>
--outFilterMultimapScoreRange 1
--outFilterMultimapNmax 20
--outFilterMismatchNmax 10
--alignIntronMax 500000
--alignMatesGapMax 1000000
--sjdbScore 2
--alignSJDBoverhangMin 1
--genomeLoad NoSharedMemory
--readFilesCommand <bzcat|cat|zcat>
--outFilterMatchNminOverLread 0.33
--outFilterScoreMinOverLread 0.33
--sjdbOverhang 100
--outSAMstrandField intronMotif
--outSAMtype None
--outSAMmode None

### Step 3: Intermediate Index Generation.

STAR
--runMode genomeGenerate
--genomeDir <output_path>
--genomeFastaFiles <reference>
--sjdbOverhang 100
--runThreadN <runThreadN>
--sjdbFileChrStartEnd <SJ.out.tab from previous step>

### Step 4: Alignment 2nd Pass.

STAR
--genomeDir <output_path from previous step>
--readFilesIn <fastq_left_1>,<fastq_left2>,... <fastq_right_1>,<fastq_right_2>,...
--runThreadN <runThreadN>
--outFilterMultimapScoreRange 1
--outFilterMultimapNmax 20
--outFilterMismatchNmax 10
--alignIntronMax 500000
--alignMatesGapMax 1000000
--sjdbScore 2
--alignSJDBoverhangMin 1
--genomeLoad NoSharedMemory
--limitBAMsortRAM 0
--readFilesCommand <bzcat|cat|zcat>
--outFilterMatchNminOverLread 0.33
--outFilterScoreMinOverLread 0.33
--sjdbOverhang 100
--outSAMstrandField intronMotif
--outSAMattributes NH HI NM MD AS XS
--outSAMunmapped Within
--outSAMtype BAM SortedByCoordinate
--outSAMheaderHD @HD VN:1.4
--outSAMattrRGline <formatted RG line provided by wrapper>

*这些索引可在 GDC Website 上下载，无需再次构建。

mRNA 表达量处理流程

比对后，通过 RNA Expression Workflow 处理BAM文件以确定RNA表达水平。

映射到每个基因的读数使用HT-Seq-Count计数。表达式值以制表符分隔的格式提供。 GENCODE v22 用于基因注释。

在Data Release 14之后处理的文件具有STAR在对齐步骤期间生成的额外读取计数集。

I/O	Entity	Format
Input	Aligned Reads	BAM
Output	Gene Expression	TXT

mRNA Quantification 命令行参数

HTSeq-0.6.1p1

htseq-count \
-m intersection-nonempty \
-i gene_id \
-r pos \
-s no \
- gencode.v22.annotation.gtf

RC_g: 映射到Gene的reads数
RC_pc: 映射到所有蛋白质编码基因的reads数
RC_g75: 本中基因的第75百分位reads计数值
L: Length of the gene in base pairs; 计算为基因中所有外显子的总和

Note: 在归一化时，reads计数乘以标量(10⁹) 以考虑千碱基和'百万映射读数'单位

Examples 样品

Sample 1: Gene A

Gene length: 3,000 bp
1,000 reads mapped to Gene A
1,000,000 reads mapped to all protein-coding regions
Read count in Sample 1 for 75th percentile gene: 2,000

FPKM for Gene A = (1,000)(10^9)/[(3,000)(1,000,000)] = 333.33

FPKM-UQ for Gene A = (1,000)(10^9)/[(3,000)(2,000)] = 166,666.67

File Access and Availability 文件访问和可用性

为了便于在用户创建的管道中使用协调数据，可以在GDC数据门户中的几个中间步骤中访问RNA-Seq基因表达。以下是可在GDC Data Portal中下载的每种文件类型的说明。

Type	Description	Format
RNA-Seq Alignment	已经与GRCh38构建一致的RNA-Seq reads。包括未比对上的reads以促进原始读取集的可用性	BAM
HT-Seq Read Counts	通过HT-Seq计算的与每个基因比对的reads数目	TXT
STAR Read Counts	STAR计算的比对到每个基因的reads数	TSV
FPKM	标准化的表达值，其考虑每个基因长度和映射到所有蛋白质编码基因的reads的数量	TXT
FPKM-UQ	FPKM公式的修改版本，其中第75百分位reads计数用作分母代替蛋白质编码的总reads数	TXT

生物信息学流程：mRNA Analysis Pipeline