按道理,featurecount应该是转录组分析中最基本的工具,但由于没有深挖其中的参数,前段时间在用它处理转录组数据时还是吃了大亏。现在将一些常用的参数记录下来:
首先在计数水平上:
分为feature和meta-feature,feature是值基因组上被定义为一个片段区域,meta-feature是指多个feature组成的区域,如exon和gene的关系,分享相同的feature identifier (GTF文件中有)的feature属于同一个meta-feature
featurecount的 "-t" 参数:设置feature-type,“-t” 指定的必须是gtf中有的feature,同时read只有落到这些feature上才会被统计到,默认是“exon”,即以exon为最小单位(feature)去回帖计数。
featurecount的 "-f" 参数:设置之后将会统计feature层面的数据,如exon-level;否则会统计meta-feature层面的数据,如gene-levels;
默认 "-f" 一般不设置,并与 “ -t exon" 结合使用,表明:在exon-level上计数,共享相同gene identifier(隶属同一meta-feature)的exon相加。否则就是在meta-feature层面上得到计数统计数据,即每一个得到的count矩阵以exon为行名。
参考:featurecounts的使用说明 - 简书 (jianshu.com)
A feature is an interval (range of positions) on one of the reference sequences. A meta-feature is a set of features that represents a biological construct of interest. For example, features often correspond to exons and meta-features to genes. Features sharing the same feature identifier in the GTF or SAF annotation are taken to belong to the same meta-feature. featureCounts can summarize reads at either the feature or meta-feature levels.
次就是链特异性的问题:
featurecounts使用参数 "-s <0|1|2> " 来定义链特异性,”0“ 表示普通文库,默认;”1“表示stranded链;”2“表示reversely stranded;
测序方案是链特异性文库。在使用Hisat2或 STAR等比对工具时,通过指明链特异性的参数,会在生成的bam文件中对每一条read添加 “ read 方向” 信息,featurecounts 通过 -s 参数来识别 read 方向,结合GTF文件判断该read 是否真的在该feature上,是否能够被计数。
通常而言,如果时链特异性文库,被当成普通文库,后续差别不大;而如果时普通文库,被当成链特异性文库,则差别就会很大。
一般转录组数据使用以下命令:
featureCounts -p -t exon -g gene_id -a annotation.gtf -o counts.txt mapping_results_SE.bam
下面用不同的参数对同一批转录组数据计数结果:###
##情况1:
featureCounts -T 5 -p -s 2 -t gene -g gene_id -a genome.gtf -o all_matrix ./*bam
#[all_matrix.summary]
Status mock.bam WT-H-11d.bam
Assigned 19515146 19585577
Unassigned_Unmapped 1678230 1482804
Unassigned_NoFeatures 18289861 18254838
Unassigned_Ambiguity 278186 434598
#Successfully assigned alignments:
mock.bam: 19515146 (43.6%)
WT-H-11d.bam: 19585577 (44.1%)
##情况2:
featureCounts -T 5 -p -t gene -g gene_id -a genome.gtf -o all_matrix2 ./*bam
#[all_matrix.summary]
Status mock.bam WT-H-11d.bam
Assigned 34510666 3451129
Unassigned_Unmapped 1678230 1482804
Unassigned_NoFeatures 789427 593567
Unassigned_Ambiguity 2783100 3170156
#Successfully assigned alignments:
mock.bam: 34510666 (77.0%)
WT-H-11d.bam: 34511290 (77.8%)
##情况3: 正常命令行
featureCounts -T 5 -p -t exon -g gene_id -a genome.gtf -o all_matrix3 ./*bam
#[all_matrix.summary]
Status mock.bam WT-H-11d.bam
Assigned 32210883 33904184
Unassigned_Unmapped 1678230 1482804
Unassigned_NoFeatures 4921182 3315212
Unassigned_Ambiguity 951128 1055617
#Successfully assigned alignments:
mock.bam: 32210883 (71.9%)
WT-H-11d.bam: 33904184 (76.4%)
从结果上可以看到:如果将普通文库当作链特异性文库(-s 2 ,情况1),会造成assigned的reads非常低,featurecount成功比对率低;
在比较feature 水平定义层面上的问题时,如果将feature level定义为 gene(-t gene,情况2),会造成Unassigned_Ambiguity非常多,且在实际的计数中会发现,all_matrix2基因数量和表达量都要低于all_matrix3(-t exon,情况3)。
根据定义:在使用 -t gene,即情况2下,以gene为最小的单位(feature)去回帖计数时,以下两种情况会被定义为Unassigned_Ambiguity;而使用exon为feature时,这种Unassigned_Ambiguity的情况则会大大减少
- Unassigned NoFeatures: alignments that do not overlap any feature.
- Unassigned Ambiguity: alignments that overlap two or more features (feature-level summarization) or meta-features (meta-feature-level summarization) ==无法确定分配到哪个基因