参考文献:Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown #一定要看!
说明书:http://ccb.jhu.edu/software/stringtie/gff.shtml
参考文章:https://www.jianshu.com/p/5b104830751b #使用类似的cufflinks的附件做的
参考文章:https://www.jianshu.com/p/1f5d13cc47f8 #未用gffcompare导致出现大量未知转录本
一、简介
比较不同样本的转录本定量信息需要先将转录本信息储存为相同的格式,一般组装软件的输出结果都是gtf或gff。由于在组装的过程中产生了大量的新的转录本信息,而我们仅通过肉眼观察其唯一的注释信息----染色体上的起始位置,很显然无法阐明其中蕴含的生物学意义,因此我们需要将它们与已知的转录本注释文件---annotation.gtf进行比较,将新得到的转录本与注释好的转录本之间建立联系,这样可以让我们更好地发现新的转录本。而gffcompare就是做的这个工作,由于它是基于cufflinks的一个附件cuffcompare开发的,因此很多原理及输出文件的格式也与cuffcompare类似。
二、使用方法及参数说明
使用方法:gffcompare [options] gtf.file(s)
常用表达:gffcompare –G –r annotation.gtf -o output.prefix input.gtf(s)
常用参数说明:
-r 提供注释好的gtf文件
-G 比较输入的gtf中所有的转录本,即使它们有可能是冗余的
-o 输出文件的前缀
-i 如果gtf是很多文件,可以通过-i 提交一个gtf文件的list文件
所有参数
gffcompare v0.11.2
-----------------------------
Usage:
gffcompare [-r <reference_mrna.gtf> [-R]] [-T] [-V] [-s <seq_path>]
[-o <outprefix>] [-p <cprefix>]
{-i <input_gtf_list> | <input1.gtf> [<input2.gtf> .. <inputN.gtf>]}
GffCompare provides classification and reference annotation mapping and
matching statistics for RNA-Seq assemblies (transfrags) or other generic
GFF/GTF files.
GffCompare also clusters and tracks transcripts across multiple GFF/GTF
files (samples), writing matching transcripts (identical intron chains) into
<outprefix>.tracking, and a GTF file <outprefix>.combined.gtf which
contains a nonredundant set of transcripts across all input files (with
a single representative transfrag chosen for each clique of matching transfrags
across samples).
Options:
-v display gffcompare version (also --version)
-i provide a text file with a list of (query) GTF files to process instead
of expecting them as command line arguments (useful when a large number
of GTF files should be processed)
-r reference annotation file (GTF/GFF)
--strict-match : the match code '=' is only assigned when all exon boundaries
match; code '~' is assigned for intron chain match or single-exon
-R for -r option, consider only the reference transcripts that
overlap any of the input transfrags (Sn correction)
-Q for -r option, consider only the input transcripts that
overlap any of the reference transcripts (Precision correction);
(Warning: this will discard all "novel" loci!)
-M discard (ignore) single-exon transfrags and reference transcripts
-N discard (ignore) single-exon reference transcripts
-D discard "duplicate" query transfrags (i.e. those with the same
intron chain) within a single sample (disable "annotation" mode)
-S like -D, but stricter duplicate checking: only discard matching query
or reference transcripts (same intron chain) if their boundaries are fully
contained within other, larger or identical transfrags; if --strict-match
is also given, exact matching of all exon boundaries is required
--no-merge : disable close-exon merging (default: merge exons separated by
"introns" shorter than 5 bases
-s path to genome sequences (optional); this can be either a multi-FASTA
file or a directory containing single-fasta files (one for each contig);
repeats must be soft-masked (lower case) in order to be able to classify
transfrags as repeats
-T do not generate .tmap and .refmap files for each input file
-e max. distance (range) allowed from free ends of terminal exons of
reference transcripts when assessing exon accuracy (100)
-d max. distance (range) for grouping transcript start sites (100)
-V verbose processing mode (also shows GFF parser warnings)
--chr-stats: the .stats file will show summary and accuracy data
for each reference contig/chromosome separately
--debug : enables -V and generates additional files:
<outprefix>.Q_discarded.lst, <outprefix>.missed_introns.gff,
<outprefix>.R_missed.lst
Options for the combined GTF output file:
-p the name prefix to use for consensus transcripts in the
<outprefix>.combined.gtf file (default: 'TCONS')
-C discard matching and "contained" transfrags in the GTF output
(i.e. collapse intron-redundant transfrags across all query files)
-A like -C but does not discard intron-redundant transfrags if they start
with a different 5' exon (keep alternate TSS)
-X like -C but also discard contained transfrags if transfrag ends stick out
within the container's introns
-K for -C/-A/-X, do NOT discard any redundant transfrag matching a reference
三、输出文件说明
1、class codes
是指一些代码,用于表示input中的转录本与annotation中的转录本的关系,代码对应关系如下图所示
2、输出文件六个,前四个文件可以指定保存位置,后两个文件是跟输入的gtf文件保存在一个位置,并且都是以-o提供的前缀开头的
gffcmp.annotated.gtf:包含了class code信息,该文件一般用于下文继续stringtie
gffcmp.stats:包含了feature的统计信息,也包含了找到新的外显子、内含子的数目,其中有两个统计量sensitivity和precision,定义为 Sensitivity is defned as the proportion of genes from the annotation that are correctly reconstructed,whereas precision (also known as positive predictive value) captures the proportion of the output that overlaps the annotation
gffcompare.loci:见说明书
gffcompare.tracking:见说明书
gffcompare_result.refmap:这个文件包含四列信息,第一列ref_gene_id是gene symbol ,无symbol的给出的是ensemble的gene id; 第二列ref_id是指ensemble的transcript id; 第三列class_code 是“=”和“c”;第四列是cuff_id_list。这个文件指组装后与参考基因组几乎完全匹配的转录本
gffcompare_result.tmap:包含了转录本的定量信息,如cov,FPKM等,可用于定量或筛选新转录本
四、如何寻找新的转录本
1、上游:hisat2+stringtie+stringtie-merge
2、中游:gffcompare
3、下游:stringtie+gffcompare.result
4、下下游:ballgown定量及差异分析
新转录本的特征为(参考别人的文章)
1、class code满足标准,如满足”i,j,o,u,x“等
2、统计信息达标,如FPKM>=0.5 、coverage >1,Length > 200等