intro

GTF（gene transfer format，主要注释基因）和GFF（general feature format，主要注释基因组）除了位置信息外，添加了更详细的注释内容，一般作为基因组或基因的注释文件，在RNA-seq的处理过程中经常被用到。

二者都是\t分隔的9列文件，可以互相转化。GTF 借鉴了GFF2

GFF可以包含染色体，基因，转录本的信息。
GTF主要用来描述基因和转录本的信息。

GFF GTF格式使用以1为起始的坐标系

基因注释

hg38、GRCH38、 ensembl 75这3种基因组版本是国际通用的人类参考基因组，储存的是同样的fasta序列，分别对应着三种国际生物信息学数据库资源收集存储单位，即NCBI，UCSC及ENSEMBL各自发布的基因组信息。

GTF示例

头部有注释

#!genome-build GRCh38.p12
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.27
#!genebuild-last-updated 2018-01

正文如下（头部和正文不是取自同一文件）

chr1    ensembl gene    339070  350389  .   -   .   gene_id "ENSBTAG00000006648"; gene_version "6"; gene_source "ensembl"; gene_biotype "protein_coding";
chr1    ensembl transcript  339070  350389  .   -   .   gene_id "ENSBTAG00000006648"; gene_version "6"; transcript_id "ENSBTAT00000008737"; transcript_version "6"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding";
chr1    ensembl exon    350267  350389  .   -   .   gene_id "ENSBTAG00000006648"; gene_version "6"; transcript_id "ENSBTAT00000008737"; transcript_version "6"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSBTAE00000512015"; exon_version "1";
chr1    ensembl CDS 350267  350389  .   -   0   gene_id "ENSBTAG00000006648"; gene_version "6"; transcript_id "ENSBTAT00000008737"; transcript_version "6"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSBTAP00000008737"; protein_version "6";
chr1    ensembl start_codon 350387  350389  .   -   0   gene_id "ENSBTAG00000006648"; gene_version "6"; transcript_id "ENSBTAT00000008737"; transcript_version "6"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding";

各列内容

1. seq_id：序列的编号，一般为chr或者scanfold编号，每条染色体拥有一个唯一的ID。
1. source: 注释的来源，可以是数据库比如RefSeq，也可以是软件比如用GeneScan软件预测，当然，也可以为空，用.点号填充。
1. feature type: 代表区间对应的特征类型, 在GTF中，常见的类型如下：

Gene
cDNA
mRNA
5UTR
3UTR
exon
CDS
start_codon
stop_codon

4.start:该基因或转录本在参考序列上的起始位置。
5.end: 该基因或转录本在参考序列上的终止位置。
6.score: 得分，软件提供了统计值，是注释信息可能性的说明，可以是比对时的E-value值或者预测时的P-value值，“.”表示为空。
7.strand: +表示正链，-表示负链，?表示不清楚，当正负链信息没有意义时，可以用.填充。
8.phase: 仅注释类型为“CDS”有效，表示翻译起始编码的位置，有效值为0、1、2。如果是非编码序列，则为“.”

0表示该编码框的第一个密码子第一个碱基位于其5'末端；
1表示该编码框的第一个密码子的第一个碱基位于该编码区外；
2表示该编码框的第一个密码子的第一、二个碱基位于该编码区外；

9.attributes:包含众多属性，格式为“标签值”（tag value）标签与值之间以空格分开。和gff中key=value有所区别，而且必须有gene_id和transcript_id这两个属性，多个属性用分号分隔，存在预定义的键值

chr1    ensembl gene    339070  350389  .   -   .   gene_id "ENSBTAG00000006648"; gene_version "6"; gene_source "ensembl"; gene_biotype "protein_coding";
chr1    ensembl transcript  339070  350389  .   -   .   gene_id "ENSBTAG00000006648"; gene_version "6"; transcript_id "ENSBTAT00000008737"; transcript_version "6"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding";
chr1    ensembl exon    350267  350389  .   -   .   gene_id "ENSBTAG00000006648"; gene_version "6"; transcript_id "ENSBTAT00000008737"; transcript_version "6"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSBTAE00000512015"; exon_version "1";
chr1    ensembl CDS 350267  350389  .   -   0   gene_id "ENSBTAG00000006648"; gene_version "6"; transcript_id "ENSBTAT00000008737"; transcript_version "6"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSBTAP00000008737"; protein_version "6";
chr1    ensembl start_codon 350387  350389  .   -   0   gene_id "ENSBTAG00000006648"; gene_version "6"; transcript_id "ENSBTAT00000008737"; transcript_version "6"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding";

对于基因，提供了如下属性

gene_id
gene_version
gene_name
gene_source
gene_biotype

Ensembl数据库中的基因ID以ENSG作为前缀，gene_version指的是基因ID的版本号，用于区分不同版本，一个完整的ID为ENSG编号加上版本号，之间用.点号分隔，比如ENSG00000186092.6。gene_name指的是基因的symbol, 和NCBI数据库中的gene_symbol一致；gene_source代表来源，来自ensembl_havana, gene_biotype代表基因类型，protein_coding表示蛋白编码基因。

转录本示例如下

对于转录本，在基因的基础上，增加了如下属性

transcript_id
transcript_version
transcript_name
transcript_source
transcript_biotype

对于exon,新增了如下属性

exon_number
exon_id
exon_version

对于non-coding的转录本而言，transcript和exon两种信息就可以准确描述该转录本的结构了；对于编码蛋白质的转录本，还需要5UTR,CDS,start_codon,stop_codon,3UTR几种信息,可以准确描述一个编码蛋白的转录本的结构。

GFF与GTF比较

GTF featuretypes是必须根据软件注明的。GFF的feature type可以使用任意名称。
GTF的score一般不会被用到，都是“.”。
GFF第九列属性键值之间用=，不同键值分割用“；”。GTF用空格分隔
GTF第9列必须以gene_id以及transcript_id开头
使用cufflinks里的工具gffread在两种格式之间转换

image.png

GFF3

GFF3文件

在最新版本的GFF文件中(GFF3)，有一些是已经预先定义的属性特征，并且这些特征往往还有特殊的含义。

常用的标签有：
ID
Feature的标识。该ID具有唯一性。
Name
Feature的展示名称。Name的值在可视化的时候得到展示。
Alias
Feature的第2个Name。
Parent
指明feature所从属的上一级ID。用于将exons聚集成transcript，将transripts聚集成gene。
Target
指明比对的目标区域，一般用于表明序列的比对结果。格式为”target_id start end [strand]”,其中strand是可选的(“+”或”-“),
Gap
比对结果的gap信息，和Target一起，用于表明序列的比对结果。
Note
文本描述
Is_circular
表明featrue是否为环化的。用于环状基因组序列。

同一个tag如果有多个值，则多个值之间使用逗号隔开，比如：

gff3.md)
GFF3 format

GFF&GTF格式