最近在对大型植物基因组进行分析,36GB,最长的染色体长度为4.3GB
这样的话,如果使用gffread将gff3文件转换为gtf文件的话,将会出现负数坐标的严重bug
# gffread不能处理染色体长度>2048Mb的基因组
gffread Ldavi.gff3 -T -o Ldavi.gtf
结果很明显有问题,因为坐标不可能为负数
(RNAseq) tl5024@iyun50:~/Ldavidii$ more Ldavi.gtf LG01 IGA transcript -1534463781 -1534304990 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770" LG01 IGA exon -1534463781 -1534463737 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534422824 -1534422708 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534419675 -1534419604 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534405501 -1534405423 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534405312 -1534405242 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534398737 -1534398636 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534398546 -1534398457 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534361241 -1534361167 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534361078 -1534361046 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534360930 -1534360850 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534345045 -1534344993 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534315508 -1534315402 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534305039 -1534304990 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770";
于是,得用AGAT进行处理,这是一个perl写的软件,而且输出的日志很美观,我很喜欢
我在之前的一篇文章中也写过用agat提取超大基因组的转录本等
安装agat
conda install -c bioconda agat
用agat的子程序进行转换
agat同时还会修复原文件中的错误(缺少UTR注释、坐标错误等)
agat_convert_sp_gff2gtf.pl --gff Ldavi.gff3 -o Ldavi.gtf
也有一个通用的转换子程序,如果是gtf转换为gff3可以用
# 通用GXF 转换,AGAT 会根据输出后缀推断目标格式
agat_convert_sp_gxf2gxf.pl -gff lrv2.gff -o Lregale.agat.gff3
转换完成后,对gtf/gff文件进行信息统计
agat_sp_statistics.pl --gff Ldavi.gtf -o Ldavi.gtf.stats.txt
# more Ldavi.gtf.stats.txt
这是统计结果,长这样
(base) tl5024@iyun50:~/Ldavidii$ cat Ldavi.gtf.stats.txt
--------------------------------------------------------------------------------
------------------------------------- mrna -------------------------------------
Number of gene 87501
Number of mrna 87501
Number of mrnas with utr both sides 16887
Number of mrnas with at least one utr 16966
Number of cds 87501
Number of exon 347059
Number of five_prime_utr 16900
Number of three_prime_utr 16953
Number of exon in cds 347059
Number of exon in five_prime_utr 16900
Number of exon in three_prime_utr 16953
Number of intron in cds 259558
Number of intron in exon 259558
Number gene overlapping 0
Number of single exon gene 16329
Number of single exon mrna 16329
mean mrnas per gene 1.0
mean cdss per mrna 1.0
mean exons per mrna 4.0
mean five_prime_utrs per mrna 0.2
mean three_prime_utrs per mrna 0.2
mean exons per cds 4.0
mean exons per five_prime_utr 1.0
mean exons per three_prime_utr 1.0
mean introns in cdss per mrna 3.0
mean introns in exons per mrna 3.0
Total gene length (bp) 5238003904
Total mrna length (bp) 5238003904
Total cds length (bp) 74172901
Total exon length (bp) 271493346
Total five_prime_utr length (bp) 99412128
Total three_prime_utr length (bp) 97908317
Total intron length per cds (bp) 4966510558
Total intron length per exon (bp) 4966510558
mean gene length (bp) 59862
mean mrna length (bp) 59862
mean cds length (bp) 848
mean exon length (bp) 782
mean five_prime_utr length (bp) 5882
mean three_prime_utr length (bp) 5775
mean cds piece length (bp) 214
mean five_prime_utr piece length (bp) 5882
mean three_prime_utr piece length (bp) 5775
mean intron in cds length (bp) 19134
mean intron in exon length (bp) 19134
median gene length (bp) 24383
median mrna length (bp) 24383
median cds length (bp) 137
median exon length (bp) 147
median five_prime_utr length (bp) 189
median three_prime_utr length (bp) 297
median cds piece length (bp) 137
median five_prime_utr piece length (bp) 189
median three_prime_utr piece length (bp) 297
median intron in cds length (bp) 3588
median intron in exon length (bp) 3588
90 percentile gene length (bp) 175713
90 percentile mrna length (bp) 175713
90 percentile cds length (bp) 436
90 percentile exon length (bp) 531
90 percentile five_prime_utr length (bp) 11595
90 percentile three_prime_utr length (bp) 8514
90 percentile cds piece length (bp) 436
90 percentile five_prime_utr piece length (bp) 11595
90 percentile three_prime_utr piece length (bp) 8514
90 percentile intron in cds length (bp) 36128
90 percentile intron in exon length (bp) 36128
Longest gene (bp) 2164927
Longest mrna (bp) 2164927
Longest cds (bp) 15192
Longest exon (bp) 766758
Longest five_prime_utr (bp) 585540
Longest three_prime_utr (bp) 766673
Longest cds piece (bp) 6681
Longest five_prime_utr piece (bp) 585540
Longest three_prime_utr piece (bp) 766673
Longest intron into cds part (bp) 1351418
Longest intron into exon part (bp) 1351418
Shortest gene (bp) 150
Shortest mrna (bp) 150
Shortest cds piece (bp) 1
Shortest five_prime_utr piece (bp) 1
Shortest three_prime_utr piece (bp) 1
Shortest intron into cds part (bp) 20
Shortest intron into exon part (bp) 20