AGAT处理基因组gff3/gtf文件

最近在对大型植物基因组进行分析,36GB,最长的染色体长度为4.3GB

这样的话,如果使用gffread将gff3文件转换为gtf文件的话,将会出现负数坐标的严重bug

# gffread不能处理染色体长度>2048Mb的基因组
gffread Ldavi.gff3 -T -o Ldavi.gtf

结果很明显有问题,因为坐标不可能为负数

(RNAseq) tl5024@iyun50:~/Ldavidii$ more Ldavi.gtf LG01 IGA transcript -1534463781 -1534304990 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770" LG01 IGA exon -1534463781 -1534463737 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534422824 -1534422708 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534419675 -1534419604 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534405501 -1534405423 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534405312 -1534405242 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534398737 -1534398636 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534398546 -1534398457 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534361241 -1534361167 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534361078 -1534361046 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534360930 -1534360850 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534345045 -1534344993 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534315508 -1534315402 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770"; LG01 IGA exon -1534305039 -1534304990 . + . transcript_id "Lily01G64770.1"; gene_id "Lily01G64770";

于是,得用AGAT进行处理,这是一个perl写的软件,而且输出的日志很美观,我很喜欢

我在之前的一篇文章中也写过用agat提取超大基因组的转录本等

安装agat

conda install -c bioconda agat

用agat的子程序进行转换
agat同时还会修复原文件中的错误(缺少UTR注释、坐标错误等)

agat_convert_sp_gff2gtf.pl --gff Ldavi.gff3 -o Ldavi.gtf

也有一个通用的转换子程序,如果是gtf转换为gff3可以用

# 通用GXF 转换,AGAT 会根据输出后缀推断目标格式
agat_convert_sp_gxf2gxf.pl -gff lrv2.gff -o Lregale.agat.gff3

转换完成后,对gtf/gff文件进行信息统计

agat_sp_statistics.pl --gff Ldavi.gtf -o Ldavi.gtf.stats.txt
# more Ldavi.gtf.stats.txt

这是统计结果,长这样

(base) tl5024@iyun50:~/Ldavidii$ cat Ldavi.gtf.stats.txt
--------------------------------------------------------------------------------

------------------------------------- mrna -------------------------------------
Number of gene                                              87501
Number of mrna                                              87501
Number of mrnas with utr both sides                         16887
Number of mrnas with at least one utr                       16966
Number of cds                                               87501
Number of exon                                              347059
Number of five_prime_utr                                    16900
Number of three_prime_utr                                   16953
Number of exon in cds                                       347059
Number of exon in five_prime_utr                            16900
Number of exon in three_prime_utr                           16953
Number of intron in cds                                     259558
Number of intron in exon                                    259558
Number gene overlapping                                     0
Number of single exon gene                                  16329
Number of single exon mrna                                  16329
mean mrnas per gene                                         1.0
mean cdss per mrna                                          1.0
mean exons per mrna                                         4.0
mean five_prime_utrs per mrna                               0.2
mean three_prime_utrs per mrna                              0.2
mean exons per cds                                          4.0
mean exons per five_prime_utr                               1.0
mean exons per three_prime_utr                              1.0
mean introns in cdss per mrna                               3.0
mean introns in exons per mrna                              3.0
Total gene length (bp)                                      5238003904
Total mrna length (bp)                                      5238003904
Total cds length (bp)                                       74172901
Total exon length (bp)                                      271493346
Total five_prime_utr length (bp)                            99412128
Total three_prime_utr length (bp)                           97908317
Total intron length per cds (bp)                            4966510558
Total intron length per exon (bp)                           4966510558
mean gene length (bp)                                       59862
mean mrna length (bp)                                       59862
mean cds length (bp)                                        848
mean exon length (bp)                                       782
mean five_prime_utr length (bp)                             5882
mean three_prime_utr length (bp)                            5775
mean cds piece length (bp)                                  214
mean five_prime_utr piece length (bp)                       5882
mean three_prime_utr piece length (bp)                      5775
mean intron in cds length (bp)                              19134
mean intron in exon length (bp)                             19134
median gene length (bp)                                     24383
median mrna length (bp)                                     24383
median cds length (bp)                                      137
median exon length (bp)                                     147
median five_prime_utr length (bp)                           189
median three_prime_utr length (bp)                          297
median cds piece length (bp)                                137
median five_prime_utr piece length (bp)                     189
median three_prime_utr piece length (bp)                    297
median intron in cds length (bp)                            3588
median intron in exon length (bp)                           3588
90 percentile gene length (bp)                              175713
90 percentile mrna length (bp)                              175713
90 percentile cds length (bp)                               436
90 percentile exon length (bp)                              531
90 percentile five_prime_utr length (bp)                    11595
90 percentile three_prime_utr length (bp)                   8514
90 percentile cds piece length (bp)                         436
90 percentile five_prime_utr piece length (bp)              11595
90 percentile three_prime_utr piece length (bp)             8514
90 percentile intron in cds length (bp)                     36128
90 percentile intron in exon length (bp)                    36128
Longest gene (bp)                                           2164927
Longest mrna (bp)                                           2164927
Longest cds (bp)                                            15192
Longest exon (bp)                                           766758
Longest five_prime_utr (bp)                                 585540
Longest three_prime_utr (bp)                                766673
Longest cds piece (bp)                                      6681
Longest five_prime_utr piece (bp)                           585540
Longest three_prime_utr piece (bp)                          766673
Longest intron into cds part (bp)                           1351418
Longest intron into exon part (bp)                          1351418
Shortest gene (bp)                                          150
Shortest mrna (bp)                                          150
Shortest cds piece (bp)                                     1
Shortest five_prime_utr piece (bp)                          1
Shortest three_prime_utr piece (bp)                         1
Shortest intron into cds part (bp)                          20
Shortest intron into exon part (bp)                         20
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

友情链接更多精彩内容