基因注释：基于SNAP+Augustus+GeneMark的maker3 pipeline

我使用的maker版本为3.01.04

第一轮：将已知基因比对到基因组

包括两个部分：
🔸屏蔽重复序列
🔸将已知的转录组/蛋白序列与基因组进行比对

1.（可选）构建自定义重复序列数据库

安装RepeatModeler
RepeatModeler Download Page (repeatmasker.org)
RepeatModeler的安装（包含RepeatMasker安装）_nnnnnnny-的博客-CSDN博客_repeatmodeler安装

/path/RepeatModeler/BuildDatabase -name pyu pyu_contig.fasta
RepeatModeler -pa 4 -database pyu -LTRStruct >& repeatmodeler.log

运行结束后获得pyu-families.fa，将其提供给maker_opts.ctl文件的“rmlib= ”选项

2.创建maker控制文件

maker -CTL
#将创建三个控制文件：maker_boopts.ctl、maker_exe.ctl、maker_opts.ctl

3.修改控制文件maker_opts.ctl

vi maker_opts.ctl

genome=pyu_contig.fasta

est=unigene.fasta #从头组装的转录组序列
protein=protein.fasta #uniprot中下载的同源蛋白序列

rmlib=pyu-families.fa
softmask=1 #软屏蔽，将重复序列转为小写而不是N，因此基因内的短重复序列仍然可以作为基因的一部分进行注释

est2genome=1 #使用转录组证据
protein2genome=1 #使用同源蛋白证据

TMP=/workdir/tmp  #对于大型基因组来说很重要

🔹如果单独运行RepeatMasker，如https://gist.github.com/darencard/bb1001ac1532dd4225b030cf0cd61ce2
中所述，“rmlib”选项应不填，“rm_gff”选项应该填写重复序列的gff文件

maker -base pyu_rnd1 maker_opts.ctl maker_bopts.ctl maker_exe.ctl

第二轮——使用SNAP进行基因预测

1.训练SNAP基因模型

首先使用上一轮产生的比对结果进行训练

mkdir SNAP1
cd SNAP1
gff3_merge -d ../pyu_rnd1.maker.output/pyu_rnd1_master_datastore_index.log
maker2zff -l 50 -x 0.5 pyu_rnd1.all.gff

🔹maker2zff生成一个ZFF格式文件（genome.ann）和一个FASTA格式文件（genome.dna），过滤用于再次训练的高置信度基因，共有7个选项：
-c 由EST/mRNA-Seq比对确定的剪接位点的比例，默认0.5
-e 与EST/mRNA-Seq比对重叠的外显子的比例，默认0.5
-o 和任何证据（EST或者蛋白）重叠的外显子的比例，默认0.5
-a 从头预测证实的剪接位点的比例，默认0
-t 和从头预测结果重叠的外显子的比例，默认0
-l mRNA翻译的蛋白质序列的最短长度
-x 最大AED值，默认0.5
-n 不过滤
🔸AED值：maker2使用注释编辑距离（AED）来评估基因组注释的准确性，AED是一个介于 0 和 1 之间的数字，衡量注释与支持它的evidence的拟合优度，0 表示与可用证据完全一致，1 表示缺乏对注释基因模型的支持

fathom -categorize 1000 genome.ann genome.dna #过滤
fathom -export 1000 -plus uni.ann uni.dna
forge export.ann export.dna
hmm-assembler.pl pyu . > ../pyu1.hmm
mv pyu_rnd1.all.gff ../
cd ..

2.使用SNAP预测基因

为第一轮的maker_opts.ctl 保存副本

cp maker_opts.ctl maker_opts.ctl_backup_rnd1

编辑第二轮的maker_opts.ctl

vi maker_opts.ctl 

maker_gff= pyu_rnd1.all.gff
est_pass=1 # 使用第一轮的EST比对结果
protein_pass=1 #使用第一轮的protein比对结果
rm_pass=1 # 使用gff文件中的repeats
snaphmm=pyu1.hmm
est= # 删除est文件，这一步不需要再跑EST比对了
protein= # 同上
model_org= #同上
rmlib= # 同上
repeat_protein= #同上
est2genome=0 # 不需要再构建基于EST证据的基因模型
protein2genome=0 #同上
pred_stats=1 #report AED stats
alt_splice=0 # 0: keep one isoform per gene; 1: identify splicing variants of the same gene
keep_preds=1 # keep genes even without evidence support, set to 0 if no

运行maker

maker -base pyu_rnd2 maker_opts.ctl maker_bopts.ctl maker_exe.ctl

第三轮——重新训练SNAP模型并进行另一轮SNAP基因预测

SNAP一共需要运行2~3轮
1.首先训练一个新的SNAP模型

mkdir SNAP2
cd SNAP2
gff3_merge -d ../pyu_rnd2.maker.output/pyu_rnd2_master_datastore_index.log
maker2zff -l 50 -x 0.5 pyu_rnd2.all.gff

fathom -categorize 1000 genome.ann genome.dna
fathom -export 1000 -plus uni.ann uni.dna
forge export.ann export.dna
hmm-assembler.pl pyu . > ../pyu2.hmm
mv pyu_rnd2.all.gff ..
cd ..

2.使用SNAP预测基因
为第二轮的maker_opts.ctl 保存副本

cp maker_opts.ctl maker_opts.ctl_backup_rnd2

编辑第三轮的maker_opts.ctl

vi maker_opts.ctl 

maker_gff=pyu_rnd2.all.gff
snaphmm=pyu2.hmm

运行maker

maker -base pyu_rnd3 maker_opts.ctl maker_bopts.ctl maker_exe.ctl

第四轮——训练AUGUSTUS模型

1.格式转换

mkdir augustus1
cd augustus1
gff3_merge -d ../pyu_rnd1.maker.output/pyu_rnd1_master_datastore_index.log

过滤gff文件，只保留maker注释

awk '{if ($2=="maker") print }' pyu_rnd1.all.gff > maker_rnd1.gff

将maker_rnd1.gff和pyu_contig.fasta转为Genbank格式的文件pyu.gb
保留每个基因上下游2000bp的序列用于训练模型

gff2gbSmallDNA.pl maker_rnd1.gff pyu_contig.fasta 2000 pyu.gb

检查训练集中的基因数目

grep -c LOCUS pyu.gb

2.开始训练

首先创建一个Augustus新物种

new_species.pl --species=pyu

初始训练

etraining --species=pyu pyu.gb

检查初始模型是否存在

ls -ort $AUGUSTUS_CONFIG_PATH/species/pyu

创建一个更小的测试集pyu.gb.evaluation，便于在优化前后进行评估

randomSplit.pl pyu.gb 200
mv pyu.gb.test pyu.gb.evaluation

预测测试集中的基因，并检查结果

augustus --species=pyu pyu.gb.evaluation >& first_evaluate.out
grep -A 22 Evaluation first_evaluate.out

示例：

*******      Evaluation of gene prediction     *******

---------------------------------------------\
                 | sensitivity | specificity |
---------------------------------------------|
nucleotide level |       0.873 |       0.626 |
---------------------------------------------/

----------------------------------------------------------------------------------------------------------\
           |  #pred |  #anno |      |    FP = false pos. |    FN = false neg. |             |             |
           | total/ | total/ |   TP |--------------------|--------------------| sensitivity | specificity |
           | unique | unique |      | part | ovlp | wrng | part | ovlp | wrng |             |             |
----------------------------------------------------------------------------------------------------------|
           |        |        |      |                253 |                101 |             |             |
exon level |    484 |    332 |  231 | ------------------ | ------------------ |       0.696 |       0.477 |
           |    484 |    332 |      |   35 |    0 |  218 |   36 |    0 |   65 |             |             |
----------------------------------------------------------------------------------------------------------/

----------------------------------------------------------------------------\
transcript | #pred | #anno |   TP |   FP |   FN | sensitivity | specificity |
----------------------------------------------------------------------------|
gene level |   156 |   100 |   47 |  109 |   53 |        0.47 |       0.301 |
----------------------------------------------------------------------------/

🔸100个基因中有47个被准确预测
🔸69.6%的外显子被准确预测
🔸47.7%预测的外显子在测试集中确切存在

优化模型，该步骤极其耗时

randomSplit.pl pyu.gb 1000
optimize_augustus.pl --species=pyu --kfold=24 --cpus=24 --rounds=3 --onlytrain=pyu.gb.train pyu.gb.test

优化后再次训练

etraining --species=pyu pyu.gb

使用优化后的模型评估测试集，并检查结果

augustus --species=pyu pyu.gb.evaluation >& second_evaluate.out
grep -A 22 Evaluation second_evaluate.out

在这些步骤之后，物种模型位于目录 augustus_config/species/pyu 中

3.使用新的 augustus 模型运行 maker

vi maker_opts.ctl

maker_gff= pyu_rnd1.all.gff
est_pass=1 # use est alignment from round 1
protein_pass=1 #use protein alignment from round 1
rm_pass=1 # use repeats in the gff file
augustus_species=pyu # augustus species model you just built
est= # remove est file, do not run EST blast again
protein= # remove protein file, do not run blast again
model_org= #remove repeat mask model, so not running RM again
rmlib= # not running repeat masking again
repeat_protein= #not running repeat masking again
est2genome=0 # do not do EST evidence based gene model
protein2genome=0 # do not do protein based gene model.
pred_stats=1 #report AED stats
alt_splice=0 # 0: keep one isoform per gene; 1: identify splicing variants of the same gene
keep_preds=1 # keep genes even without evidence support, set to 0 if no

maker -base pyu_rnd4 maker_opts.ctl maker_bopts.ctl maker_exe.ctl

第五轮——训练GeneMark

GeneMark训练只需要基因组组装文件

gmes_petap.pl -ES -fungus -cores 10 -sequence pyu_contig.fasta

将生成的gmhmm.mod添加到maker_opts.ctl文件中

vi maker_opts.ctl

gmhmm=gmhmm.mod

最后一次运行maker

maker -base pyu_rnd5 maker_opts.ctl maker_bopts.ctl maker_exe.ctl

六、最后的整合

gff3_merge -n -d ../pyu_rnd5.maker.output/pyu_rnd5_master_datastore_index.log
fasta_merge -d pyu_rnd5.maker.output/pyu_rnd5_master_datastore_index.log

获得一个不包含基因组序列的gff3 文件：pyu_rnd5.all.gff，以及一系列蛋白质和转录组fasta 文件。

maker最终结果

tips：要使基因名称更短，可使用以下命令：
maker_map_ids --prefix pyu_ --justify 8 --iterate 1 pyu_rnd5.all.gff > id_map
map_gff_ids id_map pyu_rnd5.all.gff
map_fasta_ids id_map pyu_rnd5.all.maker.proteins.fasta
map_fasta_ids id_map pyu_rnd5.all.maker.transcripts.fasta
参考文章：https://biohpc.cornell.edu/doc/annotation_2019_exercises1_v2.html