软件教程 | Bitacora：基因组组件中基因家族识别和注释的综合工具

依赖环境:

Perl # https://learn.perl.org/installing/

BLAST

方法1.sudo apt install ncbi-blast+

方法2.# ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

tar -zxvf ncbi-blast-2.16.0+-x64-linux.tar.g

echo 'export PATH="$HOME/ncbi-blast-2.16.0+/bin:$PATH"' >> ~/.bashrc

HMMER # http://hmmer.org/

方法 sudo apt install hmmer

GeMoMa

# BITACORA软件压缩包中默认包含了V1.7.1版本的GeMoMa，可以直接使用

软件安装与测试

1.解压 unzip bitacora-master.zip

赋予权限 chmod 755 runBITACORA*

2.解压GeMoMa

unzip GeMoMa-1.7.1.zip && rm GeMoMa-1.7.1.zip __MACOSX/ -rf

加入环境变量 echo 'export GEMOMA_PATH=/home/bitacora/GeMoMa-1.7.1/GeMoMa-1.7.1.jar' >> ~/.bashrc

3，Example里面是测试数据，解压后调用runBITACORA.sh

将runBITACORA.sh里面的输入文件和程序适配自己的完整路径

4.运行 bash runBITACORA.sh

5.结果解读：

测试的两个基因家族分别建立各自的文件夹，外面有两个基因家族识别结果的汇总

结果的汇总解读

Gene/Gene Family 基因家族名称

Number of annotated genes identified

从 GFF 注释文件中识别出的已注释的基因数量，且被 BITACORA 识别为该家族成员

Number of putative not annotated genes

在 genome/protein 序列中检测到但未在 GFF 注释文件中存在，被 BITACORA 认为是可能的新基因成员的数量（新预测）

Total number of identified genes (Annotated + Genomic)

注释基因 + 新预测的潜在基因总数

Total number of identified genes clustering identical sequences

上面所有基因中，有多少是冗余（完全相同）序列聚类去重后的数量（100% identical）

Total number of identified genes clustering highly identical sequences and proteins shorter than 30 aa

聚类后保留的代表序列数量，排除了非常短（<30 aa）的和高度冗余的

单个基因家族解读：

CD36-SNMP_genomic_and_annotated_genes.gff3

所有识别出的 CD36-SNMP 成员的 GFF3 注释，包括已有注释 + 新预测

CD36-SNMP_genomic_and_annotated_genes_nr.gff3

上面结果去除完全冗余条目后的 GFF3（nr = non-redundant）

CD36-SNMP_genomic_and_annotated_proteins_trimmed.fasta

所有识别出的蛋白序列（修剪处理后）

CD36-SNMP_genomic_and_annotated_proteins_trimmed_nr.fasta

去除冗余（non-redundant）蛋白序列

CD36-SNMP_genomic_and_annotated_proteins_trimmed_idseqsclustered.fasta

基于序列一致性聚类后保留的代表蛋白序列（最终推荐使用的）

CD36-SNMP_genomic_and_annotated_proteins_trimmed_idseqsclustered.gff3

推荐使用的 GFF3 注释，排除了冗余、短序列，最适合作图或功能分析

CD36-SNMP_genomic_and_annotated_proteins_trimmed_idseqsclustered.gff3_overlapping_genes.txt

提示 GFF 中存在坐标重叠的基因（如 isoforms），仅供参考

CD36-SNMP_genomic_and_annotated_proteins_trimmed_idseqsclustered_table.txt

每个识别成员的详细信息（基因位置、长度、蛋白 ID、分类等），用于 Excel 或 R/Python 分析

中间数据

CD36-SNMPtblastn_parsed_list_genomic_positions.bed

从 tblastn 中提取的所有 hits（含 GFF 区域）

CD36-SNMPtblastn_parsed_list_genomic_positions_nogff_filtered.bed

过滤掉已注释基因后的 tblastn hit，代表可能的新基因区域

Intermediate_files/ 中间步骤产生的临时文件，通常无需直接使用，但有助于调试