一、OrthoFinder 及 聚类结果统计作图
在比较基因组学研究中,直系同源基因(Orthologs)的鉴定是揭示物种进化关系、基因功能分化的基石。然而传统工具如 InParanoid、OrthoMCL 往往面临 速度慢、精度低、流程繁琐 的缺点。
二、手把手安装 OrthoFinder
1. 环境要求
操作系统:Linux/macOS/Windows(推荐 Linux)
依赖工具:Python 3.7+、Diamond/Blastp、FastME、MCL
2. 三种安装方式任选
方式一:Conda 快速安装(推荐)
conda create -n orthofinder python=3.9
conda activate orthofinder
conda install -c bioconda orthofinder
方式二:GitHub 源码安装
git clone https://github.com/davidemms/OrthoFinder.git
cd OrthoFinder
python3 orthofinder.py -h # 测试安装
三、实战教程:15个昆虫基因组的进化分析
1. 输入文件准备
将所有物种的 蛋白序列文件(.faa) 放入同一目录,例如:
input_proteins/
├── Drosophila_melanogaster.faa
├── Bombyx_mori.faa
└── ...(共200个文件)
文件命名建议:无空格、无特殊字符,例如用下划线替代空格
2. 运行命令
orthofinder -f input_proteins/ -t 32 -a 50 -S diamond
-t 32:使用 32 个 CPU 线程加速
-a 50:启动 50 个并行比对任务
-S diamond:选择 Diamond 代替 BLAST 提升比对速度

四、结果解读
运行完成后,results/ 目录下将生成:
1. 核心结果文件
Orthogroups/Orthogroups.tsv
所有直系同源基因簇列表,每行格式:
OG001 SpeciesA|Gene1,SpeciesA|Gene2 SpeciesB|Gene3 ...
Species_Tree/SpeciesTree_rooted.txt
基于基因家族构建的物种进化树(Newick 格式),可直接用 FigTree 可视化
2. 高级分析报表
Comparative_Genomics_Statistics/Statistics_Overall.tsv
基因家族扩张收缩统计表,包含:
每个物种特有的基因家族数
祖先节点基因家族丢失事件
Orthogroups.GeneCount.tsv
基因家族大小矩阵,可直接作为 CAFE5 的输入文件(用于基因家族扩增分析)
五、常见问题与避坑指南
Q1:输入文件格式报错?
确保所有 .faa 文件为有效 FASTA 格式
Q2:物种树构建异常?
尝试调整参数 -M 的选项:
-M msa:适合物种数 <100 的高精度模式
-M dendroblast:适合超大规模数据集
Q3:计算资源不足?
使用 -t 和 -a 参数合理分配并行任务
对超大数据集可分步运行:
orthofinder -f proteins/ -b previous_orthofinder_results
六、R代码作图 ------orthofinder 聚类结果统计作图
# 加载必要的包
library(tidyverse)
library(ggsci)
library(ggpubr)
# 读取并预处理数据
stats <- tribble(
~Species, ~Total_genes, ~In_orthogroups, ~Unassigned,
"A", 29581, 24594, 4987,
"S",29758,28484,1274,
"D",15909,14463,1446,
"F",12984,12455,529,
"G",15358,13843,1515,
"H",29285,28826,459,
"J",33111,30777,2334,
"K",28414,21933,6481,
"L",37240,36608,632,
"P",30187,29772,415
) %>%
pivot_longer(cols = c(In_orthogroups, Unassigned),
names_to = "Category",
values_to = "Count")
# 绘制堆叠柱状图
p1 <- ggplot(stats, aes(x = Species, y = Count, fill = Category)) +
geom_col(position = "fill") +
scale_fill_npg() +
scale_y_continuous(labels = scales::percent) +
labs(x = "Species", y = "Percentage",
title = "Gene Assignment Distribution") +
theme_classic(base_size = 12) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "top")
# 保存图片
ggsave("Gene_Assignment.pdf", p1, width = 8, height = 6, dpi = 300)
# 读取基因分布数据
gene_dist <- tribble(
~Genes, ~A, ~S, ~D, ~F, ~G, ~H, ~J, ~K, ~L, ~P,
"0",5174,6567,6173,7440,6530,6863,9434,6018,6269,6952,
"1",7247,4180,7906,6780,7710,3530,4750,6158,3873,3509,
"2",1959,2055,1292,1280,1187,1979,712,2120,2046,1774,
"3",692,1000,363,313,319,1177,251,800,1177,1102,
"4",324,738,139,130,137,763,176,360,728,816,
"5",166,438,84,57,77,546,89,202,525,549,
"6",99,312,48,40,40,389,88,118,394,407,
"7",76,196,31,20,40,246,59,70,252,288,
"8",50,138,23,19,21,160,56,60,167,171,
"9",44,103,15,7,16,120,49,40,124,137,
"10",38,85,15,9,15,95,36,21,106,86
) %>%
pivot_longer(-Genes, names_to = "Species", values_to = "Count")
# 绘制分面密度图
p2 <- ggplot(gene_dist, aes(x = Genes, y = Count, fill = Species)) +
geom_col() +
scale_fill_npg() +
facet_wrap(~Species, scales = "free_y", ncol = 5) +
labs(x = "Number of Genes per Orthogroup",
y = "Orthogroup Count",
title = "Orthogroup Size Distribution") +
theme_classic(base_size = 10) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
legend.position = "none")
# 保存图片
ggsave("Orthogroup_Distribution.tiff", p2,
width = 12, height = 8, dpi = 300, compression = "lzw")
# 物种特异性基因数据
specific_genes <- tribble(
~Species, ~Specific_OGs, ~Genes_in_Specific,
"A", 706, 3291,
"S",494,2429,
"D",65,220,
"F",42,136,
"G",54,338,
"H",159,762,
"J",663,10252,
"K",390,1361,
"L",372,2882,
"P",168,760
)
# 绘制气泡图
p3 <- ggplot(specific_genes, aes(x = Species, y = Genes_in_Specific,
size = Specific_OGs, color = Species)) +
geom_point(alpha = 0.8) +
scale_color_npg() +
scale_size(range = c(3, 10)) +
labs(x = "Species", y = "Genes in Specific OGs",
title = "Species-specific Orthogroups",
size = "Number of OGs") +
theme_classic(base_size = 12) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# 保存图片
ggsave("Species_Specific.pdf", p3, width = 8, height = 6, dpi = 300)
library(patchwork)
combined <- p1 / (p2 + p3) + plot_annotation(tag_levels = "A")
ggsave("Combined_Results.tiff", combined,
width = 16, height = 12, dpi = 300)
