“进化树的构建怎么操作?”
“那肯定是用MEGA啊!”
可是真的好麻烦啊,要先比对再建树,然后再进行各种美化,习惯了R
就用R
解决吧。
简单Google之后发现R
有现成的包可以完成分析,包括了从序列读取、进化树构建、进化树美化等 。相关的R
包主要是:ape
、phangorn
、seqinr
、ggtree
。
本文中的测试数据来自一个教程,公众号后PLANTOMIX
台回复“进化树
”获取下载链接。
关于进化树
常用的方法包括基于距离矩阵的UPGMA
、ME(Minimum Evolution,最小进化法)
和 NJ(Neighbor-Joining,邻接法)
与非距离矩阵的MP(Maximum parsimony,最大简约法)
、ML(Maximum likelihood,最大似然法)
以及贝叶斯(Bayesian)推断
等方法。现在UPGMA
很少使用,在大多数文章里面基本是NJ
或者ML
。就速度来说,NJ
是较快的,ML
是很耗时的。
本文就简单描述如何在R
中利用DNA
序列构建进化树,本文的方法适用于DNA序列
和氨基酸序列
,蛋白序列
的方法还在探索中。
rm(list = ls())
setwd('../../20200727【R】进化树构建与美化/')
if (!requireNamespace(c('ape','phangorn','seqinr'))) {
install.packages(c('ape','phangorn','seqinr'))
}
library(ape)
library(phangorn)
library(seqinr)
library(ggtree)
library(ggplot2)
library(patchwork)
test_dna = read.dna('data/test.fasta', format = 'fasta')
test_phyDat = phyDat(test_dna, type = 'DNA', levels = NULL)
# 模型评估
mt = modelTest(test_phyDat)
print(mt)
# 计算距离
dna_dist = dist.ml(test_phyDat,model = 'JC69')
# NJ树
tree_nj = NJ(dna_dist)
write.tree()
p_nj = ggtree(tree_nj, layout="radial") +
geom_tiplab(size = 3, color = 'red')+
labs(title = 'Neighbor Joining Tree')
p_nj
# UPGMA树
tree_upgma = upgma(dna_dist)
p_UPGMA = ggtree(tree_upgma) +
geom_tiplab(size = 3, color = 'green')+
labs(title = 'UPGMA Tree')
p_UPGMA
# 最大简约树
parsimony(tree_upgma, data = test_phyDat)
parsimony(tree_nj, data = test_phyDat)
test_optim = optim.parsimony(tree_nj, data = test_phyDat)
tree_peatchet = pratchet(test_phyDat)
p_Maximum_Parsimony = ggtree(tree_peatchet,layout="radial") +
geom_tiplab(size = 3, color = 'blue')+
labs(title = 'Maximum Parsimony Tree')
p_Maximum_Parsimony
# 最大似然法
tree_fit = pml(tree_nj, data = test_phyDat)
print(tree_fit)
fitJC = optim.pml(tree_fit, model = 'JC', rearrangement = 'stochastic')
logLik(fitJC)
tree_bs = bootstrap.pml(fitJC, bs = 100, optNni = T,
multicore = F,
control = pml.control(trace = 0))
tree_ml_bootstrap = plotBS(midpoint(fitJC$tree), tree_bs, p = 50, type = 'p')
p_Maximum_Likelihood = ggtree(tree_ml_bootstrap) +
geom_tiplab(size = 3, color = 'purple')+
labs(title = 'Maximum Likelihood Tree')
p_Maximum_Likelihood
p_all = p_nj + p_UPGMA + p_Maximum_Parsimony + p_Maximum_Likelihood +
plot_layout(ncol = 2)
ggsave(p_all, filename = 'figures/all.pdf', width = 8, height = 8)