下载、安装
conda create -n phylophlan3
conda activate phylophlan3
conda install -c bioconda phylophlan
软件运行配置文件
无需基因组fna或者faa文件
基因组是fna,配置文件(下一步速度慢,不推荐)
# dna
phylophlan_write_config_file \
-d a \
-o phylophlan.cfg \
--db_aa diamond \
# 使用phylphlan marker蛋白序列库
--map_dna diamond \
# 基因序列map蛋白用diamond
--msa mafft \
--trim trimal \
--tree1 iqtree \
--tree2 raxml \
--verbose > log.cfg
基因组翻译后的faa(速度更快,推荐使用),配置文件。
# protein
phylophlan_write_config_file \
-d a \
-o phylophlan.cfg \
--db_aa diamond \
# 使用phylphlan marker蛋白序列库
--map_aa diamond \
# 蛋白map蛋白用diamond
--msa mafft \ # mafft常用
--trim trimal \
--tree1 iqtree \
--tree2 raxml \
--verbose > log.cfg
参数
-o OUTPUT, --output OUTPUT
-d {n,a}, --db_type {n核酸, a氨基酸}
--db_dna {makeblastdb} DNA索引
--map_dna {blastn,tblastn,diamond} 基因组比对
--db_aa {usearch,diamond} 蛋白索引
--map_aa {usearch,diamond} 蛋白组比对
--msa {muscle,mafft,opal,upp} 对齐
--trim {trimal} 修剪
--tree1 {fasttree,raxml,iqtree,astral,astrid} 构建系统发育
--tree2 {raxml} refine系统发育
使用最大似然法的软件:
FastTree 2-approximately maximumlikelihood trees for large alignments. PLoS ONE
IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol
RAxML 即是一款能使用多线程或并行化使用最大似然法构建进化树的软件
运行,下载数据库
下载节点运行一个测试文件,获取数据库
source /hwfsxx1/ST_HN/P18Z10200N0423/huty/software/miniconda3_2/etc/profile.d/conda.sh
conda activate phylophlan3
phylophlan \
-i ../lach_test \
-d phylophlan \
# -d 识别数据库,如果没有自动下载
-f phylophlan.cfg \
--diversity medium \
-o out_test \
--nproc 30 \
--fast \
--verbose
参数
-i INPUT, --input INPUT 基因组或蛋白组
-f CONFIG_FILE, --config_file CONFIG_FILE 配置文件
--diversity {low,medium,high} 多样性水平
"low": for genus-/species-/strain-level phylogenies
"medium": for class-/order-level phylogenies
"high": for phylum-/tree-of-life size phylogenies
-o OUTPUT, --output OUTPUT 文件夹名称
--fast 通过减少进化位点,进行快速进化树重建
-d DATABASE, --database DATABASE marker数据库 phylophlan 400 universal marker genes
phylophlan_databases/
├── phylophlan
│ ├── phylophlan.dmnd
│ ├── phylophlan.faa
│ └── phylophlan.faa.bz2
├── phylophlan_databases.txt
├── phylophlan.md5
└── phylophlan.tar
打开分析phylophlan.faa文件,400种蛋白序列,共34万(344503)条,因为数据库都是蛋白序列所以前面建库和比对都用diamond。
1163 p0000
855 p0001
...
1165 p0397
533 p0398
1012 p0399
过程
PhyloPhlAn version 3.0.67 (24 August 2022)
Setting "min_num_markers=100" since no value has been specified and the "database=phylophlan"
Loading configuration file "phylophlan.cfg"
Database folder "phylophlan_databases/phylophlan" present
"db_aa" database "phylophlan_databases/phylophlan/phylophlan.dmnd" present
Loading files from "/hwfsxx1/ST_HN/P18Z10200N0423/huty/analysis/Lach/lach_test"
Checking 6 inputs
Cleaning 6 inputs
Mapping "phylophlan" on 6 inputs (key: "map_aa")
Selecting 6 markers from "out_test/tmp/map_aa"
Extracting markers from 6 inputs
Aligning 333 markers (key: "msa")
Trimming gappy regions for 333 markers (key: "trim")
Trimming gappy columns from 333 markers
Trimming not variant from 333 markers
Subsampling 333 markers
...
Concatenating alignments
Alignments concatenated "out_test/lach_test_concatenated.aln" in 0s
Building phylogeny "out_test/lach_test_concatenated.aln"
Phylogeny "lach_test.tre" built in 131s
Resolving 1 polytomies
Resolving polytomies for "out_test/lach_test.tre.treefile"
"out_test/lach_test_resolved.tre" generated in 0s
Refining phylogeny "out_test/lach_test_resolved.tre"
Reducing number of RAxML threads to 20, as it appears to underperform with more threads
Phylogeny "lach_test_refined.tre" refined in 2s
最后一步refine,如果基因组很多会很慢。
结果文件
lach_test_concatenated.aln
lach_test_resolved.tre
lach_test.tre.bionj
lach_test.tre.ckp.gz
lach_test.tre.iqtree
lach_test.tre.log
lach_test.tre.mldist
lach_test.tre.treefile
RAxML_bestTree.lach_test_refined.tre
RAxML_info.lach_test_refined.tre
RAxML_log.lach_test_refined.tre
RAxML_result.lach_test_refined.tre
tmp
参考
PhyloPhlAn 3.0 微生物组系统发育分析
https://github.com/biobakery/biobakery/wiki/PhyloPhlAn-3.0:-Example-02:-Tree-of-life