kraken2 github:
https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown
环境
import argparse
import gzip
import itertools
import os
import pathlib
import textwrap
import sys
from Bio import Phylo
source /hwfsxx1/ST_HN/P18Z10200N0423/huty/software/miniconda3_2/etc/profile.d/conda.sh
conda activate py38
conda install -c bioconda biopython
GTDB注释
source /hwfssz1/**/hutongyuan/softwares/miniconda3/etc/profile.d/conda.sh
conda activate gtdbtk
# 配置数据库
echo "export GTDBTK_DATA_PATH=/hwfssz1/**/hutongyuan/databases/release207_v2/" > /hwfssz1/**/hutongyuan/softwares/miniconda3/envs/gtdbtk/etc/conda/activate.d/gtdbtk.sh
# 注释
gtdbtk classify_wf \
--cpus 32 \
-x fa \
--genome_dir bins_drep_comp70/dereplicated_genomes/ \
--out_dir anno_drep_comp70/
处理注释文件
很多潜在新物种没有注释信息,用MAG号作为补充
存在同种注释,种水平可全部加上MAG号区分
cp /hwfssz5/**/anno_drep_comp70/classify/gtdbtk.bac120.summary.tsv ./
cp /hwfssz5/**/anno_drep_comp70/classify/gtdbtk.ar53.summary.tsv ./
# 提取合并细菌,古菌
touch anno.bac.ar.tsv
cat gtdbtk.bac120.summary.tsv | sed '1d' | awk -F"\t" '{OFS="\t"}{print $1, $2}' >> anno.bac.ar.tsv
cat gtdbtk.ar53.summary.tsv | sed '1d' | awk -F"\t" '{OFS="\t"}{print $1, $2}' >> anno.bac.ar.tsv
# 补充未注释的部分
#!/usr/bin/env python3
import re,sys,os
with open("anno_edit.txt", 'w') as o:
with open("anno.bac.ar.tsv", 'r') as infile:
for line in infile:
line = line.strip()
line = re.split(r'[\t|;]', line)
#if line[7] == "s__":
line[7] = "s__" + line[0]
if line[6] == "g__":
line[6] = "g__" + line[0]
if line[5] == "f__":
line[5] = "f__" + line[0]
if line[4] == "o__":
line[4] = "o__" + line[0]
if line[3] == "c__":
line[3] = "c__" + line[0]
o.write("{}\t{};{};{};{};{};{};{}\n".format(line[0],line[1],line[2],line[3],line[4],line[5],line[6],line[7]))
gtdb注释文件要求:
1 中间是制表符
2 末尾不留空行
kraken2建库准备
source /hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/etc/profile.d/conda.sh
conda activate python3.7
bins="/hwfssz5/ST_HEALTH/P18Z10200N0423/hutongyuan/analysis/fish/02_bin/bins_drep_comp70/dereplicated_genomes"
python ./tax_gtdb_final.py \
--gtdb anno_edit.txt \
--assemblies $bins/ \
--nodes nodes.dmp \
--names names.dmp \
--kraken_dir kraken_library
bash ./sc_buildk2db.sh
过程
Loading GTDB taxonomy:
336 taxa found
Assigning tax IDs:
domains: 2
phyla: 13
classes: 19
orders: 38
families: 56
genera: 88
species: 120
Writing taxonomy nodes to nodes.dmp
Writing taxonomy names to names.dmp
Searching for an assembly for each accession:
120 / 120 assemblies found
Making Kraken assembly directory:
120 / 120 assemblies
kraken_library里就是drep的MAG
建库
conda activate kraken2
for file in kraken_library/*.fa
do
kraken2-build --add-to-library $file --db kraken_database
done
mkdir kraken_database/taxonomy
mv names.dmp kraken_database/taxonomy
mv nodes.dmp kraken_database/taxonomy
## kraken2-build --build --db kraken_database
vi build_kk2.sh
source /hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/etc/profile.d/conda.sh
conda activate kraken2
kraken2-build --build --db kraken_database --threads 4
qsub -cwd -l vf=10G,p=4 -q st.q -P P18Z10200N0423 -binding linear:4 build_kk2.sh
过程
Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map complete. [0.762s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 1491144408 bytes
Capacity estimation complete. [3m54.774s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 11 bits reserved for taxid.
Completed processing of 282544 sequences, 1399196925 bp
Writing data to disk... complete.
Database files completed. [9m25.653s]
Database construction complete. [Total: 13m35.437s]
generate bracken distribution
# Let's continue to generate bracken distribution for profile correction
cd kraken_database
kraken2 --db=./ --threads=4 <( find -L library \( -name "*.fna" -o -name "*.fasta" -o -name "*.fa" \) -exec cat {} + ) > database.kraken
过程
Loading database information... done.
41646 sequences (388.85 Mbp) processed in 37.970s (65.8 Kseq/m, 614.45 Mbp/m).
41646 sequences classified (100.00%)
0 sequences unclassified (0.00%)
readlen=50、100、200、300(这一步是后面bracken可能会用到的)
conda activate bracken
kmer2read_distr --seqid2taxid seqid2taxid.map --taxonomy ./taxonomy --kraken database.kraken --output database50mers.kraken -l 50 -t 1
kmer2read_distr --seqid2taxid seqid2taxid.map --taxonomy ./taxonomy --kraken database.kraken --output database100mers.kraken -l 100 -t 1
kmer2read_distr --seqid2taxid seqid2taxid.map --taxonomy ./taxonomy --kraken database.kraken --output database200mers.kraken -l 200 -t 1
kmer2read_distr --seqid2taxid seqid2taxid.map --taxonomy ./taxonomy --kraken database.kraken --output database300mers.kraken -l 300 -t 1
generate_kmer_distribution.py -i database50mers.kraken -o database50mers.kmer_distrib
generate_kmer_distribution.py -i database100mers.kraken -o database100mers.kmer_distrib
generate_kmer_distribution.py -i database200mers.kraken -o database200mers.kmer_distrib
generate_kmer_distribution.py -i database300mers.kraken -o database300mers.kmer_distrib
# rm -r ../kraken_library/
过程
(bracken) [hutongyuan@cngb-xcompute-0-18 kraken_database]$ kmer2read_distr --seqid2taxid seqid2taxid.map --taxonomy ./taxonomy --kraken database.kraken --output database50mers.kraken -l 50 -t 1
>>STEP 0: PARSING COMMAND LINE ARGUMENTS
Taxonomy nodes file: ./taxonomy/nodes.dmp
Seqid file: seqid2taxid.map
Num Threads: 1
Kmer Length: 31
Read Length: 50
>>STEP 1: READING SEQID2TAXID MAP
282544 total sequences read
>>STEP 2: READING NODES.DMP FILE
1056 total nodes read
>>STEP 3: CONVERTING KMER MAPPINGS INTO READ CLASSIFICATIONS:
50mers, with a database built using 31mers
282546 sequences converted (finished: )oncoct_E9_bin.21_k119_17748|kraken:taxid|595)97)))
Time Elaped: 1 minutes, 23 seconds, 0.00000 microseconds
=============================
(bracken) [hutongyuan@cngb-xcompute-0-18 kraken_database]$ kmer2read_distr --seqid2taxid seqid2taxid.map --taxonomy ./taxonomy --kraken database.kraken --output database100mers.kraken -l 100 -t 1
>>STEP 0: PARSING COMMAND LINE ARGUMENTS
Taxonomy nodes file: ./taxonomy/nodes.dmp
Seqid file: seqid2taxid.map
Num Threads: 1
Kmer Length: 31
Read Length: 100
>>STEP 1: READING SEQID2TAXID MAP
282544 total sequences read
>>STEP 2: READING NODES.DMP FILE
1056 total nodes read
>>STEP 3: CONVERTING KMER MAPPINGS INTO READ CLASSIFICATIONS:
100mers, with a database built using 31mers
282546 sequences converted (finished: )oncoct_E9_bin.21_k119_17748|kraken:taxid|595)97)))
Time Elaped: 2 minutes, 29 seconds, 0.00000 microseconds
=============================
(bracken) [hutongyuan@cngb-xcompute-0-18 kraken_database]$ kmer2read_distr --seqid2taxid seqid2taxid.map --taxonomy ./taxonomy --kraken database.kraken --output database200mers.kraken -l 200 -t 1
>>STEP 0: PARSING COMMAND LINE ARGUMENTS
Taxonomy nodes file: ./taxonomy/nodes.dmp
Seqid file: seqid2taxid.map
Num Threads: 1
Kmer Length: 31
Read Length: 200
>>STEP 1: READING SEQID2TAXID MAP
282544 total sequences read
>>STEP 2: READING NODES.DMP FILE
1056 total nodes read
>>STEP 3: CONVERTING KMER MAPPINGS INTO READ CLASSIFICATIONS:
200mers, with a database built using 31mers
282546 sequences converted (finished: )oncoct_E9_bin.21_k119_17748|kraken:taxid|595)97)))
Time Elaped: 4 minutes, 12 seconds, 0.00000 microseconds
=============================
(bracken) [hutongyuan@cngb-xcompute-0-18 kraken_database]$ kmer2read_distr --seqid2taxid seqid2taxid.map --taxonomy ./taxonomy --kraken database.kraken --output database300mers.kraken -l 300 -t 1
>>STEP 0: PARSING COMMAND LINE ARGUMENTS
Taxonomy nodes file: ./taxonomy/nodes.dmp
Seqid file: seqid2taxid.map
Num Threads: 1
Kmer Length: 31
Read Length: 300
>>STEP 1: READING SEQID2TAXID MAP
282544 total sequences read
>>STEP 2: READING NODES.DMP FILE
1056 total nodes read
>>STEP 3: CONVERTING KMER MAPPINGS INTO READ CLASSIFICATIONS:
300mers, with a database built using 31mers
282546 sequences converted (finished: )oncoct_E9_bin.21_k119_17748|kraken:taxid|595)97)))
Time Elaped: 5 minutes, 39 seconds, 0.00000 microseconds
=============================
(bracken) [hutongyuan@cngb-login-0-8 kraken_database]$ generate_kmer_distribution.py -i database50mers.kraken -o database50mers.kmer_distrib
PROGRAM START TIME: 12-04-2022 14:10:33
...473 total genomes read from kraken output file
...creating kmer counts file -- lists the number of kmers of each classification per genome
...creating kmer distribution file -- lists genomes and kmer counts contributing to each genome
PROGRAM END TIME: 12-04-2022 14:10:35
(bracken) [hutongyuan@cngb-login-0-8 kraken_database]$ generate_kmer_distribution.py -i database100mers.kraken -o database100mers.kmer_distrib
PROGRAM START TIME: 12-04-2022 14:10:35
...473 total genomes read from kraken output file
...creating kmer counts file -- lists the number of kmers of each classification per genome
...creating kmer distribution file -- lists genomes and kmer counts contributing to each genome
PROGRAM END TIME: 12-04-2022 14:10:37
(bracken) [hutongyuan@cngb-login-0-8 kraken_database]$ generate_kmer_distribution.py -i database200mers.kraken -o database200mers.kmer_distrib
PROGRAM START TIME: 12-04-2022 14:10:37
...473 total genomes read from kraken output file
...creating kmer counts file -- lists the number of kmers of each classification per genome
...creating kmer distribution file -- lists genomes and kmer counts contributing to each genome
PROGRAM END TIME: 12-04-2022 14:10:38
(bracken) [hutongyuan@cngb-login-0-8 kraken_database]$ generate_kmer_distribution.py -i database300mers.kraken -o database300mers.kmer_distrib
PROGRAM START TIME: 12-04-2022 14:10:39
...473 total genomes read from kraken output file
...creating kmer counts file -- lists the number of kmers of each classification per genome
...creating kmer distribution file -- lists genomes and kmer counts contributing to each genome
PROGRAM END TIME: 12-04-2022 14:10:40