kraken2 手册:https://github.com/DerrickWood/kraken2/wiki/Manual
1 kraken2 report是这样子的
想把各分类都注释上去呢
2 python3代码
思路:
(1) 根据第四列判断domain phylum class order family genus species,将分类信息赋值给变量,不考虑含数字的分类。
(2) 根据条件将分类信息加到第六列输出即可。
(U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies
#!/usr/bin/env python3
import os, sys, re
ms, infile_name, outfile_name = sys.argv
with open(infile_name, 'r') as infile:
with open(outfile_name, 'w') as o:
Domain = ""
for line in infile:
line = line.strip()
lines = re.split(r'\t', line)
lines[5] = re.split(r'^[ ]*', lines[5])[1]
if lines[3] == "D":
Domain = lines[5]
o.write("\t".join(lines))
o.write("\n")
elif lines[3] == "P":
Phylum = lines[5]
lines[5] = Domain+";"+Phylum
o.write("\t".join(lines))
o.write("\n")
elif lines[3] == "C":
Class = lines[5]
lines[5] = Domain+";"+Phylum+";"+Class
o.write("\t".join(lines))
o.write("\n")
elif lines[3] == "O":
Order = lines[5]
lines[5] = Domain+";"+Phylum+";"+Class+";"+Order
o.write("\t".join(lines))
o.write("\n")
elif lines[3] == "F":
Family = lines[5]
lines[5] = Domain+";"+Phylum+";"+Class+";"+Order+";"+Family
o.write("\t".join(lines))
o.write("\n")
elif lines[3] == "G":
Genus = lines[5]
lines[5] = Domain+";"+Phylum+";"+Class+";"+Order+";"+Family+";"+Genus
o.write("\t".join(lines))
o.write("\n")
elif lines[3] == "S":
Species = lines[5]
lines[5] = Domain+";"+Phylum+";"+Class+";"+Order+";"+Family+";"+Genus+";"+Species
o.write("\t".join(lines))
o.write("\n")
上面漏掉了kingdom的信息,尤其是真菌是Kingdom在结果中完全看不到真菌。奇怪的是Kraken注释,只有部分Domain有kingdom,修改代码。后续数据处理不变。
#!/usr/bin/env python3
import os, sys, re
ms, infile_name, outfile_name = sys.argv
with open(infile_name, 'r') as infile:
with open(outfile_name, 'w') as o:
Domain = ""
for line in infile:
line = line.strip()
lines = re.split(r'\t', line)
lines[5] = re.split(r'^[ ]*', lines[5])[1]
if lines[3] == "D":
Domain = lines[5]
o.write("\t".join(lines))
o.write("\n")
Kingdom = "NG" ## Kingdom非必须,Domain必须,在这里赋空值
elif lines[3] == "K":
Kingdom = lines[5]
lines[5] = Domain+";"+Kingdom
o.write("\t".join(lines))
o.write("\n")
elif lines[3] == "P":
Phylum = lines[5]
lines[5] = Domain+";"+Kingdom+";"+Phylum
o.write("\t".join(lines))
o.write("\n")
elif lines[3] == "C":
Class = lines[5]
lines[5] = Domain+";"+Kingdom+";"+Phylum+";"+Class
o.write("\t".join(lines))
o.write("\n")
elif lines[3] == "O":
Order = lines[5]
lines[5] = Domain+";"+Kingdom+";"+Phylum+";"+Class+";"+Order
o.write("\t".join(lines))
o.write("\n")
elif lines[3] == "F":
Family = lines[5]
lines[5] = Domain+";"+Kingdom+";"+Phylum+";"+Class+";"+Order+";"+Family
o.write("\t".join(lines))
o.write("\n")
elif lines[3] == "G":
Genus = lines[5]
lines[5] = Domain+";"+Kingdom+";"+Phylum+";"+Class+";"+Order+";"+Family+";"+Genus
o.write("\t".join(lines))
o.write("\n")
elif lines[3] == "S":
Species = lines[5]
lines[5] = Domain+";"+Kingdom+";"+Phylum+";"+Class+";"+Order+";"+Family+";"+Genus+";"+Species
o.write("\t".join(lines))
o.write("\n")
3 运行脚本,查看结果
任意找一个样本的kraken2结果,包含所有物种物种注释结果,即使reads count是0。行数均为31030。
conda activate python3.7
python3 ../script/kraken_anno.py sample_report.tsv kraken_anno.txt
4 抽提种水平注释
cat kraken_anno.txt | awk -F"\t" '{if($4=="S") print $6}' > kraken_anno2.txt
行数为17091,该数据库中各物种总数。
为保证处理结果与kraken2的合并结果species完全一致,还需做:
' -> 空
[] -> 空
5 统计各门中基因组数量
四大域/界:古菌,细菌,真核,病毒
kraken2数据库的基因组统计,门水平
1 Archaea Candidatus Geothermarchaeota
1 Archaea Candidatus Korarchaeota
1 Archaea Candidatus Lokiarchaeota
1 Archaea Candidatus Micrarchaeota
57 Archaea Crenarchaeota
230 Archaea Euryarchaeota
24 Archaea Thaumarchaeota
17 Bacteria Acidobacteria
1164 Bacteria Actinobacteria
17 Bacteria Aquificae
3 Bacteria Armatimonadetes
484 Bacteria Bacteroidetes
1 Bacteria Balneolaeota
1 Bacteria Caldiserica
1 Bacteria Calditrichaeota
1 Bacteria Candidatus Bipolaricaulota
1 Bacteria Candidatus Cloacimonetes
1 Bacteria Candidatus Omnitrophica
4 Bacteria Candidatus Saccharibacteria
22 Bacteria Chlamydiae
14 Bacteria Chlorobi
19 Bacteria Chloroflexi
1 Bacteria Chrysiogenetes
8 Bacteria Coprothermobacterota
190 Bacteria Cyanobacteria
5 Bacteria Deferribacteres
36 Bacteria Deinococcus-Thermus
2 Bacteria Dictyoglomi
3 Bacteria Elusimicrobia
1 Bacteria Fibrobacteres
1001 Bacteria Firmicutes
25 Bacteria Fusobacteria
4 Bacteria Gemmatimonadetes
2 Bacteria Ignavibacteriae
1 Bacteria Kiritimatiellaeota
9 Bacteria Nitrospirae
35 Bacteria Planctomycetes
3102 Bacteria Proteobacteria
60 Bacteria Spirochaetes
8 Bacteria Synergistetes
152 Bacteria Tenericutes
7 Bacteria Thermodesulfobacteria
32 Bacteria Thermotogae
16 Bacteria Verrucomicrobia
24 Eukaryota Apicomplexa
52 Eukaryota Ascomycota
2 Eukaryota Bacillariophyta
5 Eukaryota Basidiomycota
1 Eukaryota Cercozoa
4 Eukaryota Chlorophyta
1 Eukaryota Chordata
1 Eukaryota Ciliophora
7 Eukaryota Euglenozoa
2 Eukaryota Evosea
4 Eukaryota Microsporidia
4 Eukaryota Rhodophyta
69 Eukaryota Streptophyta
729 Viruses Artverviricota
410 Viruses Cossaviricota
903 Viruses Cressdnaviricota
7 Viruses Dividoviricota
222 Viruses Duplornaviricota
46 Viruses Hofneiviricota
789 Viruses Kitrinoviricota
1153 Viruses Lenarviricota
645 Viruses Negarnaviricota
127 Viruses Nucleocytoviricota
111 Viruses Peploviricota
61 Viruses Phixviricota
894 Viruses Pisuviricota
94 Viruses Preplasmiviricota
417 Viruses Saleviricota
3542 Viruses Uroviricota