wtdbg2：原理及用法

wtdbg2优点是速度非常快，安装使用都非常简单。
用于nanopore及pacbio数据的组装

原理

首先要注意，wtdbg2不同于megahit等二代组装软件DBG的原理，wtdbg2得到的图称为fuzzy-Bruijn graph (FBG), 作者在文中提到：

A ‘base’ in FBG is a 256 bp bin(each small box ) and a ‘K-mer’ or K-bin in FBG consists of K consecutive bins on reads.

也就是说：DBG中的 a base在wtdbg2中是一个256bp的bin， DBG中的K-mer在wtdbg2中是指reads上连续的至少四个bin。

image

wtdbg2原理步骤：

把所有的reads都加载进内存，数Kmer的个数。
把reads分成以256bp为一个单元的bin(图中每个box),一个Kmer至少有4x256bp。也就是说reads分布都低于4x256 bp的nanopore或者pacbio数据咱们只能换个软件了！！
different K-bins may be represented by a single vertex if they are aligned together based on all-versus-all read alignment. （不知道怎么翻译会更准确，大家看图就能理解了），这个过程允许不匹配和空白
构建一个hash表，key是在reads中出现两次及以上的k-mer（只出现一次的kmer没办法确定是不是真的，无法纠正），value是reads上相关bin的位置

wtdbg2运行安装及运行：

git clone https://github.com/ruanjue/wtdbg2
cd wtdbg2 && make

#quick start with wtdbg2.pl
./wtdbg2.pl -t 16 -x rs -g 4.6m -o dbg reads.fa.gz #-x specifies the sequencing technology, "rs" for PacBio RSII, "sq" for PacBio
Sequel, "ccs" for PacBio CCS reads and "ont" for Oxford Nanopore
-e, defaults to 3, specifies the minimum read coverage of an edge in the assembly graph
# Step by step commandlines
# assemble long reads
./wtdbg2 -x rs -g 4.6m -i reads.fa.gz -t 16 -fo dbg
# derive consensus
./wtpoa-cns -t 16 -i dbg.ctg.lay.gz -fo dbg.raw.fa

后续的polish建议使用：2 iterations of racon ，medaka，2 interations of pilon（pilon可以使用二代reads也可以使用三代reads）

如果有疑问，建议大家先读原文 Fast and accurate long-read assembly with wtdbg2. Nature Methods.文章非常易懂。如果还没有解决你的问题，可以去作者的github上去issues页面查找作者有没有回答类似的问题，作者经常在上面解答，非常耐心。

一些常见问题

kmer-size or pmer-size is more about sequencing error rate.

increasing -e /length threshold (-L) improved contiguity , at the cost of genome size

wtdbg is degined to be able to assemble a huge genome within one day, SMARTdenovo might get better assemblies in small genomes

参考文献
Ruan, J., & Li, H. (2019). Fast and accurate long-read assembly with wtdbg2. Nature Methods.

wtdbg2：原理及用法

原理

wtdbg2运行安装及运行：

一些常见问题

推荐阅读更多精彩内容