wtdbg2优点是速度非常快,安装使用都非常简单。
用于nanopore及pacbio数据的组装
原理
首先要注意,wtdbg2不同于megahit等二代组装软件DBG的原理,wtdbg2得到的图称为fuzzy-Bruijn graph (FBG), 作者在文中提到:
A ‘base’ in FBG is a 256 bp bin(each small box ) and a ‘K-mer’ or K-bin in FBG consists of K consecutive bins on reads.
也就是说:DBG中的 a base在wtdbg2中是一个256bp的bin, DBG中的K-mer在wtdbg2中是指reads上连续的至少四个bin。
wtdbg2原理步骤:
把所有的reads都加载进内存,数Kmer的个数。
把reads分成以256bp为一个单元的bin(图中每个box),一个Kmer至少有4x256bp。也就是说reads分布都低于4x256 bp的nanopore或者pacbio数据咱们只能换个软件了!!
different K-bins may be represented by a single vertex if they are aligned together based on all-versus-all read alignment. (不知道怎么翻译会更准确,大家看图就能理解了),这个过程允许不匹配和空白
构建一个hash表,key是在reads中出现两次及以上的k-mer(只出现一次的kmer没办法确定是不是真的,无法纠正),value是reads上相关bin的位置
wtdbg2运行安装及运行:
git clone https://github.com/ruanjue/wtdbg2
cd wtdbg2 && make
#quick start with wtdbg2.pl
./wtdbg2.pl -t 16 -x rs -g 4.6m -o dbg reads.fa.gz #-x specifies the sequencing technology, "rs" for PacBio RSII, "sq" for PacBio
Sequel, "ccs" for PacBio CCS reads and "ont" for Oxford Nanopore
-e, defaults to 3, specifies the minimum read coverage of an edge in the assembly graph
# Step by step commandlines
# assemble long reads
./wtdbg2 -x rs -g 4.6m -i reads.fa.gz -t 16 -fo dbg
# derive consensus
./wtpoa-cns -t 16 -i dbg.ctg.lay.gz -fo dbg.raw.fa
后续的polish建议使用:2 iterations of racon ,medaka,2 interations of pilon(pilon可以使用二代reads也可以使用三代reads)
如果有疑问,建议大家先读原文 Fast and accurate long-read assembly with wtdbg2. Nature Methods.文章非常易懂。如果还没有解决你的问题,可以去作者的github上去issues页面查找作者有没有回答类似的问题,作者经常在上面解答,非常耐心。
一些常见问题
- kmer-size or pmer-size is more about sequencing error rate.
- increasing -e /length threshold (-L) improved contiguity , at the cost of genome size
- wtdbg is degined to be able to assemble a huge genome within one day, SMARTdenovo might get better assemblies in small genomes
参考文献
Ruan, J., & Li, H. (2019). Fast and accurate long-read assembly with wtdbg2. Nature Methods.