使用Unicycler进行细菌基因组组装

二代测序技术的迅速发展大幅度的降低了高通量测序价格且提高了测序质量,同时高通量测序相关的数据处理和分析的软件也是越来越好用。SPAdes在二代测序小基因组的组装上表现优异,使用方便,真是爱不释手。Unicycler在二代测序小基因组装上则是依赖于SPAdes,在此基础上进行了优化;此外在二代三代测序数据混合组装和纯三代测序数据组装上也表现较好;该项目在github上被标星210次。今天使用SPAdes和Unicycler对一个350bp小片段建库测序1G数据量的细菌菌株进行组装,熟悉一下流程并且看看两个软件的差别。


1. 数据质控

公司给的建库测序的数据分为raw_data和clean_data, 我先直接使用fastqc对clean_data的数据进行质控。双端测序数据是N46-1_add_BDMS190627350-1a_1.clean.fq.gz和N46-1_add_BDMS190627350-1a_2.clean.fq.gz。

(denovo_genome) czh@ubuntu:~/Desktop/01_denovo_genome/N46_DENOVO$ fastqc '/home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_1.clean.fq.gz' '/home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_2.clean.fq.gz'  -o '/home/czh/Desktop/01_denovo_genome/N46_DENOVO/fastqc_out'

正向序列
反向序列

fastqc结果看,测序长度是150bp,序列数量约为8M,数据量约为1.2G,序列GC含量和duplication水平不是特别完美,总体来看双端序列质量很好,数据的质量和数量满足下游分析。

2. 使用spades对Illumina双端序列进行组装。
(denovo_genome) czh@ubuntu:~/Desktop/01_denovo_genome/N46_DENOVO$ spades.py -1 N46-1_add_BDMS190627350-1a_1.clean.fq.gz -2 N46-1_add_BDMS190627350-1a_2.clean.fq.gz --careful -o spades_out -t 12 -m 24
Command line: /home/czh/miniconda3/envs/denovo_genome/bin/spades.py -1  /home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_1.clean.fq.gz  -2  /home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_2.clean.fq.gz  --careful   -o  /home/czh/Desktop/01_denovo_genome/N46_DENOVO/spades_out    -t  12  -m  24  

System information:
  SPAdes version: 3.13.1
  Python version: 3.7.3
  OS: Linux-5.0.0-23-generic-x86_64-with-debian-buster-sid

Output dir: /home/czh/Desktop/01_denovo_genome/N46_DENOVO/spades_out
Mode: read error correction and assembling
Debug mode is turned OFF

Dataset parameters:
  Multi-cell mode (you should set '--sc' flag if input data was obtained with MDA (single-cell) technology or --meta flag if processing metagenomic dataset)
  Reads:
    Library number: 1, library type: paired-end
      orientation: fr
      left reads: ['/home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_1.clean.fq.gz']
      right reads: ['/home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_2.clean.fq.gz']
      interlaced reads: not specified
      single reads: not specified
      merged reads: not specified
Read error correction parameters:
  Iterations: 1
  PHRED offset will be auto-detected
  Corrected reads will be compressed
Assembly parameters:
  k: automatic selection based on read length
  Repeat resolution is enabled
  Mismatch careful mode is turned ON
  MismatchCorrector will be used
  Coverage cutoff is turned OFF
Other parameters:
  Dir for temp files: /home/czh/Desktop/01_denovo_genome/N46_DENOVO/spades_out/tmp
  Threads: 12
  Memory limit (in Gb): 24


======= SPAdes pipeline started. Log can be found here: /home/czh/Desktop/01_denovo_genome/N46_DENOVO/spades_out/spades.log


===== Read error correction started. 


== Running read error correction tool: /home/czh/miniconda3/envs/denovo_genome/bin/spades-hammer /home/czh/Desktop/01_denovo_genome/N46_DENOVO/spades_out/corrected/configs/config.info

  0:00:00.000     4M / 4M    INFO    General                 (main.cpp                  :  75)   Starting BayesHammer, built from N/A, git revision N/A
  0:00:00.000     4M / 4M    INFO    General                 (main.cpp                  :  76)   Loading config from /home/czh/Desktop/01_denovo_genome/N46_DENOVO/spades_out/corrected/configs/config.info
  0:00:00.000     4M / 4M    INFO    General                 (main.cpp                  :  78)   Maximum # of threads to use (adjusted due to OMP capabilities): 12
  0:00:00.000     4M / 4M    INFO    General                 (memory_limit.cpp          :  49)   Memory limit set to 24 Gb
  0:00:00.000     4M / 4M    INFO    General                 (main.cpp                  :  86)   Trying to determine PHRED offset
  0:00:00.000     4M / 4M    INFO    General                 (main.cpp                  :  92)   Determined value is 33


耗时大约70分钟。

3. 使用unicycler对Illumina双端序列进行组装。
(denovo_genome) czh@ubuntu:~/Desktop/01_denovo_genome/N46_DENOVO$ unicycler -1 N46-1_add_BDMS190627350-1a_1.clean.fq.gz -2 N46-1_add_BDMS190627350-1a_2.clean.fq.gz -o unicycler_out -t 12
Starting Unicycler (2019-11-03 22:46:05)
    Welcome to Unicycler, an assembly pipeline for bacterial genomes. Since you
provided only short reads, Unicycler will essentially function as a SPAdes-
optimiser. It will try many k-mer sizes, choose the best based on contig length
and graph connectivity, and scaffold the graph using SPAdes repeat resolution.
    For more information, please see https://github.com/rrwick/Unicycler

Command: /home/czh/miniconda3/envs/denovo_genome/bin/unicycler -1 /home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_1.clean.fq.gz -2 /home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_2.clean.fq.gz -o /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out -t 12

Unicycler version: v0.4.8
Using 12 threads

The output directory already exists:
  /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out

Dependencies:
  Program         Version           Status  
  spades.py       3.13.1            good    
  racon                             not used
  makeblastdb     2.9.0+            good    
  tblastn         2.9.0+            good    
  bowtie2-build   2.3.5             good    
  bowtie2         2.3.5             good    
  samtools        1.9               good    
  java            11.0.1-internal   good    
  pilon           1.23              good    
  bcftools                          not used


SPAdes read error correction (2019-11-03 22:47:16)
    Unicycler uses the SPAdes read error correction module to reduce the number
of errors in the short read before SPAdes assembly. This can make the assembly
faster and simplify the assembly graph structure.

Command: /home/czh/miniconda3/envs/denovo_genome/bin/spades.py -1 /home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_1.clean.fq.gz -2 /home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_2.clean.fq.gz -o /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/spades_assembly/read_correction --threads 12 --only-error-correction

Corrected reads:
  /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/spades_assembly/corrected_1.fastq.gz
  /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/spades_assembly/corrected_2.fastq.gz


Choosing k-mer range for assembly (2019-11-03 23:47:00)
    Unicycler chooses a k-mer range for SPAdes based on the length of the input
reads. It uses a wide range of many k-mer sizes to maximise the chance of
finding an ideal assembly.

SPAdes maximum k-mer: 127
Median read length: 150
K-mer range: 27, 47, 63, 77, 89, 99, 107, 115, 121, 127


SPAdes assemblies (2019-11-03 23:48:07)
    Unicycler now uses SPAdes to assemble the short reads. It scores the
assembly graph for each k-mer using the number of contigs (fewer is better) and
the number of dead ends (fewer is better). The score function is 1/(c*(d+2)),
where c is the contig count and d is the dead end count.

K-mer   Contigs   Dead ends   Score   
   27                           failed
   47                           failed
   63                           failed
   77                           failed
   89                           failed
   99                           failed
  107                           failed
  115                           failed
  121                           failed
  127       198           0   2.53e-03 <-best

Read depth filter: removed 216 contigs totalling 72842 bp
Deleting /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/spades_assembly/


Determining graph multiplicity (2019-11-04 00:07:15)
    Multiplicity is the number of times a sequence occurs in the underlying
sequence. Single-copy contigs (those with a multiplicity of one, occurring only
once in the underlying sequence) are particularly useful.

Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/001_best_spades_graph.gfa


Cleaning graph (2019-11-04 00:07:15)
    Unicycler now performs various cleaning procedures on the graph to remove
overlaps and simplify the graph structure. The end result is a graph ready for
bridging.

Graph overlaps removed

Removed zero-length segments:
    111, 112, 116, 117, 122, 123, 125, 126, 134, 135, 140, 143, 144, 145, 146,
149, 151, 153, 158, 161, 169, 170, 171, 172, 179, 181, 183

Removed zero-length segments:
    110, 165

Removed zero-length segments:
    195

Merged small segments:
    184, 185, 186, 187, 192, 196, 197

Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/002_overlaps_removed.gfa

    Unicycler now selects a set of anchor contigs from the single-copy contigs.
These are the contigs which will be connected via bridges to form the final
assembly.

43 anchor segments (5,368,430 bp) out of 162 total segments (5,423,141 bp)


Creating SPAdes contig bridges (2019-11-04 00:07:15)
    SPAdes uses paired-end information to perform repeat resolution (RR) and
produce contigs from the assembly graph. SPAdes saves the graph paths
corresponding to these contigs in the contigs.paths file. When one of these
paths contains two or more anchor contigs, Unicycler can create a bridge from
the path.

                                                                                                                                         Bridge
Start                                                            Path                                                           End     quality
  -35                                                            -58                                                            44         18.2
  -19       109 -> 114 -> 95 -> -136 -> 106 -> 112 -> 123 -> 80 -> 130 -> 65 -> 136 -> 110 -> -145 -> -103 -> -109 -> 131       43         12.0
    2                                                            -42                                                            37          9.3
    3                                                            102                                                            10         62.2
    5                                                     -96 -> 152 -> -96                                                     -12        37.8
    8                                                            -102                                                           -15        61.8
   10                                                      -98 -> 46 -> -98                                                     1           7.3
   16   -150 -> 113 -> 133 -> -103 -> -109 -> 131 -> 45 -> 132 -> 109 -> 114 -> 124 -> 129 -> 83 -> 121 -> -117 -> -88 -> 156   -28         7.5
   22                                                            -57                                                            -40        11.8
   37                                                            -42                                                            4           8.7


Creating loop unrolling bridges (2019-11-04 00:07:15)
    When a SPAdes contig path connects an anchor contig with the middle contig
of a simple loop, Unicycler concludes that the sequences are contiguous (i.e.
the loop is not a separate piece of DNA). It then uses the read depth of the
middle and repeat contigs to guess the number of times to traverse the loop and
makes a bridge.

                                  Loop count   Loop count    Loop    Bridge
Start   Repeat   Middle     End    by repeat    by middle   count   quality
    2      -42       37       4         1.07         0.97       1      40.5
    5      -96      152     -12         0.67         0.60       1      36.4
    9       97      147      12         0.82         0.71       1      40.1
   10      -98       46       1         0.12         0.88       1      37.5


Applying bridges (2019-11-04 00:07:15)
    Unicycler now applies to the graph in decreasing order of quality. This
ensures that when multiple, contradictory bridges exist, the most supported
option is used.

Bridge type   Start -> end   Path                                       Quality
SPAdes           3 -> 10     102                                         62.210
SPAdes           8 -> -15    -102                                        61.764
SPAdes           5 -> -12    -96, 152, -96                               37.778
SPAdes         -35 -> 44     -58                                         18.167
SPAdes         -19 -> 43     109, 114, 95, -136, 106, 112, 123, 80,      11.994
                             130, 65, 136, 110, -145, -103, -109, 131          
SPAdes          22 -> -40    -57                                         11.829
loop             2 -> 4      -42, 37, -42                                40.460
loop             9 -> 12     97, 147, 97                                 40.120
loop            10 -> 1      -98, 46, -98                                37.498

Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/003_bridges_applied.gfa


Bridged assembly graph (2019-11-04 00:07:16)
    The assembly is now mostly finished and no more structural changes will be
made. Ideally the assembly graph should now have one contig per replicon and no
erroneous contigs (i.e a complete assembly). If there are more contigs, then
the assembly is not complete.

Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/004_final_clean.gfa

Component   Segments   Links   Length      N50         Longest segment   Status    
        1        131     184   5,430,795   1,298,586         1,645,591   incomplete


Polishing assembly with Pilon (2019-11-04 00:07:16)
    Unicycler now conducts multiple rounds of Pilon in an attempt to repair any
remaining small-scale errors with the assembly.

Aligning reads to find appropriate insert size range...
Insert size 1st percentile:  190
Insert size 99th percentile: 513

Pilon polish round 1
Total number of changes: 29

Pilon polish round 2
Total number of changes: 4

Pilon polish round 3
No Pilon changes

Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/005_polished.gfa


Assembly complete (2019-11-04 02:13:11)
Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/assembly.gfa
Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/assembly.fasta

总共耗时3小时左右,比较慢。

4. 使用QUAST评估两个软件组装结果。
(qc_genome) czh@ubuntu:~/Desktop/01_denovo_genome/N46_DENOVO$ quast -o quast_out(qc_genome) czh@ubuntu:~/Desktop/01_denovo_genome/N46_DENOVO$ quast -o quast_out/ -t 8 spades_assembly.fasta unicycler_assembly.fasta 
/home/czh/miniconda3/envs/qc_genome/bin/quast -o quast_out/ -t 8 spades_assembly.fasta unicycler_assembly.fasta

Version: 5.0.2

System information:
  OS: Linux-5.0.0-23-generic-x86_64-with-debian-buster-sid (linux_64)
  Python version: 3.6.7
  CPUs number: 16

Started: 2019-11-04 06:14:57

Logging to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/quast.log

CWD: /home/czh/Desktop/01_denovo_genome/N46_DENOVO
Main parameters: 
  MODE: default, threads: 8, minimum contig length: 500, minimum alignment length: 65, \
  ambiguity: one, threshold for extensive misassembly size: 1000

Contigs:
  Pre-processing...
  1  spades_assembly.fasta ==> spades_assembly
  2  unicycler_assembly.fasta ==> unicycler_assembly

2019-11-04 06:15:06
Running Basic statistics processor...
  Contig files: 
    1  spades_assembly
    2  unicycler_assembly
  Calculating N50 and L50...
    1  spades_assembly, N50 = 472283, L50 = 5, Total length = 5427777, GC % = 57.74, # N's per 100 kbp =  15.92
    2  unicycler_assembly, N50 = 1298586, L50 = 2, Total length = 5419702, GC % = 57.76, # N's per 100 kbp =  0.00
  Drawing Nx plot...
    saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/Nx_plot.pdf
  Drawing cumulative plot...
    saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/cumulative_plot.pdf
  Drawing GC content plot...
    saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/GC_content_plot.pdf
  Drawing spades_assembly GC content plot...
    saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/spades_assembly_GC_content_plot.pdf
  Drawing unicycler_assembly GC content plot...
    saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/unicycler_assembly_GC_content_plot.pdf
  Drawing Coverage histogram (bin size: 24x)...
    saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/coverage_histogram.pdf
  Drawing spades_assembly coverage histogram (bin size: 24x)...
    saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/spades_assembly_coverage_histogram.pdf
Done.

NOTICE: Genes are not predicted by default. Use --gene-finding or --glimmer option to enable it.

2019-11-04 06:15:08
Creating large visual summaries...
This may take a while: press Ctrl-C to skip this step..
  1 of 2: Creating Icarus viewers...
  2 of 2: Creating PDF with all tables and plots...
Done

2019-11-04 06:15:08
RESULTS:
  Text versions of total report are saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/report.txt, report.tsv, and report.tex
  Text versions of transposed total report are saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/transposed_report.txt, transposed_report.tsv, and transposed_report.tex
  HTML version (interactive tables and plots) is saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/report.html
  PDF version (tables and plots) is saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/report.pdf
  Icarus (contig browser) is saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/icarus.html
  Log is saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/quast.log

Finished: 2019-11-04 06:15:08
Elapsed time: 0:00:11.039108
NOTICEs: 1; WARNINGs: 0; non-fatal ERRORs: 0

Thank you for using QUAST!

quast评估结果可以看出unicycler组装结果优于spades,unicycler应该是矫正和剔除了长度太短和一些冗余的contig,减少了contig的数量,提高了N50的长度,但基因组总长度小于spades组装。


5. unicycler调用spades初步组装后对contigs的处理。

区分单拷贝的contigs和重复的contigs,选择合适的单拷贝contigs进行下一步处理。



剔除contig间的重叠区域,为了尽可能的环化contigs。

单拷贝contigs桥接图


最后使用Pilon将短序列mapping到组装好的基因组上进行矫正。


unicycler相较于spades耗时较长,组装结果较好,这只是一个菌基因组的组装比较结果,可能其他菌株使用unicycler组装并不会有较大提升,反而会降低N50,不过N50不是评估基因组的唯一标准,过于苛求更长的N50可能会导致错误组装,可能会误导后续基因注释和功能研究。

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,686评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,668评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,160评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,736评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,847评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,043评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,129评论 3 410
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,872评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,318评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,645评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,777评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,470评论 4 333
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,126评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,861评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,095评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,589评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,687评论 2 351

推荐阅读更多精彩内容