使用nextpolish对三代组装进行polish(v1.2.2版)

NextPolish是武汉未来组开发的一个三代基因组polish工具（另外一个常用软件是Pilon）。NextPolish可以使用二代短读序列或者三代序列或者两者结合去纠正三代长读长序列在组装时导致的碱基错误(SNV/Indel)。由于它是专为polish设计，因此在运行速度和内存使用上都优与Pilon。

软件安装

目前NextPolish已经支持Python2/3，推荐使用Python2.7。NextPolish运行依赖Python的两个模块，分别是psutil和drmaa，其中只有psutil才是必须的，drmaa仅在你需要投递任务时才是必须的。

确认我们的Python版本, 以及检查是否安装了所需要的Python模块

python -V
# Python 2.7.15
python -c "import psutil"
python -c "import drmaa"

之后到https://github.com/Nextomics/NextPolish/releases找最新的版本进行安装，写这篇时已经更新到1.2.2。

mkdir -p ~/opt/biosoft
cd ~/opt/biosoft
wget https://github.com/Nextomics/NextPolish/releases/download/v1.2.2/NextPolish.tgz
tar -zxvf NextPolish.tgz
# 编译软件
cd NextPolish && make -j 10
# 加入到.bashrc或.zshrc
export PATH=~/opt/biosoft/NextPolish:$PATH

软件使用

注意：如果你的基因组用的是miniasm这类缺少consensus步骤的组装软件，那么你需要运行racon利用三代序列进行polish。否则，由于基因组上存在过高的错误率，导致二代序列错误比对，影响polish效果。

threads=20
genome=input.genome.fa # 组装的基因组
lgsreads=input.lgs.reads.fq.gz # 三代长度序列
# 将三代回帖到参考基因组
minimap2 -a -t ${threads} -x map-ont/map-pb ${genome} ${lgsreads}| \
    samtools view -F 0x4 -b - | \
    samtools sort - -m 2g -@ ${threads} -o genome.lgs.bam
#建立索引
samtools index -@ ${threads} genome.lgs.bam
samtools faidx ${genome}
# 使用nextPolish.py 进行polish
python ~/opt/biosoft/NextPolish/lib/nextPolish.py \
    -g ${genome} -t 5 --bam_lgs genome.lgs.bam -p ${threads} > genome.lgspolish.fa

目前NextPolish也支持根据三代序列进行polish，用于替代Racon，脚本如下

threads=20
genome=input.genome.fa # 组装的基因组
lgsreads=input.lgs.reads.fq.gz # 三代长度序列
minimap2 -ax map-pb -t ${threads} ${genome} ${lgsreads} |samtools sort - -m 2g --threads 20 -o genome.lgs.bam
samtools index genome.lgs.bam
ls `pwd`/genome.lgs.bam > pb.map.bam.fofn
python NextPolish/lib/nextpolish2.py -g ${genome} -l pb.map.bam.fofn -r clr -p 20 -a -s -o genome.lgspolish.fa

生成的genome.lgspolish.fa就能用于后续的二代polish步骤。

NextPolish要求我们准备两个文件：

run.cfg: 配置文件，设置各项参数
sgs.fofn: 二代测序文件的位置信息

以使用NextDenovo组装Nanopore数据文章组装的结果为例进行介绍。在分析目录下有三个文件。

三代组装结果: nextgraph.assembly.contig.fasta
二代序列: ERR2173372_1.fastq,ERR2173372_2.fastq

第一步：创建一个文件，用于记录二代序列的位置信息

realpath ERR2173372_1.fastq ERR2173372_2.fastq  > sgs.fofn

第二步：配置run.cfg文件

# 从NextPolish目录下复制配置文件
cp ~/opt/biosoft/NextPolish/doc/run.cfg run2.cfg

修改配置文件

[General]
job_type = local
job_prefix = nextPolish
task = default
rewrite = 1212
rerun = 3
parallel_jobs = 2
multithread_jobs = 10
genome = ./nextgraph.assembly.contig.fasta
genome_size = auto
workdir = ./01_rundir
polish_options = -p {multithread_jobs}

[sgs_option]
sgs_fofn = ./sgs.fofn
sgs_options = -max_depth 100

其中需要修改的参数为，其余参数查看官方的参数配置说明:

job_type: 任务类型，local表示单个节点运行。由于NextPolish使用DRMAA进行任务投递，因此还支持，SGE, PBS和SLURM
task: 任务类型，用12,1212,121212,12121212来设置polish的轮数，建议迭代2轮就可以了。
parallel_jobs和multithread_jobs表示同时投递的任务数和每个任务的线程数，此处2 X 10=20
genome: 表示组装基因组的位置
workdir: 输出文件所在目录
sgs_options: 该选项设置二代测序polish的参数，包括-use_duplicate_reads, -unpaired, -max_depth, -bwa, -minimap2(默认使用)

运行方法

nextPolish run2.cfg &

在最后输出日志中，会提示最终存放的文件在什么位置，然后将这些文件合并到单个文件即可。

除了本地任务投递以外，NextPolish还支持使用DRAMAA进行任务投递，只需要修改job_type和cluster_options这两项即可。例如SGE的配置方法为

# SGE
job_type=sge
cluster_options = -l vf={vf} -q all.q -pe smp {cpu} -S {bash} -w n

其中{vf}, {cpu}, {bash}会被NextPolish根据实际情况进行替换。不同的任务投递系统的投递参数也不同，需要根据实际情况进行调整。

如果之前的python -c "import drmaa"遇到如下的报错

RuntimeError: Could not find drmaa library.  Please specify its full path using the environment variable DRMAA_LIBRARY_PATH

则需要设置DRMAA_LIBRARY_PATH的环境变量指定libdrmaa.so.1文件所在的位置

export DRMAA_LIBRARY_PATH=/path/to/libdrmaa.so.1

对于单节点服务器而言，使用nextPolish进行任务投递并不是最佳选择，因为它会先将Fastq文件拆分成多份，然后短读处理这些文件，其中的数据拆分步骤就会浪费一些时间。我们可以自己写一个脚本，直接调用bwa/minimap2进行比对，然后再调用实际处理数据的脚本进行分析。

#!/usr/bin/bash

genome=$1
lgsreads=$2
read1=$3
read2=$4
threads=100

NextPolish=/opt/biosoft/NextPolish-1.2.2
#确保环境有samtools, bwa, minimap2
#module load samtools/1.10
#module load bwa/0.7.17

minimap2 -ax map-pb -t ${threads} ${genome} ${lgsreads} | samtools sort - -m 2g --threads 20 -o genome.lgs.bam
samtools index genome.lgs.bam
ls `pwd`/genome.lgs.bam > pb.map.bam.fofn
python $NextPolish/lib/nextpolish2.py -g ${genome} -l pb.map.bam.fofn -r clr -p ${threads} -a -s -o genome.lgspolish.fa

#Set input and parameters
round=2
input=genome.lgspolish.fa
for ((i=1; i<=${round};i++)); do
#step 1:
        #index the genome file and do alignment
        bwa index ${input};
        bwa mem -t ${threads} ${input} ${read1} ${read2} | samtools view --threads 10 -F 0x4 -b - | samtools sort - -m 2g --threads 20 -o sgs.sort.bam;
        #index bam and genome files
        samtools index -@ 20 sgs.sort.bam;
        samtools faidx ${input};
        #polish genome file
        python $NextPolish/lib/nextpolish1.py -g ${input} -t 1 -p ${threads} -s sgs.sort.bam > genome.polishtemp.fa;
        input=genome.polishtemp.fa;
#step2:
        #index genome file and do alignment
        bwa index ${input};
        bwa mem -t ${threads} ${input} ${read1} ${read2} | samtools view --threads 10 -F 0x4 -b - |samtools sort - -m 2g --threads 20 -o sgs.sort.bam;
        #index bam and genome files
        samtools index -@ 20 sgs.sort.bam;
        samtools faidx ${input};
        #polish genome file
        python $NextPolish/lib/nextpolish1.py -g ${input} -t 2 -p ${threads} -s sgs.sort.bam > genome.nextpolish.fa;
        input=genome.nextpolish.fa;
done;
#Finally polished genome file: genome.nextpolish.fa