二代测序序列拼接:flash

转自:https://mp.weixin.qq.com/s/lgJDpwk0vYipARfTorfCkA
首先要说的是,并不是所有的分析都需要将双末端测序序列拼接,比如转录组就不需要,拼接最常见的是扩增子测序。
为什么要进行拼接?因为二代测序是将DNA或RNA打成特定长度的片段,比如300-400bp,而二代测序只能测特定长度,比如150nt,超过这一长度,测序质量就会下降的很严重,基本没有意义了。但是还有150-200bp没有测到,所以同一条DNA片段再反向测一次。
以下就是双末端测序中同一条DNA片段,正向和反向测序序列使用Clone Manager的比对结果。图中蓝色和红色分别表示两条reads匹配的序列,长约111bp,而打碎的这条DNA/RNA片段长约189bp。

Fig1.png

1.软件安装

在Linux系统下通过命令行进行下载安装。
自行下载安装

wget http://ccb.jhu.edu/software/FLASH/index.shtml/FLASH-1.2.11.tar.gz
tar -zxvf FLASH-1.2.11.tar.gz(解压缩FLASH-1.2.11.tar.gz)
cd FLASH-1.2.11/(进入FLASH-1.2.11文件夹工作路径下)
make(运行make编译命令,自动完成安装,生成可执行文件‘flash’)

或者conda安装

conda install -c bioconda flash
flash --help
Usage: flash [OPTIONS] MATES_1.FASTQ MATES_2.FASTQ
       flash [OPTIONS] --interleaved-input (MATES.FASTQ | -)
       flash [OPTIONS] --tab-delimited-input (MATES.TAB | -)

----------------------------------------------------------------------------
                                 DESCRIPTION                                
----------------------------------------------------------------------------

FLASH (Fast Length Adjustment of SHort reads) is an accurate and fast tool
to merge paired-end reads that were generated from DNA fragments whose
lengths are shorter than twice the length of reads.  Merged read pairs result
in unpaired longer reads, which are generally more desired in genome
assembly and genome analysis processes.

Briefly, the FLASH algorithm considers all possible overlaps at or above a
minimum length between the reads in a pair and chooses the overlap that
results in the lowest mismatch density (proportion of mismatched bases in
the overlapped region).  Ties between multiple overlaps are broken by
considering quality scores at mismatch sites.  When building the merged
sequence, FLASH computes a consensus sequence in the overlapped region.
More details can be found in the original publication
(http://bioinformatics.oxfordjournals.org/content/27/21/2957.full).

Limitations of FLASH include:
   - FLASH cannot merge paired-end reads that do not overlap.
   - FLASH is not designed for data that has a significant amount of indel
     errors (such as Sanger sequencing data).  It is best suited for Illumina
     data.

----------------------------------------------------------------------------
                               MANDATORY INPUT
----------------------------------------------------------------------------

The most common input to FLASH is two FASTQ files containing read 1 and read 2
of each mate pair, respectively, in the same order.

Alternatively, you may provide one FASTQ file, which may be standard input,
containing paired-end reads in either interleaved FASTQ (see the
--interleaved-input option) or tab-delimited (see the --tab-delimited-input
option) format.  In all cases, gzip compressed input is autodetected.  Also,
in all cases, the PHRED offset is, by default, assumed to be 33; use the
--phred-offset option to change it.

----------------------------------------------------------------------------
                                   OUTPUT
----------------------------------------------------------------------------

The default output of FLASH consists of the following files:

   - out.extendedFrags.fastq      The merged reads.
   - out.notCombined_1.fastq      Read 1 of mate pairs that were not merged.
   - out.notCombined_2.fastq      Read 2 of mate pairs that were not merged.
   - out.hist                     Numeric histogram of merged read lengths.
   - out.histogram                Visual histogram of merged read lengths.

FLASH also logs informational messages to standard output.  These can also be
redirected to a file, as in the following example:

  $ flash reads_1.fq reads_2.fq 2>&1 | tee flash.log

In addition, FLASH supports several features affecting the output:

   - Writing the merged reads directly to standard output (--to-stdout)
   - Writing gzip compressed output files (-z) or using an external
     compression program (--compress-prog)
   - Writing the uncombined read pairs in interleaved FASTQ format
     (--interleaved-output)
   - Writing all output reads to a single file in tab-delimited format
     (--tab-delimited-output)

----------------------------------------------------------------------------
                                   OPTIONS
----------------------------------------------------------------------------

  -m, --min-overlap=NUM   The minimum required overlap length between two
                          reads to provide a confident overlap.  Default:
                          10bp.

  -M, --max-overlap=NUM   Maximum overlap length expected in approximately
                          90% of read pairs.  It is by default set to 65bp,
                          which works well for 100bp reads generated from a
                          180bp library, assuming a normal distribution of
                          fragment lengths.  Overlaps longer than the maximum
                          overlap parameter are still considered as good
                          overlaps, but the mismatch density (explained below)
                          is calculated over the first max_overlap bases in
                          the overlapped region rather than the entire
                          overlap.  Default: 65bp, or calculated from the
                          specified read length, fragment length, and fragment
                          length standard deviation.

  -x, --max-mismatch-density=NUM
                          Maximum allowed ratio between the number of
                          mismatched base pairs and the overlap length.
                          Two reads will not be combined with a given overlap
                          if that overlap results in a mismatched base density
                          higher than this value.  Note: Any occurence of an
                          'N' in either read is ignored and not counted
                          towards the mismatches or overlap length.  Our
                          experimental results suggest that higher values of
                          the maximum mismatch density yield larger
                          numbers of correctly merged read pairs but at
                          the expense of higher numbers of incorrectly
                          merged read pairs.  Default: 0.25.

  -O, --allow-outies      Also try combining read pairs in the "outie"
                          orientation, e.g.

                               Read 1: <-----------
                               Read 2:       ------------>

                          as opposed to only the "innie" orientation, e.g.

                               Read 1:       <------------
                               Read 2: ----------->

                          FLASH uses the same parameters when trying each
                          orientation.  If a read pair can be combined in
                          both "innie" and "outie" orientations, the
                          better-fitting one will be chosen using the same
                          scoring algorithm that FLASH normally uses.

                          This option also causes extra .innie and .outie
                          histogram files to be produced.

  -p, --phred-offset=OFFSET
                          The smallest ASCII value of the characters used to
                          represent quality values of bases in FASTQ files.
                          It should be set to either 33, which corresponds
                          to the later Illumina platforms and Sanger
                          platforms, or 64, which corresponds to the
                          earlier Illumina platforms.  Default: 33.

  -r, --read-len=LEN
  -f, --fragment-len=LEN
  -s, --fragment-len-stddev=LEN
                          Average read length, fragment length, and fragment
                          standard deviation.  These are convenience parameters
                          only, as they are only used for calculating the
                          maximum overlap (--max-overlap) parameter.
                          The maximum overlap is calculated as the overlap of
                          average-length reads from an average-size fragment
                          plus 2.5 times the fragment length standard
                          deviation.  The default values are -r 100, -f 180,
                          and -s 18, so this works out to a maximum overlap of
                          65 bp.  If --max-overlap is specified, then the
                          specified value overrides the calculated value.

                          If you do not know the standard deviation of the
                          fragment library, you can probably assume that the
                          standard deviation is 10% of the average fragment
                          length.

  --cap-mismatch-quals    Cap quality scores assigned at mismatch locations
                          to 2.  This was the default behavior in FLASH v1.2.7
                          and earlier.  Later versions will instead calculate
                          such scores as max(|q1 - q2|, 2); that is, the
                          absolute value of the difference in quality scores,
                          but at least 2.  Essentially, the new behavior
                          prevents a low quality base call that is likely a
                          sequencing error from significantly bringing down
                          the quality of a high quality, likely correct base
                          call.

  --interleaved-input     Instead of requiring files MATES_1.FASTQ and
                          MATES_2.FASTQ, allow a single file MATES.FASTQ that
                          has the paired-end reads interleaved.  Specify "-"
                          to read from standard input.

  --interleaved-output    Write the uncombined pairs in interleaved FASTQ
                          format.

  -I, --interleaved       Equivalent to specifying both --interleaved-input
                          and --interleaved-output.

  -Ti, --tab-delimited-input
                          Assume the input is in tab-delimited format
                          rather than FASTQ, in the format described below in
                          '--tab-delimited-output'.  In this mode you should
                          provide a single input file, each line of which must
                          contain either a read pair (5 fields) or a single
                          read (3 fields).  FLASH will try to combine the read
                          pairs.  Single reads will be written to the output
                          file as-is if also using --tab-delimited-output;
                          otherwise they will be ignored.  Note that you may
                          specify "-" as the input file to read the
                          tab-delimited data from standard input.

  -To, --tab-delimited-output
                          Write output in tab-delimited format (not FASTQ).
                          Each line will contain either a combined pair in the
                          format 'tag <tab> seq <tab> qual' or an uncombined
                          pair in the format 'tag <tab> seq_1 <tab> qual_1
                          <tab> seq_2 <tab> qual_2'.

  -o, --output-prefix=PREFIX
                          Prefix of output files.  Default: "out".

  -d, --output-directory=DIR
                          Path to directory for output files.  Default:
                          current working directory.

  -c, --to-stdout         Write the combined reads to standard output.  In
                          this mode, with FASTQ output (the default) the
                          uncombined reads are discarded.  With tab-delimited
                          output, uncombined reads are included in the
                          tab-delimited data written to standard output.
                          In both cases, histogram files are not written,
                          and informational messages are sent to standard
                          error rather than to standard output.

  -z, --compress          Compress the output files directly with zlib,
                          using the gzip container format.  Similar to
                          specifying --compress-prog=gzip and --suffix=gz,
                          but may be slightly faster.

  --compress-prog=PROG    Pipe the output through the compression program
                          PROG, which will be called as `PROG -c -',
                          plus any arguments specified by --compress-prog-args.
                          PROG must read uncompressed data from standard input
                          and write compressed data to standard output when
                          invoked as noted above.
                          Examples: gzip, bzip2, xz, pigz.

  --compress-prog-args=ARGS
                          A string of additional arguments that will be passed
                          to the compression program if one is specified with
                          --compress-prog=PROG.  (The arguments '-c -' are
                          still passed in addition to explicitly specified
                          arguments.)

  --suffix=SUFFIX, --output-suffix=SUFFIX
                          Use SUFFIX as the suffix of the output files
                          after ".fastq".  A dot before the suffix is assumed,
                          unless an empty suffix is provided.  Default:
                          nothing; or 'gz' if -z is specified; or PROG if
                          --compress-prog=PROG is specified.

  -t, --threads=NTHREADS  Set the number of worker threads.  This is in
                          addition to the I/O threads.  Default: number of
                          processors.  Note: if you need FLASH's output to
                          appear deterministically or in the same order as
                          the original reads, you must specify -t 1
                          (--threads=1).

  -q, --quiet             Do not print informational messages.

  -h, --help              Display this help and exit.

  -v, --version           Display version.

Run `flash --help | less' to prevent this text from scrolling by.

2.使用方法

flash read1.fq read2.fq -p 33 -r 250 -f 500 -s 100 -o output

主要参数说明:

-m 拼接时overlap区的最小长度阈值,默认10bp;
-M overlap区的最大长度阈值,
-x overlap区允许的最大碱基错配比率(最大碱基错配数目/overlap区长度),默认为0.25;
-p 碱基质量值类型,64或者33;
-r reads长度;
-f 片段长度,也就是测序的文库大小;
-s 文库的偏差;
-o 输出文件前缀;
-z 输出压缩文件
-t 设置线程数,默认为1,FLASH软件支持多线程,速度快;

FLASH拼接默认输出6个结果文件:
output.extendeFrags.fastq 为拼接后的扩增片段序列文件;
output.flash.log 为日志文件,详细记录了拼接过程中的参数和拼接统计的数据;
output.hist 为拼接后的reads长度的统计信息文件;
output.histogram 为拼接后的reads长度直方图文件;
output.notCombined_1.fastq 为拼接不上的reads1序列文件;
output.notCombined_2.fastq 为拼接不上的reads2序列文件;

拼接

ls *1.fastq.gz |while read id;
do
mkdir -p ${id%_*}
flash ${id%_*}_R1.fastq.gz -O ${id%_*}_R2.fastq.gz \
-m 10 -M 100 -x 0.25 -z -o  ${id%_*} -d ./${id%_*}
done
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,591评论 6 501
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,448评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 162,823评论 0 353
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,204评论 1 292
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,228评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,190评论 1 299
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,078评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,923评论 0 274
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,334评论 1 310
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,550评论 2 333
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,727评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,428评论 5 343
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,022评论 3 326
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,672评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,826评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,734评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,619评论 2 354