All the following information come from www.drive5.com, I just use this as a notebook for my learning, I declare no commercial interest with this. Everyone who see this document should refer to www.drive5.com.
I got some problem when I was trying to merge my read data, then I collected some information, they are shown as following.
The fastq_mergepairs command merges (assembles) paired-end reads to create consensus sequences and, optionally, consensus quality scores. This command has many features and options so I recommend spending some time browsing the documentation to get familiar with the capabilities of fastq_mergepairs and issues that arise in read merging.
Basic usage
The simplest way to use fastq_mergepairs is to specify the the forward and reverse FASTQ filenames and an output FASTQ filename.
usearch -fastq_mergepairs SampleA_R1.fastq -reverse SampleA_R2.fastq -fastqout merged.fq
Automatic R2 filename
If the -reverse option is omitted, the reverse FASTQ filename is constructed by replacing R1 with R2. The following command line is equivalent to the example above.
usearch -fastq_mergepairs SampleA_R1.fastq -fastqout merged.fq
Merging multiple FASTQ file pairs in a single command
You can specify two or more FASTQ filenames following -fastq_mergepairs. In the following example, SampleA and SampleB are both merged. The R2 filenames are constructed automatically as explained above, or can be given explicitly using the -reverse option.
usearch -fastq_mergepairs SampleA_R1.fastq SampleB_R1.fastq -fastqout merged.fq
usearch -fastq_mergepairs *_R1*.fastq -fastqout merged.fq (This is what I was using when I had 45 reads).
Adding sample identifiers to read labels
If multiple samples are combined into a single file as shown in some of the above examples, then you lose track of which read came from which sample. This is addressed by adding a sample identifier to each read label. The simplest method is to use the -sample option, e.g.
usearch -fastq_mergepairs SampleA_R1.fastq -fastqout merged.fq -sample SampleA
The string sample=SampleA; will be added at the end of the read label.
Getting the sample identifier from the FASTQ filename
FASTQ filenames are often based on the sample identifier, e.g. SampleA_R1.fastq. If you specify -relabel @ then fastq_mergepairs gets the sample identifier from the FASTQ file name by truncating at the first underscore (_) or period (.). A period and the read number is added after the sample identifier to make the new read label, which replaces the original label. This differs from the -sample option, which adds the sample= annotation at the end of the label. The usearch_global command understands both of these methods for putting sample identifiers into read labels..
usearch -fastq_mergepairs SampleA_R1.fastq -fastqout merged.fq -relabel @
Merging multiple files with sample identifiers
By using wildcards and the -relabel @ option you can merge multiple files and add sample identifiers to the read labels, for example:
usearch -fastq_mergepairs *R1*.fastq -fastqout merged.fq -relabel @
fastq_mergepairs options
Input files
-
fastq_mergepairs Forward FASTQ filename(s). -reverse Reverse FASTQ filename(s). If not given, constructed by replacing R1 with R2.
-interleaved Forward and reverse reads are interleaved in the same file (sometimes produced by SRA fastq-dump).
Output files
-
fastqout FASTQ filename for merged reads.
-fastaout FASTA filename for merged reads.
-fastqout_notmerged_fwd FASTQ filename for forward reads which were not merged.
-fastaout_notmerged_fwd FASTA filename for forward reads which were not merged.
-fastqout_notmerged_rev FASTQ filename for reverse reads which were not merged.
-fastaout_notmerged_rev FASTA filename for reverse reads which were not merged.
Reports
-report Filename for summary report. See Reviewing a fastq_mergepairs report to check for problems.
-tabbedout Tabbed text file containing detailed information about merging process for each pair including reason for discarding.
-alnout Human-readable alignments. Useful for trouble-shooting.
Merged read labels
-relabel Prefix string for output labels. The read number 1, 2, 3... is appended after the prefix.
-relabel @ Relabel using prefix string constructed from FASTQ filename, this will be understood as the sample identifier.
-sample xxx Append sample identifier to read label using sample=xxx; format. This is an alternative method for adding sample ids.
-fastq_eeout Add ee=xxx; annotation with the number of expected errors in the merged read.
-label_suffix Suffix to append to merged read label. Can be used e.g. to add sample=xxx; type of sample identifier annotations.
Filtering
-fastq_maxdiffs Maximum number of mismatches in the alignment. Default 5. Consider increasing if you have long overlaps.
-fastq_pctid Minimum %id of alignment. Default 90. Consider decreasing if you have long overlaps.
-fastq_nostagger Discard staggered pairs. Default is to trim overhangs (non-biological sequence).
-fastq_minmergelen Minimum length for the merged sequence. See Filtering artifacts by setting a merge length range.
-fastq_maxmergelen Maximum length for the merged sequence.
-fastq_minqual Discard merged read if any merged Q score is less than the given value. (No minimum by default).
-fastq_minovlen Discard pair if alignment is shorter than given value. Default 16.
Pre-processing of reads before alignment
-fastq_trunctail Truncate reads at the first Q score with <= this value. Default 2.
-fastq_minlen Discard pair if either read is shorter than this, after truncating by -fastq_trunctail if applicable. Default 64.
Multi-threading
-threads Specifies the number of threads. Default 10, or the number of CPU cores, which ever is less.