Usearch search_oligodb command 信息搬运

信息来源:https://www.drive5.com/usearch/manual/cmd_search_oligodb.html

Search for matches of nucleotide sequences to a database containing short nucleotide sequences (oligonucleotides). The most common use for this command is searching for matches to primers or probes in genome sequences or gene databases.

Wildcard letters indicating degenerate positions in the primer are supported. See IUPAC codes for details.

Nucleotide

Symbol Meaning A Adenine

     C Cytosine

     G Guanine

     T Thymine

     U Uracil

     M A or C

     R A or G

     W A or T

     S C or G

     Y C or T

     K G or T

     V A or C or G

     H A or C or T

     D A or G or T

     B C or G or T

     X G or A or T or C

     N G or A or T or C

Protein

Symbol Meaning

X Any amino acid

B N or D

Z Q or E

Reference

Cornish-Bowden (1985), IUPAC-IUB symbols for nucleotide nomenclature, Nucl. Acids Res. 13: 3021-3030.

The algorithm uses a fast and exact method; there are no heuristics, so all matches meeting the accept criteria are guaranteed to be found. Alignments are global; all letters of the database sequence must be aligned to a letter in the query sequence. Gaps are not permitted, except for terminal gaps in the query sequence.

Note that it is the longer sequence (genome, chromosome, gene etc.) that is the query; the database contains the oligos. The name of the command (search_oligodb) is intended to remind you of this, just in case you're used to doing it the other way around, as with some other local aligners like BLAST.

Termination options are supported. By default, termination is disabled, equivalent to -maxaccepts 0 -maxrejects 0. In other words, by default the entire database is searched.

The maxaccepts and maxrejects options

The termination options -maxaccepts and -maxrejects are supported by most search and clustering commands. These options cause the search for a given query sequence to stop if a given number of accepts (target sequences that meet the accept criteria) or rejects (target sequences that were processed but failed to meet those criteria) have occurred. Early search termination can give dramatic improvements in speed, often with minimal or no cost in sensitivity. See USEARCH algorithm for discussion of why "U-sorting" with termination is an effective speed optimization.

Other termination options

-termid terminate search when a target identity drops below the given value, specified as a fractional identity in range 0.0 to 1.0.

-termidd terminate when the difference (maxid - minid) exceeds the given value, when maxid (minid) is the maximum (minimum) identity found so far.

Comprehensive search

Roughly speaking, a search of the complete database is specified by disabling the maxaccepts and maxrejects termination options. This is done by setting -maxaccepts 0 -maxrejects 0. This is the default for the ublast command, but not for clustering and search based on the USEARCH algorithm. See table below for default values for each command. However, this is not strictly true: with commands based on the USEARCH and UBLAST algorithms, a database sequence will not be aligned if it has no words (or seeds) in common with the query sequence. For a truly comprehensive search, use search_global or search_local.

Discussion

Termination conditions are combined with OR, so the first one to be satisfied causes the search to stop. (Unlike accept criteria, which are combined with AND).

By default, termination options are enabled only for clustering and search commands based on the USEARCH algorithm. This is because USEARCH tests database sequences (targets) in order of decreasing number of words in common between the query and target sequence. This order correlates well with sequence similarity, so the best hit(s) are likely to be found quickly.

With ublast, search_local and search_global, targets are compared to the query in an order that does not correlate with sequence similarity or E-value. With these commands, the first accepted hit is not expected to be close to the best possible hit. However, termination options can still be useful; see weak hits for discussion and examples.

If maxaccepts is set to a value > 1, then more than one hit may be reported per query. In this case, it is usually recommended to increase maxrejects also, because it will often be necessary to search further into the list of candidate target sequences to find more than one hit.

The maxaccepts and maxrejects options can be used to tune speed against sensitivity. Smaller values of both parameters tend to improve speed by reducing the number of alignments that must be computed per query. For example, with cluster_fast, the default value of maxrejects is reduced from 32 to 8 in order to achieve higher speed. Increasing either value tends to result in slower execution because more alignments must be computed. Increasing maxrejects tends to improve sensitivity by reducing the number of false negatives, i.e. target sequences that would be accepted but are not tested because they are too far down the list in word-count order.

With translated searches, termination conditions apply to each ORF separately. This is because the nucleotide query sequence might span more than one gene.

termination options

Accept options are supported. By default, -maxdiffs 2 is assumed and other accept criteria are not used.

Accept criteria determine whether an alignment is a hit, also called an accept. See also weak hits. Hits are written to the output files. The -maxhits N and -top_hits_only options specify that only the best hits are to be reported. Note that two or more hits may be tied for the best score or identity. Accepted hits are written to an output file sorted by decreasing alignment score (local alignments) or by decreasing identity (global alignments).

In clustering commands based on UCLUST (cluster_fast and cluster_smallmem), accept options determine whether or not a sequence matches a cluster centroid and should be assigned to that cluster. A sequence can match only one centroid; this is usually the first accepted centroid, but this can be changed by increasing the -maxaccepts, in which case it will be the centroid with highest identity (see termination options).

Accept criteria do not have default values. If a given accept option is not specified, then the corresponding value is not computed or tested. So for example if -id is the only option given, then identity is the only value that is calculated from the alignment.

If more than one accept option is specified, they are combined with AND, so all of them must be satisfied.

Criteria that do not require an alignment, e.g. -idprefix and -minqt, are tested before an alignment is computed; these can give significant improvements in speed because a target can be rejected without the overhead of computing an alignment. Most of these are not supported by local search commands (ublast, usearch_local and search_local).

The -acceptall option specifies that all hits should be accepted, overriding any other accept options.

Accept options

The query file may be in FASTA or FASTQ format.

The FASTA sequence file format is widely supported by bioinformatics tools. For a detailed description, see this Wikipedia entry about FASTA.

USEARCH allows lines of any length in a FASTA file. (Some programs limit lines to e.g. 80 characters).

USEARCH does not support comments in FASTA files.

White space characters (blanks and tabs) are discarded if found in sequence data, but many other tools do not allow this and the practice is not recommended.

If the -trunclabels option is given, USEARCH will truncate sequence labels at the first white space (similar to BLAST), otherwise the full label is retained..

A database file must be specified using the -db option and must be in FASTA format.

The -strand option is required for nucleotide databases.

Multithreading is supported.

Standard output files are supported.

Example

usearch -search_oligodb human_genome.fa -db probes.fa -strand both \

-userout out.txt -userfields query+target+qstrand+diffs+tlo+thi+trowdots

Usearch search_oligodb command 信息搬运

推荐阅读更多精彩内容