Usearch search_oligodb command 信息搬运

信息来源:https://www.drive5.com/usearch/manual/cmd_search_oligodb.html


Search for matches of nucleotide sequences to a database containing short nucleotide sequences (oligonucleotides). The most common use for this command is searching for matches to primers or probes in genome sequences or gene databases.

Wildcard letters indicating degenerate positions in the primer are supported. See IUPAC codes for details.


Nucleotide

Symbol  Meaning A  Adenine

     C  Cytosine

     G  Guanine

     T  Thymine

     U  Uracil

     M  A or C

     R  A or G

     W  A or T

     S  C or G

     Y  C or T

     K  G or T

     V  A or C or G

     H  A or C or T

     D  A or G or T

     B  C or G or T

     X  G or A or T or C

     N  G or A or T or C

Protein

Symbol Meaning

X Any amino acid

B N or D

Z Q or E

Reference

Cornish-Bowden (1985), IUPAC-IUB symbols for nucleotide nomenclature, Nucl. Acids Res. 13: 3021-3030.


The algorithm uses a fast and exact method; there are no heuristics, so all matches meeting the accept criteria are guaranteed to be found. Alignments are global; all letters of the database sequence must be aligned to a letter in the query sequence. Gaps are not permitted, except for terminal gaps in the query sequence.

Note that it is the longer sequence (genome, chromosome, gene etc.) that is the query; the database contains the oligos. The name of the command (search_oligodb) is intended to remind you of this, just in case you're used to doing it the other way around, as with some other local aligners like BLAST.

Termination options are supported. By default, termination is disabled, equivalent to -maxaccepts 0 -maxrejects 0. In other words, by default the entire database is searched.


The maxaccepts and maxrejects options

The termination options -maxaccepts and -maxrejects are supported by most search and clustering commands. These options cause the search for a given query sequence to stop if a given number of accepts (target sequences that meet the accept criteria) or rejects (target sequences that were processed but failed to meet those criteria) have occurred. Early search termination can give dramatic improvements in speed, often with minimal or no cost in sensitivity. See USEARCH algorithm for discussion of why "U-sorting" with termination is an effective speed optimization.

Other termination options

-termid terminate search when a target identity drops below the given value, specified as a fractional identity in range 0.0 to 1.0.

-termidd terminate when the difference (maxid - minid) exceeds the given value, when maxid (minid) is the maximum (minimum) identity found so far.

Comprehensive search

Roughly speaking, a search of the complete database is specified by disabling the maxaccepts and maxrejects termination options. This is done by setting -maxaccepts 0 -maxrejects 0. This is the default for the ublast command, but not for clustering and search based on the USEARCH algorithm. See table below for default values for each command. However, this is not strictly true: with commands based on the USEARCH and UBLAST algorithms, a database sequence will not be aligned if it has no words (or seeds) in common with the query sequence. For a truly comprehensive search, use search_global or search_local.

Discussion

Termination conditions are combined with OR, so the first one to be satisfied causes the search to stop. (Unlike accept criteria, which are combined with AND).

By default, termination options are enabled only for clustering and search commands based on the USEARCH algorithm. This is because USEARCH tests database sequences (targets) in order of decreasing number of words in common between the query and target sequence. This order correlates well with sequence similarity, so the best hit(s) are likely to be found quickly.

With ublastsearch_local and search_global, targets are compared to the query in an order that does not correlate with sequence similarity or E-value. With these commands, the first accepted hit is not expected to be close to the best possible hit. However, termination options can still be useful; see weak hits for discussion and examples.

If maxaccepts is set to a value > 1, then more than one hit may be reported per query. In this case, it is usually recommended to increase maxrejects also, because it will often be necessary to search further into the list of candidate target sequences to find more than one hit.

The maxaccepts and maxrejects options can be used to tune speed against sensitivity. Smaller values of both parameters tend to improve speed by reducing the number of alignments that must be computed per query. For example, with cluster_fast, the default value of maxrejects is reduced from 32 to 8 in order to achieve higher speed. Increasing either value tends to result in slower execution because more alignments must be computed. Increasing maxrejects tends to improve sensitivity by reducing the number of false negatives, i.e. target sequences that would be accepted but are not tested because they are too far down the list in word-count order.

With translated searches, termination conditions apply to each ORF separately. This is because the nucleotide query sequence might span more than one gene.

termination options

Accept options are supported. By default, -maxdiffs 2 is assumed and other accept criteria are not used.


Accept criteria determine whether an alignment is a hit, also called an accept. See also weak hits. Hits are written to the output files. The -maxhits N and -top_hits_only options specify that only the best hits are to be reported. Note that two or more hits may be tied for the best score or identity. Accepted hits are written to an output file sorted by decreasing alignment score (local alignments) or by decreasing identity (global alignments).

In clustering commands based on UCLUST (cluster_fast and cluster_smallmem), accept options determine whether or not a sequence matches a cluster centroid and should be assigned to that cluster. A sequence can match only one centroid; this is usually the first accepted centroid, but this can be changed by increasing the -maxaccepts, in which case it will be the centroid with highest identity (see termination options).

Accept criteria do not have default values. If a given accept option is not specified, then the corresponding value is not computed or tested. So for example if -id is the only option given, then identity is the only value that is calculated from the alignment.

If more than one accept option is specified, they are combined with AND, so all of them must be satisfied.

Criteria that do not require an alignment, e.g. -idprefix and -minqt, are tested before an alignment is computed; these can give significant improvements in speed because a target can be rejected without the overhead of computing an alignment. Most of these are not supported by local search commands (ublastusearch_local and search_local).

The -acceptall option specifies that all hits should be accepted, overriding any other accept options.


Accept options



The query file may be in FASTA or FASTQ format.


The FASTA sequence file format is widely supported by bioinformatics tools. For a detailed description, see this Wikipedia entry about FASTA.

USEARCH allows lines of any length in a FASTA file. (Some programs limit lines to e.g. 80 characters).

USEARCH does not support comments in FASTA files.

White space characters (blanks and tabs) are discarded if found in sequence data, but many other tools do not allow this and the practice is not recommended.

If the -trunclabels option is given, USEARCH will truncate sequence labels at the first white space (similar to BLAST), otherwise the full label is retained..


database file must be specified using the -db option and must be in FASTA format.

The -strand option is required for nucleotide databases.

Multithreading is supported.

Standard output files are supported.

Example

usearch -search_oligodb human_genome.fa -db probes.fa -strand both \

-userout out.txt -userfields query+target+qstrand+diffs+tlo+thi+trowdots


©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,752评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,100评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 159,244评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,099评论 1 286
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,210评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,307评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,346评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,133评论 0 269
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,546评论 1 306
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,849评论 2 328
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,019评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,702评论 4 337
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,331评论 3 319
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,030评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,260评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,871评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,898评论 2 351

推荐阅读更多精彩内容