生信软件:diamond

介绍:

diamond软件是NCBI推出的blast+升级版本(应该是的)。

功能:

寻找同源序列

使用:

1.建库

diamond makedb --in nr.faa -d nr

--in <file> 填写建库所需的序列文件,fasta格式(.faa一般指蛋白序列文件)
-d 索引的前缀名,生成后缀为.dmnd的文件

2.序列比对

diamond blastp -d nr -q gene.fasta -o matches.m8 -f 6 -p 50 --more-sensitive -e 1e-5

--db/-d <file> 输入比对数据库
--query/-q <file> 比对序列
--threads/-p 线程数
--out/-o <file> 输出文件
--outfmt/-f 输出文件格式,默认6(表格)
--evalue/-e 比对的最大evalue值(默认0.001)
--max-target-seqs/-k 比对到的最大序列数,默认值是25
--more-sensitive 比对准确度更高

其他参数:
--top 百分数的形式表示--max-target-seqs
--min-score 最小评分
--id 给出指定百分比的数据
--subject-cover 最小覆盖度
--unal (0,1) 是否输出未比对上的reads(0=no, 1=yes)
--sensitive 建议对齐较长的序列
--block-size/b,一次处理的十亿碱基的大小,主要控制内存使用,默认为2(预计使用此内存数量的大约六倍,即默认内存使用将到达12G),转录流程使用0.2
--salltitles 将全长标题包含在DAA格式中,默认DAA文件仅包含缩短序列ID(直到第一个空白字符)

3.比对结果

image.png 别人简书里的图片,默认输出格式

每一列是什么呢? BLASTn output format 6

Column headers: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

1. qseqid query sequence id
2. sseqid subject (e.g., reference genome) sequence id
3. pident percentage of identical matches
4. length alignment length
5. mismatch number of mismatches
6. gapopen number of gap openings
7. qstart start of alignment in query
8. qend end of alignment in query
9. sstart start of alignment in subject
10. send end of alignment in subject
11. evalue expect value
12. bitscore bit score

自定义输出结果

--outfmt 6 qseqid sseqid pident qlen length mismatch gapope evalue bitscore

diamond输出格式:

0 BLAST pairwise format.  
5 BLAST XML format.  
6 表格模式 (默认输出格式).    
100 DIAMOND  
101 SAM format.  
102 Taxonomic classification.  
103 PAF format.

4.结果筛选(linux)

#选取identity>80的结果
awk -F"\t" '$3>80' matches.m8 > matches.m8.identity80

附件:
diamond 帮助文档

Licensed under the GNU AGPL <https://www.gnu.org/licenses/agpl.txt>
Check http://github.com/bbuchfink/diamond for updates.

Syntax: diamond COMMAND [OPTIONS]

Commands:
makedb  Build DIAMOND database from a FASTA file
blastp  Align amino acid query sequences against a protein reference database
blastx  Align DNA query sequences against a protein reference database
view    View DIAMOND alignment archive (DAA) formatted file
help    Produce help message
version Display version information
getseq  Retrieve sequences from a DIAMOND database file
dbinfo  Print information about a DIAMOND database file

General options:
--threads (-p)         number of CPU threads
--db (-d)              database file
--out (-o)             output file
--outfmt (-f)          output format
    0   = BLAST pairwise
    5   = BLAST XML
    6   = BLAST tabular
    100 = DIAMOND alignment archive (DAA)
    101 = SAM

    Value 6 may be followed by a space-separated list of these keywords:

    qseqid means Query Seq - id # Query序列id
    qlen means Query sequence length    # Query序列长度
    sseqid means Subject Seq - id   # Subject序列id
    sallseqid means All subject Seq - id(s), separated by a ';' # 所有Subject序列id,以“;”分隔
    slen means Subject sequence length  # Subject序列长度
    qstart means Start of alignment in query    # Query对齐序列起点
    qend means End of alignment in query    # Query对齐序列终点
    sstart means Start of alignment in subject  # Subject对齐序列起点
    send means End of alignment in subject  # Subject对齐序列终点
    qseq means Aligned part of query sequence   # Query序列对齐部分
    sseq means Aligned part of subject sequence # Subject序列对齐部分
    full_sseq means Full subject sequence   # 完整的Subject序列
    evalue means Expect value   # 期望值
    bitscore means Bit score    # 比特得分
    score means Raw score   # 原始分数
    length means Alignment length   # 比对长度
    pident means Percentage of identical matches    # 相同匹配的百分比
    nident means Number of identical matches    # 相同匹配的数目
    mismatch means Number of mismatches # 不匹配的数目
    positive means Number of positive - scoring matches # 相似匹配的数目
    gapopen means Number of gap openings    # 间隙开口的数目
    gaps means Total number of gaps # 间隙的总数
    ppos means Percentage of positive - scoring matches # 相似匹配的百分比
    qframe means Query frame    # Query的翻译类型
    btop means Blast traceback operations(BTOP) # blast回溯操作
    staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order)    # Subject的物种id,以“;”分隔
    stitle means Subject Title  # Subject的题目(>右侧所有内容)
    salltitles means All Subject Title(s), separated by a '<>'  # 所有Subject题目(>右侧所有内容),以“<>”分隔
    qcovhsp means Query Coverage Per HSP    # Query每个HSP(匹配的一段序列)的覆盖度
    qtitle means Query title    # Query的题目(>右侧所有内容)


    Default: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore   # 默认输出的内容
--verbose (-v)         verbose console output
--log                  enable debug log
--quiet                disable console output

Makedb options:
--in                   input reference file in FASTA format

Aligner options:
--query (-q)           input query file
--strand               query strands to search (both/minus/plus)
--un                   file for unaligned queries
--unal                 report unaligned queries (0=no, 1=yes)
--max-target-seqs (-k) maximum number of target sequences to report alignments for
--top                  report alignments within this percentage range of top alignment score (overrides --max-target-seqs)
--range-culling        restrict hit culling to overlapping query ranges
--compress             compression for output files (0=none, 1=gzip)
--evalue (-e)          maximum e-value to report alignments (default=0.001)
--min-score            minimum bit score to report alignments (overrides e-value setting)
--id                   minimum identity% to report an alignment
--query-cover          minimum query cover% to report an alignment
--subject-cover        minimum subject cover% to report an alignment
--sensitive            enable sensitive mode (default: fast)
--more-sensitive       enable more sensitive mode (default: fast)
--block-size (-b)      sequence block size in billions of letters (default=2.0)
--index-chunks (-c)    number of chunks for index processing
--tmpdir (-t)          directory for temporary files
--gapopen              gap open penalty
--gapextend            gap extension penalty
--frameshift (-F)      frame shift penalty (default=disabled)
--matrix               score matrix for protein alignment (default=BLOSUM62)
--custom-matrix        file containing custom scoring matrix
--lambda               lambda parameter for custom matrix
--K                    K parameter for custom matrix
--comp-based-stats     enable composition based statistics (0/1=default)
--masking              enable masking of low complexity regions (0/1=default)
--query-gencode        genetic code to use to translate query (see user manual)
--salltitles           include full subject titles in DAA file
--sallseqid            include all subject ids in DAA file
--no-self-hits         suppress reporting of identical self hits
--taxonmap             protein accession to taxid mapping file
--taxonnodes           taxonomy nodes.dmp from NCBI
--taxonlist            restrict search to list of taxon ids (comma-separated)

Advanced options:
--algo                 Seed search algorithm (0=double-indexed/1=query-indexed)
--bin                  number of query bins for seed search
--min-orf (-l)         ignore translated sequences without an open reading frame of at least this length
--freq-sd              number of standard deviations for ignoring frequent seeds
--id2                  minimum number of identities for stage 1 hit
--window (-w)          window size for local hit search
--xdrop (-x)           xdrop for ungapped alignment
--ungapped-score       minimum alignment score to continue local extension
--hit-band             band for hit verification
--hit-score            minimum score to keep a tentative alignment
--gapped-xdrop (-X)    xdrop for gapped alignment in bits
--band                 band for dynamic programming computation
--shapes (-s)          number of seed shapes (0 = all available)
--shape-mask           seed shapes
--index-mode           index mode (0=4x12, 1=16x9)
--rank-ratio           include subjects within this ratio of last hit (stage 1)
--rank-ratio2          include subjects within this ratio of last hit (stage 2)
--max-hsps             maximum number of HSPs per subject sequence to save for each query
--range-cover          percentage of query range to be covered for hit culling (default=50)
--dbsize               effective database size (in letters)
--no-auto-append       disable auto appending of DAA and DMND file extensions
--xml-blord-format     Use gnl|BL_ORD_ID| style format in XML output

View options:
--daa (-a)             DIAMOND alignment archive (DAA) file
--forwardonly          only show alignments of forward strand

Getseq options:
--seq                  Sequence numbers to display.

参考文档:
https://www.jianshu.com/p/199824ca5234
https://www.jianshu.com/p/2c4c53b74594

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容