介绍:
diamond软件是NCBI推出的blast+升级版本(应该是的)。
功能:
寻找同源序列
使用:
1.建库
diamond makedb --in nr.faa -d nr
--in <file> 填写建库所需的序列文件,fasta格式(.faa一般指蛋白序列文件)
-d 索引的前缀名,生成后缀为.dmnd的文件
2.序列比对
diamond blastp -d nr -q gene.fasta -o matches.m8 -f 6 -p 50 --more-sensitive -e 1e-5
--db/-d <file> 输入比对数据库
--query/-q <file> 比对序列
--threads/-p 线程数
--out/-o <file> 输出文件
--outfmt/-f 输出文件格式,默认6(表格)
--evalue/-e 比对的最大evalue值(默认0.001)
--max-target-seqs/-k 比对到的最大序列数,默认值是25
--more-sensitive 比对准确度更高
其他参数:
--top 百分数的形式表示--max-target-seqs
--min-score 最小评分
--id 给出指定百分比的数据
--subject-cover 最小覆盖度
--unal (0,1) 是否输出未比对上的reads(0=no, 1=yes)
--sensitive 建议对齐较长的序列
--block-size/b,一次处理的十亿碱基的大小,主要控制内存使用,默认为2(预计使用此内存数量的大约六倍,即默认内存使用将到达12G),转录流程使用0.2
--salltitles 将全长标题包含在DAA格式中,默认DAA文件仅包含缩短序列ID(直到第一个空白字符)
3.比对结果
每一列是什么呢? BLASTn output format 6
Column headers: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
1. | qseqid | query sequence id |
---|---|---|
2. | sseqid | subject (e.g., reference genome) sequence id |
3. | pident | percentage of identical matches |
4. | length | alignment length |
5. | mismatch | number of mismatches |
6. | gapopen | number of gap openings |
7. | qstart | start of alignment in query |
8. | qend | end of alignment in query |
9. | sstart | start of alignment in subject |
10. | send | end of alignment in subject |
11. | evalue | expect value |
12. | bitscore | bit score |
自定义输出结果
--outfmt 6 qseqid sseqid pident qlen length mismatch gapope evalue bitscore
diamond输出格式:
0 BLAST pairwise format.
5 BLAST XML format.
6 表格模式 (默认输出格式).
100 DIAMOND
101 SAM format.
102 Taxonomic classification.
103 PAF format.
4.结果筛选(linux)
#选取identity>80的结果
awk -F"\t" '$3>80' matches.m8 > matches.m8.identity80
附件:
diamond 帮助文档
Licensed under the GNU AGPL <https://www.gnu.org/licenses/agpl.txt>
Check http://github.com/bbuchfink/diamond for updates.
Syntax: diamond COMMAND [OPTIONS]
Commands:
makedb Build DIAMOND database from a FASTA file
blastp Align amino acid query sequences against a protein reference database
blastx Align DNA query sequences against a protein reference database
view View DIAMOND alignment archive (DAA) formatted file
help Produce help message
version Display version information
getseq Retrieve sequences from a DIAMOND database file
dbinfo Print information about a DIAMOND database file
General options:
--threads (-p) number of CPU threads
--db (-d) database file
--out (-o) output file
--outfmt (-f) output format
0 = BLAST pairwise
5 = BLAST XML
6 = BLAST tabular
100 = DIAMOND alignment archive (DAA)
101 = SAM
Value 6 may be followed by a space-separated list of these keywords:
qseqid means Query Seq - id # Query序列id
qlen means Query sequence length # Query序列长度
sseqid means Subject Seq - id # Subject序列id
sallseqid means All subject Seq - id(s), separated by a ';' # 所有Subject序列id,以“;”分隔
slen means Subject sequence length # Subject序列长度
qstart means Start of alignment in query # Query对齐序列起点
qend means End of alignment in query # Query对齐序列终点
sstart means Start of alignment in subject # Subject对齐序列起点
send means End of alignment in subject # Subject对齐序列终点
qseq means Aligned part of query sequence # Query序列对齐部分
sseq means Aligned part of subject sequence # Subject序列对齐部分
full_sseq means Full subject sequence # 完整的Subject序列
evalue means Expect value # 期望值
bitscore means Bit score # 比特得分
score means Raw score # 原始分数
length means Alignment length # 比对长度
pident means Percentage of identical matches # 相同匹配的百分比
nident means Number of identical matches # 相同匹配的数目
mismatch means Number of mismatches # 不匹配的数目
positive means Number of positive - scoring matches # 相似匹配的数目
gapopen means Number of gap openings # 间隙开口的数目
gaps means Total number of gaps # 间隙的总数
ppos means Percentage of positive - scoring matches # 相似匹配的百分比
qframe means Query frame # Query的翻译类型
btop means Blast traceback operations(BTOP) # blast回溯操作
staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order) # Subject的物种id,以“;”分隔
stitle means Subject Title # Subject的题目(>右侧所有内容)
salltitles means All Subject Title(s), separated by a '<>' # 所有Subject题目(>右侧所有内容),以“<>”分隔
qcovhsp means Query Coverage Per HSP # Query每个HSP(匹配的一段序列)的覆盖度
qtitle means Query title # Query的题目(>右侧所有内容)
Default: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore # 默认输出的内容
--verbose (-v) verbose console output
--log enable debug log
--quiet disable console output
Makedb options:
--in input reference file in FASTA format
Aligner options:
--query (-q) input query file
--strand query strands to search (both/minus/plus)
--un file for unaligned queries
--unal report unaligned queries (0=no, 1=yes)
--max-target-seqs (-k) maximum number of target sequences to report alignments for
--top report alignments within this percentage range of top alignment score (overrides --max-target-seqs)
--range-culling restrict hit culling to overlapping query ranges
--compress compression for output files (0=none, 1=gzip)
--evalue (-e) maximum e-value to report alignments (default=0.001)
--min-score minimum bit score to report alignments (overrides e-value setting)
--id minimum identity% to report an alignment
--query-cover minimum query cover% to report an alignment
--subject-cover minimum subject cover% to report an alignment
--sensitive enable sensitive mode (default: fast)
--more-sensitive enable more sensitive mode (default: fast)
--block-size (-b) sequence block size in billions of letters (default=2.0)
--index-chunks (-c) number of chunks for index processing
--tmpdir (-t) directory for temporary files
--gapopen gap open penalty
--gapextend gap extension penalty
--frameshift (-F) frame shift penalty (default=disabled)
--matrix score matrix for protein alignment (default=BLOSUM62)
--custom-matrix file containing custom scoring matrix
--lambda lambda parameter for custom matrix
--K K parameter for custom matrix
--comp-based-stats enable composition based statistics (0/1=default)
--masking enable masking of low complexity regions (0/1=default)
--query-gencode genetic code to use to translate query (see user manual)
--salltitles include full subject titles in DAA file
--sallseqid include all subject ids in DAA file
--no-self-hits suppress reporting of identical self hits
--taxonmap protein accession to taxid mapping file
--taxonnodes taxonomy nodes.dmp from NCBI
--taxonlist restrict search to list of taxon ids (comma-separated)
Advanced options:
--algo Seed search algorithm (0=double-indexed/1=query-indexed)
--bin number of query bins for seed search
--min-orf (-l) ignore translated sequences without an open reading frame of at least this length
--freq-sd number of standard deviations for ignoring frequent seeds
--id2 minimum number of identities for stage 1 hit
--window (-w) window size for local hit search
--xdrop (-x) xdrop for ungapped alignment
--ungapped-score minimum alignment score to continue local extension
--hit-band band for hit verification
--hit-score minimum score to keep a tentative alignment
--gapped-xdrop (-X) xdrop for gapped alignment in bits
--band band for dynamic programming computation
--shapes (-s) number of seed shapes (0 = all available)
--shape-mask seed shapes
--index-mode index mode (0=4x12, 1=16x9)
--rank-ratio include subjects within this ratio of last hit (stage 1)
--rank-ratio2 include subjects within this ratio of last hit (stage 2)
--max-hsps maximum number of HSPs per subject sequence to save for each query
--range-cover percentage of query range to be covered for hit culling (default=50)
--dbsize effective database size (in letters)
--no-auto-append disable auto appending of DAA and DMND file extensions
--xml-blord-format Use gnl|BL_ORD_ID| style format in XML output
View options:
--daa (-a) DIAMOND alignment archive (DAA) file
--forwardonly only show alignments of forward strand
Getseq options:
--seq Sequence numbers to display.
参考文档:
https://www.jianshu.com/p/199824ca5234
https://www.jianshu.com/p/2c4c53b74594