软件介绍
Seqkit是一款专门处理fsata/q序列文件的软件,由go语言编写,功能比较完善,软件使用也很稳定。
安装方法
方法一:下载二进制文件(最新的稳定/开发版本)
下载地址:https://bioinf.shenwei.me/seqkit/download/只需要载您的操作系统的压缩可执行文件,并使用tar -zxvf *.tar.gz命令或其他工具解压即可
方法二:通过conda安装(最新稳定版)
conda install -c bioconda seqkit
方法三:通过homebrew安装(最新稳定版)
brew install seqkit
Usage:
seqkit rmdup [flags]
Flags:
-n, --by-name by full name instead of just id #通过fasta的名字去重,相同fasta ID的序列会被去除
-s, --by-seq by seq #通过fasta 的序列去重,相同碱基组成的序列会被去除
-D, --dup-num-file string file to save number and list of duplicated seqs #用来存放被去除序列的信息的文件
-d, --dup-seqs-file string file to save duplicated seqs #用来存在被去除的序列
-h, --help help for rmdup
-i, --ignore-case ignore case
Global Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
--infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments
-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others. can also set with environment variable SEQKIT_THREADS) (default 2)
示例
1.按照fasta的ID去重,相同ID的序列被去除:
seqkit rmdup -n test.fasta -o test.rmdup.fasta
2.按照fasta序列去重,相同碱基组成的序列被去除:
seqkit rmdup -s test.fasta -o test.rmdup.fasta