根据ID从FASTA文件中批量提取序列是做序列分析常做的事情,有网友让我帮忙从11万条中挑选7万条,我自己写写了一个,太慢了;后来发现Biopython官方文档里面“Cookbook – Cool things to do with it”第一件事就是做这个事情的,后来我又学习了“冷月”小伙伴在知乎的帖子,稍微改写了一下,其实就是ctrl+c和ctrl+v,供大家参考,欢迎交流。
冷月的帖子:
https://zhuanlan.zhihu.com/p/57938795
Biopython官方文档:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec365
当然首先在要电脑上装python和Biopython。
然后就是我ctrl+c和ctrl+v的代码了:
#coding:utf-8
import click
from Bio import SeqIO
@click.command()
@click.option('-f', '--fastafile', help='Input a fasta file', required=True)
@click.option('-i', '--idfile', help='Input an idlist', required=True)
@click.option('-o', '--outfile', help='Input the name of result file', default='result_out.fa')
def main(fastafile="FA.fa",idfile="ID.txt",outfile= "result_out.fa"):
with open(idfile) as id_handle:
wanted = set(line.rstrip("\n").split(None,1)[0] for line in id_handle)
print("Found %i unique identifiers in %s" % (len(wanted), idfile))
records = (r for r in SeqIO.parse(fastafile, "fasta") if r.id in wanted)
count = SeqIO.write(records, outfile, "fasta")
print("Saved %i records from %s to %s" % (count, fastafile, outfile))
if count < len(wanted):
print("Warning %i IDs not found in %s" % (len(wanted) - count, fastafile))
if __name__ == '__main__':
main()
另存为get_seqs_by_id.py
使用方法:
python get_seqs_by_id.py -f **.fasta -i ID.txt -o out_file.fasta
加个时间模块测试了一下,11万条中挑选7万花了不到两秒,回去看看我自己写的啥JB玩意儿啊~。。。。
其实还有下面更快的方法:
seqtk方法也十分方便("Nickier"):
https://www.jianshu.com/p/2671198ae625
seqtk subseq **.fasta **.txt > out.fasta
还有seqkit的方法("苏牧传媒"):
https://www.jianshu.com/p/471283080bd6