biopython - 比较两个序列的相似性

比较序列相似性（sequence similarity）可以考虑用biopython或者emboss的几种比对方法。

1. Bio.pairwise2

主要用到SeqIO.parse读取，然后用Bio.pairwise2.align.globalxx比对并输出两个序列一样的比例。如果用局部比对，可以用Bio.pairwise2.align.localxx.

from __future__ import division
from Bio import pairwise2 as pw2
from Bio import SeqIO 

first_dict = SeqIO.to_dict(SeqIO.parse(open(first_fasta),'fasta')) # 直接转为字典格式
second_dict = SeqIO.to_dict(SeqIO.parse(open(second_fasta),'fasta'))

# 两个fasta文件中的序列两两比较：
for t in first_dict:
    t_len = len(first_dict[t].seq)
    for t2 in correspond[t]:
        global_align = pw2.align.globalxx(first_dict[t].seq, second_dict[t2].seq)
        matched = global_align[0][2]
        percent_match = (matched / t_len) * 100

            print(t + '\t' + t2 + '\t' + str(percent_match) + '\n')

2. Bio.Emboss.Applications

用了NeedleCommandline去比对，实测比上面的方法要快一点。不过都是python写的，又是基于DP，都不算很快。

from Bio.Emboss.Applications import NeedleCommandline

needle_cline = NeedleCommandline(asequence="test1.txt", bsequence="test2.txt",gapopen=10,gapextend=0.5,outfile='stdout')

out_data, err = needle_cline()
out_split = out_data.split("\n")
p = re.compile("\: (.*)\/")
print(int(p.search(out_split[24]).group(1).replace("%", "")))

3. needle

本质与上面的方法一样，不过这个是在shell中运行的。

# 安装
conda install -c bioconda emboss

# 运行
needle -asequence test1.txt -bsequence test2.txt -gapopen 10 -gapextend 0.5 -outfile aln.needle
grep 'Similarity' aln.needle | awk -F '[:| |/]' '{print $4}'