Exact match
Naive exact match
Introduction:精确匹配,耗时长
Input: parten 和 text即模式串和需要比对上的字符串
Output: parten出现所有位置的index
Algorithm:两个loop,外层对t进行移位,内层逐字符比对,关键是有一个mach标志为,如果比对上了,append。
def naive(p, t):
occurrences = []
for i in range(len(t) - len(p) + 1): # loop over alignments
match = True
for j in range(len(p)): # loop over characters
if t[i+j] != p[j]: # compare characters
match = False
break
if match:
occurrences.append(i) # all chars matched; record
return occurrences
Boyer-Moore
Introduction:j精确匹配,可以跳过一些不需要匹配的位置,因此速度比Naive快
Input: p, t 和p_bm,它相当于一个对p的preprocess
Output: p匹配到t的位置
Algorithm:核心思想:一个shift,每次选出一个最大gap, 如何匹配上了, append(i)
细节:首先while循环,因为你不清楚循环次数,所以不用for循环;然后for循环,对p中的每个字符遍历,从右往左,如果不等,通过bad_character和good_suffix两种规则,求出最大shift,并break;如果match,则append,并skip_gs,继续map,有可能有多个位置。
def boyer_moore(p, p_bm, t):
""" Do Boyer-Moore matching. p=pattern, t=text,
p_bm=BoyerMoore object for p """
i = 0
occurrences = []
while i < len(t) - len(p) + 1:
shift = 1
mismatched = False
for j in range(len(p)-1, -1, -1):
if p[j] != t[i+j]:
skip_bc = p_bm.bad_character_rule(j, t[i+j])
skip_gs = p_bm.good_suffix_rule(j)
shift = max(shift, skip_bc, skip_gs)
mismatched = True
break
if not mismatched:
occurrences.append(i)
skip_gs = p_bm.match_skip()
shift = max(shift, skip_gs)
i += shift
return occurrences
Hash Algorithm
Introduction:需要preprocess的算法,精确匹配,适用于test较大的比对,但是hash表较大
Input: p, t, k k是生成hash table也就是Index中的k-mer
Output:比对的位置
Algorithm:核心:先是对test进行index生成hashtable,query的时候对p只选了一个k-mer, 因为这里选的都是精确匹配。python中用的一个list,然后元素是(k-mer, position), 用了一个二叉树查找,这里的-1有点奇怪。然后对p进行选取k-mer,查找,最后需要对剩下的字符进行匹配。
class Index(object):
""" Holds a substring index for a text T """
def __init__(self, t, k):
""" Create index from all substrings of t of length k """
self.k = k # k-mer length (k)
self.index = []
for i in range(len(t) - k + 1): # for each k-mer
self.index.append((t[i:i+k], i)) # add (k-mer, offset) pair
self.index.sort() # alphabetize by k-mer
def query(self, p):
""" Return index hits for first k-mer of p """
kmer = p[:self.k] # query with first k-mer
i = bisect.bisect_left(self.index, (kmer, -1)) # binary search
hits = []
while i < len(self.index): # collect matching index entries
if self.index[i][0] != kmer:
break
hits.append(self.index[i][1])
i += 1
return hits
def query_index(p, t, index):
k = index.k
offsets = []
for i in index.query(p):
if p[k:] == t[i+k: i+len(p)]:
offsets.append(i)
return offsets
减少hashtable的大小
提高specificity, 减少collision的次数
Approximate match
Pigenohole principle:即P与T匹配,允许k个失配,那么把P分成k+1份,必然在T中能够找到k+1份中某一份的精确匹配
if |X| = |Y|, then EditDistance(X, Y) <= HammingDistance(X, Y)
if |X| not equal |Y|, then EditDistance(X, Y) >= ||X|-|Y||
** Hamming distance
Edit distance
Hash with pigenohole principle
Introduction:模糊匹配,原理主要采用鸽子洞原理,即把P分为k+1份,总有一份满足精确匹配,找到它的位置,然后计算其余部分失配的个数不能超过限定值。
Input: p, t, n n表示允许失配的个数
Output:匹配到的位置,以及字符
Algorithm:核心思想是把p分为n+1份,我们只要求长度即可,并且小于它就行,不需要我们产生n+1个随机数,还得之和为len(P).然后根据这个平均长度,生成hashtable.然后对n+1份字符进行遍历,并且查找位置,之后对查找出的位置进行遍历。如果查找的位置小于start 或者大于len(t),则continue,否则,逐字符比对剩下的字符,统计失配个数。
def approximate_match(p, t, n):
segment_length = int(round(len(p) / (n + 1)))
all_matches = set()
p_idx = Index(t, segment_length)
idx_hits = 0
for i in range(n + 1):
start = i * segment_length
end = min((i + 1) * segment_length, len(p))
matches = p_idx.query(p[start:end])
# Extend matching segments to see if whole p matches
for m in matches:
idx_hits += 1
if m < start or m - start + len(p) > len(t):
continue
mismatches = 0
for j in range(0, start):
if not p[j] == t[m - start + j]:
mismatches += 1
if mismatches > n:
break
for j in range(end, len(p)):
if not p[j] == t[m - start + j]:
mismatches += 1
if mismatches > n:
break
if mismatches <= n:
all_matches.add(m - start)
return list(all_matches), idx_hits
def editDistance(a, b):
if len(a) == 0 :
return len(b)
if len(b) == 0:
return len(a)
if a[-1] == b[-1]:
delta = 1
else :
delta = 0
return min{editDistance(a[ :-1], b[ :-1]) + delta, editDistance(a, b[ : -1]) + 1, editDistance(a[ : -1], b) + 1}
Problem :递归太耗时了,主要是会重复计算相同的值
Dynamic programing
For any pair of prefixes from X and Y ,edit distance is calculated once ,这也是它跟递归相比快的原因,但很明显,它消耗更多的内存资源。
def editDistance(x, y):
# create distance matrix
D = []
for i in range(len(x)+1):
D.append([0]*(len(y)+1))
# initialze first row and colum of matrix
for i in range(len(x)+1):
D[i][0] = i
for i in range(len(y)+1):
D[0][i] = i
#fill in the rest of the matrix
for i in range(1, len(x)+1):
for j in range(1, len(y)+1):
disHor = D[i][j-1] + 1
disVer = D[i-1][j] + 1
if x[i-1] == y[j-1]:
disDiag = D[i-1][j-1]
else:
disDiag = D[i-1][j-1] + 1
D[i][j] = min(disHor, disVer, disDiag)
return D[-1][-1]
Approximate match
Introduction:第一行初始化为0,因为你不清楚P在哪个地方与T匹配第一个字符,所以默认的editDistance都为0,然后从最后一排中editDistance最小的位置开始回溯
local alignment and gloal alignement
Assembly
Overlap graph
Introduction:
Input:一组reads
Output:overlap graph
Algorithm:核心:对每一个read移位生成k-mer,然后以k-mer为地址,value为read,进行建表,key-value也即dic数据结构
def overlap(a, b, min_length=3):
""" Return length of longest suffix of 'a' matching
a prefix of 'b' that is at least 'min_length'
characters long. If no such overlap exists,
return 0. """
start = 0 # start all the way at the left
while True:
start = a.find(b[:min_length], start) # look for b's prefix in a
if start == -1: # no more occurrences to right
return 0
# found occurrence; check for full suffix/prefix match
if b.startswith(a[start:]):
return len(a)-start
start += 1 # move just past previous match
def overlap_all_pairs(reads, k):
dic = {}
index = defaultdict(set)
for read in reads:
for i in range(len(read)-k+1):
index[read[i:i+k]].add(read)
# make graph
graph = defaultdict(set)
for r in reads:
for o in index[r[-k:]]:
if r != o:
if overlap(r, o, k):
graph[r].add(o)
edges = 0
for read in graph:
edges += len(graph[read])
return(edges, len(graph))
shortest common string
Introduction:需要把各种组合都考虑进去,求取最小的一个,太耗时间了
Input:
Output:
Algorithm
def scs(ss):
shortest_sup = None
for ssperm in itertools.permutations(ss):
sup = ssperm[0]
for i in range(len(ssperm)-1):
olen = overlap(ssperm[i], ssperm[i+1], min_length=1)
sup += ssperm[i+1][olen:]
if shortest_sup is None or len(sup)<len(shortest_sup):
shortest_sup = sup
return shortest_sup
Greedy shortest common superstring
Introduction:不一定能得到最优解,但是速度快,并且与最优解的差距不大
Input:
Output:
Altogrithm
def pick_maximal_overlap(reads, k):
read_a = None
read_b = None
best_olen = 0
for a, b in itertools.permutations(reads, 2):
olen = overlap(a, b, min_length=k)
if olen > best_olen:
read_a, read_b = a, b
best_olen = olen
return read_a, read_b, best_olen
def greedy_scs(reads, k):
read_a , read_b, olen = pick_maximal_overlap(reads, k)
while(olen > 0):
reads.remove(read_a)
reads.remove(read_b)
reads.append(read_a + read_b[olen:])
read_a, read_b, olen = pick_maximal_overlap(reads, k)
return ''.join(reads)
De Bruijn graph
def de_bruijn_ize(st, k):
edges = []
nodes = set()
for i in range(len(st)-k+1):
edges.append((st[i:i+k-1], st[i+1:i+k]))
nodes.add(st[i:i+k-1])
nodes.add(st[i+1:i+k])
return nodes, edges