问题描述
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",
Return:
["AAAAACCCCC", "CCCCCAAAAA"].
思路分析
蠢蠢的我以为要枚举长度为10的子串并用KMP算法进行匹配,因此去复习了一波KMP,然后就超时啦(手动再见)。
正确的做法是,由于字母只有ATCG四个,可以根据字母对一个字符串进行编码,也可以理解为将ATCG对应为0123,即将字符串转换为4进制数,然后对数字进行匹配就好了。其中还需要注意,不用每次都重算字符串的encode,因为前一个字符串的值后一个可以利用,而因为10个字母正好对应20位二进制数,因此可以通过“&0xFFFFF”来去掉10个字母之前的值,最后是注意用dict提高速度。
AC代码
class Solution(object):
def findRepeatedDnaSequences(self, s):
"""
:type s: str
:rtype: List[str]
"""
n = len(s)
rst = set()
once = set()
code = 0
code_map = {'A':0, 'C':1, 'G':2, 'T':3}
for i in range(n):
code = code * 4 + code_map[s[i]] & 0xFFFFF
if i < 9:
continue
if code in once:
rst.add(s[i-9 : i+1])
else:
once.add(code)
return list(rst)
KMP复习
def getNext(p):
n = len(p)
next = [0 for i in range(n)]
k = 0
for i in range(1, n):
while k > 0 and p[k] != p[i]:
k = next[k-1]
if p[k] == p[i]:
k += 1
next[i] = k
return next
def kmp(s, p):
i = 0
j = 0
n = len(s)
m = len(p)
next = getNext(p)
while j < m and i < n:
if s[i] == p[j]:
i += 1
j += 1
else:
if j == 0:
i += 1
else:
j = next[j-1]
if j == m:
return True
else:
return False