All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",
Return:
["AAAAACCCCC", "CCCCCAAAAA"].
思路:
最暴力的方法是两次循环,每次比较当前位置起长度为10的子串,在后面是否出现过。
继而优化方法是用哈希表存已经找到的长度为10的子串,遍历到新的就看是否已遍历过,并且结果集中没有。
由于字符串中只含有ACGT四个字母,因此如果把它们看作是一种计数方式,是可以把长度为10的字符串转化为一个数字的,如果哈希表中只存数字,空间利用率将会进一步提升。
拓展思考,在字符串中只有8个不同字母情况下,用64位整型,仍然可以利用此种方法。
public List<String> findRepeatedDnaSequences(String s) {
List<String> res = new ArrayList<>();
if (s == null || s.length() < 11) {
return res;
}
Map<Integer, Integer> map = new HashMap<>();
map.put(0, 0);
map.put('C' - 'A', 1);
map.put('G' - 'A', 2);
map.put('T' - 'A', 3);
Map<Integer, Integer> record = new HashMap<>();
for (int i = 0; i < s.length() - 9; i++) {
int numFlag = 0;
for (int j = i; j < i+10; j++) {
numFlag |= map.get(s.charAt(j) - 'A');
numFlag <<= 2;
}
record.put(numFlag, record.getOrDefault(numFlag, 0) + 1);
if (record.get(numFlag) == 2) {
res.add(s.substring(i, i+10));
}
}
return res;
}