Description
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",
Return:
["AAAAACCCCC", "CCCCCAAAAA"].
Solution
Bit-manipulation & HashMap, time O(n), space O(n)
为了判断sequence的重复性,可以对String做hash。由于DNA str的长度为10,且只由四个字符组成,很自然可以想到将DNA转换成二进制表达,每个字符由2 bits表示,那么只需要20 bits即可表达一条DNA。
class Solution {
public List<String> findRepeatedDnaSequences(String s) {
List<String> dup = new ArrayList<>();
if (s == null || s.length() < 11) {
return dup;
}
Map<Integer, Integer> dnaFreq = new HashMap<>();
Map<Character, Integer> map = new HashMap<>();
String t = "ACGT";
for (int i = 0; i < t.length(); ++i) {
map.put(t.charAt(i), i);
}
int val = 0;
for (int i = 0; i < 10; ++i) {
val = (val << 2) | map.get(s.charAt(i));
}
dnaFreq.put(val, 1);
final int MASK = ~(3 << 18);
for (int i = 10; i < s.length(); ++i) {
val &= MASK; // remove the val of s.charAt(i - 10)
val = (val << 2) | map.get(s.charAt(i));
dnaFreq.put(val, dnaFreq.getOrDefault(val, 0) + 1);
if (dnaFreq.get(val) == 2) { // duplicate
dup.add(s.substring(i - 9, i + 1));
}
}
return dup;
}
}