题目来源
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",
Return:
["AAAAACCCCC", "CCCCCAAAAA"].
看到这道题的最初想法是用哈希,记录每一个十个字母长度的数目,如果大于等于2则输出。然后直接AC了!
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
vector<string> res;
int n = s.length();
if (n < 10)
return res;
unordered_map<string, int> map;
for (int i=0; i<=n-10; i++) {
map[s.substr(i, 10)]++;
if (map[s.substr(i, 10)] == 2)
res.push_back(s.substr(i, 10));
}
return res;
}
};
然后看看大神们的做法吧,结合上位操作。
The main idea is to store the substring as int in map to bypass the memory limits.
There are only four possible character A, C, G, and T, but I want to use 3 bits per letter instead of 2.
Why? It's easier to code.
A is 0x41, C is 0x43, G is 0x47, T is 0x54. Still don't see it? Let me write it in octal.
A is 0101, C is 0103, G is 0107, T is 0124. The last digit in octal are different for all four letters. That's all we need!
We can simply use s[i] & 7 to get the last digit which are just the last 3 bits, it's much easier than lookup table or switch or a bunch of if and else, right?
We don't really need to generate the substring from the int. While counting the number of occurrences, we can push the substring into result as soon as the count becomes 2, so there won't be any duplicates in the result.
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
vector<string> res;
int n = s.length();
if (n < 10)
return res;
unordered_map<int, int> map;
int i = 0, t = 0;
while (i < 9) {
t = t << 3 | s[i++] & 7;
}
while (i < n){
if (map[t = t << 3 & 0x3FFFFFFF | s[i++] & 7]++ == 1)
res.push_back(s.substr(i-10, 10));
}
return res;
}
};