187. Repeated DNA Sequences

Description

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

Solution

Bit-manipulation & HashMap, time O(n), space O(n)

为了判断sequence的重复性，可以对String做hash。由于DNA str的长度为10，且只由四个字符组成，很自然可以想到将DNA转换成二进制表达，每个字符由2 bits表示，那么只需要20 bits即可表达一条DNA。

class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        List<String> dup = new ArrayList<>();
        if (s == null || s.length() < 11) {
            return dup;
        }
        
        Map<Integer, Integer> dnaFreq = new HashMap<>();
        Map<Character, Integer> map = new HashMap<>();
        String t = "ACGT";
        for (int i = 0; i < t.length(); ++i) {
            map.put(t.charAt(i), i);
        }
        
        int val = 0;
        for (int i = 0; i < 10; ++i) {
            val = (val << 2) | map.get(s.charAt(i));
        }
        dnaFreq.put(val, 1);
        
        final int MASK = ~(3 << 18);
        for (int i = 10; i < s.length(); ++i) {
            val &= MASK;    // remove the val of s.charAt(i - 10)
            val = (val << 2) | map.get(s.charAt(i));
            dnaFreq.put(val, dnaFreq.getOrDefault(val, 0) + 1);
            
            if (dnaFreq.get(val) == 2) {    // duplicate
                dup.add(s.substring(i - 9, i + 1));
            }
        }
        
        return dup;
    }
}

187. Repeated DNA Sequences

Description

Solution

Bit-manipulation & HashMap, time O(n), space O(n)

推荐阅读更多精彩内容