187. Repeated DNA Sequences

Description

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

Solution

Bit-manipulation & HashMap, time O(n), space O(n)

为了判断sequence的重复性,可以对String做hash。由于DNA str的长度为10,且只由四个字符组成,很自然可以想到将DNA转换成二进制表达,每个字符由2 bits表示,那么只需要20 bits即可表达一条DNA。

class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        List<String> dup = new ArrayList<>();
        if (s == null || s.length() < 11) {
            return dup;
        }
        
        Map<Integer, Integer> dnaFreq = new HashMap<>();
        Map<Character, Integer> map = new HashMap<>();
        String t = "ACGT";
        for (int i = 0; i < t.length(); ++i) {
            map.put(t.charAt(i), i);
        }
        
        int val = 0;
        for (int i = 0; i < 10; ++i) {
            val = (val << 2) | map.get(s.charAt(i));
        }
        dnaFreq.put(val, 1);
        
        final int MASK = ~(3 << 18);
        for (int i = 10; i < s.length(); ++i) {
            val &= MASK;    // remove the val of s.charAt(i - 10)
            val = (val << 2) | map.get(s.charAt(i));
            dnaFreq.put(val, dnaFreq.getOrDefault(val, 0) + 1);
            
            if (dnaFreq.get(val) == 2) {    // duplicate
                dup.add(s.substring(i - 9, i + 1));
            }
        }
        
        return dup;
    }
}
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容