LeetCode – Repeated DNA Sequences (Java)

Problem

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example, given s = “AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT”, return: [“AAAAACCCCC”, “CCCCCAAAAA”].

Java Solution

The key to solve this problem is that each of the 4 nucleotides can be stored in 2 bits. So the 10-letter-long sequence can be converted to 20-bits-long integer. The following is a Java solution. You may use an example to manually execute the program and see how it works.

public List<String> findRepeatedDnaSequences(String s) {
    List<String> result = new ArrayList<>();
    if(s==null||s.length()<10){
        return result;
    }
 
    HashMap<Character, Integer> dict = new HashMap<>();
    dict.put('A', 0);
    dict.put('C', 1);
    dict.put('G', 2);
    dict.put('T', 3);
 
    int hash=0;      
    int mask = (1<<20) -1;
 
    HashSet<Integer> added = new HashSet<>();
    HashSet<Integer> temp = new HashSet<>();
 
    for(int i=0; i<s.length(); i++){
        hash = (hash<<2) + dict.get(s.charAt(i));
 
        if(i>=9){
            hash&=mask;
            if(temp.contains(hash) && !added.contains(hash)){
                result.add(s.substring(i-9, i+1));
                added.add(hash);
            }
 
            temp.add(hash);
        }
    }
 
    return result;
}

5 thoughts on “LeetCode – Repeated DNA Sequences (Java)”

  1. why do you have a temp and an added hashset. If your calculating the hashcode, shouldn’t you just have one set that contains all the seen hashcode to find duplicates?

  2. The hashcode sometimes will give you the same value for sequences that are not valid. Example

    ACCCCTGAGG
    CTGTTCGTTG

    Both return hashCode: 1406448045

    In java at least. So you can’t rely on the under the hood java implementation of hashcode unless you implement your own version. In the code, they implement it’s own hashcode for 20 bits.

  3. Why do you need to generate your own hashcode? The String class has it’s own hashCode method that returns a unique hash for each unique string. So if you just take each 10 letter string and check if it exists in the Set and if so then add it to the list, wouldn’t that work?

Leave a Comment