3 longest substring without repeating characters

title: Longest Substring Without Repeating Characters
tags:
- longest-substring-without-repeating-characters
- No.3
- medium
- string
- divide-conquer
- dynamic-programming
- queue

Problem

Given a string, find the length of the longest substring without repeating characters.

Examples:

Given "abcabcbb", the answer is "abc", which the length is 3.

Given "bbbbb", the answer is "b", with the length of 1.

Given "pwwkew", the answer is "wke", with the length of 3. Note that the answer must be a substring, "pwke" is a subsequence and not a substring.

Corner Cases

""
"a"
"ab"
"xxxcabcxy"

Solution

Divide and Conquer

Divide

For string s[i : j], it can be divided into 2 independent substrings: s[i : k], s[k+1 : j]. If the longest substring without repeating characters for s[i : j] is LSWRC(s[i : j]), then we have:

LSWRC(s[i : j]) = max{ LSWRC(s[i : k]), LSWRC(s[k+1 : j]), s[x : y] }

where s[x : y] is the longest substring without repeating characters accrossing over k, k+1.

In this case, we can divide s[i : j] into 2 halves, which means k = (i + j) / 2

Conquer

Let say LSWRC(String s) gives the longest substring without repeating characters for input s. Then we directly get the results of the following for input s[i : j]:

LSWRC(s[i : k])
LSWRC(s[k+1 : j])

The main problem is s[x : y].

Starting from k and k+1, we move x and y one by one at one time untill:

x decreases to i and y increases to j
s[x - 1] or s[y + 1] is already in s[x : y]

Then we get the required s[x : y].

Finally, compare LSWRC(s[i : k]), LSWRC(s[k+1 : j]) and s[x : y].

An instance follows:

  LSWRC("abcabcbb") 
= max{ LSWRC("abca"), LSWRC("bcbb"), "cab"          }
= max{ max{ LSWRC("ab"), LSWRC("ca"), "abc" },
       max{ LSWRC("bc"), LSWRC("bb"), "cb"  },
       "cab"                                        }
= max{ max{ max{ LSWRC("a"), LSWRC("b"), "ab" },
            max{ LSWRC("c"), LSWRC("a"), "ca" },
            "abc"                                },
       max{ max{ LSWRC("b"), LSWRC("c"), "bc" },
            max{ LSWRC("b"), LSWRC("b"), "b"  },
            "cb"                                 },
       "cab"                                        }
= max{ max{ max{ "a", "b", "ab" },
            max{ "c", "a", "ca" },
            "abc"                  },
       max{ max{ "b", "c", "bc" },
            max{ "b", "b", "b"  },
            "cb"                   },
       "cab"                          }
= max{ max{ "ab", "ca", "abc" },
       max{ "bc", "b" , "cb"  },
       "cab"                     }
= max{ "abc", "cb", "cab" }
= "abc"

$T(n) = 2 \times T(\frac{n}{2}) + O(n)$ , thus running time is $O(n lg(n))$ .

However, there is a severe bug in this algorithm in the part of computing s[x : y]:

When x and y are expanded, they cannot be updated in the same time. For an example: LSWRC("xxxcabcxy") = max{ LSWRC("xxxca"), LSWRC("bcxy"), "abcxy" }.

However, in the process of getting "abcxy", if we add s[x] first:

0. "ab"
1. "cab"
2. "cabc" // stop increasing y
3. "xcab"
4. "xxcab"
5. return "xcab"

s[x : y] is locked by the right c. If we add s[y] first:

0. "ab"
1. "abc"
2. "cabc" // stop decreasing x
3. "abcx"
4. "abcxy"
5. return "abcxy"

It's rather difficult to fix this bug since index x and y are symmetrical in theory while non-symmetrical in code:

class Solution {
    private String[] s;

    public int lengthOfLongestSubstring(String s) {
        this.s = s.split("");
        return LSWRC(0, s.length()-1);
    }

    public int[] LSWRC(int i, int j) {
        // if a character
        if (i == j) {return new int[] {i, j};}

        int   k       = (i + j) / 2;
        int[] lswrc   = {0, 0};

        // divide
        int[] l_lswrc = LSWRC(i  , k);
        int[] r_lswrc = LSWRC(k+1, j);

        int   l1      = l_lswrc[1] - l_lswrc[0] + 1;
        int   l2      = r_lswrc[1] - r_lswrc[0] + 1;
        lswrc = (l1 > l2) ? l_lswrc : r_lswrc;

        // conquer
        int i = k;
        int j = k + 1;

        int x = k;
        int y = k + 1;

        if (this.s[i].equals(this.s[j])) {return lswrc;}

        HashSet<String> hs = new HashSet<String>();
        boolean      istop = false;
        boolean      jstop = false;

        for ( ; !(istop && jstop); ) {
            boolean p1 = hs.contains(this.s[i]);
            boolean p2 = (i == i);
            istop = p1 || p2;

            if (!p1) {
                hs.add(this.s[i]);
                x = i;
                i = p2 ? i : i - 1;
            }

            boolean q1 = hs.contains(this.s[j]);
            boolean q2 = (j == j);
            jstop = q1 || q2;
            if (!q1) {
                hs.add(this.s[j]);
                y = j;
                j = q2 ? j : j + 1;
            }
        }

        int   l3 = lswrc[1] - lswrc[0] + 1;
        int   l4 = y - x + 1;
        lswrc = (l3 > l4) ? lswrc : new int[] {x, y};

        return lswrc;
    }
}

Dynamic Programming

For all substring s[i : j], we have i <= j.

If s[i : j] has repeating characters, then all s[a : b] with a <= i && j <= b have repeating characters too. We can describe this relationship by table A according to Bottom-Up Dynamic Programming:

8|111100000
7|11110000
6|1111000
5|110000
4|11000
3|1100
2|110
1|10
0|0
 +---------
  012345678
  xxxcabcxy

Here, A[i][j] equals

0 : s[i : j] does not have repeating characters
1 : s[i : j] has repeating characters

Then the relationship can be visualized by a rectangle area:

8|1111
7|1111
6|1111
5|
4|
3|
2|
1|
0|
 +---------
  012345678
  xxxcabcxy

Here all substrings contains[3 : 6] = "cabc" have repeating characters.

The longest substring without repeating characters are bounded by a line with all 1 parallel to $j = i$ (single character line):

8|   1    0
7|  1    0
6| 1    0
5|1    0
4|    0
3|   0
2|  0
1| 0
0|0
 +---------
  012345678
  xxxcabcxy

How to check if s[i : j] has repeating characters or not? Suppose the set of characters of s[i : j] is B[i][j]. Since there exists:

s[i : j] = s[i : k] + s[k+1 : j]

we have:

B[i][j] = B[i][k] union B[k+1][j]

if:

B[i][k] intersect B[k+1][j] == Empty

Then s[i : j] does not have repeating characters.

What's more, if we compute the triangle from j axle:

8|11    000
7|11   000
6|11  000
5|11?000
4|11000
3|1100
2|110
1|10
0|0
 +---------
  012345678
  xxxcabcxy

s[i : j] = s[i : j-1] + s[j : j]. In this direction, B[i][j-1] of s[i : j-1] is computed. Then we only check if s[j] is in B[i][j - 1] or not.

For instance, s[2 : 5] = s[2 : 4] + s[5 : 5]. B[2][4] = {"x", "a", "c"}, s[5] = "b". Then A[2][5] = 0.

Complexity

The space complexity is huge. For A, $O(n^2)$ is needed. For B, since the elements of the table are all sets, the worst case is $O(n^3)$ . The algorithm visits all A[i][j], thus at least $O(n^2)$ time complexity. For HashSet,

B[i][j-1].contians(s[j])
B[i][j-1].add(s[j])

are all $O(1)$ . However, if we do not reduce the visiting times of 1 points like A[0][n-1], the total time complexity would be much more than $O(n^2)$ .

class Solution {
    public int lengthOfLongestSubstring(String s) {
        if (s.equals("")) { return 0; }

        String[] x = s.split("");
        int      n = s.length();

        int[]             A = new int[f(n-1, n-1) + 1];
        HashSet<String>[] B = new HashSet[f(n-1, n-1) + 1];

        for (int i=0; i<n; i++) {
            A[f(i, i)] = 0;
            B[f(i, i)] = new HashSet<String>();
            B[f(i, i)].add(x[i]);
        }

        int     c    = 1;
        boolean flag = true;
        for( ; c<n; c++) {
            flag = true;
            for (int i=0; i<n-c; i++) {
                int j = i + c;
                if (A[f(i, j)] != 1) {
                    if (B[f(i, j-1)].contains(x[j])) {
                        // to be optimized
                        for (int ii=0; ii<=i; ii++) {
                            for (int jj=n-1; j<=jj; jj--) {
                                A[f(ii, jj)] = 1;
                            }
                        }
                    }
                    else {
                        A[f(i, j)] = 0;
                        B[f(i, j)] = B[f(i, j-1)];
                        B[f(i, j)].add(x[j]);
                    }
                }
                flag = flag && (A[f(i, j)] == 1);
            }
            if (flag) { return c; }
        }
        return c;
    }

    private int f(int i, int j) { return i + (j * (j + 1)) / 2; }
}

Improvement

Like LCS problem in CLRS, we solve the problem in $O(n)$ space complexity. Different from LCS, we use a queue instead of an array.

Here we give a looser condition: if A[i][j]==1, then for all j<=k, A[i][k]=1. In other words, we just check one column of the table instead of the rectangle area, because of the sequence the improved algorithm visiting table A.

For table A, we make it a queue. It is updated as following for input xxxcabcxy:

44,43,41,38,34,29,23,16,08
42,40,37,33,28,22,15,07
39,36,32,27,21,14,06
35,31,26,20,13,05
30,25,19,12,04
24,18,11,03
17,10,02
09,01
00

A         i+(j*(j+1))/2
-------------------------------------
0         [00]
00        [00,01]
000       [00,01,02]
0000      [00,01,02,03]
00000     [00,01,02,03,04]
000000    [00,01,02,03,04,05]
0000000   [00,01,02,03,04,05,06]
00000000  [00,01,02,03,04,05,06,07]
000000000 [00,01,02,03,04,05,06,07,08]
000000001 [01,02,03,04,05,06,07,08,09]
000000011 [02,03,04,05,06,07,08,09,10]
000000110 [03,04,05,06,07,08,09,10,11]
000001100 [04,05,06,07,08,09,10,11,12]
000011000 [05,06,07,08,09,10,11,12,13]
000110000 [06,07,08,09,10,11,12,13,14]
001100000 [07,08,09,10,11,12,13,14,15]
011000000 [08,09,10,11,12,13,14,15,16]
11000000  [09,10,11,12,13,14,15,16]
10000001  [10,11,12,13,14,15,16,17]
00000011  [11,12,13,14,15,16,17,18]
00000110  [12,13,14,15,16,17,18,19]
00001100  [13,14,15,16,17,18,19,20]
00011000  [14,15,16,17,18,19,20,21]
00110000  [15,16,17,18,19,20,21,22]
01100000  [16,17,18,19,20,21,22,23]
1100000   [17,18,19,20,21,22,23]
1000001   [18,19,20,21,22,23,24]
0000011   [19,20,21,22,23,24,25]
0000110   [20,21,22,23,24,25,26]
0001101   [21,22,23,24,25,26,27]
0011010   [22,23,24,25,26,27,28]
0110100   [23,24,25,26,27,28,29]
110100    [24,25,26,27,28,29]
101001    [25,26,27,28,29,30]
010011    [26,27,28,29,30,31]
100111    [27,28,29,30,31,32]
001111    [28,29,30,31,32,33]
011110    [29,30,31,32,33,34]
11110     [30,31,32,33,34]
11101     [31,32,33,34,35]
11011     [32,33,34,35,36]
10111     [33,34,35,36,37]
01111     [34,35,36,37,38]
1111      [35,36,37,38]
RETURN 5

B updates synchronously with A. However, the running time is still $O(n^2)$ .

import java.util.LinkedList;
import java.util.Queue;
import java.util.HashSet;

class Solution {    
    public int lengthOfLongestSubstring(String s) {
        if (s.equals("")) { return 0; }

        String[] x = s.split("");
        int      n = s.length();

        Queue<Integer>         A = new LinkedList<Integer>();
        Queue<HashSet<String>> B = new LinkedList<HashSet<String>>();

        // O(n)
        for (int i=0; i<n; i++) {
            HashSet<String> b = new HashSet<String>();
            b.add(x[i]);
            A.offer(0);
            B.offer(bi);
        }

        int     c    = 1;
        boolean flag = true;
        for( ; c<n; c++) {
            flag = true;
            for (int i=0; i<n-c; i++) {
                int             j = i + c;
                int             a = A.poll();
                HashSet<String> b = B.poll();
                // [i][j-1] has been removed
                if (a != 1) {
                    if (b.contains(x[j])) {
                        A.offer(1);
                        B.offer(b);
                        flag = flag && true;
                    }
                    else {
                        b.add(x[j]);
                        A.offer(0);
                        B.offer(b);
                        flag = flag && false;
                    }
                }
                else {
                    A.offer(1);
                    B.offer(b);
                    flag = flag && true;
                }
            }
            A.poll();
            B.poll();
            if (flag) {return c;}
        }
        return c;
    }
}