title: Longest Substring Without Repeating Characters
tags:
- longest-substring-without-repeating-characters
- No.3
- medium
- string
- divide-conquer
- dynamic-programming
- queue
Problem
Given a string, find the length of the longest substring without repeating characters.
Examples:
Given "abcabcbb"
, the answer is "abc"
, which the length is 3.
Given "bbbbb"
, the answer is "b"
, with the length of 1.
Given "pwwkew"
, the answer is "wke"
, with the length of 3. Note that the answer must be a substring, "pwke"
is a subsequence and not a substring.
Corner Cases
""
"a"
"ab"
"xxxcabcxy"
Solution
Divide and Conquer
Divide
For string s[i : j]
, it can be divided into 2 independent substrings: s[i : k]
, s[k+1 : j]
. If the longest substring without repeating characters for s[i : j]
is LSWRC(s[i : j])
, then we have:
LSWRC(s[i : j]) = max{ LSWRC(s[i : k]), LSWRC(s[k+1 : j]), s[x : y] }
where s[x : y]
is the longest substring without repeating characters accrossing over k, k+1
.
In this case, we can divide s[i : j]
into 2 halves, which means k = (i + j) / 2
Conquer
Let say LSWRC(String s)
gives the longest substring without repeating characters for input s
. Then we directly get the results of the following for input s[i : j]
:
LSWRC(s[i : k])
LSWRC(s[k+1 : j])
The main problem is s[x : y]
.
Starting from k
and k+1
, we move x
and y
one by one at one time untill:
-
x
decreases toi
andy
increases toj
-
s[x - 1]
ors[y + 1]
is already ins[x : y]
Then we get the required s[x : y]
.
Finally, compare LSWRC(s[i : k])
, LSWRC(s[k+1 : j])
and s[x : y]
.
An instance follows:
LSWRC("abcabcbb")
= max{ LSWRC("abca"), LSWRC("bcbb"), "cab" }
= max{ max{ LSWRC("ab"), LSWRC("ca"), "abc" },
max{ LSWRC("bc"), LSWRC("bb"), "cb" },
"cab" }
= max{ max{ max{ LSWRC("a"), LSWRC("b"), "ab" },
max{ LSWRC("c"), LSWRC("a"), "ca" },
"abc" },
max{ max{ LSWRC("b"), LSWRC("c"), "bc" },
max{ LSWRC("b"), LSWRC("b"), "b" },
"cb" },
"cab" }
= max{ max{ max{ "a", "b", "ab" },
max{ "c", "a", "ca" },
"abc" },
max{ max{ "b", "c", "bc" },
max{ "b", "b", "b" },
"cb" },
"cab" }
= max{ max{ "ab", "ca", "abc" },
max{ "bc", "b" , "cb" },
"cab" }
= max{ "abc", "cb", "cab" }
= "abc"
T(n) = 2 \times T(\frac{n}{2}) + O(n), thus running time is O(n lg(n)).
However, there is a severe bug in this algorithm in the part of computing s[x : y]
:
When x
and y
are expanded, they cannot be updated in the same time. For an example: LSWRC("xxxcabcxy") = max{ LSWRC("xxxca"), LSWRC("bcxy"), "abcxy" }
.
However, in the process of getting "abcxy"
, if we add s[x]
first:
0. "ab"
1. "cab"
2. "cabc" // stop increasing y
3. "xcab"
4. "xxcab"
5. return "xcab"
s[x : y]
is locked by the right c
. If we add s[y]
first:
0. "ab"
1. "abc"
2. "cabc" // stop decreasing x
3. "abcx"
4. "abcxy"
5. return "abcxy"
It's rather difficult to fix this bug since index x
and y
are symmetrical in theory while non-symmetrical in code:
class Solution {
private String[] s;
public int lengthOfLongestSubstring(String s) {
this.s = s.split("");
return LSWRC(0, s.length()-1);
}
public int[] LSWRC(int i, int j) {
// if a character
if (i == j) {return new int[] {i, j};}
int k = (i + j) / 2;
int[] lswrc = {0, 0};
// divide
int[] l_lswrc = LSWRC(i , k);
int[] r_lswrc = LSWRC(k+1, j);
int l1 = l_lswrc[1] - l_lswrc[0] + 1;
int l2 = r_lswrc[1] - r_lswrc[0] + 1;
lswrc = (l1 > l2) ? l_lswrc : r_lswrc;
// conquer
int i = k;
int j = k + 1;
int x = k;
int y = k + 1;
if (this.s[i].equals(this.s[j])) {return lswrc;}
HashSet<String> hs = new HashSet<String>();
boolean istop = false;
boolean jstop = false;
for ( ; !(istop && jstop); ) {
boolean p1 = hs.contains(this.s[i]);
boolean p2 = (i == i);
istop = p1 || p2;
if (!p1) {
hs.add(this.s[i]);
x = i;
i = p2 ? i : i - 1;
}
boolean q1 = hs.contains(this.s[j]);
boolean q2 = (j == j);
jstop = q1 || q2;
if (!q1) {
hs.add(this.s[j]);
y = j;
j = q2 ? j : j + 1;
}
}
int l3 = lswrc[1] - lswrc[0] + 1;
int l4 = y - x + 1;
lswrc = (l3 > l4) ? lswrc : new int[] {x, y};
return lswrc;
}
}
Dynamic Programming
For all substring s[i : j]
, we have i <= j
.
If s[i : j]
has repeating characters, then all s[a : b]
with a <= i && j <= b
have repeating characters too. We can describe this relationship by table A
according to Bottom-Up Dynamic Programming:
8|111100000
7|11110000
6|1111000
5|110000
4|11000
3|1100
2|110
1|10
0|0
+---------
012345678
xxxcabcxy
Here, A[i][j]
equals
-
0
:s[i : j]
does not have repeating characters -
1
:s[i : j]
has repeating characters
Then the relationship can be visualized by a rectangle area:
8|1111
7|1111
6|1111
5|
4|
3|
2|
1|
0|
+---------
012345678
xxxcabcxy
Here all substrings contains[3 : 6] = "cabc"
have repeating characters.
The longest substring without repeating characters are bounded by a line with all 1
parallel to j = i (single character line):
8| 1 0
7| 1 0
6| 1 0
5|1 0
4| 0
3| 0
2| 0
1| 0
0|0
+---------
012345678
xxxcabcxy
How to check if s[i : j]
has repeating characters or not? Suppose the set of characters of s[i : j]
is B[i][j]
. Since there exists:
s[i : j] = s[i : k] + s[k+1 : j]
we have:
B[i][j] = B[i][k] union B[k+1][j]
if:
B[i][k] intersect B[k+1][j] == Empty
Then s[i : j]
does not have repeating characters.
What's more, if we compute the triangle from j
axle:
8|11 000
7|11 000
6|11 000
5|11?000
4|11000
3|1100
2|110
1|10
0|0
+---------
012345678
xxxcabcxy
s[i : j] = s[i : j-1] + s[j : j]
. In this direction, B[i][j-1]
of s[i : j-1]
is computed. Then we only check if s[j]
is in B[i][j - 1]
or not.
For instance, s[2 : 5] = s[2 : 4] + s[5 : 5]
. B[2][4] = {"x", "a", "c"}
, s[5] = "b"
. Then A[2][5] = 0
.
Complexity
The space complexity is huge. For A
, O(n^2) is needed. For B
, since the elements of the table are all sets, the worst case is O(n^3). The algorithm visits all A[i][j]
, thus at least O(n^2) time complexity. For HashSet
,
B[i][j-1].contians(s[j])
B[i][j-1].add(s[j])
are all O(1). However, if we do not reduce the visiting times of 1
points like A[0][n-1]
, the total time complexity would be much more than O(n^2).
class Solution {
public int lengthOfLongestSubstring(String s) {
if (s.equals("")) { return 0; }
String[] x = s.split("");
int n = s.length();
int[] A = new int[f(n-1, n-1) + 1];
HashSet<String>[] B = new HashSet[f(n-1, n-1) + 1];
for (int i=0; i<n; i++) {
A[f(i, i)] = 0;
B[f(i, i)] = new HashSet<String>();
B[f(i, i)].add(x[i]);
}
int c = 1;
boolean flag = true;
for( ; c<n; c++) {
flag = true;
for (int i=0; i<n-c; i++) {
int j = i + c;
if (A[f(i, j)] != 1) {
if (B[f(i, j-1)].contains(x[j])) {
// to be optimized
for (int ii=0; ii<=i; ii++) {
for (int jj=n-1; j<=jj; jj--) {
A[f(ii, jj)] = 1;
}
}
}
else {
A[f(i, j)] = 0;
B[f(i, j)] = B[f(i, j-1)];
B[f(i, j)].add(x[j]);
}
}
flag = flag && (A[f(i, j)] == 1);
}
if (flag) { return c; }
}
return c;
}
private int f(int i, int j) { return i + (j * (j + 1)) / 2; }
}
Improvement
Like LCS problem in CLRS, we solve the problem in O(n) space complexity. Different from LCS, we use a queue instead of an array.
Here we give a looser condition: if A[i][j]==1
, then for all j<=k
, A[i][k]=1
. In other words, we just check one column of the table instead of the rectangle area, because of the sequence the improved algorithm visiting table A
.
For table A
, we make it a queue. It is updated as following for input xxxcabcxy
:
44,43,41,38,34,29,23,16,08
42,40,37,33,28,22,15,07
39,36,32,27,21,14,06
35,31,26,20,13,05
30,25,19,12,04
24,18,11,03
17,10,02
09,01
00
A i+(j*(j+1))/2
-------------------------------------
0 [00]
00 [00,01]
000 [00,01,02]
0000 [00,01,02,03]
00000 [00,01,02,03,04]
000000 [00,01,02,03,04,05]
0000000 [00,01,02,03,04,05,06]
00000000 [00,01,02,03,04,05,06,07]
000000000 [00,01,02,03,04,05,06,07,08]
000000001 [01,02,03,04,05,06,07,08,09]
000000011 [02,03,04,05,06,07,08,09,10]
000000110 [03,04,05,06,07,08,09,10,11]
000001100 [04,05,06,07,08,09,10,11,12]
000011000 [05,06,07,08,09,10,11,12,13]
000110000 [06,07,08,09,10,11,12,13,14]
001100000 [07,08,09,10,11,12,13,14,15]
011000000 [08,09,10,11,12,13,14,15,16]
11000000 [09,10,11,12,13,14,15,16]
10000001 [10,11,12,13,14,15,16,17]
00000011 [11,12,13,14,15,16,17,18]
00000110 [12,13,14,15,16,17,18,19]
00001100 [13,14,15,16,17,18,19,20]
00011000 [14,15,16,17,18,19,20,21]
00110000 [15,16,17,18,19,20,21,22]
01100000 [16,17,18,19,20,21,22,23]
1100000 [17,18,19,20,21,22,23]
1000001 [18,19,20,21,22,23,24]
0000011 [19,20,21,22,23,24,25]
0000110 [20,21,22,23,24,25,26]
0001101 [21,22,23,24,25,26,27]
0011010 [22,23,24,25,26,27,28]
0110100 [23,24,25,26,27,28,29]
110100 [24,25,26,27,28,29]
101001 [25,26,27,28,29,30]
010011 [26,27,28,29,30,31]
100111 [27,28,29,30,31,32]
001111 [28,29,30,31,32,33]
011110 [29,30,31,32,33,34]
11110 [30,31,32,33,34]
11101 [31,32,33,34,35]
11011 [32,33,34,35,36]
10111 [33,34,35,36,37]
01111 [34,35,36,37,38]
1111 [35,36,37,38]
RETURN 5
B
updates synchronously with A
. However, the running time is still O(n^2).
import java.util.LinkedList;
import java.util.Queue;
import java.util.HashSet;
class Solution {
public int lengthOfLongestSubstring(String s) {
if (s.equals("")) { return 0; }
String[] x = s.split("");
int n = s.length();
Queue<Integer> A = new LinkedList<Integer>();
Queue<HashSet<String>> B = new LinkedList<HashSet<String>>();
// O(n)
for (int i=0; i<n; i++) {
HashSet<String> b = new HashSet<String>();
b.add(x[i]);
A.offer(0);
B.offer(bi);
}
int c = 1;
boolean flag = true;
for( ; c<n; c++) {
flag = true;
for (int i=0; i<n-c; i++) {
int j = i + c;
int a = A.poll();
HashSet<String> b = B.poll();
// [i][j-1] has been removed
if (a != 1) {
if (b.contains(x[j])) {
A.offer(1);
B.offer(b);
flag = flag && true;
}
else {
b.add(x[j]);
A.offer(0);
B.offer(b);
flag = flag && false;
}
}
else {
A.offer(1);
B.offer(b);
flag = flag && true;
}
}
A.poll();
B.poll();
if (flag) {return c;}
}
return c;
}
}