28 Implement `strstr()`


title: Implement strstr()
tags:
- implement-strstr
- No.28
- simple
- string
- rabin-karp
- finite-automata
- kmp


Problem

Implement strStr().

Return the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.

Example 1:

Input: haystack = "hello", needle = "ll"
Output: 2

Example 2:

Input: haystack = "aaaaa", needle = "bba"
Output: -1

Clarification:

What should we return when needle is an empty string? This is a great question to ask during an interview.

For the purpose of this problem, we will return 0 when needle is an empty string. This is consistent to C's strstr() and Java's indexOf().

Corner Cases

  • empty haystack:
haystack: ""
needle  : ""
  • short haystack:
haystack: "a"
needle  : "abcdefg"
  • empty needle
haystack: "abaskdjflsdf"
needle  : ""

Solutions

Rabin-Karp

Use Rabin-Karp algorithm to match string pattern. If function RabinKarp(String s) returns a hash value for any string with a certain length l, then compare the hash value between needle and substrings in length l in haystack.

If hash value hits the needle, then compare the string character by character. Else, skip. In another word,

HIT \subsetneq MATCH

The design of hash function RabinKarp() goes as following:

Take a large prime number q. Any substring in length l are hashed to this q pool. Thus the larger q is, the fewer frequently the spurious hits are.

Take a radix r for character set \Sigma, which usually is |\Sigma|. For ASCII, r=256.

Compute the hash value according to val = (r * val + s[i]) % q.

The expected running time is O(n):

class Solution {
    private int q = 2671; // Prime number
    private int r = 256;  // size of ASCII

    public int strStr(String haystack, String needle) {
        char[]  h_arr = haystack.toCharArray();
        char[]  n_arr = needle.toCharArray();
        int     lh    = haystack.length();
        int     ln    = needle.length();
        int     pn    = RabinKarp(n_arr);
        int     ph;
        boolean match;
        
        if (ln == 0)            { return 0;  }
        if (lh == 0 || lh < ln) { return -1; }

        for (int i=0; i<lh-ln+1; i++) {
            ph = RabinKarp(Arrays.copyOfRange(h_arr, i, i+ln));
            if (pn == ph) {
                match = true;
                for (int j=i; j<i+ln; j++) {
                    match = match && (h_arr[j] == n_arr[j-i]);
                }
                if (match) { return i; }
            }
        }
        return -1;
    }

    private int RabinKarp(char[] s) {
        int p = 1;
        for (int i=0; i<s.length; i++) {
            p = (p * this.r + s[i]) % this.q;
        }
        return p;
    }
}

Knuth-Morris-Pratt & Finite Automata

Finite Automata is widely used in regular matching in the part of lexical analysis in compiler. And the ingenious constructing method of FA belongs to KMP.

For a pattern string p, suppose the length of it is m. Then we have m+1 kinds of states for FA, including an initialization state. And we have 256 characters for char type. Only when the state is transfered to the final one by input, we say that FA accept the input string(or char array).

Take input ababac and character set {a, b, c} as an instance:

state a b c accepted
0 1 0 0
1 1 2 0 a
2 3 0 0 b
3 1 4 0 a
4 5 2 0 b
5 1 4 6 a
6 c

When compute the transition table, use a state b to record the reset state for mis-matching input. Then b can be updated as b = dfa[x][p[i]]. This b means rolling backwards. It indicates the prefix p[0 : b] should overlap the suffix s[i-b : b] for mis-matching:

x = 3:
s[..., i-4, i-3, i-2, i-1, i]
          p[0,   1,   2,   3,   4, ...]

Running time is O(m \times |\Sigma|) = O(256m) for computing char transition table, O(n) for matching.

class Solution {
    private int[][] dfa;

    public int strStr(String haystack, String needle) {
        char[]  h_arr = haystack.toCharArray();
        char[]  n_arr = needle.toCharArray();
        int     lh    = haystack.length();
        int     ln    = needle.length();        
        int     q     = 0;
        
        if (ln == 0)            { return 0;  }
        if (lh == 0 || lh < ln) { return -1; }

        ASCIItransition(n_arr);

        for (int i=0; i<lh; i++) {
            q = dfa[q][h_arr[i]];
            if (q == ln) {
                return i - ln + 1;
            }
        }

        return -1;
    }

    private void ASCIItransition(char[] p) {
        int m        = p.length;
        int r        = 256;
        int b        = 0;
        dfa          = new int[m][r];
        
        dfa[0][p[0]] = 1;
        for (int i=1; i<m; i++) {
            for (int j=0; j<r; j++) {
                dfa[i][j] = dfa[b][j];
            }
            dfa[i][p[i]] = i + 1;
            b            = dfa[b][p[i]]; 
        }
    }
}
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • The Inner Game of Tennis W Timothy Gallwey Jonathan Cape ...
    网事_79a3阅读 12,428评论 3 20
  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,490评论 0 10
  • ‌故老相传,此地有飞鹰守护大地,飞于春,遨于夏,沉于秋,眠于冬。而此地亦有一大蛇,醒于秋,游于冬,而慢慢藏于春,寂...
    郭曦阅读 559评论 0 3
  • 很多时候,对于他人的选择和决定应当予以理解和支持就足够,不需要理性的分析和自以为聪明的判断。因为对于陷在迷茫的人,...
    一_8095阅读 258评论 0 0