后缀数组

数据处理中最基础的问题之一是从文本T中找到一段模式P所在的位置。而后缀数组与后缀树就是解决这类问题的数据结构（两者基本等价，就是用空间换时间）。

后缀数组的定义

关于一段文本T的后缀数组其实就是T的所有 后缀字串 有序排列所组成的一个数组。例如，假设文本字符串T为banana，则banana的后缀数组如下图所示。

图1："banana"的后缀数组

直接存储所有后缀的文本，需要平方级的空间（）。所以实际的实现方法通常是只存后缀字串在文本T中的起始下标（即substring的参数），而不是整个子串。

图2：仅存起始下标的后缀数组

所以得到一个字符串的后缀数组的主要工作就是排序这些后缀字串。后面会介绍一种O(N)时间复杂度的算法构建后缀数组。
另外，得到后缀数组后，对数组中每一对连续排放的子串计算最长公共前缀（Longest Common Prefix, LCP）的长度，得到的数组称为LCP数组。计算的方法就是针对每个后缀数组中的子串，计算自身与前一个子串的LCP，并将LCP的长度存储即可。

图3：LCP数组

后缀数组及其LCP的用处

判断某个模式P是否出现在文本T中。如果某个模式P出现在文本中，则它必定是某个后缀的前缀。对后缀数组做折半查找可以快速确定模式P是否在文本中。比如判断ann是否在banana文本中，第一次折半查找(0+5)/2=2，anana小于ann；第二次折半查找(2+5)/2=3，banana大于ann；此时(2,3)之间已经没有其他元素，而anana和banana都不是以ann为前缀，所以折半查找结束，结果为不存在。查询时间为 $O(|P|log|T|)$ ，其中 $log|T|$ 是折半查找造成的， $|P|$ 是每步比较的开销。
找到某个模式P在文本T中出现的次数。如果某个模式P在文本T中出现多次，那么这些后缀子串会被顺序存储在后缀数组中。单纯的利用后缀数组进行折半查找也可以实现，但是利用LCP会加速这个过程： $O(|P|+log|T|)$ 。比如查找ana在banana中出现的次数，第一次折半查找为下标为2的字符串anana，发现此字符串满足ana前缀，此时只需要在LCP数组中，以下标2为起点，前后依序遍历，查找LCP值大于等于"ana".length，即3，的元素，一旦发现小于3的值即停止遍历。结果是：只有下标为2的LCP值大于等于3，那么ana在T中出现的次数就为1+1次。第一个1是折半查找到匹配的元素，第二个1是LCP数组中查找到的值的个数。

后缀数组的计算

下面给出的是后缀数组和LCP数组的简单算法。

后缀数组计算的运行时间主要花在排序步骤上（Arrays.sort），用了 $O(NlogN)$ 次比较， $N=|T|$ ，每次比较两个字符串，依赖于 $LCP(s1, s2)$ 的计算复杂度，最坏情形为 $O(N)$ ，所以排序步骤的总时间复杂度最坏为 $O(N^{2}logN)$ 。

后面会给出一个以线性时间计算后缀数组的算法，利用分治的思想和基数排序实现。

 /*
     * Returns the LCP for any two strings
     */
    public static int computeLCP( String s1, String s2 )
    {
        int i = 0;
        
        while( i < s1.length( ) && i < s2.length( ) && s1.charAt( i ) == s2.charAt( i ) )
            i++;
        
        return i;
    }

    /*
     * Fill in the suffix array and LCP information for String str
     * @param str the input String
     * @param SA existing array to place the suffix array
     * @param LCP existing array to place the LCP information
     * Note: Starting in Java 7, this will use quadratic space.
     */
    public static void createSuffixArraySlow( String str, int [ ] SA, int [ ] LCP )
    {
        if( SA.length != str.length( ) || LCP.length != str.length( ) )
            throw new IllegalArgumentException( );
        
        int N = str.length( );
        
        String [ ] suffixes = new String[ N ];
        for( int i = 0; i < N; i++ )
            suffixes[ i ] = str.substring( i );
        
        Arrays.sort( suffixes );
        
        for( int i = 0; i < N; i++ )
            SA[ i ] = N - suffixes[ i ].length( );
        
        LCP[ 0 ] = 0;
        for( int i = 1; i < N; i++ )
            LCP[ i ] = computeLCP( suffixes[ i - 1 ], suffixes[ i ] );
    }

线性时间构建后缀数组

构建后缀数组实际就是解决所有后缀子串的排序问题。算法用到了分治，基本思路如下：
1.选择所有后缀子串的一个样本集A。
2.用递归将样本集A排序。
3.利用当前有序的后缀样本集A将剩下的后缀集合B排序。
4.归并A和B。

为什么步骤3会管用的说明：设后缀样本集A是所有从奇数下标开始的后缀，则后缀样本集B就是从偶数下标开始的后缀。所以，设我们已经计算了后缀样本集A的有序集合，现在要利用A计算后缀样本集B的有序集合，即对所有从偶数下标开始的后缀进行排序。这些后缀每个都有一个首字母在偶数位置上，后面跟着一个字符串，其开始于第二个字符（奇数位置），所以这个字符串正是A中的一个字符串。于是要对B中所有后缀排序，我们可以利用类似于基数排序的做法。首先将B中的字符串按照从第二个字符开始的字符串进行排序，花费线性时间可以完成，因为A的有序顺序是已知的。然后对B中字符串的按照首字母进行排序，也是线性时间。

要想使上述提出的算法能够在线性时间内完成，需要使得第2步的递归调用和第3步的归并步骤都能够在线性时间内完成。下面对算法的描述中会使用到的符号先做说明：
S[i]：表示字符串S的第i个字符
S[i]S[i+1]S[i+2]：由S中从i开始的连续3个字符组成的一个字符串，但是我们将这个字符串看成一个字符，称之为三字符（tri-character）
S[i=>]：表示S的从下标i起始的后缀子串，是个字符串
<>：表示数组

步骤1：对字符串中出现过的所有字符进行排序（基数排序），并分配映射数字（从1开始），然后将字符串用这些数字表示。

步骤2：利用原始的字符数组构建如下3个新的数组（三字符数组）：
S0：<S[3i]S[3i+1]S[3i+2]>，其中i=0,1,2...
S1：<S[3i+1]S[3i+2]S[3i+3]>，其中i=0,1,2...
S2：<S[3i+2]S[3i+3]S[3i+4]>，其中i=0,1,2...
说明：S0，S1，S2每个由约N/3个符号组成，但是符号不再是原始的单字符，而是三字符。关键的性质是：S0，S1，S2的后缀组合起来形成了S的后缀。

于是一个思路就是递归地计算S0，S1，S2地后缀，然后在线性时间内归并结果。其运行时间方程为： $T(N)=3T(N/3)+O(N^{1})$ 。

注：方程 $T(N)=aT(N/b)+O(N^{k})$ 的通解为：

从而上述的运行时间方程属于第二种情况，是一个O(NlogN)的算法。因此我们进行调整，通过递归计算其中两个后缀分组，并利用该信息计算第三个后缀分组（线性时间），从而避免了一次递归调用，此思路的运行时间方程为： $T(N)=2T(N/3)+O(N^{1})$ ，其解属于通解的第三种情况，即O(N)。

步骤3：将S1和S2接起来，递归计算后缀数组。为了计算这个后缀数组，我们要将新的三字符的字母表排序。用三趟基数排序可以在线性时间内完成这一步，因为单字符在步骤1已经排序好了。

步骤4：对S0计算后缀数组
$S_{0}[i =>] = S[3i =>] = S[3i]S[3i+1 =>] = S[3i]S_{1}[i =>] = S_{0}[i]S_{1}[i =>]$
递归调用已经将所有 $S_{1}[i =>]$ 排序了，因此我们可以用简单的两趟基数排序来做步骤4：第一趟在 $S_{1}[i =>]$ 上做，第二趟在 $S_{0}[i]$ 上做。

步骤5：用归并两个有序表的标准算法将两个后缀数组归并。仅有的问题是我们必须能够在常数时间内比较每一对后缀。有两种情况：
情况1：比较S0的元素和S1的元素。比较首字母，如果他们不一样，那么比较结束；否则，将S0剩下的部分（即S1的后缀）与S1剩下的部分（即S2的后缀）相比，这两部分已经在步骤3排序过了，所以结束。
情况2：比较S0的元素和S2的元素。比较S0的前两个元素和S2的前两个元素，如果不一样，那么比较结束；否则，将S0剩下的部分（S2的后缀）和S2剩下的部分（S1的后缀）相比，这两部分已经在步骤3排序过了，所以结束。
总结情况1和2，得到步骤5可在线性时间归并完毕。

具体的例子，详见书本。代码如下。

/*
     * Fill in the suffix array information for String str
     * @param str the input String
     * @param sa existing array to place the suffix array
     */
    public static void createSuffixArray( String str, int [ ] sa, int [ ] LCP )
    {        
        int N = str.length( );
        // 避免边界情况出问题, 数组大小加3
        int [ ] s = new int[ N + 3 ];
        int [ ] SA = new int[ N + 3 ];
        
        for( int i = 0; i < N; i++ )
            s[ i ] = str.charAt( i );
        
        makeSuffixArray( s, SA, N, 256 );
        
        for( int i = 0; i < N; i++ )
            sa[ i ] = SA[ i ];
        
        makeLCPArray( s, sa, LCP );
    }

// find the suffix array SA of s[0..n-1] in {1..K}^n
    // require s[n]=s[n+1]=s[n+2]=0, n>=2
    public static void makeSuffixArray( int [ ] s, int [ ] SA, int n, int K )
    {
        int n0 = ( n + 2 ) / 3;
        int n1 = ( n + 1 ) / 3;
        int n2 = n / 3;
        int t = n0 - n1;  // 1 if n%3 == 1
        int n12 = n1 + n2 + t;

        int [ ] s12  = new int[ n12 + 3 ];
        int [ ] SA12 = new int[ n12 + 3 ];
        int [ ] s0   = new int[ n0 ];
        int [ ] SA0  = new int[ n0 ];
        
        // generate positions in s for items in s12
        // the "+t" adds a dummy mod 1 suffix if n%3 == 1
        // at that point, the size of s12 is n12
        for( int i = 0, j = 0; i < n + t; i++ )
            if( i % 3 != 0 )
                s12[ j++ ] = i;
        
        int K12 = assignNames( s, s12, SA12, n0, n12, K );
  
        computeS12( s12, SA12, n12, K12 );
        computeS0( s, s0, SA0, SA12, n0, n12, K );
        merge( s, s12, SA, SA0, SA12, n, n0, n12, t );
    }
    
    // Assigns the new supercharacter names.
    // At end of routine, SA will have indices into s, in sorted order
    // and s12 will have new character names
    // Returns the number of names assigned; note that if
    // this value is the same as n12, then SA is a suffix array for s12.
    private static int assignNames( int [ ] s, int [ ] s12, int [ ] SA12,
                                   int n0, int n12, int K )
    {
           // radix sort the new character trios
        radixPass( s12 , SA12, s, 2, n12, K );
        radixPass( SA12, s12 , s, 1, n12, K );  
        radixPass( s12 , SA12, s, 0, n12, K );

          // find lexicographic names of triples
        int name = 0;
        int c0 = -1, c1 = -1, c2 = -1;
      
        for( int i = 0; i < n12; i++ )
        {
            if( s[ SA12[ i ] ] != c0 || s[ SA12[ i ] + 1 ] != c1
                                     || s[ SA12[ i ] + 2 ] != c2 )
            { 
                name++;
                c0 = s[ SA12[ i ] ];
                c1 = s[ SA12[ i ] + 1 ];
                c2 = s[ SA12[ i ] + 2 ];
            }
          
            if( SA12[ i ] % 3 == 1 )
                s12[ SA12[ i ] / 3 ]      = name;
            else
                s12[ SA12[ i ] / 3 + n0 ] = name; 
       }
      
       return name;
    }
    
    
    // stably sort in[0..n-1] with indices into s that has keys in 0..K
    // into out[0..n-1]; sort is relative to offset into s
    // uses counting radix sort
    private static void radixPass( int [ ] in, int [ ] out, int [ ] s, int offset,
                                  int n, int K ) 
    { 
        int [ ] count = new int[ K + 2 ];            // counter array
        
        for( int i = 0; i < n; i++ )
            count[ s[ in[ i ] + offset ] + 1 ]++;    // count occurences
        
        for( int i = 1; i <= K + 1; i++ )            // compute exclusive sums
            count[ i ] += count[ i - 1 ];

        for( int i = 0; i < n; i++ )
            out[ count[ s[ in[ i ] + offset ] ]++ ] = in[ i ];      // sort
    } 
    
    // stably sort in[0..n-1] with indices into s that has keys in 0..K
    // into out[0..n-1]
    // uses counting radix sort
    private static void radixPass( int [ ] in, int [ ] out, int [ ] s, int n, int K ) 
    { 
        radixPass( in, out, s, 0, n, K );
    }
   

    // Compute the suffix array for s12, placing result into SA12
    private static void computeS12( int [ ] s12, int [ ] SA12, int n12, int K12 )
    {
        if( K12 == n12 ) // if unique names, don't need recursion
            for( int i = 0; i < n12; i++ )
                SA12[ s12[i] - 1 ] = i; 
        else
        {
            makeSuffixArray( s12, SA12, n12, K12 );
            // store unique names in s12 using the suffix array 
            for( int i = 0; i < n12; i++ )
                s12[ SA12[ i ] ] = i + 1;
        }
    }
    
    private static void computeS0( int [ ] s, int [ ] s0, int [ ] SA0, int [ ] SA12,
                               int n0, int n12, int K )
    {
        for( int i = 0, j = 0; i < n12; i++ )
            if( SA12[ i ] < n0 )
                s0[ j++ ] = 3 * SA12[ i ];
        
        radixPass( s0, SA0, s, n0, K );
    }
    
    
    // merge sorted SA0 suffixes and sorted SA12 suffixes
    private static void merge( int [ ] s, int [ ] s12,
                              int [ ] SA, int [ ] SA0, int [ ] SA12,
                              int n, int n0, int n12, int t )
    {      
        int p = 0, k = 0;
        
        while( t != n12 && p != n0 )
        {
            int i = getIndexIntoS( SA12, t, n0 ); // S12
            int j = SA0[ p ];                     // S0
            
            if( suffix12IsSmaller( s, s12, SA12, n0, i, j, t ) )
            { 
                SA[ k++ ] = i;
                t++;
            }
            else
            { 
                SA[ k++ ] = j;
                p++;
            }  
        } 
        
        while( p < n0 )
            SA[ k++ ] = SA0[ p++ ];
        while( t < n12 )
            SA[ k++ ] = getIndexIntoS( SA12, t++, n0 ); 
    }
    
    private static int getIndexIntoS( int [ ] SA12, int t, int n0 )
    {
        if( SA12[ t ] < n0 )
            return SA12[ t ] * 3 + 1;
        else
            return ( SA12[ t ] - n0 ) * 3 + 2;
    }
    
    private static boolean leq( int a1, int a2, int b1, int b2 )
      { return a1 < b1 || a1 == b1 && a2 <= b2; }
    
    private static boolean leq( int a1, int a2, int a3, int b1, int b2, int b3 )
      { return a1 < b1 || a1 == b1 && leq( a2, a3,b2, b3 ); }
    
    private static boolean suffix12IsSmaller( int [ ] s, int [ ] s12, int [ ] SA12,
                                             int n0, int i, int j, int t )
    {
        if( SA12[ t ] < n0 )  // s1 vs s0; can break tie after 1 character
            return leq( s[ i ], s12[ SA12[ t ] + n0 ],
                        s[ j ], s12[ j / 3 ] );
        else                  // s2 vs s0; can break tie after 2 characters
            return leq( s[ i ], s[ i + 1 ], s12[ SA12[ t ] - n0 + 1 ],
                        s[ j ], s[ j + 1 ], s12[ j / 3 + n0 ] );
    }

/*
     * Create the LCP array from the suffix array
     * @param s the input array populated from 0..N-1, with available pos N
     * @param sa the already-computed suffix array 0..N-1
     * @param LCP the resulting LCP array 0..N-1
     */
    public static void makeLCPArray( int [ ] s, int [ ] sa, int [ ] LCP )
    {
        int N = sa.length;
        int [ ] rank = new int[ N ];
        
        s[ N ] = -1;
        for( int i = 0; i < N; i++ )
            rank[ sa[ i ] ] = i;
        
        int h = 0;
        for( int i = 0; i < N; i++ )
            if( rank[ i ] > 0 )
            {
                int j = sa[ rank[ i ] - 1 ];
                
                while( s[ i + h ] == s[ j + h ] )
                    h++;
                
                LCP[ rank[ i ] ] = h;
                if( h > 0 )
                    h--;
            }
    }

后缀数组

后缀数组的定义

后缀数组及其LCP的用处

后缀数组的计算

线性时间构建后缀数组

相关阅读更多精彩内容

友情链接更多精彩内容