5.1 String Sorts 笔记+理解

Before Start

Reminder: Compare-based sorting algorithms need ~N\log{N} compares as a lower bound.
Can we do better?
- Yes, as long as the algorithm doesn't depend on the compares.

Key-indexed Counting

Assume we want to

  • sort a list a[] of length N
  • a[] contains R distinct keys
  • The keys are Integers between 0 and R-1
    e.g. a = {4, 3, 1, 1, 0, 2, 3}, N = 7, R = 5 (0,1,2,3,4)
  • if the keys are from alphabets, then convert them like this:
alphabet to int
  • so that e.g. a.key() == 0, b.key() == 1 ... f.key() == 5

Intuition: use key as an array index.

1. Compute frequency counts:

  • create array count[] of length R + 1 (see next step why R + 1)
  • loop through a[] compute the frequencies of the keys
// Compute frequency counts
for (int i = 0; i < N; a++;) {
    count[a[i].key  + 1]++; // always count[0] == 0 (see next step)
}

Result:

  • count[0] is always 0
  • key v appears count[v+1] times in a[]

2. Transform counts to indices:

  • calculate the cumulative counts
// loop until the second last index of count[]
for (int i = 0; i < R; i++) {  
    count[i+1] += count[i];
}

Result:

  • count[v] equals how many keys in a[] are smaller than v so that count[v] should be the correct sorted position for the first v in a[]
    e.g. v=3, a = {0,1,1,2,3,3...}, there're 4 keys before the first v, so count[v] == 4
  • count[0] is always 0 since no key is smaller than 0 or the first key in the alphabet (i.e. a)
  • count[R] is always N since all keys are smaller than R (our keys are between 0 and R-1)
keys are between 0 and 4

3. Distribute the data

  • create a aux array of length N
for (int i = 0; i < a.length; i++) {
    aux[count[a[i].key()]++] = a[i];
}
  • as mentioned in 2, count[v] contains the right position for the first v in a[]
  • so, loop through a[], use count[i] to find the position for i, put it to aux[count[i]]
  • and increment count[i] every time we distribute a[i], since the right position for the second i.key() element is count[i.key()] + 1, third is count[i.key()] + 2, and so forth...
    **Result: **
    aux[] now contains the right order for a[], so

4. Copy back

// Copy back.
for (int i = 0; i < N; i++) {
     a[i] = aux[i];
}

Analysis

  • Initializing the arrays count[] and aux[] takes N+R+1 array accesses
  • Compute frequencies loop through N items, which takes 2N array accesses
  • Computing the cumulative counts takes 2R array accesses
  • the third loop does N counter increments and N data moves (3N array accesses)
  • the fourth loop does N data moves (2N array accesses)

Therefore, array accesses ~8N + 3 R + 1

Stability

Since the algorithm preserves the order of the keys that are equal, it is stable.

Least-significant-digit first (LSD) string sort

  • do Key-indexed Counting from right-most character to left
  • need strings of equal length
  • stability of key-indexed counting ensures the correctness of LSD


    Typical candidate for LSD string sort
public class LSD
  {
     public static void sort(String[] a, int W)
     {  // Sort a[] on leading W characters.
        int N = a.length;
        int R = 256;
        String[] aux = new String[N];
        for (int d = W-1; d >= 0; d--)
        { // Sort by key-indexed counting on dth char.
           int[] count = new int[R+1];     // Compute frequency counts.
           for (int i = 0; i < N; i++)
               count[a[i].charAt(d) + 1]++;
           for (int r = 0; r < R; r++)     // Transform counts to indices.
              count[r+1] += count[r];
           for (int i = 0; i < N; i++)     // Distribute.
              aux[count[a[i].charAt(d)]++] = a[i];
           for (int i = 0; i < N; i++)     // Copy back.
              a[i] = aux[i];
       } 
     }
   }

Most-significant-digit-first (MSD) string sort

Basic Idea

  • string lengths can be different
  • use the left-most character (the first character), do the indexed-key counting
  • partition the list so that strings having the same first character are in the same group
  • recursively sort each group using the substrings


    Overview of MSD string sort

Implementation

Public class MSD {
    // radix
    private static int R = 256; 
    // cutoff for switching to insertion sort
    private static final int M = 15; 
    // auxiliary array for distribution
    private static String[] aux;

    // client sort
    public static void sort(String[] a) {
        int N = a.length;
        aux = new String[N];
        sort(a, 0, N-1, 0);
    }
    
    private static void sort (String[] a, int lo, int hi, int d) {
        if (hi <= lo + M) {
            Insertion.sort(a, lo, hi, d);
            return;
        }
        int[] count = new int[R + 2];
        // compute frequencies
        for (int i = lo; i <= hi; i++;) {
            count[charAt(a[i], d) + 2] ++;
        }
        // convert count to indice
        for (int r = 0; r < R+1; r++;) {
            count[r+1] += count[r]; 
        }
        // distribute 
        for (int i = lo; i <= hi; i++;) {
            aux[count[charAt(a[i], d) + 1]++] = a[i];
        }
        // copy back
        for (int i = lo; i <= hi; i++;) {
            a[i] = aux[i - lo];
        }
        // Recursively sort for each character value.
        for (int r = 0; r < R; r++;) {
            sort(a, lo + count[r], lo + count[r+1] - 1, d+1)
        }
    }

    private static int charAt(String s, int d) {  
        if (d < s.length()) return s.charAt(d); else return -1; 
    }
}

detailed explanations of the code below:

Problems & How to address them

1. String with different lengths
  • End-of-string convention: treat the end of each string as a key -1, the smallest key, so that the string whose characters have all been examined will be moved to the first in the sorted order
  • To accomplish that, we need count[] to be of length R+2, so that count[1] is the number of characters smaller than key 0 ... count[R+1] is the number of characters smaller than key R, so count[R+1] == N, since our keys are between -1 and R-1
  • also, we need to adjust the charAt() method to return -1 when we reach the end of a string
private static int charAt(String s, int d) {
  if (d < s.length()) return s.charAt(d);
  else return -1;
}
2. Small subarrays
  • When the subarray partitioned by the previous indexed-key counting is small, say {a, b}
  • to recursively sort small subarrays like {a, b} using indexed-key counting, we also need to create a count[] array with the length of the alphabet + 2, say ASCII strings R = 256 or even Unicode R = 65536
  • therefore, when the number of small subarrays is large, the algorithm become inefficient, therefore,

the switch to insertion sort for small subarrays is a must for MSD string sort

  • like we did in Quicksort / Mergesort, when the length of subarrays is smaller than a cutoff, we switch to InsertionSort
// Sort from a[lo] to a[hi], starting at the dth character.
if (hi <= lo + M) {
    Insertion.sort(a, lo, hi, d);  // d is explained in the step below
    return;
}
Effect of cutoff for small subarrays in MSD string sort
3. Won't reexamine equal keys
  • we need to adjust the InsertionSort so that it won't compare keys that we already know they are equal
  • so we maintain a pointer d to show we have sorted first d character of the strings.
  • Improved insertion sort:
  public static void sort(String[] a, int lo, int hi, int d)
  {  // Sort from a[lo] to a[hi], starting at the dth character.
     for (int i = lo; i <= hi; i++)
        for (int j = i; j > lo && less(a[j], a[j-1], d); j--)
           exch(a, j, j-1);
}
  private static boolean less(String v, String w, int d)
  {  return v.substring(d).compareTo(w.substring(d)) < 0;  }
  • it remain efficient since in Java, string.substring() takes constant operation time.
4. Equal keys
  • if two keys are equal, MSD need to exam each character in those strings
  • worst case for MSD is when all keys are equal


    Characters examined by MSD string sort

Running time:

MSD string sort uses between 8N+3Rand ~7wN+3WR array accesses to sort N strings taken from an R-character alphabet, where w is the average string length.


To be completed: Three-way string quicksort

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容