5.1 String Sorts 笔记+理解

Before Start

Reminder: Compare-based sorting algorithms need ~ $N\log{N}$ compares as a lower bound.
Can we do better?
$-$ Yes, as long as the algorithm doesn't depend on the compares.

Key-indexed Counting

Assume we want to

sort a list a[] of length N
a[] contains R distinct keys
The keys are Integers between 0 and R-1
e.g. a = {4, 3, 1, 1, 0, 2, 3}, N = 7, R = 5 (0,1,2,3,4)
if the keys are from alphabets, then convert them like this:

alphabet to int

so that e.g. a.key() == 0, b.key() == 1 ... f.key() == 5

Intuition: use key as an array index.

1. Compute frequency counts:

create array count[] of length R + 1 (see next step why R + 1)
loop through a[] compute the frequencies of the keys

// Compute frequency counts
for (int i = 0; i < N; a++;) {
    count[a[i].key  + 1]++; // always count[0] == 0 (see next step)
}

Result:

count[0] is always 0
key v appears count[v+1] times in a[]

2. Transform counts to indices:

calculate the cumulative counts

// loop until the second last index of count[]
for (int i = 0; i < R; i++) {  
    count[i+1] += count[i];
}

Result:

count[v] equals how many keys in a[] are smaller than $v$ so that count[v] should be the correct sorted position for the first $v$ in a[]
e.g. $v=3$ , a = {0,1,1,2,3,3...}, there're 4 keys before the first $v$ , so count[v] == 4
count[0] is always 0 since no key is smaller than 0 or the first key in the alphabet (i.e. a)
count[R] is always N since all keys are smaller than R (our keys are between 0 and R-1)

keys are between 0 and 4

3. Distribute the data

create a aux array of length N

for (int i = 0; i < a.length; i++) {
    aux[count[a[i].key()]++] = a[i];
}

as mentioned in 2, count[v] contains the right position for the first $v$ in a[]
so, loop through a[], use count[i] to find the position for $i$ , put it to aux[count[i]]
and increment count[i] every time we distribute a[i], since the right position for the second $i.key()$ element is count[i.key()] + 1, third is count[i.key()] + 2, and so forth...
**Result: **
aux[] now contains the right order for a[], so

4. Copy back

// Copy back.
for (int i = 0; i < N; i++) {
     a[i] = aux[i];
}

Analysis

Initializing the arrays count[] and aux[] takes $N+R+1$ array accesses
Compute frequencies loop through N items, which takes 2N array accesses
Computing the cumulative counts takes 2R array accesses
the third loop does N counter increments and N data moves (3N array accesses)
the fourth loop does N data moves (2N array accesses)

Therefore, array accesses ~ $8N + 3 R + 1$

Stability

Since the algorithm preserves the order of the keys that are equal, it is stable.

Least-significant-digit first (LSD) string sort

do Key-indexed Counting from right-most character to left
need strings of equal length
stability of key-indexed counting ensures the correctness of LSD

Typical candidate for LSD string sort

public class LSD
  {
     public static void sort(String[] a, int W)
     {  // Sort a[] on leading W characters.
        int N = a.length;
        int R = 256;
        String[] aux = new String[N];
        for (int d = W-1; d >= 0; d--)
        { // Sort by key-indexed counting on dth char.
           int[] count = new int[R+1];     // Compute frequency counts.
           for (int i = 0; i < N; i++)
               count[a[i].charAt(d) + 1]++;
           for (int r = 0; r < R; r++)     // Transform counts to indices.
              count[r+1] += count[r];
           for (int i = 0; i < N; i++)     // Distribute.
              aux[count[a[i].charAt(d)]++] = a[i];
           for (int i = 0; i < N; i++)     // Copy back.
              a[i] = aux[i];
       } 
     }
   }

Most-significant-digit-first (MSD) string sort

Basic Idea

string lengths can be different
use the left-most character (the first character), do the indexed-key counting
partition the list so that strings having the same first character are in the same group
recursively sort each group using the substrings

Overview of MSD string sort

Implementation

Public class MSD {
    // radix
    private static int R = 256; 
    // cutoff for switching to insertion sort
    private static final int M = 15; 
    // auxiliary array for distribution
    private static String[] aux;

    // client sort
    public static void sort(String[] a) {
        int N = a.length;
        aux = new String[N];
        sort(a, 0, N-1, 0);
    }
    
    private static void sort (String[] a, int lo, int hi, int d) {
        if (hi <= lo + M) {
            Insertion.sort(a, lo, hi, d);
            return;
        }
        int[] count = new int[R + 2];
        // compute frequencies
        for (int i = lo; i <= hi; i++;) {
            count[charAt(a[i], d) + 2] ++;
        }
        // convert count to indice
        for (int r = 0; r < R+1; r++;) {
            count[r+1] += count[r]; 
        }
        // distribute 
        for (int i = lo; i <= hi; i++;) {
            aux[count[charAt(a[i], d) + 1]++] = a[i];
        }
        // copy back
        for (int i = lo; i <= hi; i++;) {
            a[i] = aux[i - lo];
        }
        // Recursively sort for each character value.
        for (int r = 0; r < R; r++;) {
            sort(a, lo + count[r], lo + count[r+1] - 1, d+1)
        }
    }

    private static int charAt(String s, int d) {  
        if (d < s.length()) return s.charAt(d); else return -1; 
    }
}

detailed explanations of the code below:

Problems & How to address them

1. String with different lengths

End-of-string convention: treat the end of each string as a key -1, the smallest key, so that the string whose characters have all been examined will be moved to the first in the sorted order
To accomplish that, we need count[] to be of length R+2, so that count[1] is the number of characters smaller than key 0 ... count[R+1] is the number of characters smaller than key R, so count[R+1] == N, since our keys are between -1 and R-1
also, we need to adjust the charAt() method to return -1 when we reach the end of a string

private static int charAt(String s, int d) {
  if (d < s.length()) return s.charAt(d);
  else return -1;
}

2. Small subarrays

When the subarray partitioned by the previous indexed-key counting is small, say {a, b}
to recursively sort small subarrays like {a, b} using indexed-key counting, we also need to create a count[] array with the length of the alphabet + 2, say ASCII strings $R = 256$ or even Unicode $R = 65536$
therefore, when the number of small subarrays is large, the algorithm become inefficient, therefore,

the switch to insertion sort for small subarrays is a must for MSD string sort

like we did in Quicksort / Mergesort, when the length of subarrays is smaller than a cutoff, we switch to InsertionSort

// Sort from a[lo] to a[hi], starting at the dth character.
if (hi <= lo + M) {
    Insertion.sort(a, lo, hi, d);  // d is explained in the step below
    return;
}

Effect of cutoff for small subarrays in MSD string sort

3. Won't reexamine equal keys

we need to adjust the InsertionSort so that it won't compare keys that we already know they are equal
so we maintain a pointer $d$ to show we have sorted first $d$ character of the strings.
Improved insertion sort:

  public static void sort(String[] a, int lo, int hi, int d)
  {  // Sort from a[lo] to a[hi], starting at the dth character.
     for (int i = lo; i <= hi; i++)
        for (int j = i; j > lo && less(a[j], a[j-1], d); j--)
           exch(a, j, j-1);
}
  private static boolean less(String v, String w, int d)
  {  return v.substring(d).compareTo(w.substring(d)) < 0;  }

it remain efficient since in Java, string.substring() takes constant operation time.

4. Equal keys

if two keys are equal, MSD need to exam each character in those strings
worst case for MSD is when all keys are equal

Characters examined by MSD string sort

Running time:

MSD string sort uses between $8N+3Rand$ ~ $7wN+3WR$ array accesses to sort N strings taken from an R-character alphabet, where w is the average string length.

To be completed: Three-way string quicksort