Before Start
Reminder: Compare-based sorting algorithms need ~ compares as a lower bound.
Can we do better?
Yes, as long as the algorithm doesn't depend on the compares.
Key-indexed Counting
Assume we want to
- sort a list
a[]
of length N -
a[]
contains R distinct keys - The keys are Integers between 0 and R-1
e.g.a = {4, 3, 1, 1, 0, 2, 3}
, N = 7, R = 5 (0,1,2,3,4) - if the keys are from alphabets, then convert them like this:
- so that e.g.
a.key() == 0
,b.key() == 1
...f.key() == 5
Intuition: use key as an array index.
1. Compute frequency counts:
- create array count[] of length R + 1 (see next step why R + 1)
- loop through
a[]
compute the frequencies of the keys
// Compute frequency counts
for (int i = 0; i < N; a++;) {
count[a[i].key + 1]++; // always count[0] == 0 (see next step)
}
Result:
-
count[0]
is always 0 - key v appears
count[v+1]
times ina[]
2. Transform counts to indices:
- calculate the cumulative counts
// loop until the second last index of count[]
for (int i = 0; i < R; i++) {
count[i+1] += count[i];
}
Result:
-
count[v]
equals how many keys ina[]
are smaller than so thatcount[v]
should be the correct sorted position for the first ina[]
e.g. ,a = {0,1,1,2,3,3...}
, there're 4 keys before the first , socount[v] == 4
- count[0] is always 0 since no key is smaller than 0 or the first key in the alphabet (i.e. a)
- count[R] is always N since all keys are smaller than R (our keys are between 0 and R-1)
3. Distribute the data
- create a aux array of length N
for (int i = 0; i < a.length; i++) {
aux[count[a[i].key()]++] = a[i];
}
- as mentioned in 2,
count[v]
contains the right position for the first ina[]
- so, loop through
a[]
, usecount[i]
to find the position for , put it toaux[count[i]]
- and increment count[i] every time we distribute a[i], since the right position for the second element is
count[i.key()] + 1
, third iscount[i.key()] + 2
, and so forth...
**Result: **
aux[]
now contains the right order fora[]
, so
4. Copy back
// Copy back.
for (int i = 0; i < N; i++) {
a[i] = aux[i];
}
Analysis
- Initializing the arrays
count[]
andaux[]
takes array accesses - Compute frequencies loop through N items, which takes 2N array accesses
- Computing the cumulative counts takes 2R array accesses
- the third loop does N counter increments and N data moves (3N array accesses)
- the fourth loop does N data moves (2N array accesses)
Therefore, array accesses ~
Stability
Since the algorithm preserves the order of the keys that are equal, it is stable.
Least-significant-digit first (LSD) string sort
- do Key-indexed Counting from right-most character to left
- need strings of equal length
-
stability of key-indexed counting ensures the correctness of LSD
public class LSD
{
public static void sort(String[] a, int W)
{ // Sort a[] on leading W characters.
int N = a.length;
int R = 256;
String[] aux = new String[N];
for (int d = W-1; d >= 0; d--)
{ // Sort by key-indexed counting on dth char.
int[] count = new int[R+1]; // Compute frequency counts.
for (int i = 0; i < N; i++)
count[a[i].charAt(d) + 1]++;
for (int r = 0; r < R; r++) // Transform counts to indices.
count[r+1] += count[r];
for (int i = 0; i < N; i++) // Distribute.
aux[count[a[i].charAt(d)]++] = a[i];
for (int i = 0; i < N; i++) // Copy back.
a[i] = aux[i];
}
}
}
Most-significant-digit-first (MSD) string sort
Basic Idea
- string lengths can be different
- use the left-most character (the first character), do the indexed-key counting
- partition the list so that strings having the same first character are in the same group
-
recursively sort each group using the substrings
Implementation
Public class MSD {
// radix
private static int R = 256;
// cutoff for switching to insertion sort
private static final int M = 15;
// auxiliary array for distribution
private static String[] aux;
// client sort
public static void sort(String[] a) {
int N = a.length;
aux = new String[N];
sort(a, 0, N-1, 0);
}
private static void sort (String[] a, int lo, int hi, int d) {
if (hi <= lo + M) {
Insertion.sort(a, lo, hi, d);
return;
}
int[] count = new int[R + 2];
// compute frequencies
for (int i = lo; i <= hi; i++;) {
count[charAt(a[i], d) + 2] ++;
}
// convert count to indice
for (int r = 0; r < R+1; r++;) {
count[r+1] += count[r];
}
// distribute
for (int i = lo; i <= hi; i++;) {
aux[count[charAt(a[i], d) + 1]++] = a[i];
}
// copy back
for (int i = lo; i <= hi; i++;) {
a[i] = aux[i - lo];
}
// Recursively sort for each character value.
for (int r = 0; r < R; r++;) {
sort(a, lo + count[r], lo + count[r+1] - 1, d+1)
}
}
private static int charAt(String s, int d) {
if (d < s.length()) return s.charAt(d); else return -1;
}
}
detailed explanations of the code below:
Problems & How to address them
1. String with different lengths
- End-of-string convention: treat the end of each string as a key -1, the smallest key, so that the string whose characters have all been examined will be moved to the first in the sorted order
- To accomplish that, we need
count[]
to be of length R+2, so thatcount[1]
is the number of characters smaller than key 0 ...count[R+1]
is the number of characters smaller than key R, socount[R+1] == N
, since our keys are between -1 and R-1 - also, we need to adjust the
charAt()
method to return -1 when we reach the end of a string
private static int charAt(String s, int d) {
if (d < s.length()) return s.charAt(d);
else return -1;
}
2. Small subarrays
- When the subarray partitioned by the previous indexed-key counting is small, say
{a, b}
- to recursively sort small subarrays like
{a, b}
using indexed-key counting, we also need to create acount[]
array with the length of the alphabet + 2, say ASCII strings or even Unicode - therefore, when the number of small subarrays is large, the algorithm become inefficient, therefore,
the switch to insertion sort for small subarrays is a must for MSD string sort
- like we did in Quicksort / Mergesort, when the length of subarrays is smaller than a cutoff, we switch to InsertionSort
// Sort from a[lo] to a[hi], starting at the dth character.
if (hi <= lo + M) {
Insertion.sort(a, lo, hi, d); // d is explained in the step below
return;
}
3. Won't reexamine equal keys
- we need to adjust the InsertionSort so that it won't compare keys that we already know they are equal
- so we maintain a pointer to show we have sorted first character of the strings.
- Improved insertion sort:
public static void sort(String[] a, int lo, int hi, int d)
{ // Sort from a[lo] to a[hi], starting at the dth character.
for (int i = lo; i <= hi; i++)
for (int j = i; j > lo && less(a[j], a[j-1], d); j--)
exch(a, j, j-1);
}
private static boolean less(String v, String w, int d)
{ return v.substring(d).compareTo(w.substring(d)) < 0; }
- it remain efficient since in Java,
string.substring()
takes constant operation time.
4. Equal keys
- if two keys are equal, MSD need to exam each character in those strings
-
worst case for MSD is when all keys are equal
Running time:
MSD string sort uses between ~ array accesses to sort N strings taken from an R-character alphabet, where w is the average string length.
To be completed: Three-way string quicksort