5.1 String Sorts 笔记+理解

Before Start

Reminder: Compare-based sorting algorithms need ~N\log{N} compares as a lower bound.
Can we do better?
- Yes, as long as the algorithm doesn't depend on the compares.

Key-indexed Counting

Assume we want to

  • sort a list a[] of length N
  • a[] contains R distinct keys
  • The keys are Integers between 0 and R-1
    e.g. a = {4, 3, 1, 1, 0, 2, 3}, N = 7, R = 5 (0,1,2,3,4)
  • if the keys are from alphabets, then convert them like this:
alphabet to int
  • so that e.g. a.key() == 0, b.key() == 1 ... f.key() == 5

Intuition: use key as an array index.

1. Compute frequency counts:

  • create array count[] of length R + 1 (see next step why R + 1)
  • loop through a[] compute the frequencies of the keys
// Compute frequency counts
for (int i = 0; i < N; a++;) {
    count[a[i].key  + 1]++; // always count[0] == 0 (see next step)
}

Result:

  • count[0] is always 0
  • key v appears count[v+1] times in a[]

2. Transform counts to indices:

  • calculate the cumulative counts
// loop until the second last index of count[]
for (int i = 0; i < R; i++) {  
    count[i+1] += count[i];
}

Result:

  • count[v] equals how many keys in a[] are smaller than v so that count[v] should be the correct sorted position for the first v in a[]
    e.g. v=3, a = {0,1,1,2,3,3...}, there're 4 keys before the first v, so count[v] == 4
  • count[0] is always 0 since no key is smaller than 0 or the first key in the alphabet (i.e. a)
  • count[R] is always N since all keys are smaller than R (our keys are between 0 and R-1)
keys are between 0 and 4

3. Distribute the data

  • create a aux array of length N
for (int i = 0; i < a.length; i++) {
    aux[count[a[i].key()]++] = a[i];
}
  • as mentioned in 2, count[v] contains the right position for the first v in a[]
  • so, loop through a[], use count[i] to find the position for i, put it to aux[count[i]]
  • and increment count[i] every time we distribute a[i], since the right position for the second i.key() element is count[i.key()] + 1, third is count[i.key()] + 2, and so forth...
    **Result: **
    aux[] now contains the right order for a[], so

4. Copy back

// Copy back.
for (int i = 0; i < N; i++) {
     a[i] = aux[i];
}

Analysis

  • Initializing the arrays count[] and aux[] takes N+R+1 array accesses
  • Compute frequencies loop through N items, which takes 2N array accesses
  • Computing the cumulative counts takes 2R array accesses
  • the third loop does N counter increments and N data moves (3N array accesses)
  • the fourth loop does N data moves (2N array accesses)

Therefore, array accesses ~8N + 3 R + 1

Stability

Since the algorithm preserves the order of the keys that are equal, it is stable.

Least-significant-digit first (LSD) string sort

  • do Key-indexed Counting from right-most character to left
  • need strings of equal length
  • stability of key-indexed counting ensures the correctness of LSD


    Typical candidate for LSD string sort
public class LSD
  {
     public static void sort(String[] a, int W)
     {  // Sort a[] on leading W characters.
        int N = a.length;
        int R = 256;
        String[] aux = new String[N];
        for (int d = W-1; d >= 0; d--)
        { // Sort by key-indexed counting on dth char.
           int[] count = new int[R+1];     // Compute frequency counts.
           for (int i = 0; i < N; i++)
               count[a[i].charAt(d) + 1]++;
           for (int r = 0; r < R; r++)     // Transform counts to indices.
              count[r+1] += count[r];
           for (int i = 0; i < N; i++)     // Distribute.
              aux[count[a[i].charAt(d)]++] = a[i];
           for (int i = 0; i < N; i++)     // Copy back.
              a[i] = aux[i];
       } 
     }
   }

Most-significant-digit-first (MSD) string sort

Basic Idea

  • string lengths can be different
  • use the left-most character (the first character), do the indexed-key counting
  • partition the list so that strings having the same first character are in the same group
  • recursively sort each group using the substrings


    Overview of MSD string sort

Implementation

Public class MSD {
    // radix
    private static int R = 256; 
    // cutoff for switching to insertion sort
    private static final int M = 15; 
    // auxiliary array for distribution
    private static String[] aux;

    // client sort
    public static void sort(String[] a) {
        int N = a.length;
        aux = new String[N];
        sort(a, 0, N-1, 0);
    }
    
    private static void sort (String[] a, int lo, int hi, int d) {
        if (hi <= lo + M) {
            Insertion.sort(a, lo, hi, d);
            return;
        }
        int[] count = new int[R + 2];
        // compute frequencies
        for (int i = lo; i <= hi; i++;) {
            count[charAt(a[i], d) + 2] ++;
        }
        // convert count to indice
        for (int r = 0; r < R+1; r++;) {
            count[r+1] += count[r]; 
        }
        // distribute 
        for (int i = lo; i <= hi; i++;) {
            aux[count[charAt(a[i], d) + 1]++] = a[i];
        }
        // copy back
        for (int i = lo; i <= hi; i++;) {
            a[i] = aux[i - lo];
        }
        // Recursively sort for each character value.
        for (int r = 0; r < R; r++;) {
            sort(a, lo + count[r], lo + count[r+1] - 1, d+1)
        }
    }

    private static int charAt(String s, int d) {  
        if (d < s.length()) return s.charAt(d); else return -1; 
    }
}

detailed explanations of the code below:

Problems & How to address them

1. String with different lengths
  • End-of-string convention: treat the end of each string as a key -1, the smallest key, so that the string whose characters have all been examined will be moved to the first in the sorted order
  • To accomplish that, we need count[] to be of length R+2, so that count[1] is the number of characters smaller than key 0 ... count[R+1] is the number of characters smaller than key R, so count[R+1] == N, since our keys are between -1 and R-1
  • also, we need to adjust the charAt() method to return -1 when we reach the end of a string
private static int charAt(String s, int d) {
  if (d < s.length()) return s.charAt(d);
  else return -1;
}
2. Small subarrays
  • When the subarray partitioned by the previous indexed-key counting is small, say {a, b}
  • to recursively sort small subarrays like {a, b} using indexed-key counting, we also need to create a count[] array with the length of the alphabet + 2, say ASCII strings R = 256 or even Unicode R = 65536
  • therefore, when the number of small subarrays is large, the algorithm become inefficient, therefore,

the switch to insertion sort for small subarrays is a must for MSD string sort

  • like we did in Quicksort / Mergesort, when the length of subarrays is smaller than a cutoff, we switch to InsertionSort
// Sort from a[lo] to a[hi], starting at the dth character.
if (hi <= lo + M) {
    Insertion.sort(a, lo, hi, d);  // d is explained in the step below
    return;
}
Effect of cutoff for small subarrays in MSD string sort
3. Won't reexamine equal keys
  • we need to adjust the InsertionSort so that it won't compare keys that we already know they are equal
  • so we maintain a pointer d to show we have sorted first d character of the strings.
  • Improved insertion sort:
  public static void sort(String[] a, int lo, int hi, int d)
  {  // Sort from a[lo] to a[hi], starting at the dth character.
     for (int i = lo; i <= hi; i++)
        for (int j = i; j > lo && less(a[j], a[j-1], d); j--)
           exch(a, j, j-1);
}
  private static boolean less(String v, String w, int d)
  {  return v.substring(d).compareTo(w.substring(d)) < 0;  }
  • it remain efficient since in Java, string.substring() takes constant operation time.
4. Equal keys
  • if two keys are equal, MSD need to exam each character in those strings
  • worst case for MSD is when all keys are equal


    Characters examined by MSD string sort

Running time:

MSD string sort uses between 8N+3Rand ~7wN+3WR array accesses to sort N strings taken from an R-character alphabet, where w is the average string length.


To be completed: Three-way string quicksort

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 214,837评论 6 496
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,551评论 3 389
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 160,417评论 0 350
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,448评论 1 288
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,524评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,554评论 1 293
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,569评论 3 414
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,316评论 0 270
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,766评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,077评论 2 330
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,240评论 1 343
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,912评论 5 338
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,560评论 3 322
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,176评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,425评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,114评论 2 366
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,114评论 2 352

推荐阅读更多精彩内容