Princeton Alogorithm COS226 Week11 Data Compression

Compression reduces the size of a file which can save space when storing it, to save time when transmitting it, most files have lots of redundancy.

image.png

lossless compression and expansion
binary data B we want to compress, after compress, it will generate a "compressed" representation C(B). then expand, we can reconstruct original bitstream B.

the compression ratio is C(B) / B,
C(B) use fewer bits

image.png

genomic code

image.png

reading and writing binary data

image.png

run-length encoding

image.png

variable-length codes

use different number of bits to encode different chars.
for example, morse code.
but the issue is ambiguity.
like ...---..., could be decode by SOS, V7, IAMIE and EEWNI
how do we avoid ambiguity?

solution1. fixed-length code
solution2. append special stop char to each word
solution3. generate prefix-free code.

image.png

how to write trie

write preorder traversal of trie; mark leaf and internal nodes with a bit.

image.png

huffman algorithm

Count frequency freq[i] for each char i in input.
・Start with one node corresponding to each char i (with weight freq[i]).
・Repeat until single trie formed:
select two tries with min weight freq[i] and freq[j]
merge into single trie with weight freq[i] + freq[j]

image.png

LZW

there are a lot of different kind models.
static model is same model for all texts. the advantage of it is fast. but it is not optimal, different texts have different statictical properties.
eg: Morse code

dynamic model will generate model based on text.
preliminary pass needed to generate model, huffman code is a example.

now i will introduce the adaptive model, progressively learn and update model as u read text. it will be more accurate modeling produces better compression. decoding must start from beginning.

image.png

add this implementation in extra credit

QUIZ

Q1 Ternary Huffman codes

image.png

Combine smallest 3 probabilities at each step (instead of smallest 2). which mean we need poll 3 element from PQ, and gernerate one node.
another thing is in binary alphabet, we can use one bit to know in left or right child, now we need two bit to distinguish it is in left, mid or right.

to handle the case when the number of symbols is not of the form 3+2k for some integer k.

we need add some 0 freq node

while (pq.size() < 3 || pq.size() % 2 == 0) {
            pq.offer(new Node('\0', 0, null, null, null));
        }

Q2

image.png

if u know the answer, please leave comment here.

Q3 Move-to-front coding

image.png

this problem could be solved by redblack tree, rank and select function.
rank function is to give a key, return the rank of the key.
select function is to give a rank, return the key which is on this rank.

if we want to move to front, we can remove it from red black tree, and mark a (min - 1) key, to add this node back.
therefore, we need maintain a map for key(is 'a'-'z'), value is (key in red black tree)
details see my github implementation

Project Burrows Wheeler

The Burrows–Wheeler data compression algorithm consists of three algorithmic components, which are applied in succession:

Burrows–Wheeler transform. Given a typical English text file, transform it into a text file in which sequences of the same character occur near each other many times.
Move-to-front encoding. Given a text file in which sequences of the same character occur near each other many times, convert it into a text file in which certain characters appear much more frequently than others.
Huffman compression. Given a text file in which certain characters appear much more frequently than others, compress it by encoding frequently occurring characters with short codewords and infrequently occurring characters with long codewords.

Step 3 is the only one that compresses the message: it is particularly effective because Steps 1 and 2 produce a text file in which certain characters appear much more frequently than others. To expand a message, apply the inverse operations in reverse order: first apply the Huffman expansion, then the move-to-front decoding, and finally the inverse Burrows–Wheeler transform.

MoveToFront

use for 0~R to find the match char, then use System.arraycopy and set 0 index to move to front

CircularSuffixArray

the cleanest way to design a inner class CircularSuffix with Comparable<CircularSuffix>
% operation is time cost, we could double the input string to avoid %

str = (s + s).toCharArray();

more quick sort method is MSD with insertion sort

BurrowsWheeler

encode is just use CircularSuffixArray, then find the origin index 0 position in sorted index
write and get the last character we could add (n - 1) then mod n, because they are Circular
to build the next array, we could use cnt sorting. then

for (int i = 0; i < t.length; i++)
    next[cnt[t[i]]++] = i;

course quiz and project : https://github.com/yixuaz/algorithm4-princeton-cos226