Compression reduces the size of a file which can save space when storing it, to save time when transmitting it, most files have lots of redundancy.
lossless compression and expansion
binary data B we want to compress, after compress, it will generate a "compressed" representation C(B). then expand, we can reconstruct original bitstream B.
the compression ratio is C(B) / B,
C(B) use fewer bits
genomic code
reading and writing binary data
run-length encoding
variable-length codes
use different number of bits to encode different chars.
for example, morse code.
but the issue is ambiguity.
like ...---..., could be decode by SOS, V7, IAMIE and EEWNI
how do we avoid ambiguity?
solution1. fixed-length code
solution2. append special stop char to each word
solution3. generate prefix-free code.
how to write trie
write preorder traversal of trie; mark leaf and internal nodes with a bit.
huffman algorithm
Count frequency freq[i] for each char i in input.
・Start with one node corresponding to each char i (with weight freq[i]).
・Repeat until single trie formed:
select two tries with min weight freq[i] and freq[j]
merge into single trie with weight freq[i] + freq[j]
LZW
there are a lot of different kind models.
static model is same model for all texts. the advantage of it is fast. but it is not optimal, different texts have different statictical properties.
eg: Morse code
dynamic model will generate model based on text.
preliminary pass needed to generate model, huffman code is a example.
now i will introduce the adaptive model, progressively learn and update model as u read text. it will be more accurate modeling produces better compression. decoding must start from beginning.
add this implementation in extra credit
QUIZ
Q1 Ternary Huffman codes
Combine smallest 3 probabilities at each step (instead of smallest 2). which mean we need poll 3 element from PQ, and gernerate one node.
another thing is in binary alphabet, we can use one bit to know in left or right child, now we need two bit to distinguish it is in left, mid or right.
to handle the case when the number of symbols is not of the form 3+2k for some integer k.
we need add some 0 freq node
while (pq.size() < 3 || pq.size() % 2 == 0) {
pq.offer(new Node('\0', 0, null, null, null));
}
Q2
if u know the answer, please leave comment here.
Q3 Move-to-front coding
this problem could be solved by redblack tree, rank and select function.
rank function is to give a key, return the rank of the key.
select function is to give a rank, return the key which is on this rank.
if we want to move to front, we can remove it from red black tree, and mark a (min - 1) key, to add this node back.
therefore, we need maintain a map for key(is 'a'-'z'), value is (key in red black tree)
details see my github implementation
Project Burrows Wheeler
The Burrows–Wheeler data compression algorithm consists of three algorithmic components, which are applied in succession:
Burrows–Wheeler transform. Given a typical English text file, transform it into a text file in which sequences of the same character occur near each other many times.
Move-to-front encoding. Given a text file in which sequences of the same character occur near each other many times, convert it into a text file in which certain characters appear much more frequently than others.
Huffman compression. Given a text file in which certain characters appear much more frequently than others, compress it by encoding frequently occurring characters with short codewords and infrequently occurring characters with long codewords.
Step 3 is the only one that compresses the message: it is particularly effective because Steps 1 and 2 produce a text file in which certain characters appear much more frequently than others. To expand a message, apply the inverse operations in reverse order: first apply the Huffman expansion, then the move-to-front decoding, and finally the inverse Burrows–Wheeler transform.
MoveToFront
use for 0~R to find the match char, then use System.arraycopy and set 0 index to move to front
CircularSuffixArray
- the cleanest way to design a inner class CircularSuffix with Comparable<CircularSuffix>
- % operation is time cost, we could double the input string to avoid %
str = (s + s).toCharArray();
- more quick sort method is MSD with insertion sort
BurrowsWheeler
- encode is just use CircularSuffixArray, then find the origin index 0 position in sorted index
- write and get the last character we could add (n - 1) then mod n, because they are Circular
- to build the next array, we could use cnt sorting. then
for (int i = 0; i < t.length; i++)
next[cnt[t[i]]++] = i;
course quiz and project : https://github.com/yixuaz/algorithm4-princeton-cos226