原文网址:
https://github.com/gtoubassi/femtozip/wiki/How-femtozip-works
How FemtoZip Works
FemtoZip represents two natural extensions of gzip and gzip like compression schemes. In order to understand it, it is easiest to first understand how gzip/DEFLATE algorithms work.
How GZip Works
A great overview is provided here:http://www.gzip.org/deflate.html. But the short story is gzip does two things:
- Look for repeated substrings in the input stream. If any string is seen that has already occurred in the previous 32k (and it is at least 3 chars in length), it will be replaced by a relative reference to the previous occurrence <distance, length>. For example
mississippi
becomesmiss<-3,4>ppi
(note the <> notation in the string is just to show the substitution, it is not expressed as ascii). This example is a bit exotic because it shows that all characters referenced by a <distance, length> pair may not exist. But by the time they are needed they will! This allows gzip to get the effect of run length encoding. For example 'aaaaaaaaaa' becomesa<-1,9>
. - The second thing gzip does is take the resulting stream of literals (say miss in the mississippi example), as well as the stream of distances, lengths, and huffman encode them, so common symbols (like the letter 'e' in english) use fewer than 8 bits. The details of the huffman coding scheme are fairly involved, in fact two huffman trees are used, one which encodes literal and lengths, and one which encodes distances.
With the above background on gzip you can see why it doesn't work on small documents. First off, a repeated substring has to be seen at least twice to get any compression. For documents that are similar to each other, but don't have any internal similarity, step 1 does nothing. Step 2 is a problem because a Huffman tree that is derived from the actual data will compress the data well, but whose definition has to be written into the stream so the receiver knows how to decode. For small documents gzip will either just stick with a built in Huffman tree (which is unlikely to be optimal for the data), or take a significant cost in storing that tree in the stream.
How FemtoZip Improves on GZip
FemtoZip addresses both shortcomings by doing the following:
- Analyze a sample set of documents and derive a set of popular substrings. These substrings are packed into a dictionary and used to bootstrap the substring repetition detector. In effect the compressor starts off with some history to compare against. zlib actually supports starting compression/decompression off with an initial dictionary via deflateSetDictionary() and inflateSetDictionary(), but those APIs have some overhead (for example in the inflate case, the dictionary has to be hashed before compressing each document, which is a significant performance hit for small documents). Specifying a dictionary to zlib causes a checksum of the dictionary to be written to the stream, which can be a significant cost for small documents. Also zlib has no facility for computing a dictionary given sample data.
- FemtoZip uses a 64k dictionary, vs the historical 32k limit in gzip.
- An optimal Huffman tree is computed based on the sample documents, and is stored once in the model vs per document.
- FemtoZip writes no header bytes or checksums to the stream.