- the advantage
decrease the disk space
increase the use of cache, thus decrease response time
decrease the transfer time from disk to memory - In this chapter, we define a posting as a docID in a postings list.
statistical properties of terms in information retrieval
- lossy compression, when the "lost" information is unlikely to be used by the search system
- The compression techniques we discussed in the remainder of this chapter is are lossless, that is, all information is preserved
estimating the number of terms:
- OED: more than 600,000 words, but ignoring most names of person, locations, products etc.
- heap's law
M = k * power(T,b)
for T>100,000, b=0.49, k=44, the fit is excellent.
It suggests...
modeling the distribution of terms
- Zipf's law
the collection frequency of the ith commonest item = c * power(i,k)
log cf = logc + k * log i
it suggests...
dictionary compression
- why is it necessary?
To decrease the number of disk seeks to shorten the response time.
dictionary as a string
- the simplest approach: the dictionary as a fixed-width array.
Too wasteful!!!
Can't store term containing more than 20 characters - storing the dictionary terms as one long string.
blocked storage
- grouping terms in the string into blocks of size k, and keeping one term pointer for the first term in the string, adding the length byte for each term in the string.
- trade-off between the compression and the speed of term lookup.
- front coding:
- hash function:
unifiable for dynamic environment, since every new term will create collision.
posting file compression
- variable ecoding
docID (rare terms) vs gap(frequent terms) -
two methods
bytewise, bitwise (encoding the gaps )
Variable byte encoding
- larger : less effective compression, less bit manipulation.
- smaller: more effective compression, more bit manipulation.
- trade-off between compression ratio and depression time.
γ Codes
- universal
- prefix free, parameter free
- how much compression does it achieve?