hardware basics
- cache
the slow speed of disk
the technique of keeping frequently accessed disk data in memory - seek time
the time for the disk head moving to the part of the disk where the data are located
0.5ms for typical disk, while 0.2 transfer time for 10M, thus storing in contiguous chunks will improve the data transfer rate. - buffer
the part of memory where a block being read or written is stored - The processor is available to process data during disk I/O
blocked sort-based indexing
- termID
represent the term as termID
do a single pass in the chapter
Reuters-RCV1 collection
- termID-docID 0.8GB, for larger collection, it's hard to sort them in memory, thus turning to external search algorithm
- block sort-based indexing algorithm/BSBI:
segments the collection into parts of equal size.
sort the termID-docID pairs of each part in memory
stores intermedia results on disk
merges all intermedia results into the final result.
- parse next block: parse the documents into termID-docID pairs and accumulates the pairs in memory until the block size is full
- invert:
sort the termID-docID pairs
collect all termID-docID pairs with the same termID into a postings list - merge blocks
open all files simultaneously, and maintain read buffers for all blocks, and a small write buffer for the final result.In each iteration, we choose the lowest termID that has not been processed yet.All postings for the same termID are read and merged, and the merged list is read back to disk.
- time complexity is �(T log T) because the step with the highest time complexity is sorting and T is an upper bound for the number of items we must sort (i.e., the number of termID–docID pairs).
ex4.6
single pass in-memory indexing
- for lager collection, the data structure for mapping term to termID dose not fit into memory
- use terms instead of termID, write each block's dictionary to disk, and then starts a new dictionary for the next block.
-
dynamic and efficient, memory-saved.
- The time complexity of SPIMI is �(T) because no sorting of tokens is required and all operations are at most linear in the size of the collection.
- Some memory is wasted.
distributed indexing
- Web search engine
- term-partitioned index
-
MapReduce - a general architecture for distributed computing to solve large computing problem on cheap commodity machines
- master node: control the splits, assigning and reassigning
- key-value pairs: termID - docID
- map phase - parsers - share a consistent mapping, index- partition, write to corresponding partition segment file into the local disk
- reduce phase- inverters- read from r segment files(only requires one-sequential read to minimize the amount of network traffic)
- parses and inverts are not separate set of machines.
- Google 2004 indexing system
dynamic indexing
- The dictionary and the postings need to be modified,created or deleted.
- Reconstruction: space cost and time delay.
- maintain two indexes: the large main index and a small auxiliary index(kept in memory)
- one-file-per-posting-list scheme is not feasible
- store the index in one file T*T/n
- logarithmatic merging T*log(T/n)
- But query processing requires the merge of log(T) indexes now.
- for position indexed, just more sorting.