- motivation: to rank-order the documents matching a query by giving a score to each (query,document) pair
parametric and zone indexes
- index and retrieve documents by metadata.
- parametric index vs zone index: fixed vocabulary, whatever vocabulary from the text of that zone.
- weighted zone scoring
- learning weights
- the optimal weight g
machine learning algorithm
term frequency and weighting
- intuition: scores relate to term frequency, but are all words equally important?
- free text query: document - the set of weights, bag of words model
score = the sum of all terms - inverse document frequency
- tf-idf weighting
terms with lower document frequency weigh higher
the vector space for scoring
- dot products : similarity between two documents
the magnitude of the vector difference? the effect of document length.
query as vectors
computation is expensivecomputing vector scores
Variant tf–idf functions
- Pivoted normalized document length
the relationship between document length and relevance
linear model
machine learning techniques