IR-chapter6: sorting, term weighting and the vector space model

  • motivation: to rank-order the documents matching a query by giving a score to each (query,document) pair

parametric and zone indexes

  • index and retrieve documents by metadata.
  • parametric index vs zone index: fixed vocabulary, whatever vocabulary from the text of that zone.
parametric search
zone index
zone index
  • weighted zone scoring
  • learning weights
  • the optimal weight g
    machine learning algorithm

term frequency and weighting

  • intuition: scores relate to term frequency, but are all words equally important?
  • free text query: document - the set of weights, bag of words model
    score = the sum of all terms
  • inverse document frequency
  • tf-idf weighting
    terms with lower document frequency weigh higher
tf-idf

the vector space for scoring

  • dot products : similarity between two documents
    the magnitude of the vector difference? the effect of document length.
cosine similarity
length-normalize cosine similarity
  • query as vectors
    computation is expensive

  • computing vector scores

basic algorithm

Variant tf–idf functions

SMART notation for tf–idf variants.
  • Pivoted normalized document length
    the relationship between document length and relevance
Pivoted normalized document length

linear model
machine learning techniques

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容