IR-chapter6: sorting, term weighting and the vector space model

  • motivation: to rank-order the documents matching a query by giving a score to each (query,document) pair

parametric and zone indexes

  • index and retrieve documents by metadata.
  • parametric index vs zone index: fixed vocabulary, whatever vocabulary from the text of that zone.
parametric search
zone index
zone index
  • weighted zone scoring
  • learning weights
  • the optimal weight g
    machine learning algorithm

term frequency and weighting

  • intuition: scores relate to term frequency, but are all words equally important?
  • free text query: document - the set of weights, bag of words model
    score = the sum of all terms
  • inverse document frequency
  • tf-idf weighting
    terms with lower document frequency weigh higher
tf-idf

the vector space for scoring

  • dot products : similarity between two documents
    the magnitude of the vector difference? the effect of document length.
cosine similarity
length-normalize cosine similarity
  • query as vectors
    computation is expensive

  • computing vector scores

basic algorithm

Variant tf–idf functions

SMART notation for tf–idf variants.
  • Pivoted normalized document length
    the relationship between document length and relevance
Pivoted normalized document length

linear model
machine learning techniques

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

  • 希望的时光总是被结局打击的毫无保留, 自然地温度成为了习惯; 生活的节奏依旧把理想忘记的一无是处, 了却的冲动宠爱...
    七月无伤阅读 1,063评论 0 0
  • 关于熬夜会给人带来的危害,其实我们都耳熟能详了。 轻的:黑眼圈、痘痘、皮肤粗糙暗沉、耳鸣、掉发、抵抗力下降、神经衰...
    我是要成为海贼王的大饭子阅读 2,187评论 0 0
  • 【友情】 以前最不喜欢自己呆着。 总是习惯有人陪着去干这个,干那个。 一个人走在校园里,感觉所有人都在嘲笑孤单的自...
    曲小叶阅读 3,741评论 0 0
  • 源自周国平先生的文集 1. 分清自己能否支配 人生智慧的一个重要方面,是分清什么是自己能够支配的,什么是自己不能支...
    我家门前有大海阅读 2,895评论 0 0

友情链接更多精彩内容