信息检索报告整理

前言

最近听了业界大佬Maarten的一个关于IR的Talk,如果我没记错,应该和去年在ESSIR上听到的是一样的,不过每次听都有新的收获,将要整理记录如下。

Query Improvement (online)

  1. 主要的目:提供shortcut给用户、处理查询的error
  2. 主要方式:Log analysis (AOL dataset)
  3. 主要途径:
    • Query Auto-Completion (QAC): what users' intent in mind but not clearly expressed
    • Query Suggestion: recommendation, ranking & diversity
    • Query Expansion
    • Query Correction
  4. 关键在于将Query的signals,如clicks, time, news, personal, general, location等信息和query logs相结合

Getting Content (offline)

  1. Crawling中常见的问题:
    • Scale
    • Content selection
    • URL filtering
    • Remove duplicate URLs: exact & near (compare sequences of word, like n-gram words)
    • Spam detection: meaningful expressions, sentiment analysis & supervised learning
    • Aggregation: considering anchor text on the web & information among entities.
    • Inverted index construction: collect -> tokenize -> stopwords -> stem/lemma -> index
    • Temporal IR: info can be images, songs, books, news, webs, videos and apps

Query Understanding (online)

  1. The result of query understanding can be presented on search engine results page (SERP), some contexts should be considered:
    • Search goals? search tasks?
    • Semantic topics?
    • Time-sensitive? location-sensitive?
  2. Classification query based on pre-defined intent is difficult (short & ambiguous): click-though data & session data.
  3. Intent Discovery (Non-predefined)
    • Shifting intents: intents change with time (Radinsky. 2013)
    • Learning to detect intent shifting (Lefortier. 2014)
      • Queries whose intents from non-fresh to fresh
      • More clicks to some links?
  4. Diversity
    • Extrinsic: query with uncertainty
    • Intrinsic: diversity is part of info needs

Ranker (learning to rank)

  1. content-based
  2. structure-based (title, content, tags, time)
  3. based on interaction behaviors (click through, scanning)
  4. docs represented by feature vector

Responsible IR

Privacy, Fairness, Accuracy, Transparency (let the sys explain why)

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • 看了空巢老人的热搜 很不舒服 在台湾的时候学的是社工 总会去老师开设的安养中心做义工 那是我第一次知道原来我们身边...
    曼总阅读 851评论 0 0
  • http协议是目前非常普及的应用层传输协议,了解https之前要先知道http的缺点. 1.通信使用明文(不加密)...
    杨帅iOS阅读 13,205评论 5 18
  • 每个女孩都渴望闺蜜。 电影《滚蛋吧,肿瘤君》中,女主熊顿深情地对闺蜜说:我可以失恋十次,却不能失去你一次,让人泪目...
    冰冰不怕加热阅读 4,816评论 21 19