前言
最近听了业界大佬Maarten的一个关于IR的Talk,如果我没记错,应该和去年在ESSIR上听到的是一样的,不过每次听都有新的收获,将要整理记录如下。
Query Improvement (online)
- 主要的目:提供shortcut给用户、处理查询的error
- 主要方式:Log analysis (AOL dataset)
- 主要途径:
- Query Auto-Completion (QAC): what users' intent in mind but not clearly expressed
- Query Suggestion: recommendation, ranking & diversity
- Query Expansion
- Query Correction
- 关键在于将Query的signals,如clicks, time, news, personal, general, location等信息和query logs相结合
Getting Content (offline)
- Crawling中常见的问题:
- Scale
- Content selection
- URL filtering
- Remove duplicate URLs: exact & near (compare sequences of word, like n-gram words)
- Spam detection: meaningful expressions, sentiment analysis & supervised learning
- Aggregation: considering anchor text on the web & information among entities.
- Inverted index construction: collect -> tokenize -> stopwords -> stem/lemma -> index
- Temporal IR: info can be images, songs, books, news, webs, videos and apps
Query Understanding (online)
- The result of query understanding can be presented on search engine results page (SERP), some contexts should be considered:
- Search goals? search tasks?
- Semantic topics?
- Time-sensitive? location-sensitive?
- Classification query based on pre-defined intent is difficult (short & ambiguous): click-though data & session data.
- Intent Discovery (Non-predefined)
- Shifting intents: intents change with time (Radinsky. 2013)
-
Learning to detect intent shifting (Lefortier. 2014)
- Queries whose intents from non-fresh to fresh
- More clicks to some links?
- Diversity
- Extrinsic: query with uncertainty
- Intrinsic: diversity is part of info needs
Ranker (learning to rank)
- content-based
- structure-based (title, content, tags, time)
- based on interaction behaviors (click through, scanning)
- docs represented by feature vector
Responsible IR
Privacy, Fairness, Accuracy, Transparency (let the sys explain why)