chapter 1 boolean search

布尔检索是数据库检索最基本的方法，是用逻辑“或”（＋、OR）、逻辑"与"(×、AND)、逻辑"非"(－、NOT)等算符在数据库中对相关文献的定性选择的方法。

单词-文档矩阵，
索引结构是倒排索引 unit-posting,基于关键词的查找

查找
合并merge：按照Doc Frequency排序，从最小的数组开始合并

chapter 2 vocabulary list and posting

更多是关于web search，不是精确检索。在search前需要mapping。

1.文档处理document delineation and char seguence decoding

index granularity索引粒度：确定document unit
（a message / a file /a message and its contained attachment / a book or a chapter or a sentence）
PS: 系统需要提供可选的索引granularity

2. vocabulary

tokenization( segment，eliminate punctuation etc.)：hyphen、space
match：返回内容充分满足expectations，windows->Windows、WINDOWS 各种变换的形式
common words(stop words,pressure for time and space)：
big list → small list → statistics
Web search engines generally do not use stop word lists.More precise with stop words
Normalization (helps in queries, not in IR performance)
standard way：create equivalence classes
1). mapping【badcase U.S.A match USA，but C.A.T. not match cat】
2). maintain relations
a. index unnormalized tokens + a query expansion list
b. index construction, index 含有token同义词的document

3). the forms of normalization
accents /diacritics
capitalization/case-folding：reduce letters to lower case
idiosyncratic issue in English：color vs colour
Stemming and lemmatization：reduce inflectional forms
stemming remove derivational affixes
lemma 还原词根

3. Faster postings list intersection

skip list pointer，而不是原来的逐一遍历
【Where do we place skips?】 There is a tradeoff.
heuristic : ` √P evenly-spaced skip pointers.

4. Positional postings and phrase queries

▼ Biword indexes→phrase index
single-word term index also needed
【cons】
large vocabulary→partial phrase index
▲ Positional postings 位置信息索引
【cons】
expands posting size，depending on token
▲+▼ combined scheme【biword index + position index】
index what biword?
- query behavior：index frequently queried phrase
- individual word high frequency, phrase comparatively low

chapter 3 Dictionaries and tolerant retrieval(容错检索)

Search structures ：Dictionary
Solution ：

hashing
search tree【demand prescribed ordering】
Binary tree
B-tree

binary tree

B-Tree

Wildcard queries 通配符检索

找索引单元/term/单词，字符串查找

query：mon*
solution：regular B-Tree
process: walk down the tree following the symbol m, o, n.

query：*mon
solution：reversed B-Tree
process: walk down the tree following the symbol m, o, n.
e.g. lemon root->n->o->m->e->l

query：se*mon
solution2：permuterm indexes
solution1：regular B-Tree + reversed B-Tree
process:
search regular B-Tree, prefix se-, set A
search reversed B-Tree, suffix -mon, set B
intersect A ∩ B

General wildcard queries

permuterm index
也就是索引单元包含了首、尾的信息
cons：the dictionary becomes quite large, lead to blowup

image.png

e.g. m*n → n$m → man、moron
e.g. fi*mo*er → 首 er 尾fi → candidate terms →filter by exhaustive enumeration，检查中间是否包含"mo"
找完单词以后，进行document retrive
k-gram indexes for wildcard queries
index k characters, with ￥ denoting beginning and end.
相比于permuterm index，数量相对少一些
e.g. k=3,castle →￥ca, cas, ast, stl, tle, le$
search
e.g. re*ve → ￥re and ve￥
filtering demanded string-matching operation
e.g. red* → ed￥→ retired【invalid，因为query左侧没有其他符号】

Spelling correction

1. implementation of spelling correction

2. 2 steps to solving the problem

edit distance
k-gram overlap

implementation

选择正确答案的原则
选择最近的，proximity；
选择最常见的（出现最多；使用最多）
forms
isolated-term correction
context-sensitive correction.

编辑距离的计算【动态规划】
可以用来计算相似度
编辑距离其实是编辑操作的最小数目
edit operation：insert, replace, delete => 带权重的edit operation

3.k-gram索引 for spelling correction
【isolated spelling correction】
【Jaccard coefficient】
set A query里面的k-gram集合
set B term里面的k-gram集合
Jaccard coefficient = |A ∩ B| / |A ∪ B|
流程：
for t in term-posting list
Jaccard( t ) > 阈值
选择t作为候选

t 的 k-gram如何获得？
Jaccard = num(A&B) / num(A+B)-num(A+B)
num(A+B)是 posting hits
方法：
step1、经验方法，直接选出candidate
step2、计算编辑距离，选出具有最小edit distance的t

【context sensitive spelling correction】
query = w1 + w2 +.....wn
列举出每一个词的candidate
trim，bi-words,previous queries by users

IR Chapter1-3

IR Chapter1-3

chapter 1 boolean search

chapter 2 vocabulary list and posting

chapter 3 Dictionaries and tolerant retrieval(容错检索)

Search structures ：Dictionary

Solution ：

Wildcard queries 通配符检索

找索引单元/term/单词，字符串查找

General wildcard queries

Spelling correction

1. implementation of spelling correction

2. 2 steps to solving the problem

相关阅读更多精彩内容

友情链接更多精彩内容