Index & B+ tree

Index

An index is a data structure that speeds up selections on the search key field(s)

Search key:any subset of the fields of a relation
the search key is not the same as the key(minimal set of fields that uniquely identify a record in a relation).

Entries in an index: (k, r)
k: search key
r: records or record ids

Index分类:

  • Clustered index
    • records sorted & stored in the order of search key
    • row data is stored with the key, so that's why the entry is (key, record) in index
    • For MySQL database, it automatically creates a clustered index for
      1. primary key if exists;
      2. if no pk, they create a clustered index for 1st unique key;
      3. if no pk and unique key, they use row ID (this is a hidden attribute for MySQL)
  • unclustered index:
    • records are not sorted in key order

search key 分类:

  • unique search key
  • composite search key: a list of fields

for different queries, the index can be useful or useless

index performance.png

B+ tree

B+ tree is a data structure that helps search data, it is often used in DBMS or file system

  • Search Tree
  • in the disk, often makes 1 node = 1 block
  • there is a parameter called d in the B+ tree

结构:

root, internal node, leaf node

  • root has [1,2d] keys
  • each internal node and leaf node has [d,2d] keys
  • each leaf node connects to its next sibling by the end pointer, so leaves are like a linked list. which supports range query efficiently
待补充.png
待补充.png

B+树结构在硬盘上的存储:

assumption:
key size = 4 bytes
pointer size = 8 size
Block size = 4 KB

according to the rules, each node has the number of keys between d and 2d
then if one node stores maximum keys:
42d + 8(2d+1) <= 4096
d = 170

In practice, d is often 100
Typically, the fill-factor(average proportion storage usage in a node) is 2/3, so the average number of keys in node is 200*2/3 = 133

Keep in mind that an index is composed of a key and its corresponding records/ record ids.
When we are ready for the keys' storage, let's look back to the records:

  • for a height 1 tree(tree only with a single root), it can store 133 records ;
  • for a height 2 tree: it can store133^2= 17689 records (134*133 to be exact)
  • for a height 3 tree: it can store133^3= 2352637 records (1342*133)
  • for a height 4 tree: it can store 133^4= 312900721 records (1343* 133 that's the reason why the B+ tree is not so deep)

When storing these records in a buffer pool, we can see:

  • Level 1, it only needs 1 page = 4KB cause there is only one node
  • Level 2, it needs 133 pages = 532KB
  • Level 3 it need 17,689 pages = 70, 756KB ~ 70MB

B+ tree上的检索

  • Equality search
    1.start as the root
    2.proceed down to the leaf

  • Range query [-,a], [a,b], [b, -]
    [-,a]
    1.find the left-most leaf
    2.sequential traverse until ends

    [a,b]
    1.find the first leaf in the range
    2.sequential traverse until ends

    [b,-]
    1.find the leaf with b
    2.sequential traverse until ends

B+ tree上index的插入

In theory, only inserting an index enough
risk: overflow => disobey the [d,2d] rule

3 cases:
a.leaf node has free space to keep key
b.leaf node overflows when inserting the key but its parent has free space
c. leaf node overflows when inserting the key and its parent is also full of storage

case1:
find leaf where k belongs to, then insert

case2:
find leaf where k belongs to, then insert (logically)
now this leaf node is overflow, we should split the node and insert the middle into its parent node

case 3:
just like case 2, but do it recursively

Difference between splitting a leaf node and an internal node:


Before splitting for a leaf node.png

After split, 5 keys and 7 pointers


After splitting for a leaf node.png

Before splitting an internal node, there are 5 keys and 6 pointers;
After splitting, there are still 5 keys and 6 pointers


spliiting an internal node.png

Insertion cascaded:
splitting a leaf node may lead to splitting of its parent and ancestors.

B+ tree 上index的删除

Deletion is the opposite operation of insertion
risk: underflow
strategy: merging

-If a node is below the min capacity after deletion, which is producing underflow, try the following in the given order

  1. move a key from an immediate left sibling;
  2. move a key from an immediate right sibling;
  3. merge with an immediate left sibling;
  4. merge with an immediate right sibling
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

友情链接更多精彩内容