恍然总结（20201202补充）

检索某个记录时，查找类型大概分为线性和B树，快慢的话，数据量大时B树比线性快，如果查找类型相同，查找一遍的比多遍快，比如innodb聚集索引比非聚集索引快的原因就是，虽然都采用B树查找，但是非聚集的只需查询一遍，而非聚集的查多遍。

innodb聚簇索引的索引以及记录数据（注意是两样），存储在同一个名为表.opt的文件（逻辑上），而这个文件则是多个数据页（既有存储索引的页，也有存储记录的页）组成，这些数据页物理存储顺序是以一整棵B+树来维护的，所以当通过该聚簇索引（一般是id列）检索某个记录时，只需要查一遍这个B+树就可以直接触达目标索引对应的记录了。

一个文件（逻辑）的所有数据页由一个B+树维护，而非聚集索引，索引数据和记录数据不在同一个文件上，即非聚集索引数据和记录数据的文件，以数据页为节点分别维护在两颗不同B+树上，所以通过非聚集索引检索某记录时，要先在非聚集索引B+树上找到目标的记录值（即聚集索引，一般为id），然后再去记录数据B+树（也就是聚集索引树）上找到该id对应的的记录（在这里，所有列的都可拿到）。

SQL

There are two types of Indexes in SQL Server:

Clustered Index
Non-Clustered Index

Clustered Index

A clustered index defines the order in which data is physically stored in a table. Table data can be sorted in only way, therefore, there can be only one clustered index per table. In SQL Server, the primary key constraint automatically creates a clustered index on that particular column.

The process of creating clustered index is similar to a normal index with one exception. With clustered index, you have to use the keyword “CLUSTERED” before “INDEX”.

The above script creates a clustered index named “IX_tblStudent_Gender_Score” on the student table. This index is created on the “gender” and “total_score” columns. An index that is created on more than one column is called “composite index”.

The above index first sorts all the records in the ascending order of the gender. If gender is same for two or more records, the records are sorted in the descending order of the values in their “total_score” column. You can create a clustered index on a single column as well. Now if you select all the records from the student table, they will be retrieved in the following order:

image.png

简而概之就是聚集索引维护的是每行数据的物理存储顺序，说到物理存储顺序就是以什么顺序存储在磁盘中的，那么存储方式肯定也就只有一种了，所以一个表只能有一个聚集索引。
至于以哪列或者哪几列的record来排序，则是看如何定义这个聚集索引了，创建每张表时，默认情况下自动会维护一个id列作索引，当插入record时，以该列的值升序来插入每个record。
当然也可以自定义聚簇索引，即指定某列或某几列的作为索引，当插入record时,通过被指定的列进行排序存储在物理磁盘中，当被指定列的值重合度较高时，或频繁修改时，这样会不会随着插入或者删除经常会把record掏出来放回去，改来改去，效率应该不高。

Non-Clustered Indexes

A non-clustered index doesn’t sort the physical data inside the table. In fact, a non-clustered index is stored at one place and table data is stored in another place. This is similar to a textbook where the book content is located in one place and the index is located in another. This allows for more than one non-clustered index per table.

It is important to mention here that inside the table the data will be sorted by a clustered index. However, inside the non-clustered index data is stored in the specified order. The index contains column values on which the index is created and the address of the record that the column value belongs to.

When a query is issued against a column on which the index is created, the database will first go to the index and look for the address of the corresponding row in the table. It will then go to that row address and fetch other column values. It is due to this additional step that non-clustered indexes are slower than clustered indexes.

image.png

Conclusion

From the discussion we find following differences between clustered and non-clustered indexes.

There can be only one clustered index per table. However, you can create multiple non-clustered indexes on a single table.
Clustered indexes only sort tables. Therefore, they do not consume extra storage. Non-clustered indexes are stored in a separate place from the actual table claiming more storage space.
Clustered indexes are faster than non-clustered indexes since they don’t involve any extra lookup step.

image.png

参考：
https://www.sqlshack.com/what-is-the-difference-between-clustered-and-non-clustered-indexes-in-sql-server/

Mysql

Clustered Index

A clustered index is collocated with the data in the same table space or same disk file. You can consider that a clustered index is a B-Tree index whose leaf nodes are the actual data blocks on disk, since the index & data reside together. This kind of index physically organizes the data on disk as per the logical order of the index key.

Structure of Primary key (clustered) Index:
An index is usually maintained as a B+ Tree on disk & in-memory, and any index is stored in blocks on disk. These blocks are called index blocks. The entries in the index block are always sorted on the index/search key. The leaf index block of the index contains a row locator. For the primary index, the row locator refers to virtual address of the corresponding physical location of the data blocks on disk where rows reside being sorted as per the index key.

In the following diagram, the left side rectangles represent leaf level index blocks, and the right side rectangles represent the data blocks. Logically the data blocks look to be aligned in a sorted order, but as already described earlier, the actual physical locations may be scattered here & there.

image.png

Secondary Index

Any index other than a clustered index is called a secondary index. Secondary indices does not impact physical storage locations unlike primary indices.

Structure of Secondary Index:
In the diagram below, the red coloured rectangles represent secondary index blocks. Secondary index is also maintained in the B+ Tree and it’s sorted as per the key on which the index was created. The leaf nodes contain a copy of the key of the corresponding data in the primary index.

Why do we use composite indices?

Why not define multiple secondary indices on the columns we are interested in?
MySQL uses only one index per table per query except for UNION. (In a UNION, each logical query is run separately, and the results are merged.) So defining multiple indices on multiple columns does not guarantee those indices will be used even if they are part of the query.

MySQL maintains something called index statistics which helps MySQL infer what the data looks like in the system. Index statistics is a generilization though, but based on this meta data, MySQL decides which index is appropriate for the current query.

How does composite index work?

The columns used in composite indices are concatenated together, and those concatenated keys are stored in sorted order using a B+ Tree. When you perform a search, concatenation of your search keys is matched against those of the composite index. Then if there is any mismatch between the ordering of your search keys & ordering of the composite index columns, the index can’t be used.

Covering Index:

A covering index is a special kind of composite index where all the columns specified in the query somewhere exist in the index. So the query optimizer does not need to hit the database to get the data — rather it gets the result from the index itself. Example: we have already defined a composite index on (pan_no, name, age) , so now consider the following query:

覆盖索引就是比如某表有 a、b、c列，且创建了a,b,c三列的复合索引com_index，
当通过该复合索引select 某个目标列的record时，这个目标列恰好在是b、c中的某列，
比如select c from table where a=1 and b="a"; 那么刚好c的record就在这个复合索引树中有备份，而无需再去聚集索引树中找c的值了，这种情况就叫做覆盖索引，大概是联合索引覆盖了目标列的情况把，总结这应该算是一种使用联合索引的方法。

SELECT age FROM index_demo WHERE pan_no = 'HJKXS9086W' AND name = 'kousik'

The columns mentioned in the SELECT & WHERE clauses are part of the composite index. So in this case, we can actually get the value of the age column from the composite index itself. Let’s see what the EXPLAIN command shows for this query:

EXPLAIN FORMAT=JSON SELECT age FROM index_demo WHERE pan_no = 'HJKXS9086W' AND name = '111kousik1';

image.png

In the above response, note that there is a key — using_index which is set to true which signifies that the covering index has been used to answer the query.

I don’t know how much covering indices are appreciated in production environments, but apparently it seems to be a good optimization in case the query fits the bill.

Partial Index:

当想把某列设为索引时，这个列的长度确实不固定或者固定但是很长，我们可以考虑取这列record值的前几个字节作为索引，这样索引树的key就不会，由于太长，导致每个节点上能存的key很少，最后把树高弄得很高，查询时增加磁盘访问次数了。
We already know that Indices speed up our queries at the cost of space. The more indices you have, the more the storage requirement. We have already created an index called secondary_idx_1 on the column name. The column name can contain large values of any length. Also in the index, the row locators’ or row pointers’ metadata have their own size. So overall, an index can have a high storage & memory load.

In MySQL, it’s possible to create an index on the first few bytes of data as well. Example: the following command creates an index on the first 4 bytes of name. Though this method reduces memory overhead by a certain amount, the index can’t eliminate many rows, since in this example the first 4 bytes may be common across many names. Usually this kind of prefix indexing is supported on CHAR ,VARCHAR, BINARY, VARBINARY type of columns.

CREATE INDEX secondary_index_1 ON index_demo (name(4));

参考：
https://www.freecodecamp.org/news/database-indexing-at-a-glance-bb50809d48bd/

Q&A

疑问：两个参考中（前者是Sql 后者是Mysql）都提到了聚簇索引叶子结点会存储record，第一个参考说record值直接存贮在叶子结点，但是第二个参考却说也存record值，不过是以record对应的逻辑地址（指向物理磁盘地址）形式存在的。

那么对于非聚簇索引上存储的是record的指针（和逻辑地址有区别吗）。
那为何非聚簇索引会更慢呢。
初步推测第二个参考中说的有问题。后来发现第二个说的没有问题至少可以自圆其说
其圆的方式是辅助索引key的值存的是聚簇索引的key，然后拿着这个key 再去找对应record 所以确实查一次辅助索引需要和磁盘打交道两回
那么问题来了为何辅助索引key对应值为何不直接写record 的逻辑地址，这样岂不是可以赶上聚簇索引？还是有何限制逻辑值放到辅助索引中，是因为落盘的那一瞬间只能确定主键id，还没法确定逻辑地址吗。

那为何SQL中却说其非聚簇索引 key对应值可以存储record的指针，他是怎么做到的。

解决这些疑问可能需要补充如下知识：

数据如何在磁盘中存储的（直接一个page上放一堆（就是物理上的一堆了），还是一个page只是逻辑上的，真实的数据虽然属于一个page，但是却分布在不同的物理地址？那么磁盘转一次所对应的一份数据属于一page 还是一个磁盘片区？需要详细了解）
逻辑地址和物理地址关系
SQL 中非聚簇 record的指针的实际含义是？

2020-11-18 索引

2020-11-18 索引

恍然总结（20201202补充）

SQL

Clustered Index

Non-Clustered Indexes

Conclusion

Mysql

Clustered Index

Secondary Index

Why do we use composite indices?

How does composite index work?

Covering Index:

Partial Index:

推荐阅读更多精彩内容