Paper "Optimizing Queries Using Materialized Views: A Practical, Scalable Solution"
Background
物化视图匹配(view-matching)算法在以下几个方面有所不同:
- 考虑的查询类型(例如本篇的 SPJG)
- 支持的物化视图类型(例如本篇的 SPJG)
- 考虑的替换表达式类型
- 考虑 bag semantics 还是 set semantics
- 是否支持多视图改写
Calcite 中的优化器 RBO 和 CBO 都归属于 transformation-based optimizers。物化视图匹配算法是 SQL 优化问题的一个研究方向,甚至用户的 SQL 查询执行结果都可以缓存下来,用作临时的物化视图。
Prerequisites
在 MS SQL Server 2000 中,materialized views 又叫做 indexed views,因为:
- 物化视图的创建是通过在已有的 view 上创建一个 clustered index
- 这个创建方法就要求物化视图必须包含 unique key,利用这个特性可以保证物化视图能增量更新
- 可以在物化视图上创建 secondary indexes
MS SQL Server 2000 创建物化视图的的要求:
- Single-level SQL statement containing selections, inner joins, and an optional final group-by.
- The FROM clause cannot contain derived tables.
- 如果物化视图的最后一个算子是 Aggregation,那么它需要满足:
- The aggregation view must include all grouping columns as output columns (they are the key) and a count column
- Aggregation functions are limited to sum and count
- Output columns defined by arithmetic or other expressions must be assigned names (using the AS clause)
这也是这篇论文的 SPJG 要求,例子:
# 创建视图
CREATE VIEW v1 WITH schemabinding AS
SELECT p_partkey,
p_name,
p_retailprice,
count_big(*) AS cnt,
SUM(l_extendedprice * l_quantity) AS gross_revenue
FROM dbo.lineitem,
dbo.part
WHERE p_partkey < 1000
AND p_name LIKE ‘%steel%’
AND p_partkey = l_partkey
GROUP BY p_partkey,
p_name,
p_retailprice
# 创建 cluster index,也即物化视图
CREATE UNIQUE clustered INDEX v1_cidx ON v1(p_partkey)
# 再建一个二级索引
CREATE INDEX v1_sidx ON v1( gross_revenue, p_name)
Problem Domain
View Matching with Single-View Substitutes
Given a relational expression in SPJG form, find all materialized views (SPJG) from which the expression can be computed and, for each view found, construct a substitute expression equivalent to the given expression.
Computing a query expression from a view
For a SPJ query to be computable from a view, the view must satisfy the following requirements:
- The view contains all rows needed by the query expression.
- All required rows can be selected from the view.
- Required rows can be extracted.
- All output expressions can be computed from the output of the view.
- All output rows occur with the correct duplication factor.
Column Equivalence Classes
假设 是一个 SPJ 查询的选择谓词(Selection Predicate)的合取范式,把列相等谓词(形如 ,其中 和 是表名, 和 是列引用)放前面,我们能得到:
[图片上传中...(image.png-dab5fd-1708741900703-0)]
基于 可以算出来 column equivalence classes:
- 一个 column equivalence class 指的是相等列的集合
Classes with single column are called trivial classes.
Do all required rows exist in the view?
SPJ views and queries reference the same tables
假设 query 和 view 都使用了 个表,分别为 ;query 的谓词为 ,view 的谓词为 ,问题 "Do all required rows exist in the view?" 就相当于问:
SELECT *
FROM T1, T2, ..., Tm where Wq
是不是
SELECT *
FROM T1, T2, ..., Tm where Wv
的子集,用关系代数来表示:
其中 表示逻辑蕴含,上面这个式子可再次转换为:
表示 query 的 column equality predicates, 表示 query 的 range predicates, 是剩余的 predicates。Range predicates 是形如 的形式,其中 是常量, 是 之一。
公式 3 还可以再次转化为:
根据公式:
公式 4 可以有一种更强的表达形式(但是会有遗漏):
公式 6、7 和 8 必须同时满足。
Equijoin subsumption test
公式 6 叫 equijoin subsumption test 是因为绝大部分的列相等谓词都来自于等值 JOIN(equijoin)。 里面还会包含其他类型的列相等谓词,他们甚至可以来自同一个表。Equijoin subsumption test 可以通过 column equivalence classes 解决:view 中相等的列在 query 中也必须相等(but not vice versa);在满足上述要求的条件下可以算出 compensating column equality constraints,如果 view 的 equivalence classes 在 query 中归属于同一个 class,那么需要构建 compensating predicates: for 。
Range subsumption test
公式 7 原论文假设在不包含 OR 的情况下。
给每个 column equivalence class 一个上界和下界,并且把 和 都转化为 和 :
-
会被转化成
- 是比 小的最大值
- 会被转化成
对于每一个 column equivalence class,只要 view 的 range 范围比 query 的大(宽松),那么 view 中的结果就会包含 query 的。同样在这个阶段,也可以算出 compensating range predicates:如果 query 的 predicate 范围恰好匹配对应的 view 范围,那么不需要补偿。如果 query 的下界和 view 的不同,那么需要构建 compensating predicates ,其中 是 query 中对应查询 range 的下界;如果 query 的上界和 view 不同,构建 compensating predicates 的方法类似。
Residual subsumption test
原论文用了一个比较浅显的匹配算法(shallow matching algorithm):
把一个 SQL expression 转换成字符串表示和列引用数组,其中字符串表示会把列引用忽略,列引用数组包含 expression 中的每个列引用,按照它们在表达式的文本版本中出现的顺序排列。
- 首先比较字符串是否完全相同
- 列引用是否完全相同(考虑 column equivalence classes)
Example
# 物化视图
CREATE VIEW V2 WITH schemabinding AS
SELECT l_orderkey, o_custkey, l_partkey,
l_shipdate, o_orderdate,
l_quantity * l_extendedprice as gross_revenue
FROM dbo.lineitem, dbo.orders, dbo.part
WHERE l_orderkey = o_orderkey
AND l_partkey = p_partkey
AND p_partkey >= 150
AND o_custkey >= 50 AND o_custkey <= 500
AND p_name LIKE '%abc%'
# 查询
SELECT l_orderkey, o_custkey, l_partkey,
l_quantity * l_extendedprice
FROM lineitem, orders, part
WHERE l_orderkey = o_orderkey
AND l_partkey = p_partkey
AND l_partkey >= 150 AND l_partkey <= 160
AND o_custkey = 123
AND o_orderdate = l_shipdate
AND p_name LIKE '%abc%'
AND l_quantity*l_extendedprice > 100
Step1: Compute equivalence classes for the query and the view.
View equivalence classes: {l_orderkey, o_orderkey}, {l_partkey, p_partkey}, {o_orderdate}, {l_shipdate}, etc.
Query equivalence classes: {l_orderkey, o_orderkey}, {l_partkey, p_partkey}, {o_orderdate, l_shipdate}, etc.
Step2: Check that every view equivalence class is a subset of a query equivalence class. If not, reject the view.
前两个 non-trivial view equivalence classes 和前两个 non-trivial query equivalence classes 一样,o_orderdate 和 l_shipdate 在 query 中属于同一个 equivalence class,因此构建 compensating predicate:(o_orderdate=l_shipdate)
Step3: Compute range intervals for the query and the view.
View ranges: {l_partkey, p_partkey} , {o_custkey}
Query ranges: {l_partkey, p_partkey} , {o_custkey}
Step4: Check that every view range contains the corresponding query range. If not, reject the view.
View 中 {l_partkey, p_partkey} 和 {o_custkey} 的范围都大于 query 中的,并且可以构建出 compensating predicates:({l_partkey, p_partkey} <= 160), ({o_custkey} >=123) 和 ({o_custkey} <= 123)
Step5: Check that every conjunct in the residual predicate of the view matches a conjunct in the residual predicate of the query. If not, reject the view.
View residual predicate: p_name like ‘%abc%’
Query residual predicate: p_name like ‘%abc%’, l_quantity*l_extendedprice > 100
Query 的 residual predicate 比 view 多一个,构建 compensating predicate:l_quantityl_extendedprice > 100*
Can the required rows be selected?
Compensating predicate 使用的列必须在 view 的输出列中。
Can output expressions be computed?
Query 使用的列必须在 view 的输出列中。
Do rows occur with correct duplication factor?
当 view 和 query 使用了相同的表,那么 duplication factor 是满足的;如果 view 使用了更多的表,见下。
Views with extra tables
Single Extra Table
假设 SPJ 查询使用了 个表 ,view 多使用了一个表,view 使用的 个表为 .
关键点是 cardinality-preserving join,也叫做 table extension join
- 表 和 间的 join 是 cardinality preserving 的,如果 中的每一行恰好只和 中的一行 join 上
对于 inner join,满足以下条件就满足了 cardinality-preseving 要求:
- equijoin on all columns in a non-null foreign key
Left join 也是可以的,类似的证明方法。
Multiple Extra Tables
假设 SPJ 查询使用了 个表 ,view 多使用了 个表:.
构建一个 foreign-key join graph,这个图中的节点就是上面 个表, 和 之间添加一条边,如果 和 之间的 join 满足上面说的要求。
递归的删除构建图的节点,这个节点没有出边,并且只有一条入边。如果 都被消除掉了,那么 query 可以基于这个 view 计算。
Example
# 物化视图
CREATE VIEW v3 WITH schemabinding AS
SELECT c_custkey, c_name, l_orderkey,
l_partkey, l_quantity
FROM dbo.lineitem, dbo.orders, dbo.customer
WHERE l_orderkey = o_orderkey
AND o_custkey = c_custkey
AND o_orderkey >= 500
# 查询
SELECT l_orderkey, l_partkey, l_quantity
FROM lineitem
WHERE l_orderkey BETWEEN 1000 AND 1500
AND l_shipdate = l_commitdate
View 的 foreign-key join graph 是:
先删 customer,再删 orders。
SPJG view/queries
- View 的 SPJ 部分包含 Query 的 SPJ 部分的所有 rows,并且有相同的 duplication factor
- Compensating predicates 使用的所有列都在 view 的 output 中
- View 没有 aggregation 操作或者比 query 聚合度低
- Query 的所有 grouping 列都在 view output 中
- Query output 用的所有列都在 view output 中
例子:
# 物化视图
CREATE VIEW v4 WITH schemabinding AS
select o_custkey,
count_big(*) AS cnt,
SUM(l_quantity * l_extendedprice) AS revenue
FROM dbo.lineitem, dbo.orders
WHERE l_orderkey = o_orderkey
GROUP BY o_custkey
# 查询
SELECT c_nationkey,
SUM(l_quantity * l_extendedprice)
FROM lineitem, orders, customer
WHERE l_orderkey = o_orderkey
And o_custkey = c_custkey
GROUP BY c_nationkey
# pre-aggregation 改写
SELECT c_nationkey,
SUM(rev)
FROM customer,
(SELECT o_custkey,
SUM(l_quantity * l_extendedprice) AS rev
FROM lineitem, orders
WHERE l_orderkey = o_orderkey
GROUP BY o_custkey) AS iq
WHERE c_custkey = o_custkey
GROUP BY c_nationkey
# 物化改写
SELECT c_nationkey,
SUM(revenue)
FROM customer, v4
WHERE c_custkey = o_custkey
GROUP BY c_nationkey
Fast Filtering
In-memory index called filter tree: a multiway search tree where all the leaves are on the same level. The node in the tree contains a collections of (key, pointer) pairs. A key consists of a set of values, not just a single value. A filter tree subdivides the set of views into smaller and smaller partitions at each level of the tree.
Lattice Index
问题:For a set of sets : , given a set , how to quickly find the sets in that are superset/subset of .
解决方案:集合之间子集关系声明了一种偏序关系,可以通过类似网格(lattice)的形状展示。例如,集合 :
Partitioning conditions
Source table conditions: A query cannot be computed from a view unless the view’s set of source tables is a superset of the query’s set of source tables.
Hub condition: A query cannot be computed from a view unless the hub of the view is a subset of the query’s set of source tables.
Output column condition: A view cannot provide all required output columns unless, for each equivalence class in the query’s output list, at least one of its columns is available in the view’s extended output list.
Grouping column condition: An aggregation query cannot be computed from an aggregation view unless, for each equivalence class in the query’s grouping list, at least one of its columns is present in the view’s extended grouping list.
Range constraint condition: A query cannot be computed from a view unless, for each equivalence class in the view’s range constraint list, at least one of its columns is present in the query’s extended range constraint list.
Weak range constraint condition: A query cannot be computed from a view unless the view’s reduced range constraint list is a subset of the query’s extended range constraint list.
Residual predicate condition: A query cannot be computed from a view unless the view’s residual predicate list is a subset of the query’s residual predicate list.
Output expression condition: A query cannot be computed from a view unless its (textual) output expression list is a subset of the view’s (textual) output expression list.
Grouping expression condition: An aggregation query cannot be computed from an aggregation view unless its (textual) grouping expression list is a subset of the view’s (textual) grouping expression list.