由于数据量本身通常过于浩大,我们需要一部分数据来加速数据分析过程。我们就需要运用抽样技术来发现整个数据集的模式和趋势。HQL中有三种抽样技术:随机取样(random sampling)、表取样(bucket table sampling)和块取样(block sampling)。
- 随机取样(random sampling)
随机取样使用rand()函数和LIMIT关键字来获取样例数据,请看以下示例
> SELECT name FROM employee_hr
> DISTRIBUTE BY rand() SORT BY rand() LIMIT 2;
+--------+
| name |
+--------+
| Will |
| Steven |
+--------+
2 rows selected (52.399 seconds)
以上语句中DISTRIBUTE和SORT关键字用于确保数据有效地在mapper和reducer中随机分布,也可以使用ORDER BY rand()来代替,但性能很差。
- 表抽样(bucket table sampling)
这种抽样方式专门针对bucket table进行了优化。如下例所示,
-- Sampling based on the whole row
> SELECT name FROM employee_trans
> TABLESAMPLE(BUCKET 1 OUT OF 2 ON rand()) a;
+--------+
| name |
+--------+
| Steven |
+--------+
1 row selected (0.129 seconds)
-- Sampling based on the bucket column, which is efficient
> SELECT name FROM employee_trans
> TABLESAMPLE(BUCKET 1 OUT OF 2 ON emp_id) a;
+---------+
| name |
+---------+
| Lucy |
| Steven |
| Michael |
+---------+
3 rows selected (0.136 seconds)
如果抽样列同时是CLUSTERED BY列,抽样效率更高。
- 块抽样(Block sampling)
块抽样随机提取n行数据记录,百分之n大小的数据记录或者n个字节的数据记录。这个抽样的粒度为HDFS块级别。请看以下示例,
-- Sample by number of rows
> SELECT name
> FROM employee TABLESAMPLE(1 ROWS) a;
+----------+
| name |
+----------+
| Michael |
+----------+
1 rows selected (0.075 seconds)
-- Sample by percentage of data size
> SELECT name
> FROM employee TABLESAMPLE(50 PERCENT) a;
+----------+
| name |
+----------+
| Michael |
| Will |
+----------+
2 rows selected (0.041 seconds)
-- Sample by data size
-- Support b/B, k/K, m/M, g/G
> SELECT name FROM employee TABLESAMPLE(1B) a;
+----------+
| name |
+----------+
| Michael |
+----------+
1 rows selected (0.075 seconds)