测试说明：节点 x 8 ；128内存；48 core hadoop3集群
flow数据（带domain）；snappy 压缩 30天复制；共 82G x 30 HDFS大小；450亿
并发参数设置为：96（推荐配置）； 128MB/split ； leaf节点 min-driver-per-task = 48 * 8 = 384
Presto缓存：开启了文件头和footer缓存

一些踩坑

Wroker 遇到 Failed on local exception: java.io.IOException: 打开的文件过多错误

clush -a --copy /etc/security/limits.conf --dest /etc/security/

* soft nofile 65536
* hard nofile 65536
* soft nproc 4096
* hard nproc 4096

SET session 命令里可以动态调整一些参数，但是和持久化到文件里时 key不一样，如：

   hive.pushdown_filter_enabled //session
   hive.pushdown-filter-enabled=true //connector

   set session hive.pushdown_filter_enabled=true; //OK 注意不要加引号 

   key不一样的原因是好像语法禁用了 - 字符等
   （Query 20210810_100751_00013_vzz67 failed: line 1:21: mismatched input '-'. Expecting: '.', '='）

如果要开启进程调试
-server -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8888
这几个参数应该分成3行放在jvm.properties ...

详细配置

cat config.properties

coordinator=true
discovery-server.enabled=true
node-scheduler.include-coordinator=true
http-server.http.port=8090

node-scheduler.max-splits-per-node=1024
node-scheduler.max-pending-splits-per-task=1024
task.max-worker-threads=96
task.min-drivers=192
task.min-drivers-per-task=384

query.max-memory=300GB
query.max-memory-per-node=30GB
query.max-total-memory-per-node=36GB
discovery.uri=http://192.168.255.102:8090

cat hive.properties

connector.name=hive-hadoop2
hive.metastore.uri=thrift://192.168.255.109:9083
hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml,/etc/hive/conf/hive-site.xml
#hive.node-selection-strategy=SOFT_AFFINITY
hive.node-selection-strategy=HARD_AFFINITY
hive.max-split-size=128MB
hive.max-initial-split-size=128MB
hive.max-initial-splits=1


hive.metastore.authentication.type=KERBEROS
hive.metastore.service.principal=hive/udap109@HADOOP.COM
hive.metastore.client.principal=hdfs@HADOOP.COM
hive.metastore.client.keytab=/home/hdfs.keytab
hive.metastore-impersonation-enabled=true

hive.hdfs.authentication.type=KERBEROS
hive.hdfs.impersonation.enabled=true
hive.hdfs.presto.principal=hdfs@HADOOP.COM
hive.hdfs.presto.keytab=/home/hdfs.keytab

hive.file-status-cache-expire-time=24h
hive.file-status-cache-size=100000000
hive.file-status-cache-tables=*

hive.orc.file-tail-cache-enabled=true
hive.orc.file-tail-cache-size=1GB
hive.orc.file-tail-cache-ttl-since-last-access=6h

hive.orc.stripe-metadata-cache-enabled=true
hive.orc.stripe-footer-cache-size=1GB
hive.orc.stripe-footer-cache-ttl-since-last-access=6h
hive.orc.stripe-stream-cache-size=1GB
hive.orc.stripe-stream-cache-ttl-since-last-access=6h

测试计划

简单count
简单过滤
复杂过滤
简单统计

测试结果

image.png

简单思考

目前只是初步测试，语句也很简单，场景很单调，没有做深入调优，但是也看出了一些问题。

Hive3 无疑是最慢的，执行延迟还是比较高
Spark3 性能最好，甚至都没有任何调优（split大小，shuffle并发度，GC等）
Presto比hive快2倍左右，Spark比presto快2-3倍

说实话觉得有点失望，presto居然那么慢。已经在官网blog/doc看了许多文档，没有找到满意的调优策略，只是这种简单过滤输出，调优核心无非是数据本地调度，减少网络传输，以及合适的split大小。

可是实测下来发现presto调度和执行框架上比spark臃肿的多，spark就是很简单的stage/task两层结构，task直接都是Iterator交互，一个task执行是完整的可预期的。而presto的任务执行很复杂，时间片，优先级，各种异步，加锁，这种设计在高响应，流水线，多任务模式下是需要的，但是个人认为是损失了性能的，数据吞吐上明显比不过Spark这种批处理调度。

另外，presto的中间数据也比较大，Page/Block似乎比二进制数据大的多（相对于spark的OffHeap Binary ColumnRow来说），而且数据交互模式是一个个Operator，函数调用太多了（spark的codegen把task内的数据处理逻辑都尽可能折叠刀一个方法里，执行效率提升还是很明显的）

小节

Presto性能不足的原因

split调度模式，线程切换极高，而有效利用率不高
设计上嵌入了大量的实时统计，这些损耗太多
谓词过滤性能不行，如TupleDomain体系的Predicate可能模糊/in/等条件没有spark的定向优化好
内存数据结构，buff有限，又是流水线式作业，有类似flink的背压问题，缓存处理不了时导致上游阻塞降低了吞吐（尤其是stage较多的时候）
Task上的Pipeline里Operator机制太臃肿，函数调用链很长

PS：下图显示执行过程中的dstat，system的int/csw实在太高了