Caused by: com.facebook.presto.operator.PageTransportTimeoutException: Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes. (http://10.88.189.5:8080/v1/task/async/20221008_214612_00463_s57pb.0.0.0/results/0/0 - 30 failures, failure duration 302.87s, total failed request time 312.87s)
at com.facebook.presto.operator.PageBufferClient$1.onFailure(PageBufferClient.java:369)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1052)
... 3 more
Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed
presto 每隔一段时间就不可用,出现 activeWorkers=0:
原因分析:应该是adhoc_etl队列的多个离线任务的大查询并行执行,worker负载高,频繁gc,甚至是full gc,导致coordinator和worker通讯超时,断开连接。
暂时的解决方案:限制并行running的查询数 hardConcurrencyLimit
query.max-total-memory=64(默认为query.max-memory的2倍)
7个worker ,每个jmx为20G来算,并行的查询=20*7/34=4
所以,并行查询的数最好控制在3~4个 (估算值)。
至于超时后就一直不能恢复通讯原因,需要深入分析源码来解决