The node may have crashed or be under too much load. This is probably a transient issue, so pleas...

Caused by: com.facebook.presto.operator.PageTransportTimeoutException: Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes. (http://10.88.189.5:8080/v1/task/async/20221008_214612_00463_s57pb.0.0.0/results/0/0 - 30 failures, failure duration 302.87s, total failed request time 312.87s)
        at com.facebook.presto.operator.PageBufferClient$1.onFailure(PageBufferClient.java:369)
        at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1052)
        ... 3 more
    Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed

presto 每隔一段时间就不可用，出现 activeWorkers=0：

原因分析：应该是adhoc_etl队列的多个离线任务的大查询并行执行，worker负载高，频繁gc，甚至是full gc，导致coordinator和worker通讯超时，断开连接。

暂时的解决方案：限制并行running的查询数 hardConcurrencyLimit
query.max-total-memory=64(默认为query.max-memory的2倍)
7个worker ，每个jmx为20G来算，并行的查询=20*7/34=4
所以，并行查询的数最好控制在3~4个 (估算值)。

至于超时后就一直不能恢复通讯原因，需要深入分析源码来解决

The node may have crashed or be under too much load. This is probably a transient issue, so pleas...

推荐阅读更多精彩内容