The node may have crashed or be under too much load. This is probably a transient issue, so pleas...

Caused by: com.facebook.presto.operator.PageTransportTimeoutException: Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes. (http://10.88.189.5:8080/v1/task/async/20221008_214612_00463_s57pb.0.0.0/results/0/0 - 30 failures, failure duration 302.87s, total failed request time 312.87s)
        at com.facebook.presto.operator.PageBufferClient$1.onFailure(PageBufferClient.java:369)
        at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1052)
        ... 3 more
    Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed

presto 每隔一段时间就不可用,出现 activeWorkers=0:

原因分析:应该是adhoc_etl队列的多个离线任务的大查询并行执行,worker负载高,频繁gc,甚至是full gc,导致coordinator和worker通讯超时,断开连接。

暂时的解决方案:限制并行running的查询数 hardConcurrencyLimit
query.max-total-memory=64(默认为query.max-memory的2倍)
7个worker ,每个jmx为20G来算,并行的查询=20*7/34=4
所以,并行查询的数最好控制在3~4个 (估算值)。

至于超时后就一直不能恢复通讯原因,需要深入分析源码来解决

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容