k8s部署的presto的重启方式

重启presto后,发现sql查询失败,但是active worker数正常

# presto-0
Started:      Mon, 24 Oct 2022 16:44:13 +0800
 
# presto-1 
Started:      Mon, 24 Oct 2022 16:43:52 +0800 
      
# presto-2 
Started:      Mon, 24 Oct 2022 16:43:23 +0800

# presto-3 
Started:      Mon, 24 Oct 2022 16:42:49 +0800

# presto-4  
Started:      Mon, 24 Oct 2022 16:42:36 +0800
      
# presto-5  
Started:      Mon, 24 Oct 2022 16:41:56 +0800

pod log信息

2022-10-25T12:08:26.493+0800    WARN    UpdateResponseHandler-20221025_040440_00343_e3293.4.0.2-14627   com.facebook.presto.server.RequestErrorTracker  Error updating task 20221025_040440_00343_e3293.4.0.2: java.net.SocketTimeoutException: Connect Timeout: http://10.88.122.92:8080/v1/task/20221025_040440_00343_e3293.4.0.2
2022-10-25T12:08:26.751+0800    WARN    ContinuousTaskStatusFetcher-20221025_040440_00343_e3293.3.0.1-14655     com.facebook.presto.server.RequestErrorTracker  Error getting task status 20221025_040440_00343_e3293.3.0.1: java.net.SocketTimeoutException: Connect Timeout: http://10.88.186.109:8080/v1/task/20221025_040440_00343_e3293.3.0.1
    
#查看presto- 4和presto-5的日志,发现他们一直都在连接一个不存在的pod的ip。
2022-10-25T11:50:42.697+0800    WARN    http-client-node-manager-timeout        com.facebook.presto.metadata.HttpRemoteNodeState        Error fetching node state from http://10.88.186.108:8080/v1/info/state: java.net.SocketTimeoutException: Connect Timeout
2022-10-25T11:50:52.700+0800    WARN    http-client-node-manager-timeout        com.facebook.presto.metadata.HttpRemoteNodeState        Error fetching node state from http://10.88.186.108:8080/v1/info/state: java.net.SocketTimeoutException: Connect Timeout

2022-10-25T12:43:36.266+0800    WARN    http-client-node-manager-timeout        com.facebook.presto.metadata.HttpRemoteNodeState        Error fetching node state from http://10.88.122.77:8080/v1/info/state: java.net.SocketTimeoutException: Connect Timeout
2022-10-25T12:43:46.270+0800    WARN    http-client-node-manager-timeout        com.facebook.presto.metadata.HttpRemoteNodeState        Error fetching node state from http://10.88.122.77:8080/v1/info/state: java.net.SocketTimeoutException: Connect Timeout

原因分析
1.查看presto- 4和presto-5的日志,发现他们一直都在连接一个不存在的pod的ip。
2.重启时间来看,先从presto-5开始依次重启worker节点,最后重启master(presto-0)节点
3.猜测是 presto-4和presto-5刚重启成功后,其他的presto节点还尚未重启,此时有查询任务提交到presto。master(即presto-0)随即分配计算任务到各个节点;但任务尚在计算的过程中,presto的其他worker节点和master节点却相继重启,导致presto-4和presto-5的task任务无法计算完成。

解决方法
1.所以建议重启presto集群时,先重启master(presot-0),然后再依次重启其他worker节点。避免在重启过程中有恰好有查询任务提交,导致presto集群不可用。
2.即启动时,用下面的scale方式,而不是rollout方式:
kubectl scale sts presto --replicas=0 #杀掉所有presto pod
kubectl scale sts presto --replicas=3 #启动3个presto pod

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容