k8s集群被打爆后的故障分析

故障概述

接到反馈有kubelet的10250端口无法连接,经过排查发现有部分Node是NotReady状态,且很多Pod是Terminating状态,此时整个K8S集群都处于失控状态,业务不能正常工作,老的Pod不断被删除,新的Pod又很难被调度,又有部分Pod无法删除。压测的部分业务无法正常工作。

故障分析

为什么出现很多NotReady?
查看Node状态,此时已经有部分Node的状态为NotReady

kubectl get nodes
NAME                    STATUS     ROLES    AGE   VERSION
op-k8s1-pm         Ready      <none>   66d   v1.17.3
op-k8s10-pm        Ready      <none>   46h   v1.17.3
op-k8s11-pm        Ready      <none>   46h   v1.17.3
op-k8s2-pm         NotReady   <none>   66d   v1.17.3
op-k8s3-pm         NotReady   <none>   66d   v1.17.3
op-k8s4-pm         NotReady   <none>   66d   v1.17.3
op-k8s5-pm         NotReady   <none>   66d   v1.17.3
op-k8s6-pm         NotReady   <none>   66d   v1.17.3
...
op-k8smaster3-pm   Ready      master   69d   v1.17.3

以下为op-k8s2-pm上的资源使用情况排查

free -g
total used free shared buff/cache available
Mem: 250 242 1 0 7 3
Swap: 0 0 0
uptime
18:10:11 up 70 days, 8 min, 2 users, load average: 733.31, 616.92, 625.68
ps aux|grep java |grep -v tini |wc -l
91
top
top - 18:11:24 up 70 days, 10 min, 2 users, load average: 579.80, 607.49, 622.64
Tasks: 1069 total, 3 running, 688 sleeping, 0 stopped, 0 zombie
%Cpu(s): 9.4 us, 7.0 sy, 0.0 ni, 0.2 id, 81.5 wa, 0.4 hi, 1.5 si, 0.0 st
KiB Mem : 26275092+total, 984572 free, 25881926+used, 2947100 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 370160 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
973481 nfsnobo+ 20 0 126808 28788 0 S 403.5 0.0 709:41.85 node_exporter
957773 1004 20 0 16.6g 3.0g 0 S 100.0 1.2 3:02.27 java
277 root 20 0 0 0 0 R 99.1 0.0 52:00.89 kswapd0
278 root 20 0 0 0 0 R 99.1 0.0 100:21.86 kswapd1
895608 root 20 0 1706728 3600 2336 D 7.1 0.0 3:04.87 journalctl
874 root 20 0 4570848 127852 0 S 4.4 0.0 8:16.75 kubelet
11115 maintain 20 0 165172 1760 0 R 2.7 0.0 0:00.21 top
965470 1004 20 0 17.3g 2.8g 0 S 2.7 1.1 1:59.22 java
9838 root 20 0 0 0 0 I 1.8 0.0 0:04.95 kworker/u98:0-f
952613 1004 20 0 19.7g 2.8g 0 S 1.8 1.1 1:51.01 java
954967 1004 20 0 13.6g 3.0g 0 S 1.8 1.2 3:00.73 java

此时在op-k8s2-pm已经出现大量的Terminating的Pod

kubectl get pod -owide --all-namespaces |grep op-k8s2-pm
test            tutor-episode--venv-stress-fudao-6c68ff7f89-w49tv                 1/1     Terminating         0          23h     10.1.4.56    op-k8s2-pm         <none>           <none>
test            tutor-es-lesson--venv-stress-fudao-69f67c4dc4-r56m4               1/1     Terminating         0          23h     10.1.4.93    op-k8s2-pm         <none>           <none>
test            tutor-faculty--venv-stress-fudao-7f44fbdcd5-dzcxq                 1/1     Terminating         0          23h     10.1.4.45    op-k8s2-pm         <none>           <none>
...
test            tutor-oauth--venv-stress-fudao-5989489c9d-jtzgg                   1/1     Terminating         0          23h     10.1.4.78    op-k8s2-pm         <none>           <none>

kubelet日志,出现了很多“use of closed network connection”, 由于K8S的默认连接是HTTP2.0长连接,此错误表明此kubelet连接到Apiserver的连接是broken的,在Apiserver上通过netstat观察也没有45688这个连接存在

Nov 05 21:58:35 op-k8s2-pm kubelet[105611]: E1105 21:58:35.276562 105611 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1beta1.RuntimeClass: Get https://10.2.2.2:443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: write tcp 10.2.2.7:45688->10.2.2.2:443: use of closed network connection
Nov 05 21:58:35 op-k8s2-pm kubelet[105611]: I1105 21:58:35.346898 105611 config.go:100] Looking for [api file], have seen map[]
Nov 05 21:58:35 op-k8s2-pm kubelet[105611]: I1105 21:58:35.446871 105611 config.go:100] Looking for [api file], have seen map[]

到此时基本判定soho-op-k8s2-pm的资源耗尽导致kubelet无法连接Apiserver,最终出现NotRead。经过排查其他的Node,情况基本一致。

为什么突然出现这么大量的资源争抢?

tutor-lesson-activity--venv-stress-fudao            16/16   16           16          12h28m
tutor-lesson-renew--venv-stress-fudao               9/10    9            32          12h45m
tutor-live-data-check                               9/32    9            32          12h10m
tutor-oauth--venv-stress-fudao                      16/32   16           32          12h19m
tutor-pepl--venv-stress-fudao                       10/32   10           32          12h
tutor-profile--venv-stress-fudao                    7/32    7            32          12h
tutor-recommend--venv-stress-fudao                  32/32   32           32          12h
...
tutor-student-lesson--venv-stress-fudao             24/24   24           24          12h
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
    yfd_deploy_version: 875f2832-1e81-11eb-ba04-00163e0a5041
  labels:
    project: tutor-recommend
  name: tutor-recommend--venv-stress-fudao
  namespace: test
spec:
  replicas: 32
  selector:
    matchLabels:
      project: tutor-recommend
  template:
    metadata:
      labels:
        project: tutor-recommen
    spec:
      containers:
        resources:
          limits:
            cpu: "4"
            memory: 8G
          requests:
            cpu: 500m
            memory: 512M

经过询问是由于压测想用k8s测试环境模拟线上环境压测一下,但是测试节点数远远低于线上的节点数

结论:由于很多deployment增加了大量副本数,且资源超卖验证,最终导致K8S集群资源不足,造成了资源争抢,不断压垮kubelet,出现NotReady

为什么出现很多Terminating?

没有触发Evict啊(Evict的Pod的状态为Evict),而且Pod的状态Terminating,在排查的过程也发现不断有Pod生成,老的Pod不断Terminating。

为什么没有触发Evict呢?经过排查发现磁盘没有压力,内存的必须要小于100M才会触发Evict。以下的kubelet的配置就可能看出没有触发Evict

evictionHard:
  imagefs.available: 1%
  memory.available: 100Mi
  nodefs.inodesFree: 1%

注一:由于Evict不可控,且一旦发现磁盘或者内存有压力,最好通过手动处理,自动处理很难解决问题,且很容出现迁移了一个Pod后节点恢复,但是一会儿又出现压力,来回来去触发Evict
注二:将来会彻底禁止Evict

排查kube-controller-manager的日志发现

endpoints_controller.go:590] Pod is out of service: test/tutor-episode--venv-stress-fudao-6c68ff7f89-h8wwz
taint_manager.go:105] NoExecuteTaintManager is deleting Pod: test/tutor-episode--venv-stress-fudao-6c68ff7f89-h8wwz
request.go:565] Throttling request took 1.148923424s, request: DELETE:https://10.2.2.2:443/api/v1/namespaces/test/pods/tutor-episode--venv-stress-fudao-6c68ff7f89-h8wwz
disruption.go:457] No PodDisruptionBudgets found for pod tutor-episode--venv-stress-fudao-6c68ff7f89-h8wwz, PodDisruptionBudget controller will avoid syncing
endpoints_controller.go:420] Pod is being deleted test/tutor-episode--venv-stress-fudao-6c68ff7f89-h8wwz
controller_utils.go:911] Ignoring inactive pod test/tutor-episode--venv-stress-fudao-6c68ff7f89-h8wwz in state Running, deletion time 2020-11-05 10:06:27 +0000 UTC

发现是由于NoExecuteTaintManager控制器删除的Pod,经过追查代码发现
kubernetes/pkg/controller/nodelifecycle/scheduler/taint_manager.go

handleNodeUpdate

func (tc *NoExecuteTaintManager) handleNodeUpdate(nodeUpdate nodeUpdateItem) {
    node, err := tc.getNode(nodeUpdate.nodeName)
    if err != nil {
        if apierrors.IsNotFound(err) {
            // Delete
            klog.V(4).Infof("Noticed node deletion: %#v", nodeUpdate.nodeName)
            tc.taintedNodesLock.Lock()
            defer tc.taintedNodesLock.Unlock()
            delete(tc.taintedNodes, nodeUpdate.nodeName)
            return
        }
        utilruntime.HandleError(fmt.Errorf("cannot get node %s: %v", nodeUpdate.nodeName, err))
        return
    }
 
    // Create or Update
    klog.V(4).Infof("Noticed node update: %#v", nodeUpdate)
    taints := getNoExecuteTaints(node.Spec.Taints)
    func() {
        tc.taintedNodesLock.Lock()
        defer tc.taintedNodesLock.Unlock()
        klog.V(4).Infof("Updating known taints on node %v: %v", node.Name, taints)
        if len(taints) == 0 {
            delete(tc.taintedNodes, node.Name)
        } else {
            tc.taintedNodes[node.Name] = taints
        }
    }()
 
    // This is critical that we update tc.taintedNodes before we call getPodsAssignedToNode:
    // getPodsAssignedToNode can be delayed as long as all future updates to pods will call
    // tc.PodUpdated which will use tc.taintedNodes to potentially delete delayed pods.
    pods, err := tc.getPodsAssignedToNode(node.Name)
    if err != nil {
        klog.Errorf(err.Error())
        return
    }
    if len(pods) == 0 {
        return
    }
    // Short circuit, to make this controller a bit faster.
    if len(taints) == 0 {
        klog.V(4).Infof("All taints were removed from the Node %v. Cancelling all evictions...", node.Name)
        for i := range pods {
            tc.cancelWorkWithEvent(types.NamespacedName{Namespace: pods[i].Namespace, Name: pods[i].Name})
        }
        return
    }
 
    now := time.Now()
    for _, pod := range pods {
        podNamespacedName := types.NamespacedName{Namespace: pod.Namespace, Name: pod.Name}
        tc.processPodOnNode(podNamespacedName, node.Name, pod.Spec.Tolerations, taints, now)
    }
}

processPodOnNode

func (tc *NoExecuteTaintManager) processPodOnNode(
    podNamespacedName types.NamespacedName,
    nodeName string,
    tolerations []v1.Toleration,
    taints []v1.Taint,
    now time.Time,
) {
    if len(taints) == 0 {
        tc.cancelWorkWithEvent(podNamespacedName)
    }
    allTolerated, usedTolerations := v1helper.GetMatchingTolerations(taints, tolerations)
    if !allTolerated {
        klog.V(2).Infof("Not all taints are tolerated after update for Pod %v on %v", podNamespacedName.String(), nodeName)
        // We're canceling scheduled work (if any), as we're going to delete the Pod right away.
        tc.cancelWorkWithEvent(podNamespacedName)
        tc.taintEvictionQueue.AddWork(NewWorkArgs(podNamespacedName.Name, podNamespacedName.Namespace), time.Now(), time.Now())
        return
    }
    minTolerationTime := getMinTolerationTime(usedTolerations)
    // getMinTolerationTime returns negative value to denote infinite toleration.
    if minTolerationTime < 0 {
        klog.V(4).Infof("New tolerations for %v tolerate forever. Scheduled deletion won't be cancelled if already scheduled.", podNamespacedName.String())
        return
    }
 
    startTime := now
    triggerTime := startTime.Add(minTolerationTime)
    scheduledEviction := tc.taintEvictionQueue.GetWorkerUnsafe(podNamespacedName.String())
    if scheduledEviction != nil {
        startTime = scheduledEviction.CreatedAt
        if startTime.Add(minTolerationTime).Before(triggerTime) {
            return
        }
        tc.cancelWorkWithEvent(podNamespacedName)
    }
    tc.taintEvictionQueue.AddWork(NewWorkArgs(podNamespacedName.Name, podNamespacedName.Namespace), startTime, triggerTime)
}

发现LifecycleController控制会监控所有的Node,一旦发现Node的taint有NoExecuted,就检查其上的Pod是否有toleration,如果没有或者TolerationSeconds时间小于0,就启动Pod删除动作。

kubernetes的文档上的解释 (详见:https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
)

NoExecute

Normally, if a taint with effect NoExecute is added to a node, then any pods that do not tolerate the taint will be evicted immediately, and pods that do tolerate the taint will never be evicted. However, a toleration with NoExecute effect can specify an optional tolerationSeconds field that dictates how long the pod will stay bound to the node after the taint is added.

kubectl get pods -n test tutor-recommend--venv-stress-fudao-5d89fc7dd5-vcp8h -ojson |jq '.spec.tolerations'
[
  {
    "effect": "NoExecute",
    "key": "node.kubernetes.io/not-ready",
    "operator": "Exists",
    "tolerationSeconds": 300
  },
  {
    "effect": "NoExecute",
    "key": "node.kubernetes.io/unreachable",
    "operator": "Exists",
    "tolerationSeconds": 300
  }
]

Pod上果然有设置not-ready和tolerationSeconds,但是这些tolerations谁设置呢?什么时候设置的?

从代码里发现,是由defaulttolerationseconds的adminssion controller设置的,是在Pod生成的时候默认就是设置的。 从v1.13以后的版本中,默认会启动此设置。用来替代--pod-eviction-timeout参数,来实现Evict的功能。
kubernetes/plugin/pkg/admission/defaulttolerationseconds/admission.go

Admit

// Admit makes an admission decision based on the request attributes
func (p *Plugin) Admit(ctx context.Context, attributes admission.Attributes, o admission.ObjectInterfaces) (err error) {
    if attributes.GetResource().GroupResource() != api.Resource("pods") {
        return nil
    }
 
    if len(attributes.GetSubresource()) > 0 {
        // only run the checks below on pods proper and not subresources
        return nil
    }
 
    pod, ok := attributes.GetObject().(*api.Pod)
    if !ok {
        return errors.NewBadRequest(fmt.Sprintf("expected *api.Pod but got %T", attributes.GetObject()))
    }
 
    tolerations := pod.Spec.Tolerations
 
    toleratesNodeNotReady := false
    toleratesNodeUnreachable := false
    for _, toleration := range tolerations {
        if (toleration.Key == v1.TaintNodeNotReady || len(toleration.Key) == 0) &&
            (toleration.Effect == api.TaintEffectNoExecute || len(toleration.Effect) == 0) {
            toleratesNodeNotReady = true
        }
 
        if (toleration.Key == v1.TaintNodeUnreachable || len(toleration.Key) == 0) &&
            (toleration.Effect == api.TaintEffectNoExecute || len(toleration.Effect) == 0) {
            toleratesNodeUnreachable = true
        }
    }
 
    if !toleratesNodeNotReady {
        pod.Spec.Tolerations = append(pod.Spec.Tolerations, notReadyToleration)
    }
 
    if !toleratesNodeUnreachable {
        pod.Spec.Tolerations = append(pod.Spec.Tolerations, unreachableToleration)
    }
 
    return nil
}

结论

出现很多的Terminating的Pod是由于节点Node NotReady后触发NoExecute驱逐导致。在Deployment中可以使用功能,但是在statefulset中建议要取消此功能。

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,332评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,508评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 157,812评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,607评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,728评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,919评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,071评论 3 410
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,802评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,256评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,576评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,712评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,389评论 4 332
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,032评论 3 316
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,798评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,026评论 1 266
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,473评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,606评论 2 350

推荐阅读更多精彩内容

  • 引子 目前,kubeadm 已经支持了搭建高可用的 Kubernetes 集群,大大降低了搭建的难度,官方的文档也...
    猴子精h阅读 13,628评论 2 2
  • 1 容器编排和k8s(Kubernetes) 1.1 容器部署的困局 容器部署的困境 1 10台服务器如何编排 ....
    陈朝辉_39f7阅读 1,822评论 0 0
  • Kubernetes工作流程: 1、准备好一个包含应用程序的Deployment的yml文件,然后通过kubect...
    往事随风_7c02阅读 485评论 0 0
  • 久违的晴天,家长会。 家长大会开好到教室时,离放学已经没多少时间了。班主任说已经安排了三个家长分享经验。 放学铃声...
    飘雪儿5阅读 7,515评论 16 22
  • 今天感恩节哎,感谢一直在我身边的亲朋好友。感恩相遇!感恩不离不弃。 中午开了第一次的党会,身份的转变要...
    迷月闪星情阅读 10,559评论 0 11