安装节点健康监测
在kubernetes集群上,通常我们只是管制集群本身以及容器的稳定运行。但是这些稳定性都是强依赖节点node的稳定的。通过节点健康监测,将节点的信息通知到apiServer,避免pod调度到异常节点。node problem detector就是专门来做这件事情。
一般节点常见的问题主要有
1、硬件错误
- CPU坏了
- Memory坏了
- 磁盘坏了
2、kernel问题
- kernel deadlock (内核死锁)
- corrupted file systems (文件系统崩溃)
- unresponsive runtime daemons (系统运行后台进程无响应)
3、docker问题
- unresponsive runtime daemons (docker后台进程无响应)
- docker image error (docker文件系统错误)
K8S集群管理对node的健康状态是无法感知的,pod依旧会调度到有问题的node上,通过DaemonSet部署node-problem-detector,向apiserver上报node的状态信息,使node的健康状态对上游管理可见,pod不会再调度到有异常的node上。
这里刚开始也是踩了坑,k8s官方给的demo比较简单,版本还是v0.1,node-problem-detector这个项目的镜像版本是v0.8.1,但是这俩都没有明确的给出权限这一块的配置,导致安装后访问资源拒绝,可以通过查看log日志发现。
后来在k8s的addon目录下,找到了npd.yaml文件,完美的运行了。
文件地址:https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/node-problem-detector
yaml文件如下:
apiVersion: v1
kind: ServiceAccount
metadata:
name: node-problem-detector
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: npd-binding
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:node-problem-detector
subjects:
- kind: ServiceAccount
name: node-problem-detector
namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: npd-v0.8.1
namespace: kube-system
labels:
k8s-app: node-problem-detector
version: v0.8.0
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
spec:
selector:
matchLabels:
k8s-app: node-problem-detector
version: v0.8.1
template:
metadata:
labels:
k8s-app: node-problem-detector
version: v0.8.1
kubernetes.io/cluster-service: "true"
spec:
containers:
- name: node-problem-detector
image: registry.cn-hangzhou.aliyuncs.com/speed_containers/node-problem-detector:v0.8.1
command:
- "/bin/sh"
- "-c"
- "exec /node-problem-detector --logtostderr --config.system-log-monitor=/config/kernel-monitor.json,/config/docker-monitor.json,/config/systemd-monitor.json --config.custom-plugin-monitor=/config/kernel-monitor-counter.json,/config/systemd-monitor-counter.json --config.system-stats-monitor=/config/system-stats-monitor.json >>/var/log/node-problem-detector.log 2>&1"
securityContext:
privileged: true
resources:
limits:
cpu: "200m"
memory: "100Mi"
requests:
cpu: "20m"
memory: "20Mi"
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: log
mountPath: /var/log
- name: localtime
mountPath: /etc/localtime
readOnly: true
volumes:
- name: log
hostPath:
path: /var/log/
- name: localtime
hostPath:
path: /etc/localtime
type: "FileOrCreate"
serviceAccountName: node-problem-detector
tolerations:
- operator: "Exists"
effect: "NoExecute"
- key: "CriticalAddonsOnly"
operator: "Exists"
这里默认的镜像是谷歌官方仓库gcr.io的库,因为外网问题,这里我上传了一份到阿里云的仓库,公开可直接使用的。
可能有点小伙伴会疑惑,我安装完成之后如何查看效果呢,参考链接:https://stackoverflow.com/questions/48134835/how-to-use-k8s-node-problem-detector
里面有相关的介绍,其实 node-problem-detector 是以Event事件的形式,将信息传递给了集群,我肯可以通过 kubectl describe nodes <node-name> -n kube-system
来查看,详细的过程这里简单的参考上面的stackverflow
安装之前
Bash# kubectl describe node ip-10-40-22-166.ec2.internal | grep -i condition -A 20 | grep Ready -B 20
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Thu, 20 Jun 2019 12:30:05 -0400 Thu, 20 Jun 2019 12:30:05 -0400 WeaveIsUp Weave pod has set this
OutOfDisk False Thu, 20 Jun 2019 18:27:39 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Thu, 20 Jun 2019 18:27:39 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 20 Jun 2019 18:27:39 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 20 Jun 2019 18:27:39 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 20 Jun 2019 18:27:39 -0400 Thu, 20 Jun 2019 12:30:14 -0400 KubeletReady kubelet is posting ready status
安装之后
Bash# helm upgrade --install npd stable/node-problem-detector -f node-problem-detector.values.yaml
Bash# kubectl rollout status daemonset npd-node-problem-detector #(wait for up)
Bash# kubectl describe node ip-10-40-22-166.ec2.internal | grep -i condition -A 20 | grep Ready -B 20
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
DockerDaemon False Thu, 20 Jun 2019 22:06:17 -0400 Thu, 20 Jun 2019 22:04:14 -0400 DockerDaemonHealthy Docker daemon is healthy
EBSHealth False Thu, 20 Jun 2019 22:06:17 -0400 Thu, 20 Jun 2019 22:04:14 -0400 NoVolumeErrors Volumes are attaching successfully
KernelDeadlock False Thu, 20 Jun 2019 22:06:17 -0400 Thu, 20 Jun 2019 22:04:14 -0400 KernelHasNoDeadlock kernel has no deadlock
ReadonlyFilesystem False Thu, 20 Jun 2019 22:06:17 -0400 Thu, 20 Jun 2019 22:04:14 -0400 FilesystemIsNotReadOnly Filesystem is not read-only
NetworkUnavailable False Thu, 20 Jun 2019 12:30:05 -0400 Thu, 20 Jun 2019 12:30:05 -0400 WeaveIsUp Weave pod has set this
OutOfDisk False Thu, 20 Jun 2019 22:07:10 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Thu, 20 Jun 2019 22:07:10 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 20 Jun 2019 22:07:10 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 20 Jun 2019 22:07:10 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 20 Jun 2019 22:07:10 -0400 Thu, 20 Jun 2019 12:30:14 -0400 KubeletReady kubelet is posting ready status
可以很明显的看出来,多了好多检测信息。