故障现象:检查nodes正常,但是检查cs状态时scheduler与controller-manager组件不正常
$ kubectl get nodes,cs
NAME STATUS ROLES AGE VERSION
node/test-k8s-master00 Ready master 25h v1.18.6
node/test-k8s-master01 Ready master 24h v1.18.6
node/test-k8s-master02 Ready master 24h v1.18.6
node/test-k8s-node00 Ready <none> 24h v1.18.6
node/test-k8s-node01 Ready <none> 24h v1.18.6
NAME STATUS MESSAGE ERROR
componentstatus/controller-manager Unhealthy Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused
componentstatus/scheduler Unhealthy Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused
componentstatus/etcd-0 Healthy {"health":"true"}
排故过程:
1、手工连接发现确实是被拒绝了
$ curl -k http://127.0.0.1:10251/healthz
curl: (7) Failed to connect to 127.0.0.1 port 10251: 拒绝连接
2、检查pod与容器状态
$ kubectl get pod -n kube-system | grep sche
kube-scheduler-apron-k8s-master00 1/1 Running 8 25h
kube-scheduler-apron-k8s-master01 1/1 Running 0 14m
kube-scheduler-apron-k8s-master02 1/1 Running 3 24h
$ docker ps | grep sched
15fbf835497b 0e0972b2b5d1 "kube-scheduler --au…" 23 minutes ago Up 23 minutes k8s_kube-scheduler_kube-scheduler-apron-k8s-master02_kube-system_0643afa2262d08f779c8829c02532d96_3
811c213657d0 k8s.gcr.io/pause:3.2 "/pause" 23 minutes ago Up 23 minutes k8s_POD_kube-scheduler-apron-k8s-master02_kube-system_0643afa2262d08f779c8829c02532d96_2
从上面结果看,pod是就绪的
3、检查scheduler的yaml文件配置
$ cat /etc/kubernetes/manifests/kube-scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
.......(略)
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 15 ......
从上面内容可以看到,容器内部使用的是10259端口来作健康检查 ,因此可以直接在容器内检查端口,因为容器内不包含netstat 、ss等网络命令,只有直接读取 /proc/net/tcp 文件来查看 IPv4 的 TCP 连接状态
$ docker exec -it 15fbf835497b /bin/sh
# grep -E ":$(printf "%04X" 10259)\\b" /proc/net/tcp
8: 0100007F:2813 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 40368 1 0000000000000000 100 0 0 10 0
说明:printf "%04X" 10259 是动态将10259转换为十六进制样式
上面有返回说明,容器内部是正常打开端口的
只是在外部访问不到。因此需要检查它的YAML文件中的相关安全配置
$ cat /etc/kubernetes/manifests/kube-scheduler.yaml
......(略)
containers:
- command:
- kube-scheduler
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --bind-address=127.0.0.1
- --feature-gates=TTLAfterFinished=true
- --kubeconfig=/etc/kubernetes/scheduler.conf
- --leader-elect=true
- --port=0
image: k8s.gcr.io/kube-scheduler:v1.18.6
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 15
timeoutSeconds: 15
name: kube-scheduler
以上配置说明:存活探针使用的是 HTTPS 方式访问 10259 端口,--port=0 表示禁用 HTTP 健康检查端口(非安全端口),即前面kubectl get cs 命令输出的 Get http://127.0.0.1:10251/healthz 这种http方式访问是不支持的。
解决办法:将 “ - --port=0” 这行注释后,或者将port设置为:- --port=10259 ,scheduler检查命令就正常了,对于controller-manager组件,也是同样的处理办法。不过,在生产环境这是一种不太安全的做法。