最近kubernetes遇到了无法删除pod的问题,日志如下:
10月 17 13:29:09 k8s-node kubelet[16145]: I1017 13:29:09.034868 16145 reconciler.go:186] operationExecutor.UnmountVolume started for volume "default-token-pzyxh" (UniqueName: "kubernetes.io/secret/81791176-a505-11e7-accf-5254fe5a9007-default-token-pzyxh") pod "81791176-a505-11e7-accf-5254fe5a9007" (UID: "81791176-a505-11e7-accf-5254fe5a9007")
10月 17 13:29:09 k8s-node kubelet[16145]: E1017 13:29:09.035292 16145 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/81791176-a505-11e7-accf-5254fe5a9007-default-token-pzyxh\" (\"81791176-a505-11e7-accf-5254fe5a9007\")" failed. No retries permitted until 2017-10-17 13:31:11.035205617 +0800 CST (durationBeforeRetry 2m2s). Error: UnmountVolume.TearDown failed for volume "default-token-pzyxh" (UniqueName: "kubernetes.io/secret/81791176-a505-11e7-accf-5254fe5a9007-default-token-pzyxh") pod "81791176-a505-11e7-accf-5254fe5a9007" (UID: "81791176-a505-11e7-accf-5254fe5a9007") : remove /var/lib/kubelet/pods/81791176-a505-11e7-accf-5254fe5a9007/volumes/kubernetes.io~secret/default-token-pzyxh: device or resource busy
解决方法:
使用脚本检查目录被谁占用
#!/bin/bash
declare -A map
for i in `find /proc/*/mounts -exec grep $1 {} + 2>/dev/null | awk '{print $1"#"$2}'`
do
pid=`echo $i | awk -F "[/]" '{print $3}'`
point=`echo $i | awk -F "[#]" '{print $2}'`
mnt=`ls -l /proc/$pid/ns/mnt |awk '{print $11}'`
map["$mnt"]="exist"
cmd=`cat /proc/$pid/cmdline`
echo -e "$pid\t$mnt\t$cmd\t$point"
done
for i in `ps aux|grep docker-containerd-shim |grep -v "grep" |awk '{print $2}'`
do
mnt=`ls -l /proc/$i/ns/mnt 2>/dev/null | awk '{print $11}'`
if [[ "${map[$mnt]}" == "exist" ]];then
echo $mnt
fi
done
sh leak.sh /var/lib/kubelet/pods/81791176-a505-11e7-accf-5254fe5a9007/volumes/kubernetes.io~secret/default-token-pzyxh
可以看到被占用:
8392 mnt:[4026532536] /bin/bash/start.sh--logtostderr -v=2 /var/lib/kubelet/pods/81791176-a505-11e7-accf-5254fe5a9007/volumes/kubernetes.io~secret/default-token-pzyxh
找到8392的父进程:
[root@k8s-node ~]# ps -ef | grep 8392
root 8392 8345 0 9月30 ? 00:00:00 /bin/bash /start.sh --logtostderr -v=2
root 8420 8392 0 9月30 ? 00:17:19 /usr/bin/python /usr/bin/supervisord -c supervisord.conf
root 13757 7126 0 13:35 pts/2 00:00:00 grep --color=auto 8392
[root@k8s-node ~]#
继续找到8345的父进程:
root 8345 783 0 9月30 ? 00:00:01 docker-containerd-shim 42fe1338a0ebd849630d92f46b4c343376ec7c7ffe9a3666f9725715d4f4d064 /var/run/docker/libcontainerd/42fe1338a0ebd849630d92f46b4c343376ec7c7ffe9a3666f9725715d4f4d064 docker-runc
root 8392 8345 0 9月30 ? 00:00:00 /bin/bash /start.sh --logtostderr -v=2
可以看到是id为42fe1338a0ebd849630d92f46b4c343376ec7c7ffe9a3666f9725715d4f4d064的docker在占用
exec到容器里面df -h看到/var/lib/kubelet也被挂载到容器里面,在容器里面umount以后就可以删除了。