一、问题描述
运行了2年多的harbor,突然有一天不能用了,问题比较诡异,有时候,docker push镜像的时候,能成功,有时候不能成功;
harbor部署方式: harbor通过helm部署在k8s集群中,持久化存储使用阿里的oss;
二、OSS的使用方式如下:
PV信息
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/bound-by-controller: "yes"
creationTimestamp: 2020-09-27T07:17:08Z
finalizers:
- kubernetes.io/pv-protection
labels:
alicloud-pvname: prod-harbor-image-02
name: prod-harbor-image-02
resourceVersion: "8397165"
selfLink: /api/v1/persistentvolumes/prod-harbor-image-02
uid: 73ef98aa-0091-11eb-9587-00163e010494
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 500Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: prod-harbor-image-pvc-02
namespace: harbor
resourceVersion: "8397162"
uid: 846e489c-0091-11eb-9587-00163e010494
flexVolume:
driver: alicloud/oss
options:
akId: XXXX
akSecret: XXXXXX
bucket: prod-harbor-image-02
otherOpts: -o max_stat_cache_size=0 -o allow_other
url: XXXX-a.zbops.ciasyun.local
persistentVolumeReclaimPolicy: Retain
storageClassName: oss
status:
phase: Bound
PVC信息
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
creationTimestamp: 2020-09-27T07:17:35Z
finalizers:
- kubernetes.io/pvc-protection
name: prod-harbor-image-pvc-02
namespace: harbor
resourceVersion: "8397167"
selfLink: /api/v1/namespaces/harbor/persistentvolumeclaims/prod-harbor-image-pvc-02
uid: 846e489c-0091-11eb-9587-00163e010494
spec:
accessModes:
- ReadWriteMany
dataSource: null
resources:
requests:
storage: 500Gi
selector:
matchLabels:
alicloud-pvname: prod-harbor-image-02
storageClassName: oss
volumeName: prod-harbor-image-02
status:
accessModes:
- ReadWriteMany
capacity:
storage: 500Gi
phase: Bound
服务mount信息
volumeMounts:
- mountPath: /storage
name: registry-data
- mountPath: /etc/registry/root.crt
name: registry-root-certificate
subPath: tls.crt
- mountPath: /etc/registry/passwd
name: registry-htpasswd
subPath: passwd
- mountPath: /etc/registry/config.yml
name: registry-config
subPath: config.yml
三、问题定位过程
3.1 在node节点做push动作
推送镜像的时候,一直卡主不动,截图如下:
3.2 检查harbor的日志如下:
3.3 检查OSS的监监控
3.4 检查OSS后台的访问情况
现象很怪异,后台的请求消息都是Head消息,无http的GET和PUT方法
3.4 阿里后台人员定位了许久,但是最终的结论有些搞笑,因为harbor不是他们的产品,他们不负责,且只说明请求过来的消息有问题
四、结论
经过自己花时间研究,以及被带偏了许久,偶然查看pvc,pv信息的时候,突然有个灵感,发现pvc何pv都有存储大小上限500G,突然想到是否是因为存储满了,导致无法push上去呢?但是在harbor的容器中查看,OSS的空间有260T,这个harbor容器中看到的信息未必是准确的,或者是实际可用的;故先做pv和pvc扩容事宜;由于是时候记录,只记录操作信息,未有正确截图
停止harbor的服务
kubectl scale --replicas=0 deployment/harbor-registry
删除pvc
kubectl get pvc -n harbor harbor-image-pvc-02 -o yaml >harbor-reg.yaml
kubectl get pvc -n harbor
kubectl delete pvc -n harbor harbor-image-pvc-02
修改harbor-reg.yaml 文件,并删除pv中和pvc关联的信息,关联部分为claimRef下面的内容,如图
并且扩大storage,从之前的500G,修改为1500G,
重新创建pv,harbor恢复运行,后续测试过程简单,及docker push正常。