Doc
- RKE Kubernetes Installation
- Example Cluster.ymls
- 替换基础镜像
https://rancher.com/docs/rke/latest/en/config-options/system-images/ - 云原生 PaaS 产品发布&部署方案
- Kubernetes 组件 (集群架构)
FAQ
1. 节点管理: 添加/删除节点
Adding and Removing Nodes
节点管理 | Rancher文档
修改cluster.yml
文件的内容,添加额外节点或删除节点信息;运行rke up --update-only
命令,只添加或删除工作节点,忽略除了cluster.yml
中的工作节点以外的其他内容。
备注:如果宿主机先下线,则建议先执行kubectl delete node
,再修改cluster.yml
文件,最后运行rke up --update-only
2. rke remove
- 删除
cluster.yml
中的每个节点上面的 Kubernetes 组件 - 删除集群的 etcd 快照
- 从服务留下的目录中清理每个主机
- /etc/kubernetes/ssl
- /var/lib/etcd
- /etc/cni
- /opt/cni
- /var/run/calico
3. 证书管理
默认情况下,Kubernetes 集群需要用到证书, RKE 会自动为所有集群组件生成证书,证书有效期为10年;部署集群后,请参考管理自动生成的证书管理这些自动生成的证书。可以使用自定义证书。
轮换全部证书 |
rke cert rotate [--config cluster.yml ] |
|
轮换CA证书和全部服务证 | rke cert rotate --rotate-ca |
|
轮换单个服务证书 | rke cert rotate --service etcd |
|
查看证书有效时间 | openssl x509 -in /etc/kubernetes/ssl/kube-apiserver.pem -noout -dates |
轮换全部证书后KUBECONFIG文件有所修改,需进行替换更新cp kube_config_cluster.yml $HOME/.kube/config
4. 配置文件管理
请将这些文件复制并保存到安全的位置:
-
cluster.yml
:RKE 集群的配置文件 -
kube_config_cluster.yml
:该集群的Kubeconfig 文件包含获取该集群所有权限的认证凭据 -
cluster.rkestate
:Kubernetes 集群状态文件,包含了获取该集群所有权限的认证凭据
5. RKE安装k8s集群需要开放的端口
6. Kubernetes 调优 | Rancher文档
1. 节点OS调优
2. Docker调优
3. [ETCD调优](https://docs.rancher.cn/docs/rancher2/best-practices/optimize/etcd/_index)
4. Kubernetes调优
7. 恢复集群 | Rancher文档
rke etcd snapshot-save --config cluster.yml --name mysnapshotb
ls -alh /opt/rke/etcd-snapshots/
ansible k8s_cluster -m shell -a 'ls -alh /opt/rke/etcd-snapshots/'
sudo ls -alh /opt/rke/etcd-snapshots/
rke -d etcd snapshot-restore --config cluster.yml --name 2022-07-10T17:39:05+08:00_etcd
cluster_version=$(kubectl version --short | tail -1 | awk -F: '{print $2}' | awk '$1=$1')
rke etcd snapshot-save --name backup-${cluster_version} --config cluster.yml
ls -alh /opt/rke/etcd-snapshots/backup-${cluster_version}.zip
rke etcd snapshot-restore --name backup-${cluster_version} --config cluster.yml
8. 修改网络模式ipip
canal_flannel_backend_type: ipip
9. 恢复误删的cluster.rkestate
- rancher rke up errors on etcd host health checks remote error: tls: bad certificate
- 恢复 rkestate 状态文件 | Rancher文档
rke util get-state-file
rke util get-kubeconfig
Q&A
1. failed to allocate for range 0: no IP addresses available in range set: 10.42.0.1-10.42.0.254
https://github.com/kubernetes/kubernetes/issues/57280
failed to allocate for range 0: no IP addresses available in range set
措施:删除多余的IP地址
cd /var/lib/cni/networks/k8s-pod-network/
sudo rm 10.*
This isn't a problem with Kubernetes, rather one with Flannel and the host-local IPAM plugin.
docker ps | grep flannel
docker exec -it ID bash
# /opt/bin/flanneld --version
v0.15.1
failed to allocate for range 0: no IP addresses available in range set · Issue #383 · cloudnativelab
This is a CNI bug
Pods stuck in ContainerCreating - Failed to run CNI IPAM ADD: failed to allocate for range 0 - Red H
kubenet ip泄漏
场景2:未出现IP泄露,该Node资源使用最低,调度Pod太多,近300个,导致IP资源耗尽。
临时措施:kubectl drain $Node 驱逐Pod并uncordon限制调度;
长期措施:业务部署配置文件添加requests,预申请资源
2. Failed to create Docker container [etcd-fix-perm] on host
rke部署k8s集群时报错中断
措施:清理创建失败的container,重新执行rke up指令,正常部署k8s集群
根因原因:rke调用docker的API ContainerCreate成功创建容器,但是err返回值非nil(Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?),3次重试失败(Error response from daemon: Conflict. The container name "etcd-fix-perm" is already in use by containe)后中断退出;第一个err(Cannot connect to the Docker daemon )应该是偶现(例如网络问题),否则后续重试的请求都无法从docker获取到container already in use的信息。
Issue: The container name "/etcd-fix-perm" is already in use by container · Issue #2632 · rancher/rke
pull request: https://github.com/rancher/rke/pull/2633/files
Code: https://github.com/rancher/rke/blob/665e0fd8065d7f332dd2aa424b95ce15cdc711c1/docker/docker.go#L442
3. modprobe: FATAL: Module nf_conntrack_ipv4 not found
iptables版本: iptables legacy
最新稳定版4.19内核将nf_conntrack_ipv4更名为nf_conntrack,目前的kube-proxy能否支持在4.19版本内核下开启ipvs
https://github.com/coreos/bugs/issues/2518
https://github.com/kubernetes-sigs/kubespray/issues/6934
$ sudo modprobe -- nf_conntrack_ipv4
modprobe: FATAL: Module nf_conntrack_ipv4 not found in directory /lib/modules/4.18.0-240.el8.x86_64
$ uname -a
Linux centos8-1 4.18.0-240.el8.x86_64 #1 SMP Fri Sep 25 19:48:47 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ sudo modprobe -- nf_conntrack
https://github.com/kubernetes-sigs/kubespray/pull/7014/files
https://github.com/kubernetes-sigs/kubespray/blob/master/roles/kubernetes/node/tasks/main.yml
4. FATA[0000] Failed to validate cluster: v1.13.12-rancher1-2 is an unsupported Kubernetes version
执行rke up --config cluster.yml
error
INFO[0000] Initiating Kubernetes cluster
FATA[0000] Failed to validate cluster: v1.13.12-rancher1-2 is an unsupported Kubernetes version and system images are not populated: etcd image is not populated
措施:版本号变更为v1.13.12-rancher1-1
Release v0.2.11中版本号有误,以GitHub代码库源码中的kubernetes version为准(v1.13.12-rancher1)
Rancher-System Images
kontainer-driver-metadata/k8s_rke_system_images.go at master · rancher/kontainer-driver-metadata
5. 恢复误删的node
docker ps | grep rancher | awk '{print $1}' | xargs docker rm -f
sudo rm -rf /etc/kubernetes/
rke up --config cluster.yml
RKE集群恢复: https://www.bookstack.cn/read/rancher-v2.x/652c4de8ad3dde67.md
2 - 恢复 - 2 - RKE集群恢复 - 《Rancher v2.0 使用手册》 - 书栈网 · BookStack
6. Cluster must have at least one etcd plane host: failed to connect to the following etcd host(s)
WARN[0000] Failed to set up SSH tunneling for host [10.1.2.3]: Can't retrieve Docker Info: error during connect: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info: Unable to access the Docker socket (/var/run/docker.sock). Please check if the configured user can execute `docker ps` on the node, and if the SSH server version is at least version 6.7 or higher. If you are using RedHat/CentOS, you can't use the user `root`. Please refer to the documentation for more instructions. Error: ssh: rejected: administratively prohibited (open failed)
https://github.com/rancher/rke/issues/1417
https://github.com/rancher/rke/issues/93
原因: /etc/ssh/sshd_config
的AllowTcpForwarding被关闭(设置为no)
$ sudo vim /etc/ssh/sshd_config
AllowTcpForwarding yes
$ sudo systemctl restart sshd
SSH Connectivity Errors
Requirements
WARN[0000] Failed to set up SSH tunneling for host [10.224.202.61] : Can't retrieve Docker Info: error during connect Please check if you are able to SHH to the node using the specified SSH Private Key and if you have configured the correct SSH username.Error: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey],no supported methods remain.
无法ssh[免密]登录rke部署时的通信账号cat cluster.yml | grep 'user: '
Cause: 定制/etc/ssh/sshd_config
配置,开启免密登录的白名单功能
Solution:在AllowGroups
参数后补充docker
,重启sshd
ref: Linux 禁止用户或 IP通过 SSH 登录 - mvpbang - 博客园
7. ssh: handshake failed: ssh: no common algorithm for client to server cipher; client offered[aes128-gcm@openssh.com chacha20-poly1305@openssh.com aes128-ctr aes192-ctr aes256-ctr], server offered: [3des-cbc]
sudo cat /etc/ssh/sshd_config | grep Ciphers
sudo sshd -T | grep ciphers
原因:sshd显式增加配置Ciphers 3des-cbc
,限制server支持的加密算法;
措施1:注释Ciphers
配置或补充client支持的加密算法
措施2:修改rke源码,补充server支持的加密算法,重新编译rke
8. ssh: unable to authenticate, attempted methods [none publickey]
执行ssh docker@IP
检测部署账号的免密登录
原因1:目标节点ssh通信账号缺失~/.ssh/authorized_keys
文件和公钥信息
措施:创建.ssh目录(700)和authorized_keys文件(600),并将公钥内容追加到.ssh/authorized_keys中
原因2:ssh通信账号未设置密码,或者系统对user password复杂度有要求,或者密码过期
措施:重置通信账号的password
Unable to access node with address [192.168.1.2:22] using SSH. Please check if you are able to SSH tothe node using the specified SSH Private Key and if you have configured the correct SSH username. Error: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain.
现象:命令行终端ssh密码登录正常,但是rke调用go原生ssh包进行ssh通信失败
原因:OpenSSH_8.8默认不支持ssh-rsa加密算法
措施:手工修正配置文件以支持ssh-rsa或更换ecdsa
类型的秘钥
ssh-keygen -t ecdsa
sshd -T | egrep "pubkeyauthentication|pubkeyacceptedkeytypes"
egrep -i 'pubkey*' /etc/ssh/sshd_config
ssh -Q key
# 确保sshd支持rsa类型加密算法
# 修改每台节点的/etc/ssh/sshd_config配置,在下方增加一行
PubkeyAcceptedAlgorithms=+ssh-rsa
# 重启sshd服务
systemctl restart sshd
ref:
- Issue with rke up private keys
- https://unix.stackexchange.com/questions/674582/how-to-enable-ssh-rsa-in-sshd-of-openssh-8-8
- OpenSSH 8.8 was released on 2021-09-26. It is available from the mirrors listed at https://www.opens
Host old-host
HostkeyAlgorithms +ssh-rsa
PubkeyAcceptedAlgorithms +ssh-rsa
9. 卸载k8s
./rke remove
ansible nodes -m shell -a "sudo rm -rf /etc/kubernetes"
10. FATA[0002] Unsupported Docker version found [20.10.12] on host
FATA[0002] Unsupported Docker version found [20.10.12] on host [192.168.1.91], supported versions are [1.13.x 17.03.x 17.06.x 17.09.x 18.06.x 18.09.x 19.03.x]
rke部署k8s对docker版本有限制
yum list docker-ce --showduplicates
措施:安装指定版本docker 或者 修改cluter.yml配置参数ignore_docker_version
11. rke failed to check etcd health: failed to get /health for host remote error: tls: bad certificate
WARN[0142] [etcd] host [192.168.1.171] failed to check etcd health: failed to get /health for host [192.168.1.171]: Get "https://192.168.1.171:2379/health": remote error: tls: bad certificate
原因:cluster.rkestate
缺失或不匹配,导致执行rke up后新生成的cluster.rkestate与集群信息不匹配
措施:使用正确的cluster.rkestate文件;或执行rke cert rotate --config cluster.yml
轮换证书后再执行操作
Ref:
How to solve Kubernetes upgrade in Rancher 2 failing with remote error: tls: bad certificate
rke引导删除节点时etcd健康检查异常:remote error: tls: bad certificate - Rancher
https://github.com/rancher/rke/issues/1485
https://github.com/rancher/rke/issues/1244
12. is not able to connect to the following ports. Please check network policies and firewall rules
FATA[0042] [[network] Host [192.168.0.1] is not able to connect to the following ports: [192.168.0.2:2379, 192.168.0.2:2380, 192.168.0.3:2379, 192.168.0.3:2380]. Please check network policies and firewall rules]
原因:环境未关闭防火墙
# centos
systemctl status firewalld
# 关闭防火墙
sudo systemctl stop firewalld
sudo systemctl disable firewalld
关闭防火墙后执行rke up报错:
FATA[0001] [Failed to start [rke-etcd-port-listeener] container on host [192.168.0.2]: Error nresponse from daemon: driver failed programmingexternal connectivity on endpoint rke-etcd-port-listener (b9137168bb048f21110cd4
536a5e15f91c9660b040fbd5cf28f7058aa19d567f):(iptables failed: iptables --wait -t nat -A DOCCKER -p tcp -d 0/0 --dport 2380 -j DNAT --to-destination 172.17.0.5:1337 ! -i docker0: iptablees: No chain/target/match by that nam
清理防火墙规则:
sudo iptables -F
sudo iptables -t filter -F
依然报同样的错误
Solution:重启docker重置docker网络
ansible docker -m shell -a "systemctl restart docker" -b
docker容器iptables failed: iptables --wait -t nat -A DOCKER&n
14. kubelet not ready
F0629 21:54:48.477074 1771061 kubelet.go:1316] Failed to start ContainerManager invalid Node Allocatable configuration. Resource "ephemeral-storage" has an allocatable of {{57464474049 0} {<nil>} BinarySI}, capacity of {{-30428064193 0} {<nil>} BinarySI}
现象:服务启动时检测磁盘资源不足;实际root盘(ext3文件系统)充足
措施1:建软链/var/lib/kubelet
到空间充足的数据盘 e.g: /data00/kubelet
为系统守护进程预留计算资源
Reserve Compute Resources for System Daemons
措施2:如果却是root盘空间不足,修改kubelet的system-reserved: cpu=1,memory=1Gi,ephemeral-storage=1Gi
参数,调低预料资源量
12. Unable to update cni config: no networks found in /etc/cni/net.d
CNI网络网络插件初始化异常:/opt/cni/bin
目录下的插件程序和/etc/cni/net.d
目录下的配置文件没有生成。解决措施:手工拷贝某一测试环境上的程序和配置放到指定位置
$ ls /opt/cni/bin/
bandwidth calico calico-ipam flannel host-local loopback portmap tuning
$ ls /etc/cni/net.d/
10-canal.conflist calico-kubeconfig
$ kubectl -n kube-system logs canal-8d8m6 -c install-cni
ls: cannot access '/calico-secrets': No such file or directory
Wrote Calico CNI binaries to /host/opt/cni/bin
CNI plugin version: v3.13.4
/host/secondary-bin-dir is non-writeable, skipping
Using CNI config template from CNI_NETWORK_CONFIG environment variable.
CNI config: {
"name": "k8s-pod-network",
"cniVersion": "0.3.1",
"plugins": [
{
"type": "calico",
"log_level": "WARNING",
"datastore_type": "kubernetes",
"nodename": "172.31.8.210",
"mtu": 1450,
"ipam": {
"type": "host-local",
"subnet": "usePodCidr"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/kubernetes/ssl/kubecfg-kube-node.yaml"
}
},
{
"type": "portmap",
"snat": true,
"capabilities": {"portMappings": true}
},
{
"type": "bandwidth",
"capabilities": {"bandwidth": true}
}
]
}
Created CNI config 10-canal.conflist
Done configuring CNI. Sleep=false