案例1
[问题澄清]
TiDB集群安装过程中,遇到报错:
the maximum number of open file descriptors is too small, got 65536, expect greater or equal to 82920
或者
fatal: [10.8.xxx.205]: FAILED! => {"changed": false, "msg": "The default maximum number of open file descriptors is too low 4096, should be 1000000"}
[原因分析]
操作系统max open file limitation设置过小,需要增大配置.
[解决方案]
1.检查参数值:
ulimit -Sn
ulimit -Hn
修改参数值:
ulimit -HSn 1000000
2.如果测试环境使用docker配置,可以尝将docker配置文件
/usr/lib/systemd/system/docker.service的参数
LimitNOFILE=infinity
LimitNPROC=infinity
改为:
LimitNOFILE=1000000
LimitNPROC=1000000
重启docker,部署tidb
3.如果当前session设置参数不生效,改完后,新建session尝试
[参考案例]
Docker-compose 报错 the maximum number of open file descriptors is too small 是什么 原因?
https://asktug.com/t/docker-compose-the-maximum-number-of-open-file-descriptors-is-too-small/183
使用 tidb-operator 部署,pd 部署成功,tikv 一直处于 CrashLoopBackOff 状态
https://asktug.com/t/tidb-operator-pd-tikv-crashloopbackoff/644
https://asktug.com/t/topic/2498
[引申学习点]
CENTOS中6和7设置参数的位置不同.可以参考以下文章
CENTOS/RHEL 7 系统中设置SYSTEMD SERVICE的ULIMIT资源限制
http://smilejay.com/2016/06/centos-7-systemd-conf-limits/
案例2
[问题澄清]
TiDB集群安装过程中报错:
Ansible FAILED! => playbook: start.yml; TASK: set_fact; message: {“msg”: “The conditional check ‘((existing_api_keys[‘json’] | selectattr(“name”, “equalto”, “grafana_apikey”)) | list) | length == 1’ failed. The error was: no test named ‘equalto’ The error appears to be in ‘/home/tidb/tidb-ansible/common_tasks/create_grafana_api_keys.yml’: line 24, column 3, but may be elsewhere in the file depending on the exact syntax problem. The offending line appears to be: - set_fact: ^ here ”}
[原因分析]
jinja的版本过低
[解决方案]
安装tidb-ansible目录中requirements.txt文件的要求更新版本
sudo pip install --upgrade -r requirements.txt
[参考案例]
请教下,执行 ansible-playbook start.yml 时报错,求解答
https://asktug.com/t/ansible-playbook-start-yml/170
[引申学习点]
了解jinja
https://jinja.palletsprojects.com/en/2.11.x/
案例3
[问题澄清]
TiDB集群安装过程中报错:
Ansible FAILED! => playbook: bootstrap.yml; TASK: check_system_static : Preflight check - Check if the operating system supports EPOLLEXCLUSIVE; message: {“changed”: true, “cmd”: “/home/tidb/deploy/epollexclusive”, “delta”: “0:00:00.007020”, “end”: “2019-10-16 10:29:15.858494”, “msg”: “non-zero return code”, “rc”: 1, “start”: “2019-10-16 10:29:15.851474”, “stderr”: “”, “stderr_lines”: [], “stdout”: “epoll_ctl with EPOLLEXCLUSIVE | EPOLLONESHOT succeeded. This is evidence of no EPOLLEXCLUSIVE support. Not using epollex polling engine.False: epollexclusive is not available”, “stdout_lines”: [“epoll_ctl with EPOLLEXCLUSIVE | EPOLLONESHOT succeeded. This is evidence of no EPOLLEXCLUSIVE support. Not using epollex polling engine.False: epollexclusive is not available”]}
[原因分析]
操作系统版本过低,为支持EPOLLEXCLUSIVE
[解决方案]
升级操作系统版本,参考安装配置建议
https://pingcap.com/docs-cn/stable/how-to/deploy/hardware-recommendations/
[参考案例]
https://asktug.com/t/tidb-epollexclusive/1921
[引申学习点]
一个epoll惊群导致的性能问题
https://www.ichenfu.com/2017/05/03/proxy-epoll-thundering-herd/
epoll: add EPOLLEXCLUSIVE flag
https://github.com/torvalds/linux/commit/df0108c5da561c66c333bb46bfe3c1fc65905898
案例4
[问题澄清]
TiDB集群安装过程中报错:
fatal: [10.xxx.14.2]: FAILED! => {“changed”: false, “msg”: “Make sure NTP service is running and ntpstat is synchronised to NTP server
[原因分析]
是用chronyd.service 管理的ntpd与dns
[解决方案]
启动ntpd.service服务
[参考案例]
https://asktug.com/t/topic/1747
[引申学习点]
如何检测 NTP 服务是否正常
案例5
[问题澄清]
TiDB集群安装过程中报错:
ntpdate[5507]: no server suitable for synchronization found”
[原因分析]
1.屏蔽了upd 123端口
2.检查ntp的版本,如果你使用的是ntp4.2(包括4.2)之后的版本,在restrict的定义中使用了notrust的话,会导致以上错误
[解决方案]
1.开放upd 123端口
2.notrust去掉
[参考案例]
no server suitable for synchronization found
https://asktug.com/t/topic/2165
[引申学习点]
完美解决ntp的错误问题no server suitable for synchronization
https://www.jb51.net/article/108792.htm
案例6
[问题澄清]
TiDB集群安装过程中报错:
TASK [check_system_dynamic : Preflight check - Does every node in cluster have different hostname] ****************************
fatal: [192.xxx.0.201]: FAILED! => changed=false
msg: |-
hostnames of all nodes in cluster: [w, w, w]
[原因分析]
集群中存在重复的节点名称
[解决方案]
将集群中重复的名称修改为不同
[参考案例]
https://asktug.com/t/topic/2241
[引申学习点]
3种方法更改Linux系统的主机名(hostname) https://blog.csdn.net/u013991521/article/details/80522269
案例7
[问题澄清]
TiDB集群安装过程中有报错信息:
fatal: [10.xxx.0.16]: FAILED! => {"changed": false, "msg": "Could not find the requested service irqbalance: host"}
...ignoring
[原因分析]
操作系统中不存在irqbalance
[解决方案]
这类信息已经跳过,忽略即可
[参考案例]
https://asktug.com/t/topic/2385
[引申学习点]
深度剖析告诉你irqbalance有用吗?