1 启用zookeeper自带服务
1.1 启用监控服务端口
New Metrics System从3.6.0开始提供,提供丰富的指标帮助用户监控ZooKeeper的主题:znode、网络、磁盘、quorum、leader选举、client、security、failures、watch/session、requestProcessor等向前。
先决条件:
通过zoo.cfg 中Prometheus MetricsProvider的设置启用。metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
端口也可以通过设置来配置metricsProvider.httpPort(默认值:7000)
1.2 配置promtheus
将 Prometheus 的爬虫设置为以 ZooKeeper 集群端点为目标:
- job_name: test-zk
static_configs:
- targets: ['192.168.10.32:7000','192.168.10.33:7000','192.168.10.34:7000']
重载prometheus
1.3 告警规则
提供了一个警报示例,其中应特别注意这些指标。注:仅供参考,需要根据自己的实际情况和资源环境进行调整
groups:
- name: zk-alert-example
rules:
- alert: ZooKeeper server is down
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} ZooKeeper server is down"
description: "{{ $labels.instance }} of job {{$labels.job}} ZooKeeper server is down: [{{ $value }}]."
- alert: create too many znodes
expr: znode_count > 1000000
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} create too many znodes"
description: "{{ $labels.instance }} of job {{$labels.job}} create too many znodes: [{{ $value }}]."
- alert: create too many connections
expr: num_alive_connections > 50 # suppose we use the default maxClientCnxns: 60
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} create too many connections"
description: "{{ $labels.instance }} of job {{$labels.job}} create too many connections: [{{ $value }}]."
- alert: znode total occupied memory is too big
expr: approximate_data_size /1024 /1024 > 1 * 1024 # more than 1024 MB(1 GB)
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} znode total occupied memory is too big"
description: "{{ $labels.instance }} of job {{$labels.job}} znode total occupied memory is too big: [{{ $value }}] MB."
- alert: set too many watch
expr: watch_count > 10000
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} set too many watch"
description: "{{ $labels.instance }} of job {{$labels.job}} set too many watch: [{{ $value }}]."
- alert: a leader election happens
expr: increase(election_time_count[5m]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} a leader election happens"
description: "{{ $labels.instance }} of job {{$labels.job}} a leader election happens: [{{ $value }}]."
- alert: open too many files
expr: open_file_descriptor_count > 300
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} open too many files"
description: "{{ $labels.instance }} of job {{$labels.job}} open too many files: [{{ $value }}]."
- alert: fsync time is too long
expr: rate(fsynctime_sum[1m]) > 100
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} fsync time is too long"
description: "{{ $labels.instance }} of job {{$labels.job}} fsync time is too long: [{{ $value }}]."
- alert: take snapshot time is too long
expr: rate(snapshottime_sum[5m]) > 100
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} take snapshot time is too long"
description: "{{ $labels.instance }} of job {{$labels.job}} take snapshot time is too long: [{{ $value }}]."
- alert: avg latency is too high
expr: avg_latency > 100
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} avg latency is too high"
description: "{{ $labels.instance }} of job {{$labels.job}} avg latency is too high: [{{ $value }}]."
- alert: JvmMemoryFillingUp
expr: jvm_memory_bytes_used / jvm_memory_bytes_max{area="heap"} > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "JVM memory filling up (instance {{ $labels.instance }})"
description: "JVM memory is filling up (> 80%)\n labels: {{ $labels }} value = {{ $value }}\n"
1.4 通过k8s ServiceMonitor发现
可以通过k8s ServiceMonitor发现接入zookeeper监控
1.4.1 创建zookeeper 监控端口的service
# Service 定义(针对 Exporter 或原生接口)
apiVersion: v1
kind: Service
metadata:
name: zookeeper-monitor
labels:
app: zookeeper-monitor
spec:
ports:
- name: metrics
port: 7000 # 监控端口默认端口7000
selector:
app: zookeeper # 或 zookeeper Pod 标签
1.4.2 创建zookeeper ServiceMonitor
# ServiceMonitor 定义(Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: zookeeper-monitor
namespace: monitoring
spec:
endpoints:
- port: metrics
interval: 30s
scheme: http
selector:
matchLabels:
app: zookeeper-monitor
namespaceSelector:
matchNames:
- default
2 启动zookeeper-exporter监控
2.1 下载char包
参考:https://github.com/feiyu563/prometheus-exporter/tree/master
注意要修改相关参数:
1、镜像拉取策略
- image: reg.hrlyit.com/common/zookeeper_exporter:latest
name: zookeeper-exporter
imagePullPolicy: IfNotPresent
2、容器标签值修改:app: zookeeper-exporter
3、修改apiVersion版本信息:
apiVersion: apps/v1
4、如果是zookeeper集群,建议容器名加上序号:0、1、2,可以启动后进行修改
启动:
helm upgrade --install zookeeper-exporter-0 --namespace monitoring --set env.url='zookeeper-exporter-0' --set env.zookeeper_addr='zookeeper-0.zookeeper-headless.default:2181' ./zookeeper-exporter
helm upgrade --install zookeeper-exporter-1 --namespace monitoring --set env.url='zookeeper-exporter-1' --set env.zookeeper_addr='zookeeper-1.zookeeper-headless.default:2181' ./zookeeper-exporter
helm upgrade --install zookeeper-exporter-2 --namespace monitoring --set env.url='zookeeper-exporter-2' --set env.zookeeper_addr='zookeeper-2.zookeeper-headless.default:2181' ./zookeeper-exporter
2.2 创建service
因为创建三个zookeeper-exporter,service也会自动创建了三个,先把这三个service删除,重新创建,创建yaml文件如下:
apiVersion: v1
kind: Service
metadata:
labels:
app: zookeeper-exporter
name: zookeeper-exporter
namespace: monitoring
spec:
ports:
- name: http
port: 9141
protocol: TCP
targetPort: http
selector:
app: zookeeper-exporter
sessionAffinity: None
type: ClusterIP
2.3 创建ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: zookeeper-exporter
namespace: monitoring
spec:
endpoints:
- interval: 30s
path: /metrics
port: http
namespaceSelector:
matchNames:
- monitoring
selector:
matchLabels:
app: zookeeper-exporter
2.4 二进制部署zookeeper_exporter
下载二进制部署包,
下载地址:https://github.com/carlpett/zookeeper_exporter/releases
将二进制包zookeeper_exporter 放到/opt/soft/zookeeper_exporter 目录下
mkdir /opt/soft/zookeeper_exporter
chmod +x /opt/soft/zookeeper_exporter/zookeeper_exporter
## 启动脚本:/opt/soft/zookeeper_exporter/zk_exporter.sh
#!/bin/bash
cd /opt/soft/zookeeper_exporter && ./zookeeper_exporter -zookeeper localhost:2181 -bind-addr ":9141"
## 启动
nohup bash /opt/soft/zookeeper_exporter/zk_exporter.sh &
## 配置开机启动
echo 'bash /opt/soft/zookeeper_exporter/zk_exporter.sh' >> /etc/rc.local
## 验证
curl 127.0.0.1:9141/metrics
配置prometheus,并重启
- job_name: "zookeeper-exporter"
static_configs:
- targets: ["10.51.10.4:9141","10.51.10.5:9141","10.51.10.6:9141"]
2.5 核心监控指标
状态
zk_up 节点状态
zk_server_state(节点角色:Leader/Follower)
zk_znode_count(节点数量)
zk_packets_received 接收数据包
zk_packets_sent 发送数据包
zk_outstanding_requests 待处理请求数
性能指标
zk_avg_latency(请求平均延迟)
zk_max_latency(请求最大延迟)
zk_min_latency(请求最小延迟)
zk_outstanding_requests(堆积请求数)
客户端连接
zk_num_alive_connections 活跃的客户端连接数
zookeeper_approximate_data_size:ZooKeeper 数据的大致大小
ZooKeeper 事务日志
zk_outstanding_requests:当前等待处理客户端请求数量
zookeeper_open_file_descriptor_count:打开的文件描述符数量