要是rocketmq跑在k8s里的,参考:https://www.jianshu.com/p/aac14799626c 部署exporter,下面演示监控跑在docker里的RocketMQ
1. 部署rocketmq-exporter
使用 RocketMQ 官方的 Prometheus Exporter(如 apache/rocketmq-exporter)。
该 Exporter 会将 RocketMQ 的指标转换为 Prometheus 格式,通常运行在独立的容器中。
下面,5557 是 Exporter 的 HTTP 服务端口,10.1.2.128:9876 是rocketmq的ip和端口
docker run -d --restart=always -p 5557:5557 apache/rocketmq-exporter --rocketmq.config.namesrvAddr=10.1.2.128:9876
测试:确保 RocketMQ Exporter 服务可通过 HTTP 抓取到数据
curl http://10.1.2.128:5557/metrics
1.1. 配置Endpoints和Service将外部服务暴露到k8s集群内部
vim rocketmq-exporter.yaml
apiVersion: v1
kind: Endpoints
metadata:
name: external-rocketmq
namespace: monitoring
labels:
app: rocketmq-exporter
app.kubernetes.io/name: rocketmq-exporter
subsets:
- addresses:
# 这里是外部资源列表
- ip: 10.1.2.128
ports:
- name: metrics
port: 5557
---
apiVersion: v1
kind: Service
metadata:
name: external-rocketmq
namespace: monitoring
labels:
app: rocketmq-exporter
app.kubernetes.io/name: rocketmq-exporter
spec:
clusterIP: None
ports:
- name: metrics
port: 5557
protocol: TCP
targetPort: 5557
1.2. 配置ServiceMonitor和prometheusrules
Prometheus Operator 通过 ServiceMonitor 和 prometheusrules配置监控目标和告警规则。
vim rocketmq-ServiceMonitor-prometheusrules.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: rocketmq-exporter
namespace: monitoring
labels:
app: rocketmq-exporter
release: prometheus
spec:
selector:
matchLabels:
app: rocketmq-exporter
namespaceSelector:
matchNames:
- monitoring
endpoints:
- port: metrics
interval: 15s
path: /metrics
scheme: http
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: rocketmq-rules
namespace: monitoring
spec:
groups:
- name: rocketmq.rules
rules:
# 监控 Broker 的消息堆积情况
- alert: RocketMQProducerOffsetLag
expr: rocketmq_producer_offset{job="external-rocketmq", broker="broker-a"} > 5000
for: 5m
labels:
severity: critical
annotations:
summary: RocketMQ Producer 偏移量过高
description: RocketMQ Producer 在过去 5 分钟内的偏移量超过 5000,可能导致消息积压。
# 监控 NameServer 的可用性
- alert: RocketMQServiceDown
expr: up{job="external-rocketmq"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: RocketMQ 服务不可用
description: RocketMQ 服务在过去 1 分钟内不可用。
# 监控发送消息的延迟
- alert: RocketMQConsumerGetLatencyHigh
expr: |
rocketmq_group_get_latency_by_storetime / 1000 > 10 and rate(rocketmq_group_get_latency_by_storetime[5m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: RocketMQ Consumer 获取延迟过高
description: RocketMQ Consumer 获取延迟超过 10 秒,并且过去 5 分钟内延迟有所增加。延迟:{{ $value }} 秒。
# 监控消费失败率
- alert: RocketMQConsumerFailureRateHigh
expr: |
sum(rate(rocketmq_consumer_failure_count[5m])) by (consumer, topic) > 10
for: 5m
labels:
severity: critical
annotations:
summary: RocketMQ Consumer {{ $labels.consumer }} 消费失败率过高
description: RocketMQ Consumer {{ $labels.consumer }} 消费 Topic {{ $labels.topic }} 的失败率过高,当前失败数为 {{ $value }}。
# 监控 Broker 的存储空间
- alert: RocketMQBrokerStorageUsageHigh
expr: |
(rocketmq_broker_disk_usage / rocketmq_broker_disk_capacity) * 100 > 80
for: 10m
labels:
severity: warning
annotations:
summary: RocketMQ Broker {{ $labels.broker }} 存储空间使用率过高
description: RocketMQ Broker {{ $labels.broker }} 的存储空间使用率超过 80%,当前为 {{ $value }}%。
# 监控集群的健康状态
- alert: RocketMQClusterUnhealthy
expr: |
sum(rocketmq_cluster_broker_count) < 3
for: 5m
labels:
severity: critical
annotations:
summary: RocketMQ 集群不可用
description: RocketMQ 集群的 Broker 数量低于预期,当前为 {{ $value }}。