prometheus.yaml配置
# 配置Alertmanager参数
alerting:
alertmanagers:
- static_configs:
- targets:
- 192.168.3.63:9093 # alertmanager的ip端口
# 配置规则文件路径
rule_files:
- /etc/prometheus/rules/*.yml # 规则路径
Prometheus规则配置
规则路径下可以创建多个文件名不重复的yml文件用来配置告警规则
groups:
- name: instancedemo #分组名称 唯一键
rules:
- alert: INSTANCEDOWN #告警名称
expr: up == 0 # PromQL表达式
for: 5m #最大持续时间
labels:
serverity: 100 # 告警程度
team: instance #team分组 Alertmanager对应值分组告警
annotations:
summary: " 告警ip:{{$labels.instance}}: job名称: {{$labels.job}} 宕机 "
- name: springdemo #分组名称 唯一键 以下可以重新建一个规则文件单独写也可以写一起
rules:
- alert: KAKFADOWN #告警名称
expr: up{job="kafka"} == 0 # PromQL表达式
for: 5m #最大持续时间
labels:
serverity: 100 # 告警程度
team: kafka #team分组 Alertmanager对应值分组告警
annotations:
summary: " 告警ip:{{$labels.instance}}: job名称: {{$labels.job}} 宕机 "
变量对照表
变量名称 | 对照意思 | 例子 |
---|---|---|
$node | 客户端地址 | 172.0.0.1:8080 |
$labels.instance | 告警端地址 | 172.0.0.1:8080 |
$labels.job | jobname | spring |
alertmanager.yaml配置
global: # 全局配置
resolve_timeout: 5m # 超时时间 默认5m
inhibit_rules:
- source_match: ## 源报警规则
severity: 90
target_match: ## 抑制的报警规则
severity: 80
equal: [kafka,instance] ## 需要都有相同的标签及值,否则抑制不起作用
route:
receiver: webhook1
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
group_by: [demo, kafka, instance] # 对应prometheus规则文件中的team
routes:
- receiver: webhook2 # 对应下面
group_by: [nodeExt2]
matchers:
- team = kafka
group_interval: 10s
group_wait: 30s
repeat_interval: 60m
- receiver: webhook3
group_by: [nodeExt3]
matchers:
- team = instance
group_interval: 10s
group_wait: 30s
repeat_interval: 60m
receivers:
- name: webhook1
webhook_configs: # webhook告警配置
- url: http://172.16.1.165:29098/maintenanceApi/order/alarm
- name: webhook2
webhook_configs: # webhook告警配置
- url: http://172.16.1.165:29098/maintenanceApi/order/alarm2
- name: webhook3
webhook_configs: # webhook告警配置
- url: http://172.16.1.165:29098/maintenanceApi/order/alarm3
需要重启Prometheus 和 Alertmanager