首先配置alertmanager.yml
global: #全局配置
resolve_timeout: 5m
route: #分组路由
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 4h
receiver: 'web.hook'
receivers: #接收器 通过webhook发送钉钉接口
- name: 'web.hook'
webhook_configs:
- url: 'http://localhost:8060/dingtalk/webhook1/send'
inhibit_rules: #告警抑制
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
其次配置alertmanager与prometheus相结合,在prometheus.yml中增加
alerting:
alertmanagers:
- static_configs:
- targets:
- 127.0.0.1:9093
然后还是配置prometheus.yml,加入规则文件加载路径,告警的是之前配置过的job_name,这里不进行说明。
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "conf/rules/*.yml"
配置规则rules.yml ,这些模板可以在网上下载,例如:
groups:
- name: YarnStatsAlert
rules:
- alert: hadoop_yarn_resourcemanager_apps_pending 数量过多
expr: hadoop_yarn_resourcemanager_apps_pending > 100
for: 5m
labels:
severity: P0
alertroute: dingding|phone
partyid: "2"
annotations:
summary: "Instance {{ $labels.instance }} "
description: "hadoop_yarn_resourcemanager_apps_pending数量过多, (当前值:{{ $value }})"
- alert: nodemanager节点不健康
expr: hadoop_yarn_resourcemanager_nodemanager_total{serivce="YARN",status="Unhealthy"} >0
for: 1m
labels:
severity: P1
alertroute: dingding|phone
partyid: "2"
annotations:
summary: "Instance {{ $labels.instance }} "
description: "nodemanager节点不健康,请及时查看 (当前值:{{ $value }})"
最后,从网上下载prometheus-webhook-dingtalk,安装后就是之前我们在webhook中配置的http://localhost:8060/dingtalk/webhook1/send服务。
需要配置config.yml确定实际的钉钉接口地址。
template.tmpl中配置告警话术。
登录alertmanager部署主机。例如 http://localhost:9093/#/alerts
点击NEW Silence配置静默。