安装Alertmanager
下载地址:https://prometheus.io/download/
下载完成后,将下载中软件包上传至Prometheus服务所在的机器
image.png
解压alertmanager软件包
tar -zxvf alertmanager-0.21.0.linux-amd64.tar.gz -C /data
mv /data/alertmanager-0.21.0.linux-amd64 /data/alertmanager
进入解压后的alertmanager文件夹,修改alertmanager.yml文件,配置报警信息,alertmanager.yml 内容如下:
cat alertmanager.yml
global:
resolve_timeout: 5m #5分钟内没收到告警表示警报已解除
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1m
receiver: 'wechat'
receivers:
- name: 'wechat'
wechat_configs:
- corp_id: 'XXX'
to_party: 'XX'
agent_id: '1000011'
api_secret: 'secret'
安装Exporter_rabbitmq
下载地址:https://github.com/kbudde/rabbitmq_exporter/releases
解压运行:
tar -zxvf rabbitmq_exporter-version
cd rabbitmq_exporter-version
RABBIT_USER=USER RABBIT_PASSWORD=PASSWORD OUTPUT_FORMAT=json PUBLISH_PORT=9099 RABBIT_URL=http://XXX:15672 nohup ./rabbitmq_exporter &
RABBIT_USER:Rabbitmq管理插件的用户名
RABBIT_PASSWORD: Rabbitmq管理插件的用户名密码
OUTPUT_FORMAT:数据输出格式为json
PUBLISH_PORT:监听端口
RABBIT_URL:Rabbitmq管理插件的地址
Rabbitmq_exporter起来后配置prometheus.yml添加RabbitMQ监控
- job_name: 'RabbitMQ'
static_configs:
- targets: ['47.241.2.144:9099']
labels:
instance: RabbitMQ-47.241.2.144
- targets: ['47.101.150.234:9099']
labels:
instance: RabbitMQ-47.101.150.234
我这里监控了两个节点
配置告警规则
在Prometheus.yml下配置规则文件
rule_files:
- "rule.yml"
cat /data/prometheus/rule.yml
groups:
- name: Rabbitmq
rules:
- alert: Rabbitmq-down
expr: rabbitmq_up{job='RabbitMQ'} != 1
labels:
status: High
team: Rabbitmq_monitor
annotations:
description: "Instance: {{ $labels.instance }} is Down ! ! !"
value: '{{ $value }}'
summary: "The host node is down"
- name: Rabbitmq disk free limit
rules:
- alert: Rabbitmq disk free limit status
expr: rabbitmq_node_disk_free{job='RabbitMQ'} / 1024 / 1024 <= rabbitmq_node_disk_free_limit{job='RabbitMQ'} / 1024 / 1024 + 200
labels:
status: High
team: Rabbitmq_monitor
annotations:
description: "Instance: {{ $labels.instance }} the rmq free disk is to low ! ! !"
value: '{{ $value }} MB'
summary: "The rmq free disk too low"
添加需要的监控项
Prometheus.yml整体配置
cat /data/prometheus/prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 192.168.1.178:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rule.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'RabbitMQ'
static_configs:
- targets: ['xx.xx.xx.xx:9099']
labels:
instance: RabbitMQ-47.241.2.144
- targets: ['xx.xx.xx.xx:9099']
labels:
instance: RabbitMQ-47.101.150.234
- job_name: 'Linux'
static_configs:
- targets: ['xx.xx.xx.xx:9100']
labels:
instance: Linux
- job_name: 'alertmanager'
static_configs:
- targets: ['xx.xx.xx.xx:9093']
最后查看效果测试效果
image.png