安装Alertmanager
下载地址:https://prometheus.io/download/
下载完成后,将下载中软件包上传至Prometheus服务所在的机器
image.png
解压alertmanager软件包
tar -zxvf alertmanager-0.21.0.linux-amd64.tar.gz -C /data
mv /data/alertmanager-0.21.0.linux-amd64 /data/alertmanager
进入解压后的alertmanager文件夹,修改alertmanager.yml文件,配置报警信息,alertmanager.yml 内容如下:
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25'
smtp_from: '***@163.com' # 发送告警的邮箱
smtp_auth_username: '***@163.com' #发送告警的邮箱
smtp_auth_password: '***' #邮箱授权密码
smtp_require_tls: false
route:
group_by: ['alertname'] #分组标签
group_wait: 10s # 告警等待时间。告警产生后等待10s,如果有同组告警一起发出
group_interval: 10s # 两组告警的间隔时间
repeat_interval: 1m # 重复告警的间隔时间,减少相同右键的发送频率 此处为测试设置为1分钟
receiver: 'mail' # 默认接收者 routes: # 指定那些组可以接收消息
receivers:
- name: 'mail'
email_configs:
- to: '***'
#inhibit_rules:
# - source_match:
# severity: 'critical'
# target_match:
# severity: 'warning'
# equal: ['alertname', 'dev', 'instance']
检查alertmanager.yml 配置是否正确
./amtool check-config alertmanager.yml
启动告警程序
nohup ./alertmanager &
tail -f nohup.out
level=error ts=2021-04-23T06:06:05.336Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="mail/email[0]: notify retry canceled after 2 attempts: create SMTP client: EOF"
level=error ts=2021-04-23T06:07:05.368Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="mail/email[0]: notify retry canceled after 2 attempts: create SMTP client: EOF"
level=error ts=2021-04-23T06:08:05.401Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="mail/email[0]: notify retry canceled after 2 attempts: create SMTP client: EOF"
level=info ts=2021-04-23T06:08:15.693Z caller=main.go:216 msg="Starting Alertmanager" version="(version=0.21.0, branch=HEAD, revision=4c6c03ebfe21009c546e4d1e9b92c371d67c021d)"
level=info ts=2021-04-23T06:08:15.693Z caller=main.go:217 build_context="(go=go1.14.4, user=root@dee35927357f, date=20200617-08:54:02)"
level=info ts=2021-04-23T06:08:15.697Z caller=cluster.go:161 component=cluster msg="setting advertise address explicitly" addr=192.168.56.128 port=9094
level=info ts=2021-04-23T06:08:15.700Z caller=cluster.go:623 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2021-04-23T06:08:15.737Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=alertmanager.yml
level=info ts=2021-04-23T06:08:15.738Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=alertmanager.yml
level=info ts=2021-04-23T06:08:15.788Z caller=main.go:485 msg=Listening address=:9093
level=info ts=2021-04-23T06:08:17.702Z caller=cluster.go:648 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.001649742s
level=info ts=2021-04-23T06:08:25.711Z caller=cluster.go:640 component=cluster msg="gossip settled; proceeding" elapsed=10.010215916s
alertmanager默认端口9093 可以访问IP:9093
image.png
配置Prometheus
vim /your prometheus path/prometheus.yml
修改Prometheus.yml配置文件
这是修改后的配置文件
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rule.yml"
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'nginx'
static_configs:
- targets: ['192.168.56.129:9913']
- job_name: 'tomcat'
file_sd_configs:
- files: ['/opt/prometheus/sd_config/tomcat.yml']
refresh_interval: 180s
配置其中
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
以及
rule_files: #配置告警规则
- "rule.yml"
编写rule.yml配置文件
cat prometheus-2.26.0.linux-amd64/rule.yml
groups:
- name: mem-rule
rules:
- alert: "内存报警"
expr: up == 0 #PromQL表达式
for: 30s
labels:
severity: warning
annotations:
summary: "服务名:{{$labels.alertname}} 内存报警"
description: "{{ $labels.alertname }} 内存资源利用率大于 5%"
value: "{{ $value }}"
由于体现实验效果 告警规则为up == 0 并非内存告警. 监控业务有Tomcat 以及 Nginx 以及 Prometheus本身
重启Prometheus以及Alertmanager
image.png