Prometheus+Alertmanager 配置邮件报警

安装Alertmanager

下载地址:https://prometheus.io/download/
下载完成后,将下载中软件包上传至Prometheus服务所在的机器

image.png

解压alertmanager软件包

tar -zxvf alertmanager-0.21.0.linux-amd64.tar.gz -C /data
mv /data/alertmanager-0.21.0.linux-amd64 /data/alertmanager
进入解压后的alertmanager文件夹,修改alertmanager.yml文件,配置报警信息,alertmanager.yml 内容如下:
global:
  resolve_timeout: 5m 
  smtp_smarthost: 'smtp.163.com:25'
  smtp_from: '***@163.com' # 发送告警的邮箱
  smtp_auth_username: '***@163.com'  #发送告警的邮箱
  smtp_auth_password: '***' #邮箱授权密码
  smtp_require_tls: false
route:
  group_by: ['alertname'] #分组标签
  group_wait: 10s # 告警等待时间。告警产生后等待10s,如果有同组告警一起发出
  group_interval: 10s # 两组告警的间隔时间
  repeat_interval: 1m  # 重复告警的间隔时间,减少相同右键的发送频率 此处为测试设置为1分钟
  receiver: 'mail'  # 默认接收者  routes: # 指定那些组可以接收消息
receivers:
- name: 'mail'
  email_configs:
  - to: '***'
#inhibit_rules:
#  - source_match:
#      severity: 'critical'
#    target_match:
#      severity: 'warning'
#    equal: ['alertname', 'dev', 'instance']

检查alertmanager.yml 配置是否正确

./amtool check-config alertmanager.yml

启动告警程序

nohup ./alertmanager &
tail -f nohup.out 
level=error ts=2021-04-23T06:06:05.336Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="mail/email[0]: notify retry canceled after 2 attempts: create SMTP client: EOF"
level=error ts=2021-04-23T06:07:05.368Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="mail/email[0]: notify retry canceled after 2 attempts: create SMTP client: EOF"
level=error ts=2021-04-23T06:08:05.401Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="mail/email[0]: notify retry canceled after 2 attempts: create SMTP client: EOF"
level=info ts=2021-04-23T06:08:15.693Z caller=main.go:216 msg="Starting Alertmanager" version="(version=0.21.0, branch=HEAD, revision=4c6c03ebfe21009c546e4d1e9b92c371d67c021d)"
level=info ts=2021-04-23T06:08:15.693Z caller=main.go:217 build_context="(go=go1.14.4, user=root@dee35927357f, date=20200617-08:54:02)"
level=info ts=2021-04-23T06:08:15.697Z caller=cluster.go:161 component=cluster msg="setting advertise address explicitly" addr=192.168.56.128 port=9094
level=info ts=2021-04-23T06:08:15.700Z caller=cluster.go:623 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2021-04-23T06:08:15.737Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=alertmanager.yml
level=info ts=2021-04-23T06:08:15.738Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=alertmanager.yml
level=info ts=2021-04-23T06:08:15.788Z caller=main.go:485 msg=Listening address=:9093
level=info ts=2021-04-23T06:08:17.702Z caller=cluster.go:648 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.001649742s
level=info ts=2021-04-23T06:08:25.711Z caller=cluster.go:640 component=cluster msg="gossip settled; proceeding" elapsed=10.010215916s

alertmanager默认端口9093 可以访问IP:9093


image.png

配置Prometheus

vim /your prometheus path/prometheus.yml

修改Prometheus.yml配置文件

这是修改后的配置文件
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:  
  alertmanagers:
  - static_configs:
    - targets: ['localhost:9093']
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rule.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'nginx'

    static_configs:
    - targets: ['192.168.56.129:9913']
  - job_name: 'tomcat'
    file_sd_configs:
    - files: ['/opt/prometheus/sd_config/tomcat.yml']
      refresh_interval: 180s

配置其中

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['localhost:9093']

以及

rule_files: #配置告警规则
- "rule.yml"

编写rule.yml配置文件

cat prometheus-2.26.0.linux-amd64/rule.yml 
groups:
- name: mem-rule
  rules:
  - alert: "内存报警"
    expr: up == 0  #PromQL表达式
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "服务名:{{$labels.alertname}} 内存报警"
      description: "{{ $labels.alertname }} 内存资源利用率大于 5%"
      value: "{{ $value }}"
由于体现实验效果 告警规则为up == 0 并非内存告警. 监控业务有Tomcat 以及 Nginx 以及 Prometheus本身 

重启Prometheus以及Alertmanager


image.png
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

友情链接更多精彩内容