工程参考:
1.node-exporter
node-exporter是用来收集linux主机节点指标信息的一个指标采集器。
1.1 具体编排如下:
version: '3.2'
services:
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
hostname: node-exporter
restart: always
volumes:
- /:/rootfs:ro
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
ports:
- "19100:9100"
#dashboard 8919
部署成功后,访问http://192.168.10.128:19100/metrics,如果有数据则说明部署成功。
1.2 在prometheus.yml配置文件中添加配置:
scrape_configs:
...
- job_name: 'node-exporter'
file_sd_configs: #这里基于文件发现的方式进行配置
- files: ['./node_exporter/node.json']
- job_name: 'node-exporter1'
static_configs: #静态方式配置
- targets: ['192.168.10.129:19100']
labels: #labels标签可以自己定义,key: value
app: 'node01'
env: '测试环境'
...
./node_exporter/node.json文件如下:
[
{
"targets":[
"192.168.10.128:9100"
],
"labels":{
"host":"prometheus.center",
"env":"dev"
}
},
{
"targets":[
"192.168.10.129:9100",
"192.168.10.130:9100"
],
"labels":{
"host":"prometheus.nodes",
"env":"test"
}
}
]
node.json是一个json数组,也就是可以进行多个配置。
1.3 配置告警规则node_rules.yml
groups:
- name: hostStatsAlert
rules:
- alert: hostCpuUsageAlert
expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} CPU usgae high"
description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value | humanize }})"
- alert: hostMemUsageAlert
expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes)/node_memory_MemTotal_bytes > 0.85
# expr: (1 - ((avg_over_time(node_memory_MemFree_bytes[1h]) + avg_over_time(node_memory_Cached_bytes[1h]) + avg_over_time(node_memory_Buffers_bytes[1h])) / avg_over_time(node_memory_MemTotal_bytes[1h]))) > 0.55
for: 10m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} MEM usgae high"
description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value | humanize }})"
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
# Alert for any instance that has a median request latency >1s.
- alert: APIHighRequestLatency
expr: api_http_request_latencies_second{quantile="0.5"} > 1
for: 10m
labels:
severity: warning
annotations:
summary: "High request latency on {{ $labels.instance }}"
description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value | humanize }}s)"
将node_rules.yml上传到prometheus告警规则扫描路径中。
1.4 更新prometheus配置
curl -XPOST http://localhost:9090/-/reload
2.docker容器监控
docker容器指标可以使用cadvisor来收集。
2.1 编排文件
version: '3.2'
services:
cadvisor:
image: google/cadvisor:latest
container_name: cadvisor
hostname: cadvisor
restart: always
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime
ports:
- "18080:8080"
#dashboard 893 11558 8321
部署成功后,访问http://192.168.10.128:18080/metrics,如果有数据则说明部署成功。
2.2 配置prometheus.yml
scrape_configs:
...
- job_name: 'container-exporter'
file_sd_configs:
- files: ['./container_exporter/container.json']
...
./container_exporter/container.json文件如下:
```json
[
{
"targets":[
"192.168.10.128:18080",
"192.168.10.129:18080"
],
"labels":{
"service":"docker-monitor",
"env":"dev"
}
}
]
2.3 更新prometheus配置
curl -XPOST http://localhost:9090/-/reload