背景
在阅读本文之前,你应该已经使用Helm安装了Prometheus-Operator,并对Prometheus Rule、AlertManager Config等配置有一定的了解,本文的重点是介绍Kubernetes中配置Prometheus Operator,同时会给出一些常用的规则,便于读者能够直接使用。
如果直接对Deploy和ConfigMap进行修改后,配置会被还原,这是因为新版本中Prometheus Operator的一个核心功能是监视Kubernetes API服务器对特定对象的更改,并确保当前的Prometheu斯部署与这些对象匹配,参考Prometheus文档。因此我们不能直接修改,需要通过修改CRDS,Prometheus自定义资源如下:
kubectl get crds|grep coreos
实践
Prometheus Rules
kubectl get PrometheusRule -n {namespace}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: k8s
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 2.39.1
# 见下面的说明
prometheus: k8s
release: prometheus
role: alert-rules
name: prometheus-k8s-pod-rules
spec:
groups:
- name: pod.rules
rules:
- alert: PodStatusUnhealth
expr: min_over_time(sum by (namespace, pod, phase) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[5m:1m]) > 0
labels:
severity: Critical
annotations:
summary: "Pod状态异常:{{ $labels.pod }}"
description: "Pod状态异常。命名空间: {{ $labels.namespace }},容器副本Pod:{{ $labels.pod }},详情查看: https://dashboard.xxx.com/clusters/name/projects/{{ $labels.namespace }}/pods/{{ $labels.pod }}/resource-status "
- alert: PodCPUCritical
expr: (sum(irate(container_cpu_usage_seconds_total{container!="",container!="otc-container"}[5m])) by (pod,namespace) / sum(kube_pod_container_resource_limits{container!="",container!="otc-container",resource="cpu"}) by (pod,namespace) * 100 <= 100 or on() vector(0)) > 80
labels:
severity: Critical
annotations:
summary: "Pod CPU异常使用率过高:{{ $labels.pod }}"
description: "命名空间: {{ $labels.namespace }},容器副本Pod:{{ $labels.pod }}CPU使用率过高:当前值{{ $value }},详情查看: https://dashboard.xxx.com/clusters/name/projects/{{ $labels.namespace }}/pods/{{ $labels.pod }}/monitors "
- alert: PodMomeryCritical
expr: (sum(container_memory_working_set_bytes{container!="",container!="otc-container"}) by (pod,namespace)/sum(kube_pod_container_resource_limits{container!="",container!="otc-container",resource="memory"}) by (pod,namespace) * 100 <= 100 or on() vector(0)) > 90
labels:
severity: Critical
annotations:
summary: "Pod内存状态异常:{{ $labels.pod }}"
description: "命名空间: {{ $labels.namespace }},容器副本Pod:{{ $labels.pod }},内存使用率过高:当前值{{ $value }},详情查看: https://dashboard.xxx/clusters/name/projects/{{ $labels.namespace }}/pods/{{ $labels.pod }}/monitors "
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: k8s
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 2.39.1
prometheus: k8s
release: prometheus
role: alert-rules
name: prometheus-k8s-node-rules
spec:
groups:
- name: node.rules
rules:
- alert: NodeCPUUsageHigh
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 2m
labels:
severity: warning
annotations:
summary: "High CPU usage on node {{ $labels.instance }}"
description: "Node {{ $labels.instance }} CPU usage is above 90% (current value: {{ $value }}%)."
- alert: NodeCPUUsageCritical
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
for: 1m
labels:
severity: critical
annotations:
summary: "Critical CPU usage on node {{ $labels.instance }}"
description: "Node {{ $labels.instance }} CPU usage is above 95% (current value: {{ $value }}%)."
- alert: NodeMemoryUsageHigh
expr: (sum by (instance) (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / sum by (instance) (node_memory_MemTotal_bytes) * 100) > 90
for: 2m
labels:
severity: warning
annotations:
summary: "High memory usage on node {{ $labels.instance }}"
description: "Node {{ $labels.instance }} memory usage is above 90% (current value: {{ $value }}%)."
- alert: NodeMemoryUsageCritical
expr: (sum by (instance) (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / sum by (instance) (node_memory_MemTotal_bytes) * 100) > 95
for: 1m
labels:
severity: critical
annotations:
summary: "Critical memory usage on node {{ $labels.instance }}"
description: "Node {{ $labels.instance }} memory usage is above 95% (current value: {{ $value }}%)."
kubectl apply -f prometheus-rule.yaml
注意你的prometheus资源对Prometheus Rule是否有label限制,即设定了特殊标签的Rule才能生效。
查看Prometheus资源的ruleSelector
属性,此处的特殊标识为 release: prometheus
alertmanager config
在此之前你需要一个Dingtalk Webhook用于将告警信息发送到钉钉,选择一个钉钉群组配置一下机器人,该步骤不再赘述。
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
labels:
#该label见下面的说明
alertmanagerConfig: example
name: config-example
spec:
receivers:
- name: dingtalk
webhookConfigs:
- url: http://10.0.24.8:8060/dingtalk/webhook/send
route:
groupBy:
- job
- alertname
groupInterval: 5m
groupWait: 30s
receiver: dingtalk
repeatInterval: 12h
kubectl -n {namespace} apply -f AlertmanagerConfig.yml
新的AlertManager需要手动将config的配置加入才能生效,在
Alertmanager
资源中有alertmanagerConfigNamespaceSelector
和alertmanagerConfigSelector
分别用于加载config所在的namespace和config标签。默认都为{}
alertmanagerConfigNamespaceSelector:
matchLabels:
alertmanager: yes
alertmanagerConfigSelector:
matchLabels:
alertmanagerConfig: example
参考我上面的AlertmanagerConfig
配置,config的标签中包含:alertmanagerConfig: example
并且我config所在namespace中存在label:alertmanager: yes
,下面的namespace才符合加载条件:
apiVersion: v1
kind: Namespace
metadata:
labels:
kubernetes.io/metadata.name: prometheus
alertmanager: yes
name: prometheus
spec:
finalizers:
- kubernetes