[toc]
k8s监控方案
cadvisor+heapster+influxdb+grafana
缺点:只能支持监控容器资源,无法支持业务监控,扩展性较差
cadvisor/exporter+prometheus+grafana
总体流程: 数据采集-->汇总-->处理-->存储-->展示
- 容器的监控
- prometheus使用cadvisor采集容器监控指标,cadvisor集成在k8s的kubelet中-通过prometheus进程存储-使用grafana进行展现
- node的监控-通过node_pxporter采集当前主机的资源-通过prometheus进程存储-使用grafana进行展现
- master的监控-通过kube-state-metrics插件从k8s中获取到apiserver的相关数据-通过prometheus进程存储-使用grafana进行展现
kubernetes监控指标
kubernetes自身的监控
- node的资源利用率-node节点上的cpu、内存、硬盘、链接
- node的数量-node数量与资源利用率、业务负载的比例情况、成本、资源扩展的评估
- pod的数量-当负载到一定程度时,node与pod的数量,评估负载到哪个阶段,大约需要多少服务器,每个pod的资源占用率如何,进行整体评估
- 资源对象状态-k8s在运行过程中,会创建很多pod,控制器,任务,这些内容都是由k8s中的资源对象进行维护,需要进行对资源对象的监控,获取资源对象的状态
pod监控
- 每个项目中pod的数量-正常的pod数量,有问题的pod数量
- 容器资源利用率-统计当前pod的资源利用率,统计pod中的容器资源利用率,cpu、网络、内存评估
- 应用程序-项目中的程序的自身情况,如并发,请求响应,项目用户数量,订单数等
实现思路
监控指标 具体实现 举例
pod性能 cadvisor 容器的cpu、内存利用率
node性能 node-exporter node节点的cpu、内存利用率
k8s资源对象 kube-state-metrics pod/deployment/service
服务发现
从kubernetes的api中去发现抓取的目标,并始终与kubernetes集群状态保持一致,
动态的获取被抓取的目标,实时的从api中获取当前状态是否存在,
官方文档
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config
自动发现支持的组件:
- node-自动发现集群中的node节点
- pod-自动发现运行的容器和端口
- service-自动发现创建的serviceIP、端口
- endpoints-自动发现pod中的容器
- ingress-自动发现创建的访问入口和规则
使用prometheus监控k8s
在k8s中部署prometheus
官方部署文档: https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/prometheus
制作prometheus PV/PVC
#安装依赖包
yum -y install nfs-utils rpcbind
#开机启动,
systemctl enable rpcbind.service
systemctl enable nfs-server.service
systemctl start rpcbind.service #端口是111
systemctl start nfs-server.service # 端口是 2049
# 创建一个/data/pvdata的共享目录
# mkdir /data/pvdata
# chown nfsnobody:nfsnobody /data/pvdata
# cat /etc/exports
/data/pvdata 172.22.22.0/24(rw,async,all_squash)
# exportfs -rv
exporting 172.22.22.0/24:/data/pvdata
下载prometheus yaml部署文件
mkdir /data/k8s/yaml/kube-system/prometheus
cd /data/k8s/yaml/kube-system/prometheus/
# 从github官网下载yaml部署文件
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/prometheus-rbac.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/prometheus-configmap.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/prometheus-service.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/prometheus-statefulset.yaml
修改statefulset.yaml
# 删掉最下面的10行
volumeClaimTemplates:
- metadata:
name: prometheus-data
spec:
storageClassName: standard
accessModes:
- ReadWriteOnce
resources:
requests:
storage: "16Gi"
# 新增下面3行
- name: prometheus-data
persistentVolumeClaim:
claimName: prometheus-data
新增pv/pvc yaml文件
mkdir /data/pvdata/prometheus
chown nfsnobody. /data/pvdata/prometheus
cat > prometheus-pvc-data.yaml << EFO
apiVersion: v1
kind: PersistentVolume
metadata:
name: prometheus-data
spec:
storageClassName: prometheus-data
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Recycle
nfs:
path: /data/pvdata/prometheus
server: 192.168.1.155
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-data
namespace: kube-system
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: prometheus-data
EFO
新增Prometheus-ingress.yaml文件
主要是方便外部grafana使用
cat > prometheus-ingress.yaml << EFO
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: prometheus-ingress
namespace: kube-system
spec:
rules:
- host: prometheus.baiyongjie.com
http:
paths:
- backend:
serviceName: prometheus
servicePort: 9090
EFO
应用yaml文件
# 部署顺序
1. prometheus-rbac.yaml-对prometheus访问kube-apiserver进行授权
2. prometheus-configmap.yaml-管理prometheus主配置文件
3. prometheus-service.yaml-将prometheus暴露出去,可以访问
4 prometheus-ingress.yaml-对外提供服务
4. prometheus-pvc-data.yaml-为pod提供数据存储
5. prometheus-statefulset.yaml-通过有状态的形式,将prometheus去部署
6. prometheus-ingress.yaml-对外提供服务
# 应用yaml文件
kubectl apply -f prometheus-rbac.yaml
kubectl apply -f prometheus-configmap.yaml
kubectl apply -f prometheus-ingress.yaml
kubectl apply -f prometheus-pvc-data.yaml
kubectl apply -f prometheus-service.yaml
kubectl apply -f prometheus-statefulset.yaml
# 查看部署情况
[root@master prometheus]# kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
prometheus-data 10Gi RWO Recycle Bound kube-system/prometheus-data prometheus-data 32m
[root@master prometheus]# kubectl get pvc -n kube-system
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
prometheus-data Bound prometheus-data 10Gi RWO prometheus-data 33m
[root@master prometheus]# kubectl get service -n kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP 12d
prometheus NodePort 10.107.69.131 <none> 9090/TCP 57m
[root@master prometheus]# kubectl get statefulsets.apps -n kube-system
NAME READY AGE
prometheus 1/1 15m
[root@master prometheus]# kubectl get ingresses.extensions -n kube-system
NAME HOSTS ADDRESS PORTS AGE
prometheus-ingress prometheus.baiyongjie.com 80 7m3s
[root@master prometheus]# kubectl get pods -n kube-system -o wide |grep prometheus
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
prometheus-0 2/2 Running 0 42s 10.244.1.6 node01 <none> <none>
访问ingress
# 修改hosts文件,添加ingress域名解析
192.168.1.156 prometheus.baiyongjie.com
然后访问 http://prometheus.baiyongjie.com/graph
部署node-exporter
下载yaml文件
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/node-exporter-ds.yml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/node-exporter-service.yaml
由于我们要获取到的数据是主机的监控指标数据,而我们的 node-exporter 是运行在容器中的,所以我们在 Pod 中需要配置一些 Pod 的安全策略,这里我们就添加了hostPID: true、hostIPC: true、hostNetwork: true3个策略,用来使用主机的 PID namespace、IPC namespace 以及主机网络,这些 namespace 就是用于容器隔离的关键技术,要注意这里的 namespace 和集群中的 namespace 是两个完全不相同的概念。
另外我们还将主机的/dev、/proc、/sys这些目录挂载到容器中,这些因为我们采集的很多节点数据都是通过这些文件夹下面的文件来获取到的,比如我们在使用top命令可以查看当前cpu使用情况,数据就来源于文件/proc/stat,使用free命令可以查看当前内存使用情况,其数据来源是来自/proc/meminfo文件。
另外由于我们集群使用的是 kubeadm 搭建的,所以如果希望 master 节点也一起被监控,则需要添加响应的容忍。
// 修改node-exporter-ds.yml文件
添加
spec:
hostPID: true
hostIPC: true
hostNetwork: true
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
volumes:
- name: proc
hostPath:
path: /proc
- name: dev
hostPath:
path: /dev
- name: sys
hostPath:
path: /sys
- name: rootfs
hostPath:
path: /
应用yaml文件
kubectl apply -f node-exporter-service.yaml
kubectl apply -f node-exporter-ds.yml
# 查看部署情况
[root@master prometheus]# kubectl get pods -n kube-system |grep node-export
node-exporter-lb7gb 1/1 Running 0 4m59s
node-exporter-q22zn 1/1 Running 0 4m59s
[root@master prometheus]# kubectl get service -n kube-system |grep node-export
node-exporter ClusterIP None <none> 9100/TCP 5m49s
查看Prometheus是否获取到数据
部署kube-state-metrics
下载yaml文件
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/kube-state-metrics-service.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/kube-state-metrics-rbac.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/kube-state-metrics-deployment.yaml
应用yaml文件
kubectl apply -f kube-state-metrics-service.yaml
kubectl apply -f kube-state-metrics-rbac.yaml
kubectl apply -f kube-state-metrics-deployment.yaml
部署grafana
生成yaml文件
grafana-pvc.yaml
mkdir /data/pvdata/prometheus-grafana
chown nfsnobody. /data/pvdata/prometheus-grafana
cat > grafana-pvc.yaml << EFO
apiVersion: v1
kind: PersistentVolume
metadata:
name: prometheus-grafana
spec:
storageClassName: prometheus-grafana
capacity:
storage: 2Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Recycle
nfs:
path: /data/pvdata/prometheus-grafana
server: 192.168.1.155
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-grafana
namespace: kube-system
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
storageClassName: prometheus-grafana
EFO
grafana-ingress.yaml
cat > grafana-ingress.yaml << EFO
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: grafana
namespace: kube-system
spec:
rules:
- host: grafana.baiyongjie.com
http:
paths:
- path: /
backend:
serviceName: grafana
servicePort: 3000
EFO
grafana-deployment.yaml
# cat > grafana-deployment.yaml << EFO
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: grafana
namespace: kube-system
labels:
app: grafana
spec:
revisionHistoryLimit: 10
template:
metadata:
labels:
app: grafana
component: prometheus
spec:
containers:
- name: grafana
env:
- name: GF_SECURITY_ADMIN_USER
value: admin
- name: GF_SECURITY_ADMIN_PASSWORD
value: admin
image: grafana/grafana:5.3.0
imagePullPolicy: IfNotPresent
ports:
- containerPort: 3000
name: grafana
readinessProbe:
failureThreshold: 10
httpGet:
path: /api/health
port: 3000
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
livenessProbe:
failureThreshold: 3
httpGet:
path: /api/health
port: 3000
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 100m
memory: 256Mi
requests:
cpu: 100m
memory: 256Mi
volumeMounts:
- mountPath: /var/lib/grafana
subPath: grafana
name: grafana-volumes
volumes:
- name: grafana-volumes
persistentVolumeClaim:
claimName: prometheus-grafana
EFO
部署yaml文件
kubectl apply -f grafana-pvc.yaml
kubectl apply -f grafana-ingress.yaml
kubectl apply -f grafana-deployment.yaml
# 查看部署情况
[root@master prometheus]# kubectl get service -n kube-system |grep grafana
grafana ClusterIP 10.105.159.132 <none> 3000/TCP 150m
[root@master prometheus]# kubectl get ingresses.extensions -n kube-system |grep grafana
grafana grafana.baiyongjie.com 80 150m
[root@master prometheus]# kubectl get pods -n kube-system |grep grafana
grafana-6f6d77d98d-wwmbd 1/1 Running 0 53m
配置grafana
修改本地hosts文件添加ingress域名解析,然后访问 http://grafana.baiyongjie.com
- 导入dashboard,推荐
- 3131 Kubernetes All Nodes
- 3146 Kubernetes Pods
- 8685 K8s Cluster Summary
- 10000 Cluster Monitoring for Kubernetes
部署alertmanager
下载yaml文件
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/alertmanager-pvc.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/alertmanager-service.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/alertmanager-deployment.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/alertmanager-configmap.yaml
修改yaml文件
alertmanager-pvc.yaml
mkdir /data/pvdata/prometheus-alertmanager
chown nfsnobody. /data/pvdata/prometheus-alertmanager
cat > alertmanager-pvc.yaml << EFO
apiVersion: v1
kind: PersistentVolume
metadata:
name: prometheus-alertmanager
spec:
storageClassName: prometheus-alertmanager
capacity:
storage: 2Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Recycle
nfs:
path: /data/pvdata/prometheus-alertmanager
server: 192.168.1.155
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-alertmanager
namespace: kube-system
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
storageClassName: prometheus-alertmanager
EFO
alertmanager-deployment.yaml
# 修改最后一行的claimName
- name: storage-volume
persistentVolumeClaim:
claimName: prometheus-alertmanager
应用yaml文件
kubectl apply -f alertmanager-pvc.yaml
kubectl apply -f alertmanager-configmap.yaml
kubectl apply -f alertmanager-service.yaml
kubectl apply -f alertmanager-deployment.yaml
# 查看部署情况
[root@master prometheus-ink8s]# kubectl get all -n kube-system |grep alertmanager
pod/alertmanager-c564cb9fc-bfrvb 2/2 Running 0 71s
service/alertmanager ClusterIP 10.102.208.66 <none> 80/TCP 5m44s
deployment.apps/alertmanager 1/1 1 1 71s
replicaset.apps/alertmanager-c564cb9fc 1 1 1 71s
创建告警规则
// 修改prometheus-configmap.yaml文件
kubectl edit configmaps prometheus-config -n kube-system
// 在prometheus.yml: |下面添加
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:80
rule_files:
- "/etc/config/rules.yml"
// 创建告警规则, 在最下面添加
rules.yml: |
groups:
- name: example
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
- alert: NodeMemoryUsage
expr: (sum(node_memory_MemTotal) - sum(node_memory_MemFree+node_memory_Buffers+node_memory_Cached) ) / sum(node_memory_MemTotal) * 100 > 20
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High Memory usage detected"
description: "{{$labels.instance}}: Memory usage is above 20% (current value is: {{ $value }}"
// 重载配置文件
# kubectl apply -f prometheus-configmap.yaml
# kubectl get service -n kube-system |grep prometheus
prometheus ClusterIP 10.111.97.89 <none> 9090/TCP 4h42m
# curl -X POST http://10.111.97.89:9090/-/reload
创建邮件告警
# 修改alertmanager-configmap.yaml文件
cat > alertmanager-configmap.yaml << EFO
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: EnsureExists
data:
alertmanager.yml: |
global:
resolve_timeout: 3m #解析的超时时间
smtp_smarthost: 'smtp.163.com:25'
smtp_from: 'USERNAMR@163.com'
smtp_auth_username: 'USERNAMR@163.com'
smtp_auth_password: 'PASSWORD'
smtp_require_tls: false
route:
group_by: ['example']
group_wait: 60s
group_interval: 60s
repeat_interval: 12h
receiver: 'mail'
receivers:
- name: 'mail'
email_configs:
- to: 'misterbyj@163.com'
send_resolved: true
EFO
kubectl delete configmaps -n kube-system alertmanager-config
kubectl apply -f alertmanager-configmap.yaml
查看告警
** 访问Prometheus, 查看是否有alerts告警规则 **