部署架构
部署方式:kubernetes
node 监控和 gpu 监控
- node-exporter + gpu-metrics-exporter
- prometheus + grafana
gpu 监控
使用项目:
pod-gpu-metrics-exporter
需要环境:
- NVIDIA Tesla drivers = R384+ (download from NVIDIA Driver Downloads page)
- nvidia-docker version > 2.0 (see how to install and it's prerequisites)
- Set the default runtime to nvidia
- Kubernetes version = 1.13
- Set KubeletPodResources in /etc/default/kubelet: KUBELET_EXTRA_ARGS=--feature-gates=KubeletPodResources=true
安装环境:
安装脚本(ubuntu):
install-nvidia-docker.sh
#!/bin/bash
pwd=$1
if [[ -z ${pwd} ]]
then
echo "please run [bash $0 <pwd>]"
exit 0
fi
# 安装 docker
echo ${pwd} | sudo apt-get update
echo ${pwd} | sudo apt-get install curl && \
curl -fsSL https://get.docker.com -o get-docker.sh && \
echo ${pwd} | sudo sh get-docker.sh
echo ${pwd} | sudo usermod -aG docker digisky
echo ${pwd} | sudo systemctl enable docker
# nvidia-docker
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
echo ${pwd} | sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit nvidia-container-runtime
# nvidia-container-runtime
echo ${pwd} | sudo cp -f daemon.json /etc/docker/daemon.json
echo ${pwd} | sudo systemctl restart docker
# gpu-monitoring-tools-master
daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"registry-mirrors": ["https://vs2fctcq.mirror.aliyuncs.com"]
}
pod-gpu-metrics-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
app.kubernetes.io/name: gpu-metrics-exporter
app.kubernetes.io/version: latest
name: gpu-metrics-exporter
namespace: monitor
spec:
selector:
matchLabels:
app.kubernetes.io/name: pod-gpu-metrics-exporter
template:
metadata:
labels:
app.kubernetes.io/name: pod-gpu-metrics-exporter
app.kubernetes.io/part-of: gpu-metrics-exporter
app.kubernetes.io/version: latest
name: pod-gpu-metrics-exporter
spec:
containers:
- image: xxx/pod-gpu-metrics-exporter:latest
imagePullPolicy: Always
name: pod-nvidia-gpu-metrics-exporter
ports:
- containerPort: 9400
hostPort: 59101
name: gpu-port
protocol: TCP
volumeMounts:
- mountPath: /var/lib/kubelet/pod-resources
name: pod-gpu-resources
readOnly: true
- mountPath: /run/prometheus
name: device-metrics
readOnly: true
- image: xxx/dcgm-exporter:latest
imagePullPolicy: Always
name: nvidia-dcgm-exporter
volumeMounts:
- mountPath: /run/prometheus
name: device-metrics
dnsPolicy: ClusterFirst
# imagePullSecrets:
# - name: hub-out
restartPolicy: Always
volumes:
- hostPath:
path: /var/lib/kubelet/pod-resources
type: ""
name: pod-gpu-resources
- emptyDir:
medium: Memory
name: device-metrics
采集指标解释
指标 | 解释 |
---|---|
dcgm_fan_speed_percent | GPU风扇转速占比(%) |
dcgm_sm_clock | GPU sm时钟(MHz) |
dcgm_memory_clock | GPU 内存时钟(MHz) |
dcgm_gpu_temp | GPU 运行的温度(℃) |
dcgm_power_usage | GPU 的功率(w) |
dcgm_pcie_tx_throughput | GPU PCIeTX传输的字节总数 (kb) |
dcgm_pcie_rx_throughput | GPU PCIeRX接收的字节总数 (kb) |
dcgm_pcie_replay_counter | GPU PCIe重试的总数 |
dcgm_gpu_utilization | GPU利用率(%) |
dcgm_mem_copy_utilization | GPU 内存利用率(%) |
dcgm_enc_utilization | GPU编码器利用率(%) |
dcgm_dec_utilization | GPU解码器利用率(%) |
dcgm_xid_errors | GPU 上一个xid错误的值 |
dcgm_power_violation | GPU 功率限制导致的节流持续时间(us) |
dcgm_thermal_violation | GPU 热约束节流持续时间(us) |
dcgm_sync_boost_violation | GPU 同步增强限制,限制持续时间(us) |
dcgm_fb_free | GPUfb(帧缓存)的剩余(MiB) |
dcgm_fb_used | GPUfb(帧缓存)的使用(MiB) |
node 监控
修改后并测试成功的yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
app.kubernetes.io/name: node-exporter
name: node-exporter
namespace: monitor
spec:
selector:
matchLabels:
app.kubernetes.io/name: node-exporter
template:
metadata:
labels:
app.kubernetes.io/name: node-exporter
spec:
containers:
- args:
- --web.listen-address=0.0.0.0:59100
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --path.rootfs=/host/root
- --no-collector.wifi
- --no-collector.hwmon
- --collector.filesystem.ignored-mount-points=^/(var.*|run.*|boot.*|snap.*|dev|proc|sys|var/lib/docker/.+)($|/)
- --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
image: xxx/node-exporter:latest
imagePullPolicy: IfNotPresent
name: node-exporter
ports:
# hostNetwork开启为 true 时, containerPort 和 hostPort 需设置一样
- containerPort: 59100
hostPort: 59100
name: node-port
protocol: TCP
resources:
limits:
cpu: 250m
memory: 180Mi
requests:
cpu: 102m
memory: 180Mi
securityContext:
readOnlyRootFilesystem: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/proc
name: proc
- mountPath: /host/sys
name: sys
- mountPath: /host/root
mountPropagation: HostToContainer
name: root
readOnly: true
# 以下参数用以采集 node 的真实数据
hostIPC: true
hostNetwork: true
hostPID: true
# 指定镜像仓库的密钥
# imagePullSecrets:
# - name: hub-out
nodeSelector:
beta.kubernetes.io/os: linux
restartPolicy: Always
volumes:
- hostPath:
path: /proc
type: ""
name: proc
- hostPath:
path: /sys
type: ""
name: sys
- hostPath:
path: /
type: ""
name: root
prometheus
file_sd_configs 采用 file_sd_configs 的方式
prometheus.yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus-dev'
file_sd_configs:
- files:
- prometheus-etc.json
alerting:
alertmanagers:
- static_configs:
- targets: ['192.168.20.75:9093']
grafana
资料网站: