1.安装dcgm:
# rpm --install datacenter-gpu-manager-1.5.6-1.x86_64.rpm
# dcgmi --version
# nvvs --version
启动监听
# nv-hostengine
查看GPU设备
# dcgmi discovery -l
2.安装gpu-mon:
# go get -u github.com/open-falcon/gpu-mon
# pwd
/root/go/src/github.com/open-falcon/gpu-mon
# make
gofmt -s -w ./args.go ./fetch/metrics.go ./fetch/dcgm.go ./fetch/fetch.go ./common/config.go ./common/log.go ./common/log_test.go ./common/utils.go ./common/config_test.go ./common/common.go ./send/send_test.go ./send/send.go ./send/utils.go ./send/utils_test.go ./main.go
building gpu-mon ...
3.使用插件
open-falcon 插件功能需要开启
编辑agent/config/cfg.json
设置”enabled”为true
cp gpu-mon cfg.example.json 60_gpuMonitor.sh /root/open-falcon/agent/plugin/
# pwd
/root/open-falcon/agent/plugin
# mv cfg.example.json cfg.json
/root/open-falcon/plugin 为插件路径
# pwd
/root/open-falcon/plugin
# ls
60_gpuMonitor.sh cfg.json gpu-mon logs
4.配置文件
配置文件参考cfg.json文件,相关配置项说明如下:
{
"falcon": {
// Agent: 上报falcon客户端的地址
"Agent": "http://127.0.0.1:1988/v1/agent"
},
"metric":{
// ignoreMetrics: 不进行上报的GPU监控配置项
"ignoreMetrics": [
],
// endpoint值,默认为机器主机名
"endpoint": ""
},
"log":{
// logLevel: 日志级别,支持:Info、Warn、Error和Debug,默认为Warn
"level": "Warn",
// logDir: 日志存储目录
"dir": "./logs"
}
}
参考:
https://github.com/open-falcon/gpu-mon