在IT行业中,性能监控和日志收集是确保系统稳定性和优化性能的关键。市面上也有很多的成熟的解决方案,包括企业级付费的:Datadog、New Relic等,也有开源免费的:Zabbix、ELK等。各种工具都有自己的优缺点,根据自己的需求和预算进行选择。
本文介绍的方案是 Grafana + Prometheus + Loki + 各种组件
的组合方式,实现一套性能监控和日志收集系统。这里记录下我近期的搭建过程以及踩的坑,内容有点长 ...
1. 工具介绍
1.1 Grafana
Grafana是Grafana Labs开源的度量分析和可视化工具,可以通过将采集的数据查询然后可视化的展示,并可及时报警通知。特点:
- 开源
- 数据源丰富,支持:Prometheus、Graphite、Elasticsearch 等等
- 直观的用户界面,可以自定义或者导入现成的监控仪表盘
1.2 Prometheus
Prometheus是由SoundCloud开源的监控报警系统和时间序列数据库。特点:
- 开源
- 基于时间序列的灵活查询语言PromQL,适合动态和大规模环境的监控
- 组件众多,提供完整的监控解决方案
1.3 Loki
Loki是Grafana Labs开源的日志处理系统,可以高效收集处理不同格式日志。特点:
- 水平可扩展,无中心架构,将日志数据传递给收集器
- 支持按标签过滤和查询日志,并支持日志流重定向到后端存储
- 易于伸缩,支持大规模、分布式集群
1.4 其他组件
- node-exporter:收集主机信息,对服务器硬件系统进行监控
- mysqld-exporter:收集MySQL性能信息,对数据库进行监控
- redis_exporter:收集Redis性能信息,对缓存进行监控
- promtail:收集日志,将日志发送给Loki
- go的Prometheus客户端组件:收集go程序的性能信息,对go程序进行监控
2. 搭建过程
2.1 版本确认
本次安装的所有工具版本如下,均通过docker安装:
- 服务器系统:CentOS Linux release 7.9.2009 (Core)
- Docker版本:26.1.4
- docker-compose:v2.29.0
- Grafana:11.1.4
- Prometheus:2.53.2
- Loki:3.1.1
- node-exporter:1.8.2
- mysqld-exporter:0.15.1
- redis_exporter:v1.62.0
- promtail:3.1.1
2.2 安装 grafana、prometheus、loki
2.2.1 部署
docker-compose.yaml
这里通过docker-compose统一管理grafana、prometheus、loki
这三个工具运行在同一个负责监控的服务器下,其他组件运行在各自的目标服务器上
---
networks:
my_network:
driver: bridge
services:
grafana:
image: grafana/grafana:11.1.4
container_name: grafana
ports:
- "3000:3000"
volumes:
- ./grafana_data:/var/lib/grafana
- ./grafana_logs:/var/log/grafana
environment:
- GF_PATHS_PROVISIONING=/etc/grafana/provisioning
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
depends_on:
- loki
- prometheus
networks:
- my_network
loki:
image: grafana/loki:3.1.1
container_name: loki
ports:
- "3100:3100"
volumes:
- ./loki-config.yaml:/etc/loki/local-config.yaml
- ./loki_data:/var/lib/loki
command: -config.file=/etc/loki/local-config.yaml
networks:
- my_network
prometheus:
image: prom/prometheus
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
networks:
- my_network
loki-config.yaml
这里只是初步实现,读取到的日志数据存放在宿主机本地,后期会改为放到对象存储上,如:oss、s3等
---
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
common:
instance_addr: 127.0.0.1
path_prefix: /var/lib/loki
storage:
filesystem:
chunks_directory: /var/lib/loki/chunks
rules_directory: /var/lib/loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
query_range:
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 100
schema_config:
configs:
- from: 2020-10-24
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
ruler:
alertmanager_url: http://localhost:9093
frontend:
encoding: protobuf
# By default, Loki will send anonymous, but uniquely-identifiable usage and configuration
# analytics to Grafana Labs. These statistics are sent to https://stats.grafana.org/
#
# Statistics help us better understand how Loki is used, and they show us performance
# levels for most users. This helps us prioritize features and documentation.
# For more information on what's sent, look at
# https://github.com/grafana/loki/blob/main/pkg/analytics/stats.go
# Refer to the buildReport method to see what goes into a report.
#
# If you would like to disable reporting, uncomment the following lines:
#analytics:
# reporting_enabled: false
prometheus.yml
xxx换成自己的服务器地址
---
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'node_exporter'
static_configs:
- targets: ['xxx:9100', 'xxx:9100']
- job_name: 'my_app'
static_configs:
- targets: ['xxx']
- job_name: 'mysql_exporter'
static_configs:
- targets: ['xxx:9104']
- job_name: 'redis_exporter'
static_configs:
- targets: ['xxx:9121']
2.2.2 测试
部署成功后,访问 http://xxx:3000 即可看到grafana的界面
同时 prometheus 也提供了简单可视化的界面,访问 http://xxx:9090 即可查看监控的指标是否正常
2.3 目标服务器安装组件
docker-compose.yml
---
services:
node_exporter:
image: prom/node-exporter
container_name: node_exporter
ports:
- "9100:9100"
restart: unless-stopped
mysql_exporter:
image: prom/mysqld-exporter
container_name: mysql_exporter
volumes:
- ./my.cnf:/.my.cnf
ports:
- "9104:9104"
restart: unless-stopped
networks:
- my_network
redis_exporter:
image: oliver006/redis_exporter
container_name: redis_exporter
environment:
- REDIS_ADDR=redis://redis:6379
ports:
- "9121:9121"
restart: unless-stopped
networks:
- my_network
promtail:
image: grafana/promtail:3.1.1
container_name: promtail
volumes:
- ./logs:/var/log
- ./promtail-config.yaml:/etc/promtail/promtail-config.yaml
command: -config.file=/etc/promtail/promtail-config.yaml
networks:
my_network:
driver: bridge
promtail-config.yaml
---
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /var/log/positions.yaml # 位置文件,用来保存当前读取日志的位置/偏移量
clients:
- url: http://xxx:3100/loki/api/v1/push # loki地址
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs # 任务名,自定义
__path__: /var/log/*.log # 要读取日志文件的路径
my.cnf
当前这个版本一定要用my.cnf配置文件来连接mysql,查的资料有在环境变量里用 DATA_SOURCE_NAME="exporter:password@(mysql_host:3306)
的方式,但是经测试不行,docker logs xx 会报错找不到my.cnf配置文件,可能是版本问题导致的吧
[client]
user=exporter
password=xxxx
host=mysql
port=3306
2.4 给程序添加Prometheus客户端,暴露监控指标
2.4.1 下载Prometheus客户端
go get github.com/prometheus/client_golang/prometheus
go get github.com/prometheus/client_golang/prometheus/promhttp
2.4.2 简单封装,注册和暴露监控指标
目录结构
interface.go 定义所有接口
package metrics
import "net/http"
type MetInterface interface {
IncRequestsCounter(method, route string, code int)
ObserveRequestDuration(route string, duration float64)
IncErrorsCounter(method, route, code string)
ExposeHandler() http.Handler
}
metrics.go 实现接口,定义和注册指标
package metrics
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
"net/http"
)
type ServerMetrics struct {
// 总请求数
RequestsTotal *prometheus.CounterVec
// RT: 请求耗时
RequestDuration *prometheus.HistogramVec
// 错误数
ErrorsTotal *prometheus.CounterVec
}
func NewServerMetrics() MetInterface {
return &ServerMetrics{
RequestsTotal: createCounterVec("http_requests_total", "Total number of HTTP requests.", []string{"method", "route", "code"}),
RequestDuration: createHistogramVec("http_request_duration_seconds", "HTTP request latencies in seconds.", []string{"route"}),
ErrorsTotal: createCounterVec("http_errors_total", "Total number of HTTP errors.", []string{"method", "code", "route"}),
}
}
// IncRequestsCounter 更新请求计数器
func (m *ServerMetrics) IncRequestsCounter(method, route string, code int) {
m.RequestsTotal.WithLabelValues(method, route, http.StatusText(code)).Inc()
}
// ObserveRequestDuration 更新请求持续时间
func (m *ServerMetrics) ObserveRequestDuration(route string, duration float64) {
m.RequestDuration.WithLabelValues(route).Observe(duration)
}
// IncErrorsCounter 更新错误请求计数器
func (m *ServerMetrics) IncErrorsCounter(method, route, code string) {
m.ErrorsTotal.WithLabelValues(method, code, route).Inc()
}
// ExposeHandler 暴露全部指标
func (m *ServerMetrics) ExposeHandler() http.Handler {
return promhttp.Handler()
}
counters.go 创建计数类的监控指标
package metrics
import "github.com/prometheus/client_golang/prometheus"
func createCounterVec(name, help string, labels []string) *prometheus.CounterVec {
cv := prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: name,
Help: help,
},
labels,
)
// 注册
prometheus.MustRegister(cv)
return cv
}
histograms.go 创建直方图类的监控指标
package metrics
import "github.com/prometheus/client_golang/prometheus"
func createHistogramVec(name, help string, labels []string) *prometheus.HistogramVec {
hv := prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: name,
Help: help,
Buckets: prometheus.DefBuckets,
},
labels,
)
prometheus.MustRegister(hv)
return hv
}
2.4.3 创建中间件,暴露监控指标
router.go
r := gin.Default()
// 初始化Metrics
metricsCtrl := metrics.NewServerMetrics()
// 重点!暴露Metrics,要写在调用中间件执行前,否则会报错
r.GET("/metrics", gin.WrapH(metricsCtrl.ExposeHandler()))
r.Use(
middleware.LoggerHandler(metricsCtrl),
)
logger.go 这个中间件里除了监控信息,还有日志的记录,因为功能有一些重叠的地方,所以把两块内容合到一起了
package middleware
import (
"bytes"
"encoding/json"
"github.com/gin-gonic/gin"
"link/internal/constant"
"link/internal/helper"
"link/internal/logger"
"link/metrics"
"link/pkg"
"strconv"
"time"
)
type responseBodyWriter struct {
gin.ResponseWriter
body *bytes.Buffer
}
func (w responseBodyWriter) Write(b []byte) (int, error) {
w.body.Write(b)
return w.ResponseWriter.Write(b)
}
type RespBody struct {
Code int `json:"code"`
Message string `json:"message"`
Data interface{} `json:"data,omitempty"`
Cause string `json:"cause,omitempty"`
}
func LoggerHandler(metrics metrics.MetInterface) gin.HandlerFunc {
return func(c *gin.Context) {
start := time.Now()
// 获取请求体
reqBody, err := helper.ProcessRequestBody(c)
if err != nil {
pkg.ErrorByStatusOK(c, constant.UnknownCode, err)
c.Abort()
return
}
// 获取请求头
headers := make(map[string]string)
for k, v := range c.Request.Header {
headers[k] = v[0]
}
// 记录响应体
bodyWriter := &responseBodyWriter{
body: bytes.NewBufferString(""),
ResponseWriter: c.Writer,
}
c.Writer = bodyWriter
c.Next()
statusCode := c.Writer.Status()
duration := time.Since(start).Seconds()
method := c.Request.Method
route := c.FullPath()
respBodyBytes := bodyWriter.body.Bytes()
var respJson RespBody
if err := json.Unmarshal(respBodyBytes, &respJson); err != nil {
pkg.ErrorByStatusOK(c, constant.UnknownCode, err)
c.Abort()
return
}
// 记录监控指标
metrics.IncRequestsCounter(method, route, statusCode)
metrics.ObserveRequestDuration(route, duration)
// 记录日志
fields := []interface{}{
"duration", duration * 1000,
"method", method,
"path", route,
"request_headers", headers,
"request_body", reqBody,
"ip", c.ClientIP(),
"user_agent", c.Request.UserAgent(),
"status", statusCode,
"response", respJson,
}
if respJson.Code != constant.Success {
metrics.IncErrorsCounter(method, route, strconv.Itoa(respJson.Code))
logger.With(fields...).Error("HTTP request failed")
} else {
logger.With(fields...).Info("HTTP request success")
}
}
}
2.5 测试监控组件
所有的工具在安装完成后,都可以通过访问 http://ip:port/metrics 来查看监控指标是否正常,如:
- mysql_exporter: http://xxx:9104/metrics
- node_exporter: http://xxx:9100/metrics
- redis_exporter: http://xxx:9121/metrics
- 程序客户端:http://xxx/metrics
2.6 配置grafana
所有工具都安装成功之后,就可以开始配置grafana了
访问 http://xxx:3000 登录grafana,默认用户名密码都是admin(运行容器时环境变量里定义的)
2.6.1 可以先设置下中文
点击头像->profile 或者 左侧齿轮图标->Default preferences
PS:这个并不会将所有地方都汉化,菜单之类的会汉化,有些详情页面还是英文的
2.6.2 添加数据源
将 prometheus和loki 添加为数据源,把对应的地址填上即可,如果成功,点 save & test 按钮,会有提示
2.6.3 添加仪表盘
设置好数据源后,就可以添加仪表盘了,可以自定义创建,也可以使用 grafana 提供的非常多的仪表盘模板
自定义方式,主要就是选择数据源,然后输入要查询指标的PromQL,然后配置一些参数,比如显示方式,标题等等。这个不会的直接去问AI就好了
使用模版,在 https://grafana.com/grafana/dashboards/ 搜索需要的模板,复制ID,在 grafana 加载导入即可
2.7 效果展示
2.8 配置报警
PS:grafana的报警功能主要依赖于时间序列数据的查询,所以时间序列面板通常是报警规则最常配置的面板类型,比如:使用率、错误率等等。单值的面板类型,比如:请求数量,错误数量等,通常不适合配置报警规则
找到要配置报警的仪表盘面板,点击右上角三个点,新建报警规则
主要是在这里设置阈值
后边还有,当触发阈值后,持续多长时间才进行报警,以及报警的通知方式(默认邮箱,需要配置SMTP;还有钉钉等)
别的就没啥了,以后研究深了,有新的内容再补充
3.总结
以上的工具都是通过docker-compose来部署的,如果需要直接部署宿主机,还有docker和docker-compose本身的安装,自行网上冲浪
这套方案,优点是 全部开源、定制灵活、部署方便,而且相较于其他方案比较轻量,适合于中小型规模的项目/企业;缺点是可能需要投入一些学习成本和时间成本。
最好先在测试环境跑一段时间,熟悉下流程和效果,确保没问题再考虑在生产环境部署