在IT行业中，性能监控和日志收集是确保系统稳定性和优化性能的关键。市面上也有很多的成熟的解决方案，包括企业级付费的：Datadog、New Relic等，也有开源免费的：Zabbix、ELK等。各种工具都有自己的优缺点，根据自己的需求和预算进行选择。

本文介绍的方案是 Grafana + Prometheus + Loki + 各种组件 的组合方式，实现一套性能监控和日志收集系统。这里记录下我近期的搭建过程以及踩的坑，内容有点长 ...

1. 工具介绍

1.1 Grafana

Grafana是Grafana Labs开源的度量分析和可视化工具，可以通过将采集的数据查询然后可视化的展示，并可及时报警通知。特点：

开源
数据源丰富，支持：Prometheus、Graphite、Elasticsearch 等等
直观的用户界面，可以自定义或者导入现成的监控仪表盘

1.2 Prometheus

Prometheus是由SoundCloud开源的监控报警系统和时间序列数据库。特点：

开源
基于时间序列的灵活查询语言PromQL，适合动态和大规模环境的监控
组件众多，提供完整的监控解决方案

1.3 Loki

Loki是Grafana Labs开源的日志处理系统，可以高效收集处理不同格式日志。特点：

水平可扩展，无中心架构，将日志数据传递给收集器
支持按标签过滤和查询日志，并支持日志流重定向到后端存储
易于伸缩，支持大规模、分布式集群

1.4 其他组件

node-exporter：收集主机信息，对服务器硬件系统进行监控
mysqld-exporter：收集MySQL性能信息，对数据库进行监控
redis_exporter：收集Redis性能信息，对缓存进行监控
promtail：收集日志，将日志发送给Loki
go的Prometheus客户端组件：收集go程序的性能信息，对go程序进行监控

2. 搭建过程

2.1 版本确认

本次安装的所有工具版本如下，均通过docker安装：

服务器系统：CentOS Linux release 7.9.2009 (Core)
Docker版本：26.1.4
docker-compose：v2.29.0
Grafana：11.1.4
Prometheus：2.53.2
Loki：3.1.1
node-exporter：1.8.2
mysqld-exporter：0.15.1
redis_exporter：v1.62.0
promtail：3.1.1

2.2 安装 grafana、prometheus、loki

2.2.1 部署

docker-compose.yaml
这里通过docker-compose统一管理grafana、prometheus、loki
这三个工具运行在同一个负责监控的服务器下，其他组件运行在各自的目标服务器上

---
networks:
  my_network:
    driver: bridge

services:
  grafana:
    image: grafana/grafana:11.1.4
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - ./grafana_data:/var/lib/grafana
      - ./grafana_logs:/var/log/grafana
    environment:
      - GF_PATHS_PROVISIONING=/etc/grafana/provisioning
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
    depends_on:
      - loki
      - prometheus
    networks:
      - my_network

  loki:
    image: grafana/loki:3.1.1
    container_name: loki
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yaml:/etc/loki/local-config.yaml
      - ./loki_data:/var/lib/loki
    command: -config.file=/etc/loki/local-config.yaml
    networks:
      - my_network

  prometheus:
    image: prom/prometheus
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    networks:
      - my_network

loki-config.yaml
这里只是初步实现，读取到的日志数据存放在宿主机本地，后期会改为放到对象存储上，如：oss、s3等

---
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  instance_addr: 127.0.0.1
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 100

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

ruler:
  alertmanager_url: http://localhost:9093

frontend:
  encoding: protobuf

# By default, Loki will send anonymous, but uniquely-identifiable usage and configuration
# analytics to Grafana Labs. These statistics are sent to https://stats.grafana.org/
#
# Statistics help us better understand how Loki is used, and they show us performance
# levels for most users. This helps us prioritize features and documentation.
# For more information on what's sent, look at
# https://github.com/grafana/loki/blob/main/pkg/analytics/stats.go
# Refer to the buildReport method to see what goes into a report.
#
# If you would like to disable reporting, uncomment the following lines:
#analytics:
#  reporting_enabled: false

prometheus.yml
xxx换成自己的服务器地址

---
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['xxx:9100', 'xxx:9100']
  
  - job_name: 'my_app'
    static_configs:
      - targets: ['xxx']

  - job_name: 'mysql_exporter'
    static_configs:
      - targets: ['xxx:9104']

  - job_name: 'redis_exporter'
    static_configs:
      - targets: ['xxx:9121']

2.2.2 测试

部署成功后，访问 http://xxx:3000 即可看到grafana的界面

同时 prometheus 也提供了简单可视化的界面，访问 http://xxx:9090 即可查看监控的指标是否正常

2.3 目标服务器安装组件

docker-compose.yml

---
services:
  node_exporter:
    image: prom/node-exporter
    container_name: node_exporter
    ports:
      - "9100:9100"
    restart: unless-stopped

  mysql_exporter:
    image: prom/mysqld-exporter
    container_name: mysql_exporter
    volumes:
      - ./my.cnf:/.my.cnf
    ports:
      - "9104:9104"
    restart: unless-stopped
    networks:
      - my_network

  redis_exporter:
    image: oliver006/redis_exporter
    container_name: redis_exporter
    environment:
      - REDIS_ADDR=redis://redis:6379
    ports:
      - "9121:9121"
    restart: unless-stopped
    networks:
      - my_network
        
  promtail:
    image: grafana/promtail:3.1.1
    container_name: promtail
    volumes:
      - ./logs:/var/log
      - ./promtail-config.yaml:/etc/promtail/promtail-config.yaml
    command: -config.file=/etc/promtail/promtail-config.yaml

networks:
  my_network:
    driver: bridge

promtail-config.yaml

---
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /var/log/positions.yaml # 位置文件，用来保存当前读取日志的位置/偏移量

clients:
  - url: http://xxx:3100/loki/api/v1/push # loki地址

scrape_configs:
- job_name: system
  static_configs:
  - targets:
      - localhost
    labels:
      job: varlogs # 任务名，自定义
      __path__: /var/log/*.log # 要读取日志文件的路径

my.cnf
当前这个版本一定要用my.cnf配置文件来连接mysql，查的资料有在环境变量里用 DATA_SOURCE_NAME="exporter:password@(mysql_host:3306) 的方式，但是经测试不行，docker logs xx 会报错找不到my.cnf配置文件，可能是版本问题导致的吧

[client]
user=exporter
password=xxxx
host=mysql
port=3306

2.4 给程序添加Prometheus客户端，暴露监控指标

2.4.1 下载Prometheus客户端

go get github.com/prometheus/client_golang/prometheus
go get github.com/prometheus/client_golang/prometheus/promhttp

2.4.2 简单封装，注册和暴露监控指标

目录结构

interface.go 定义所有接口

package metrics

import "net/http"

type MetInterface interface {
    IncRequestsCounter(method, route string, code int)
    ObserveRequestDuration(route string, duration float64)
    IncErrorsCounter(method, route, code string)
    ExposeHandler() http.Handler
}

metrics.go 实现接口，定义和注册指标

package metrics

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
    "net/http"
)

type ServerMetrics struct {
    // 总请求数
    RequestsTotal *prometheus.CounterVec
    // RT: 请求耗时
    RequestDuration *prometheus.HistogramVec
    // 错误数
    ErrorsTotal *prometheus.CounterVec
}

func NewServerMetrics() MetInterface {
    return &ServerMetrics{
        RequestsTotal:   createCounterVec("http_requests_total", "Total number of HTTP requests.", []string{"method", "route", "code"}),
        RequestDuration: createHistogramVec("http_request_duration_seconds", "HTTP request latencies in seconds.", []string{"route"}),
        ErrorsTotal:     createCounterVec("http_errors_total", "Total number of HTTP errors.", []string{"method", "code", "route"}),
    }
}

// IncRequestsCounter 更新请求计数器
func (m *ServerMetrics) IncRequestsCounter(method, route string, code int) {
    m.RequestsTotal.WithLabelValues(method, route, http.StatusText(code)).Inc()
}

// ObserveRequestDuration 更新请求持续时间
func (m *ServerMetrics) ObserveRequestDuration(route string, duration float64) {
    m.RequestDuration.WithLabelValues(route).Observe(duration)
}

// IncErrorsCounter 更新错误请求计数器
func (m *ServerMetrics) IncErrorsCounter(method, route, code string) {
    m.ErrorsTotal.WithLabelValues(method, code, route).Inc()
}

// ExposeHandler 暴露全部指标
func (m *ServerMetrics) ExposeHandler() http.Handler {
    return promhttp.Handler()
}

counters.go 创建计数类的监控指标

package metrics

import "github.com/prometheus/client_golang/prometheus"

func createCounterVec(name, help string, labels []string) *prometheus.CounterVec {
    cv := prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: name,
            Help: help,
        },
        labels,
    )
    // 注册
    prometheus.MustRegister(cv)

    return cv
}

histograms.go 创建直方图类的监控指标

package metrics

import "github.com/prometheus/client_golang/prometheus"

func createHistogramVec(name, help string, labels []string) *prometheus.HistogramVec {
    hv := prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    name,
            Help:    help,
            Buckets: prometheus.DefBuckets,
        },
        labels,
    )
    prometheus.MustRegister(hv)

    return hv
}

2.4.3 创建中间件，暴露监控指标

router.go

r := gin.Default()

// 初始化Metrics
metricsCtrl := metrics.NewServerMetrics()
// 重点！暴露Metrics，要写在调用中间件执行前，否则会报错
r.GET("/metrics", gin.WrapH(metricsCtrl.ExposeHandler()))

r.Use(
    middleware.LoggerHandler(metricsCtrl),
)

logger.go 这个中间件里除了监控信息，还有日志的记录，因为功能有一些重叠的地方，所以把两块内容合到一起了

package middleware

import (
    "bytes"
    "encoding/json"
    "github.com/gin-gonic/gin"
    "link/internal/constant"
    "link/internal/helper"
    "link/internal/logger"
    "link/metrics"
    "link/pkg"
    "strconv"
    "time"
)

type responseBodyWriter struct {
    gin.ResponseWriter
    body *bytes.Buffer
}

func (w responseBodyWriter) Write(b []byte) (int, error) {
    w.body.Write(b)
    return w.ResponseWriter.Write(b)
}

type RespBody struct {
    Code    int         `json:"code"`
    Message string      `json:"message"`
    Data    interface{} `json:"data,omitempty"`
    Cause   string      `json:"cause,omitempty"`
}

func LoggerHandler(metrics metrics.MetInterface) gin.HandlerFunc {
    return func(c *gin.Context) {
        start := time.Now()

        // 获取请求体
        reqBody, err := helper.ProcessRequestBody(c)
        if err != nil {
            pkg.ErrorByStatusOK(c, constant.UnknownCode, err)
            c.Abort()
            return
        }
        // 获取请求头
        headers := make(map[string]string)
        for k, v := range c.Request.Header {
            headers[k] = v[0]
        }
        // 记录响应体
        bodyWriter := &responseBodyWriter{
            body:           bytes.NewBufferString(""),
            ResponseWriter: c.Writer,
        }
        c.Writer = bodyWriter

        c.Next()

        statusCode := c.Writer.Status()
        duration := time.Since(start).Seconds()
        method := c.Request.Method
        route := c.FullPath()

        respBodyBytes := bodyWriter.body.Bytes()
        var respJson RespBody
        if err := json.Unmarshal(respBodyBytes, &respJson); err != nil {
            pkg.ErrorByStatusOK(c, constant.UnknownCode, err)
            c.Abort()
            return
        }

        // 记录监控指标
        metrics.IncRequestsCounter(method, route, statusCode)
        metrics.ObserveRequestDuration(route, duration)
        // 记录日志
        fields := []interface{}{
            "duration", duration * 1000,
            "method", method,
            "path", route,
            "request_headers", headers,
            "request_body", reqBody,
            "ip", c.ClientIP(),
            "user_agent", c.Request.UserAgent(),
            "status", statusCode,
            "response", respJson,
        }
        if respJson.Code != constant.Success {
            metrics.IncErrorsCounter(method, route, strconv.Itoa(respJson.Code))
            logger.With(fields...).Error("HTTP request failed")
        } else {
            logger.With(fields...).Info("HTTP request success")
        }
    }
}