skywalking alarm 动态配置

动态配置支持

该特性默认未打开, 目前SkyWalking支持两种动态配置:Single和Group。

  • Single: {configKey}:{configVaule}
  • Gourp
{configKey}: |{subItemkey1}:{subItemValue1}
             |{subItemkey2}:{subItemValue2}
             |{subItemkey3}:{subItemValue3}
             ...      

Single 支持的配置包含 alarm-settings, 因此使用 Single 模式, 使用配置 alarm.default.alarm-settings作为key, 覆盖 alarm-settings.yml 文件内容, 内容配置请查看 alarm-setttings.xml

nacos , 使用的模块为 NacosConfigurationProvider , 核心类为NacosConfigWatcherRegister , 对应 dataId名称生成规则如下

// org.apache.skywalking.oap.server.configuration.api.ConfigWatcherRegister
@Getter
protected class WatcherHolder {
    private ConfigChangeWatcher watcher;
    private final String key;

    public WatcherHolder(ConfigChangeWatcher watcher) {
        this.watcher = watcher;
        // 此处为名称生成规则
        this.key = String.join(
            ".", watcher.getModule(), watcher.getProvider().name(),
            watcher.getItemName()
        );
    }
}

作为 AlarmRulesWatcher, 注册的信息为 alarm.default.alarm-settings

// org.apache.skywalking.oap.server.core.alarm.provider.AlarmRulesWatcher
// moduleName: alarm,  provider: default, itemName: alarm-settings
super(AlarmModule.NAME, provider, "alarm-settings");

其他的动态配置注册的监听配置列表, 可以通过断点设置在 NacosConfigWatcherRegister#readConfig 中看到

image.png

配置

  1. 注意区分 cluster 模块和 Configuation 模块, 两块都有 nacos 配置, 但功能完全不一样的

集群管理配置

  nacos:
    serviceName: ${SW_SERVICE_NAME:"SkyWalking_OAP_Cluster"}
    hostPort: ${SW_CLUSTER_NACOS_HOST_PORT:nacos.soa.dev.test.com:8848}
    # Nacos Configuration namespace, 这里填写namesapce 的id, 不是名称
    namespace: ${SW_CLUSTER_NACOS_NAMESPACE:"xxxxxx"}
    # Nacos auth username
    username: ${SW_CLUSTER_NACOS_USERNAME:"nacos"}
    password: ${SW_CLUSTER_NACOS_PASSWORD:"nacos"}
    # Nacos auth accessKey
    accessKey: ${SW_CLUSTER_NACOS_ACCESSKEY:""}
    secretKey: ${SW_CLUSTER_NACOS_SECRETKEY:""}

配置完成, 可以在 nacos 的服务列表中查看注册信息

image.png

配置中心配置

  nacos:
    # Nacos Server Host
    serverAddr: ${SW_CONFIG_NACOS_SERVER_ADDR:nacos.soa.dev.abc.com}
    # Nacos Server Port
    port: ${SW_CONFIG_NACOS_SERVER_PORT:8848}
    # Nacos Configuration Group
    group: ${SW_CONFIG_NACOS_SERVER_GROUP:DEFAULT_GROUP}
    # Nacos Configuration namespace, 这里填写namesapce 的id, 不是名称
    namespace: ${SW_CONFIG_NACOS_SERVER_NAMESPACE:xxxx}
    # Unit seconds, sync period. Default fetch every 60 seconds.
    period: ${SW_CONFIG_NACOS_PERIOD:60}
    # Nacos auth username
    username: ${SW_CONFIG_NACOS_USERNAME:"nacos"}
    password: ${SW_CONFIG_NACOS_PASSWORD:"nacos"}
    # Nacos auth accessKey
    accessKey: ${SW_CONFIG_NACOS_ACCESSKEY:""}
    secretKey: ${SW_CONFIG_NACOS_SECRETKEY:""}

nacos 中注册的监听信息可以查询到, 说明已经配置生效

image.png

修改 nacos 中配置可以看到以下日志

image.png

检查逻辑

核心类 AlarmCoreRunningRule

RunningRule.Window: metrics 窗口, 通过保存最近 period 个 bucket 来计算值

  1. 消息检测, 发送逻辑
public void start(List<AlarmCallback> allCallbacks) {
        LocalDateTime now = LocalDateTime.now();
        lastExecuteTime = now;
        Executors.newSingleThreadScheduledExecutor().scheduleAtFixedRate(() -> {
            try {
                final List<AlarmMessage> alarmMessageList = new ArrayList<>(30);
                LocalDateTime checkTime = LocalDateTime.now();
                // 获取上次执行时间,和当前时间
                int minutes = Minutes.minutesBetween(lastExecuteTime, checkTime).getMinutes();
                boolean[] hasExecute = new boolean[]{false};
                alarmRulesWatcher.getRunningContext().values().forEach(ruleList -> ruleList.forEach(runningRule -> {
                    // 这里定时器 10s 执行一次, 但是需要一分钟后才能执行
                    if (minutes > 0) {
                        // 时间窗口向后移动, 移除掉最开始加入的 bucket, 添加新的bucket并设置为null
                        runningRule.moveTo(checkTime);
                        /*
                         * Don't run in the first quarter per min, avoid to trigger false alarm.
                         */
                         // 不在每分钟的前15秒执行, 不知道为啥, 检查当前保存的 Metrics 是否满足条件, 满足的添加的通知消息列表
                        if (checkTime.getSecondOfMinute() > 15) {
                            hasExecute[0] = true;
                            alarmMessageList.addAll(runningRule.check());
                        }
                    }
                }));
                // Set the last execute time, and make sure the second is `00`, such as: 18:30:00
                // 保存上次执行时间(时间转为分钟模式)
                if (hasExecute[0]) {
                    lastExecuteTime = checkTime.minusSeconds(checkTime.getSecondOfMinute());
                }

                if (alarmMessageList.size() > 0) {
                    if (alarmRulesWatcher.getCompositeRules().size() > 0) {
                        List<AlarmMessage> messages = alarmRulesWatcher.getCompositeRuleEvaluator().evaluate(alarmRulesWatcher.getCompositeRules(), alarmMessageList);
                        alarmMessageList.addAll(messages);
                    }
                    List<AlarmMessage> filteredMessages = alarmMessageList.stream().filter(msg -> !msg.isOnlyAsCondition()).collect(Collectors.toList());
                    if (filteredMessages.size() > 0) {
                        // 执行实际的消息发送
                        allCallbacks.forEach(callback -> callback.doAlarm(filteredMessages));
                    }
                }
            } catch (Exception e) {
                LOGGER.error(e.getMessage(), e);
            }
        }, 10, 10, TimeUnit.SECONDS);
    }
  1. 消息收集检测 RunningRule#in, 提供 RunningRule.Window 中的 values 维护了最近的 metrics, 添加逻辑为
    1. 首先将 bucket 移动到最新的位置, 一般添加的metrcis 会比通过定时器增加的时间更新, 或者在同一个 bucket 内
    2. 如果定时器时间大于指标收集时间, 则说明可能客户端时间存在问题, 直接返回
    3. 设置当前 metrics 数据到当前时间 bucket 的位置上
public class Window {
    private LocalDateTime endTime;
    private int period;
    private int silenceCountdown;

    private LinkedList<Metrics> values;
    
    // 初始化
    public Window(int period) {
        this.period = period;
        // -1 means silence countdown is not running.
        silenceCountdown = -1;
        values = new LinkedList<>();
        for (int i = 0; i < period; i++) {
            values.add(null);
        }
    }
    
    public void add(Metrics metrics) {
            long bucket = metrics.getTimeBucket();

            LocalDateTime timeBucket = TIME_BUCKET_FORMATTER.parseLocalDateTime(bucket + "");

            this.lock.lock();
            try {
                if (this.endTime == null) {
                    init();
                    this.endTime = timeBucket;
                }
                int minutes = Minutes.minutesBetween(timeBucket, this.endTime).getMinutes();
                if (minutes < 0) {
                    this.moveTo(timeBucket);
                    minutes = 0;
                }

                if (minutes >= values.size()) {
                    // too old data
                    // also should happen, but maybe if agent/probe mechanism time is not right.
                    if (log.isTraceEnabled()) {
                        log.trace(
                            "Timebucket is {}, endTime is {} and value size is {}", timeBucket, this.endTime,
                            values.size()
                        );
                    }
                    return;
                }

                this.values.set(values.size() - minutes - 1, metrics);
            } finally {
                this.lock.unlock();
            }
            if (log.isTraceEnabled()) {
                log.trace("Add metric {} to window {}", metrics, transformValues(this.values));
            }
        }
}        
  1. 静默处理: 在每分钟来检查时判断 silenceCountdown 是否为0 , 不为0 说明静默期未过, 为0 已过, 返回消息
public Optional<AlarmMessage> checkAlarm() {
    if (isMatch()) {
        /*
         * When
         * 1. Alarm trigger conditions are satisfied.
         * 2. Isn't in silence stage, judged by SilenceCountdown(!=0).
         */
        if (silenceCountdown < 1) {
            silenceCountdown = silencePeriod;
            return Optional.of(new AlarmMessage());
        } else {
            silenceCountdown--;
        }
    } else {
        silenceCountdown--;
    }
    return Optional.empty();
}
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容