使用DeepSeek进行智能的日志异常分析

从微信公众号看到的文章，很早就想实操一下，一直没有机会。今天利用在北京公司的时间，连上公司的DeepSeek实践了一下。限于条件，不用ES和飞书，纯日志操作。

参考URL：
https://mp.weixin.qq.com/s/nzgbFqorPE2bNvGQ0lcSkA

配置文件config.ini：

[deepseek]
deepseek_api_url= http://1.2.3.4:5/v1/chat/completions

代码main.py：

import logging
import configparser
import requests
import json
import time

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

config = configparser.ConfigParser()
config.read('config.ini')
deepseek_api_url = config.get('deepseek','api_url', fallback="http://1.2.3.4:5/v1/chat/completions")
# 上面这个写法有点烂，重复了塞

# 模拟历史故障匹配,内容仅作为举例使用。
historical_fault_db = {
    "Redis连接池耗尽": ["redis connection pool exhausted","redis pool full"],
    "数据库连接超时": ["database connection timeout","db connect timed out"],
    "2024年10月28日 网络连接异常历史故障匹配验证": ["网络连接异常","故障根因：专线丢包，交换机版本bug"]
}

def analyze_logs_with_deepseek(logs):
    # 有认证就写认证，没有api_key就留空
    headers = {
    "Content-Type":"application/json"
    }
    combined_logs = '\n'.join(logs)
    data = {
        "model": "DeepSeek-R1",
        'messages': [
            {
                'role': 'user',
                'content': f"作为拥有30年经验的资深运维专家,把日志按照故障类型'数据库异常'、'程序异常'、'网络异常'、'redis异常'、'mq异常'等进行分类,并评估影响范围(P0-P3)给出建议：\n{combined_logs}"
            }
        ]
    }
    max_retries = 3
    retries = 0
    while retries < max_retries:
        print(deepseek_api_url, 'deepseek_api_url')
        print(headers, 'headers')
        print(data, 'data')

        try:
            response = requests.post(deepseek_api_url,
                                     headers=headers,
                                     json=data)
            response.raise_for_status()
            result = response.json()
            return result
        except requests.RequestException as e:
            logging.error(f'请求出错：{e}')
            retries += 1
            time.sleep(2)
        except json.JSONDecodeError as e:
            logging.error(f'JSON解析出错：{e}')
            break

        except requests.HTTPError as e:
            if response.status_code == 401:
                logging.error('权限验证失败，请检查授权密钥。')
                break

            elif response.status_code == 400:
                logging.error('请求参数有误，请检查请求数据。')
                break

            else:
                logging.error(f'请求出错：状态码：{response.status_code}')
                retries += 1
                time.sleep(2)
    return None

def match_with_historical_db(logs):
    '''
    将日志与历史故障库进行匹配
    :param logs: 日志列表
    :return: 匹配结果
    '''
    matches = []
    for log in logs:
        for fault, keywords in historical_fault_db.items():
            for keyword in keywords:
                if keyword.lower() in log.lower():
                    matches.append((log, fault))
    return matches

def main():
    with open('demo.log', 'r', encoding='utf-8') as f:
        logs = f.readlines()
    # print(logs)
    if logs:
        result = analyze_logs_with_deepseek(logs)
        if result:
            try:
                message = result['choices'][0]['message']['content']
                # logging.info(f'融合日志分析结果：{message}')
                historical_matches = match_with_historical_db(logs)
                historical_match_message = ''
                if historical_matches:
                    historical_match_message = '**历史故障库匹配结果:**'
                    for log, fault in historical_matches:
                        historical_match_message += f"日志内容:{log}\n 匹配故障: **{fault}**\n"
                else:
                    historical_match_message = '未找到与历史故障库匹配的日志。'
                logging.info(historical_match_message)
                full_message = f'整合日志分类:\n{message}\n\n{historical_match_message}'
                print(full_message)
            except KeyError as e:
                logging.error(f'返回结果中缺少必要的键：{e}')
        else:
            logging.error('日志分析失败。')
    else:
        logging.error('没有日志。')

if __name__ == '__main__':
    main()

大模型输出
我感觉这种代码还是少写，就是纯POC，要正规点，起码用langchain写，或是变成Dify的Agent。更帅~~

D:\Python\Python310\python.exe D:\work\南航稳定性提升专项\华佑材料0427\运行事件\main.py 
http://36.112.33.138:60002/v1/chat/completions deepseek_api_url
{'Content-Type': 'application/json'} headers
{'model': 'DeepSeek-R1', 'messages': [{'role': 'user', 'content': "作为拥有30年经验的资深运维专家,把日志按照故障类型'数据库异常'、'程序异常'、'网络异常'、'redis异常'、'mq异常'等进行分类,并评估影响范围(P0-P3)给出建议：\n2025-03-11 10:35:05,100 [lettuce-epollEventLoop-4-3] INFO io.lettuce.core.protocol.ReconnectionHandler:177 - Reconnected to 192.168.6.15:6379\n\n2025-03-11 10:36:03,197 [lettuce-eventExecutorLoop-1-2] INFO io.lettuce.core.protocol.ConnectionWatchdog:173 - Reconnecting, last destination was 192.168.6.13/192.168.6.13:6379\n\n2025-03-11 10:36:03,200 [lettuce-epollEventLoop-4-4] INFO io.lettuce.core.protocol.ReconnectionHandler:177 - Reconnected to 192.168.6.11:6379\n\n2025-03-11 10:36:43,096 [lettuce-eventExecutorLoop-1-3] INFO io.lettuce.core.protocol.ConnectionWatchdog:173 - Reconnecting, last destination was 192.168.6.13/192.168.6.13:6379\n\n2025-03-11 10:36:43,099 [lettuce-epollEventLoop-4-1] INFO io.lettuce.core.protocol.ReconnectionHandler:177 - Reconnected to 192.168.6.13:6379\n\n2025-03-11 10:37:39,272 [SimpleAsyncTaskExecutor-5] ERROR cn.com.easypay.aspect.WebLogAspect:39 - mq请求异常开始\n\n2025-03-11 10:37:39,272 [SimpleAsyncTaskExecutor-5] ERROR cn.com.easypay.aspect.WebLogAspect:39 - mq请求异常结束\n\n2025-03-11 10:37:48,139 网络连接异常开始\n\n2025-03-11 10:37:48,139 网络连接异常结束"}]} data
2025-05-13 21:48:44,026 - INFO - **历史故障库匹配结果:**日志内容:2025-03-11 10:37:48,139 网络连接异常开始

 匹配故障: **2024年10月28日 网络连接异常历史故障匹配验证**
日志内容:2025-03-11 10:37:48,139 网络连接异常结束
 匹配故障: **2024年10月28日 网络连接异常历史故障匹配验证**

整合日志分类:
<think>
好的，现在我要仔细分析用户提供的日志数据，并按照故障类型进行分类，同时评估每个故障的影响范围和给出相应的建议。首先，我需要逐条查看日志，识别出不同的异常类型，然后判断它们属于哪个类别，再根据影响范围确定优先级。

首先，看到日志中有多次关于“Reconnecting”和“Reconnected”的信息，来自Lettuce库，这通常与Redis连接有关。例如，时间戳2025-03-11 10:35:05的日志显示Reconnected到192.168.6.15:6379，之后几次都是类似的连接重连信息。这说明Redis客户端在尝试重新连接，这可能意味着Redis服务器出现了短暂的连接问题，或者网络不稳定。这种情况下，应该归类为“redis异常”。

接下来，注意到在10:37:39时有两条错误日志，涉及“mq请求异常开始”和“mq请求异常结束”。这里的关键词是“mq”，通常指消息队列（Message Queue），比如RabbitMQ或者Kafka等。所以这属于“mq异常”。

然后，在10:37:48有网络连接异常的记录，直接提到了“网络连接异常”，这明显属于“网络异常”。

接下来需要评估每个异常的影响范围。对于redis异常，如果Redis是作为缓存使用，且系统对缓存的依赖性较高，短暂的连接问题可能导致性能下降，但如果重连成功，可能影响较低。但频繁重连可能意味着潜在问题，影响范围可能是P2（影响部分功能，需要监控）。

mq异常通常涉及消息队列，如果消息队列不可用，可能导致异步处理任务延迟或失败，影响系统间通信，特别是如果这是核心支付系统的消息队列，影响可能较大，应定为P1（影响核心功能，需立即处理）。

网络异常直接影响到系统之间的通信，如果网络不稳定，可能导致多个服务不可用，属于基础设施问题，影响范围可能广泛，属于P1，需要紧急处理。

此外，需要检查是否有其他异常，比如程序异常或数据库异常，但在提供的日志中没有明显的数据库错误或程序报错，所以可能没有这些类型的异常。需要确认日志中的异常是否已经全部涵盖。

最后，给出建议：对于redis异常，建议检查Redis集群状态、网络状况，加强监控；mq异常需要检查MQ服务器状态，确保队列正常工作；网络异常则需要检查网络设备、链路，可能涉及防火墙或物理连接问题，需要尽快修复。同时，对于P1的问题，应优先处理，防止影响扩大。
</think>

根据日志内容，我将故障分类并评估影响范围如下：

### 1. Redis异常
**日志证据**：

2025-03-11 10:35:05,100 [lettuce-epollEventLoop-4-3] INFO ... Reconnected to 192.168.6.15:6379
2025-03-11 10:36:03,197 [lettuce-eventExecutorLoop-1-2] INFO ... Reconnecting, last destination was 192.168.6.13:6379
2025-03-11 10:36:03,200 [lettuce-epollEventLoop-4-4] INFO ... Reconnected to 192.168.6.11:6379
2025-03-11 10:36:43,096 [lettuce-eventExecutorLoop-1-3] INFO ... Reconnecting, last destination was 192.168.6.13:6379
2025-03-11 10:36:43,099 [lettuce-epollEventLoop-4-1] INFO ... Reconnected to 192.168.6.13:6379

**分析**：
- 多节点Redis集群出现频繁重连，可能由于网络不稳定或Redis服务端资源（如内存、连接数）不足导致。
- 虽然客户端最终成功重连，但频繁重连可能导致服务响应延迟或短暂不可用。

**影响范围**：P2（影响部分功能，需关注但非紧急）
**建议**：
1. 检查Redis集群健康状态（内存、连接数、持久化情况）。
2. 监控Redis服务端日志，排查是否存在异常或资源瓶颈。
3. 检查客户端与Redis之间的网络稳定性（延迟、丢包）。

---

### 2. MQ异常
**日志证据**：

2025-03-11 10:37:39,272 [SimpleAsyncTaskExecutor-5] ERROR ... mq请求异常开始
2025-03-11 10:37:39,272 [SimpleAsyncTaskExecutor-5] ERROR ... mq请求异常结束

**分析**：
- 消息队列（如RabbitMQ/Kafka）出现异常，可能导致异步任务处理失败或消息积压。
- 需要结合具体业务场景判断影响，如果是支付核心链路，影响较大。

**影响范围**：P1（影响核心功能，需立即处理）
**建议**：
1. 检查MQ服务端状态（如连接数、队列堆积情况）。
2. 检查消费者是否正常处理消息，避免消息堆积或死锁。
3. 确保MQ集群的高可用性（如镜像队列、集群节点状态）。

---

### 3. 网络异常
**日志证据**：

2025-03-11 10:37:48,139 网络连接异常开始
2025-03-11 10:37:48,139 网络连接异常结束

**分析**：
- 网络连接异常可能是导致Redis和MQ异常的根因。
- 网络不稳定会影响所有依赖网络通信的服务（如微服务调用、数据库访问）。

**影响范围**：P1（影响核心功能，需立即处理）
**建议**：
1. 检查交换机、路由器、防火墙等网络设备状态。
2. 使用网络监控工具（如Pingmesh）定位具体故障节点。
3. 确保网络冗余（如双链路）并优化路由策略。

---

### 总结处理优先级：
1. **网络异常**（P1）和 **MQ异常**（P1）需要立即处理，因为它们直接影响核心功能。
2. **Redis异常**（P2）虽然重要，但可暂缓处理，前提是重连机制能保障服务可用性。

**建议的通用措施**：
- 加强日志监控系统，对异常日志进行实时告警。
- 定期演练故障恢复预案，确保高可用架构的有效性。

**历史故障库匹配结果:**日志内容:2025-03-11 10:37:48,139 网络连接异常开始

 匹配故障: **2024年10月28日 网络连接异常历史故障匹配验证**
日志内容:2025-03-11 10:37:48,139 网络连接异常结束
 匹配故障: **2024年10月28日 网络连接异常历史故障匹配验证**


Process finished with exit code 0

使用DeepSeek进行智能的日志异常分析

推荐阅读更多精彩内容