从“健康检查被误杀”到稳定上线：Kubernetes Java 健康检查优化实战

适用场景：Spring Boot/Cloud Gateway + Kubernetes（生产环境）

背景与症状

生产环境内偶发 Pod 周期性重启，kubectl describe pod 显示 Liveness probe failed 或 Readiness probe failed。
冷启动阶段或业务高峰时更容易出现，日志里常见探针超时、404。
极端情况下形成“重启风暴”，服务一直起不来。

核心根因

端口/路径不一致：容器监听 8080，但探针打到 80；或未启用 Actuator probes，/actuator/health/liveness|readiness 返回 404。
缺少 startupProbe：应用尚未完全启动，livenessProbe 提前执行并连续失败，触发重启。
JVM 冷启动慢：大堆 + -XX:+AlwaysPreTouch + 高 InitialRAMPercentage，启动时间显著拉长。
线程/连接竞争：健康接口与业务线程池竞争资源，高峰期探针容易超时。

最小可行修复（MVP）

对齐端口与路径：统一使用 8080（或你的实际端口），确认路径存在。
启用 Actuator 探针端点：
- MANAGEMENT_ENDPOINT_HEALTH_PROBES_ENABLED=true
- MANAGEMENT_ENDPOINTS_WEB_EXPOSURE_INCLUDE=health,info,metrics,prometheus
增加 startupProbe：为冷启动提供缓冲，避免 liveness 误杀。

示例（HTTP 探针，单端口 8080）：

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 120
  periodSeconds: 30
  timeoutSeconds: 10
  failureThreshold: 5
readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 60
  periodSeconds: 20
  timeoutSeconds: 5
  failureThreshold: 3
startupProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 30   # 约 5 分钟冷启动缓冲

更稳健方案（推荐）

将管理端口与业务端口分离，用 TCP 探针降低应用层影响。

容器环境变量：

env:
  - name: MANAGEMENT_SERVER_PORT
    value: "18010"
  - name: MANAGEMENT_SERVER_ADDRESS
    value: "0.0.0.0"
  - name: MANAGEMENT_ENDPOINT_HEALTH_PROBES_ENABLED
    value: "true"
  - name: MANAGEMENT_ENDPOINTS_WEB_EXPOSURE_INCLUDE
    value: "health,info,metrics,prometheus"

探针（TCP 到 18010）：

startupProbe:
  tcpSocket:
    port: 18010
  periodSeconds: 10
  timeoutSeconds: 10
  failureThreshold: 30
readinessProbe:
  tcpSocket:
    port: 18010
  initialDelaySeconds: 60
  periodSeconds: 15
  timeoutSeconds: 10
  failureThreshold: 4
livenessProbe:
  tcpSocket:
    port: 18010
  initialDelaySeconds: 90
  periodSeconds: 20
  timeoutSeconds: 10
  failureThreshold: 6

JVM 参数（容器友好，减少冷启动）

适用于 2C/4Gi～4C/8Gi 量级，按 CPU/内存调整。

-Duser.timezone=Asia/Shanghai
-Dfile.encoding=UTF-8
-Djava.security.egd=file:/dev/./urandom
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:InitiatingHeapOccupancyPercent=30
-XX:+UseContainerSupport
-XX:+PreferContainerQuotaForCPUCount
-XX:ActiveProcessorCount=<与 limit 对齐，如 2 或 4>
-XX:MaxRAMPercentage=70.0
-XX:InitialRAMPercentage=40.0     # 降低冷启动开销
-XX:MaxDirectMemorySize=512m
-XX:+ParallelRefProcEnabled
-XX:+UseStringDeduplication
-XX:+UseCompressedOops
-XX:+UseCompressedClassPointers
-XX:MaxMetaspaceSize=512m
-XX:MetaspaceSize=256m
-Xss512k
-XX:+ExitOnOutOfMemoryError
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/opt/csair/logs/dumps
-Xlog:gc*:file=/opt/csair/logs/gc.log:time,uptime:filecount=10,filesize=100m
-Djava.net.preferIPv4Stack=true
-Djdk.nio.maxCachedBufferSize=262144
-Djava.util.concurrent.ForkJoinPool.common.parallelism=<与 CPU 匹配，如 2 或 4>
-Dreactor.netty.ioWorkerCount=<CPU*2，适度，如 4 或 8>

不推荐在容器里开启：-XX:+AlwaysPreTouch、-XX:+UseNUMA、-XX:+PerfDisableSharedMem、-XX:+UnlockExperimentalVMOptions、-XX:+UseCGroupMemoryLimitForHeap（新版本已整合为 ContainerSupport）。

探针参数建议（生产）

livenessProbe：periodSeconds 20~30、timeoutSeconds 5~10、failureThreshold ≥5
readinessProbe：periodSeconds 15~20、timeoutSeconds 5~10、failureThreshold ≥3
startupProbe：冷启动长的服务必须配置，failureThreshold: 30 搭配 periodSeconds: 10

验证与排查清单

事件与原因：kubectl describe pod <pod>，区分 Probe failed vs. OOMKilled。
健康端点：kubectl exec -it <pod> -- wget -qO- http://127.0.0.1:8080/actuator/health 验证返回与 RT。
配置一致性：端口、路径、暴露开关（probes.enabled）必须一致。
资源画像：观察 CPU/内存是否逼近 limit；查看 GC 日志是否频繁 STW。
真流量验证：灰度压测时关注探针失败率、重启次数、P99。

常见坑位

探针路径 404：未启用 probes.enabled 或 base-path 配置不一致。
探针端口错配：容器监听 8080，探针打到 80。
liveness 过早执行：缺少 startupProbe，启动慢的服务被误杀。
业务线程竞争：健康接口被高峰业务“挤出”线程，HTTP 探针超时。

上线与回滚

先灰度一组 Pod，观察 30～60 分钟。
保持 HPA 与 PDB 开启，保障扩缩容与故障维护过程的可用性。
如有异常，kubectl rollout undo deployment/<name> 快速回滚。

这套方法在高负载与长冷启动的 Java 服务上稳定性显著更好：从“被误杀”到“稳上线”，关键是对齐端口/路径、启用 probes、加上 startupProbe，进一步用“独立管理端口 + TCP 探针 + 合理 JVM 参数”把可用性拉满。

从“健康检查被误杀”到稳定上线：Kubernetes Java 健康检查优化实战

从“健康检查被误杀”到稳定上线：Kubernetes Java 健康检查优化实战

从“健康检查被误杀”到稳定上线：Kubernetes Java 健康检查优化实战

背景与症状

核心根因

最小可行修复（MVP）

更稳健方案（推荐）

JVM 参数（容器友好，减少冷启动）

探针参数建议（生产）

验证与排查清单

常见坑位

上线与回滚

相关阅读更多精彩内容

友情链接更多精彩内容