概要
把 Triton + Kubernetes + Istio 当成一个"企业级推理平台"三件套:
- Triton:高性能推理服务器(多框架、多后端、动态批处理、HTTP/gRPC接口),负责模型加载/推理/调度。
- Kubernetes:集群编排、GPU 调度、Pod/CSI/PVC、滚动发布与扩缩容。用它来管理 Triton 实例(Pod)、GPU 节点、持久化模型仓库等。
-
Istio:服务网格层,负责流量管理(canary/A-B/灰度)、mTLS、安全、指标/追踪/熔断、流量限流与细粒度路由。把它放在
Triton 前面做流量治理和可观测性。
下面给出面向 Java 开发者的可落地说明、场景与示例 YAML/代码,便于立刻上手。
关键组件职责(更工程化)
- Triton:模型仓库 + scheduler(支持 TensorRT/ONNX/TensorFlow/PyTorch/Python backend、ensemble、dynamic batching、HTTP(gRPC)接口)。适合把不同框架的模型统一成一个服务端口。
- Kubernetes:负责 GPU 资源分配(
nvidia.com/gpu
)、节点亲和、PVC(模型仓库)、部署策略与水平扩缩容。实际生产推荐用
NVIDIA GPU Operator / device-plugin 来简化驱动/插件/监控的安装。 - Istio:在 ingress/sidecar 层做路由(VirtualService)、熔断/连接池(DestinationRule)、mTLS、可观测性(Prometheus + Kiali + Jaeger)。用于灰度、流控、A/B 测试与流量镜像。
典型使用场景(什么时候选这套)
- 低延迟在线推理(高并发)
- Multi-model / multi-framework 统一服务
- 灰度 / A-B / 灾备切换
- 高吞吐批量推理和动态 batching
- 边云混合(Jetson/Edge + 中心 K8s)
架构模式(简化版)
客户端(Java/服务) → Istio IngressGateway → Istio sidecar → Triton Service(K8s Deployment,调度到 GPU 节点)
监控链路:Triton /metrics
→ Prometheus → Grafana / Kiali /
Jaeger(通过 Istio 集成) 。Triton 默认端口:HTTP 8000、gRPC
8001、metrics 8002。
快速实战示例
1) Triton Deployment(K8s, 带 GPU 请求 & Istio sidecar 注入)
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-server
spec:
replicas: 2
selector:
matchLabels:
app: triton
template:
metadata:
labels:
app: triton
version: v1
annotations:
sidecar.istio.io/inject: "true"
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:{{YOUR_TAG}}
args: ["tritonserver", "--model-repository=/models"]
ports:
- containerPort: 8000
- containerPort: 8001
- containerPort: 8002
resources:
limits:
nvidia.com/gpu: 1
requests:
cpu: "1000m"
memory: "4Gi"
volumeMounts:
- name: model-repo
mountPath: /models
volumes:
- name: model-repo
persistentVolumeClaim:
claimName: triton-models-pvc
2) Service(暴露给 Istio / 集群内部)
apiVersion: v1
kind: Service
metadata:
name: triton
spec:
selector:
app: triton
ports:
- name: http
port: 8000
targetPort: 8000
- name: grpc
port: 8001
targetPort: 8001
- name: metrics
port: 8002
targetPort: 8002
3) Istio --- 灰度 90/10(VirtualService + DestinationRule)
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: triton-dr
spec:
host: triton
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 100
maxRequestsPerConnection: 100
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 30s
maxEjectionPercent: 50
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: triton-vs
spec:
hosts:
- triton
http:
- route:
- destination:
host: triton
subset: v1
weight: 90
- destination:
host: triton
subset: v2
weight: 10
4) Java 示例:调用 Triton HTTP /infer
(同步示例)
import java.net.http.*;
import java.net.URI;
import java.time.Duration;
HttpClient client = HttpClient.newHttpClient();
String url = "http://triton.default.svc.cluster.local:8000/v2/models/my_model/infer";
String json = "{\n" +
" \"inputs\": [\n" +
" {\n" +
" \"name\": \"INPUT__0\",\n" +
" \"shape\": [1,3,224,224],\n" +
" \"datatype\": \"FP32\",\n" +
" \"data\": [ 0.1, 0.2, 0.3 ]\n" +
" }\n" +
" ],\n" +
" \"outputs\": [\n" +
" {\"name\": \"OUTPUT__0\"}\n" +
" ]\n" +
"}";
HttpRequest req = HttpRequest.newBuilder()
.uri(URI.create(url))
.timeout(Duration.ofSeconds(10))
.header("Content-Type", "application/json")
.POST(HttpRequest.BodyPublishers.ofString(json))
.build();
HttpResponse<String> resp = client.send(req, HttpResponse.BodyHandlers.ofString());
System.out.println("status: " + resp.statusCode());
System.out.println("body: " + resp.body());
监控、扩缩容与调优要点
- Prometheus 指标
/metrics
- 性能调优:batch-size、并发数、序列化开销调优
- 扩缩容策略:HPA、KEDA、warm replicas
- MIG / 多租户 GPU
最佳实践 & 注意事项
- 模型放在 PVC 或对象存储
- 灰度与流量镜像
- 预处理任务分离
- gRPC 更适合低延迟场景
推荐落地步骤(5 步)
- Docker 启动 Triton
- K8s 配置 GPU 插件
- 部署 Deployment + Service
- Istio 配置 VirtualService / DestinationRule
- 性能测试与调优