1、创建vllm运行python环境
conda create -n vllm_env python=3.12 -y
conda activate vllm_env
pip install vllm(如果担心最新版本不稳定,可指定稳定版本如0.6.3)
执行vllm -v验证安装成功
INFO 08-27 10:40:24 [__init__.py:241] Automatically detected platform cuda.
0.10.1.1
2、下载需要部署的模型到指定本地路径
huggingface是国外源下载较慢,modelscope国内源更快,如果没有modelscope提前使用pip安装modelscope
modelscope download --model openai-mirror/gpt-oss-20b --local_dir ./gpt-oss-20b
3、使用vllm命令行部署本地llm服务(如果成功直接跳到第6节)
vllm serve /xxx/model/gpt-oss-20b(本地模型绝对路径)
先不带其他参数,按理说能运行起来(咱就是说,自信就是基操)
问题1:端口被占用
(APIServer pid=66637) OSError: [Errno 98] Address already in use
解决方法:vllm默认端口为8000,使用参数--port指定一个空闲的端口
问题2:EngineCore failed
(EngineCore_0 pid=66895) ERROR 08-25 14:25:16 [core.py:700] EngineCore failed to start.
(EngineCore_0 pid=66895) ERROR 08-25 14:25:16 [core.py:700] RuntimeError: CUDA error: out of memory
(EngineCore_0 pid=66895) ERROR 08-25 14:25:16 [core.py:700] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_0 pid=66895) ERROR 08-25 14:25:16 [core.py:700] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_0 pid=66895) ERROR 08-25 14:25:16 [core.py:700] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions .
(EngineCore_0 pid=66895) ERROR 08-25 14:25:16 [core.py:700]
(EngineCore_0 pid=66895) File "/home/swg32/miniconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/compilation/cuda_graph .py", line 158, in __call__
(EngineCore_0 pid=66895) with torch.cuda.graph(cudagraph, pool=self.graph_pool):
(EngineCore_0 pid=66895) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=66895) File "/home/swg32/miniconda3/envs/vllm_env/lib/python3.12/site-packages/torch/cuda/graphs.py", line 186, in __exit__
(EngineCore_0 pid=66895) self.cuda_graph.capture_end()
(EngineCore_0 pid=66895) File "/home/swg32/miniconda3/envs/vllm_env/lib/python3.12/site-packages/torch/cuda/graphs.py", line 84, in capture_end
(EngineCore_0 pid=66895) super().capture_end()
(EngineCore_0 pid=66895) RuntimeError: CUDA error: out of memory
(EngineCore_0 pid=66895) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below m ight be incorrect.
(EngineCore_0 pid=66895) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_0 pid=66895) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore_0 pid=66895)
(APIServer pid=66728) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=66728) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
解决方法:跟cuda graph有关,不了解更深层次的原因,可以尝试禁用cuda graph
问题3:解决问题2时使用“--disable-cuda-graph”,参数不存在,由于版本更新问题其替代参数为“--enforce-eager ”
vllm: error: unrecognized arguments: --disable-cuda-graph
解决方法:使用--enforce-eager,enforce_eager是一个参数,用于控制vLLM是否始终使用PyTorch的eager模式(即时执行模式),默认为False,vLLM会默认使用eager模式和CUDA图的混合模式来执行操作,这种混合模式旨在提供最大的性能和灵活性。
CUDA图是PyTorch中用于优化性能的一种技术。禁用CUDA图(即设置enforce_eager为True)可能会影响性能,但可以减少内存需求。对于小型模型,CUDA图可能对性能提升有帮助,但对于大型模型,性能差异可能不大
4、部署成功示例&&显卡占用情况
vllm serve /xxx/model/gpt-oss-20b --enforce-eager --port 16000
(APIServer pid=69302) INFO 08-25 14:38:04 [api_server.py:1880] Starting vLLM API server 0 on http://0.0.0.0:16000
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:36] Available routes are:
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /docs, Methods: HEAD, GET
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /redoc, Methods: HEAD, GET
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /health, Methods: GET
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /load, Methods: GET
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /ping, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /ping, Methods: GET
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /tokenize, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /detokenize, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/models, Methods: GET
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /version, Methods: GET
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/responses, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/chat/completions, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/completions, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/embeddings, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /pooling, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /classify, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /score, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/score, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/audio/translations, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /rerank, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/rerank, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v2/rerank, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /invocations, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /metrics, Methods: GET
(APIServer pid=69302) INFO: Started server process [69302]
(APIServer pid=69302) INFO: Waiting for application startup.
(APIServer pid=69302) INFO: Application startup complete.
5、为了提高推理效率,使用多卡并行
vllm serve /xxx/model/gpt-oss-20b --enforce-eager --port 16000 --tensor-parallel-size 3
遇到如下问题,模型是64头,张量并行情况下tensor-parallel-size需要能被2整除
(APIServer pid=74257) File "/home/swg32/miniconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1343, in create_engine_config
(APIServer pid=74257) config = VllmConfig(
(APIServer pid=74257) ^^^^^^^^^^^
(APIServer pid=74257) File "/home/swg32/miniconda3/envs/vllm_env/lib/python3.12/site-packages/pydantic/_internal/_dataclasses.py", line 123, in __init__
(APIServer pid=74257) s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=74257) pydantic_core._pydantic_core.ValidationError: 1 validation error for VllmConfig
(APIServer pid=74257) Value error, Total number of attention heads (64) must be divisible by tensor parallel size (3). [type=value_error, input_value=ArgsKwargs((), {'model_co...additional_config': {}}), input_type=ArgsKwargs]
(APIServer pid=74257) For further information visit https://errors.pydantic.dev/2.11/v/value_error
修正后成功&&显卡占用情况如下
vllm serve /xxx/model/gpt-oss-20b --enforce-eager --port 16000 --tensor-parallel-size 2
6、本地测试一下
curl http://localhost:16000/v1/models
curl http://localhost:16000/v1/completions -H "Content-Type: application/json" -d '{
"prompt": "世界上一共有多少个国家,排名前十的国家是哪些",
"max_tokens": 1024,
"temperature": 0.7}'
7、通过远端客户端访问
- 服务端:最终使用全参数如下
vllm serve /xxx/model/gpt-oss-20b \
--host 0.0.0.0 --port 16000 \
--api-key 123456 --dtype auto \
--served-model-name gpt-20b \
--enforce-eager \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.9 \
--trust-remote-code
- 客户端
from openai import OpenAI
# 初始化客户端
client = OpenAI(
base_url="http://x.x.x.x:16000/v1",
api_key="123456"
)
response = client.chat.completions.create(
model="gpt-20b",
messages=[{"role": "user", "content": "全球一共有多少个国家"}]
)
print(response.choices[0].message.content)
参考:
https://github.com/vllm-project/vllm
https://docs.vllm.ai/en/latest/
https://vllm.hyper.ai/docs/
https://vllm.hyper.ai/docs/inference-and-serving/engine_args