vllm一行命令部署大模型

1、创建vllm运行python环境

conda create -n vllm_env python=3.12 -y
conda activate vllm_env
pip install vllm（如果担心最新版本不稳定，可指定稳定版本如0.6.3）

执行vllm -v验证安装成功

INFO 08-27 10:40:24 [__init__.py:241] Automatically detected platform cuda.
0.10.1.1

2、下载需要部署的模型到指定本地路径

huggingface是国外源下载较慢，modelscope国内源更快，如果没有modelscope提前使用pip安装modelscope

modelscope download --model openai-mirror/gpt-oss-20b --local_dir ./gpt-oss-20b

3、使用vllm命令行部署本地llm服务（如果成功直接跳到第6节）

vllm serve /xxx/model/gpt-oss-20b(本地模型绝对路径)
先不带其他参数，按理说能运行起来（咱就是说，自信就是基操）

问题1：端口被占用

(APIServer pid=66637) OSError: [Errno 98] Address already in use
解决方法：vllm默认端口为8000，使用参数--port指定一个空闲的端口

问题2：EngineCore failed

(EngineCore_0 pid=66895) ERROR 08-25 14:25:16 [core.py:700] EngineCore failed to start.
(EngineCore_0 pid=66895) ERROR 08-25 14:25:16 [core.py:700] RuntimeError: CUDA error: out of memory
(EngineCore_0 pid=66895) ERROR 08-25 14:25:16 [core.py:700] CUDA kernel errors might be asynchronously reported at some other  API call, so the stacktrace below might be incorrect.
(EngineCore_0 pid=66895) ERROR 08-25 14:25:16 [core.py:700] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_0 pid=66895) ERROR 08-25 14:25:16 [core.py:700] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions .
(EngineCore_0 pid=66895) ERROR 08-25 14:25:16 [core.py:700]
(EngineCore_0 pid=66895)   File "/home/swg32/miniconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/compilation/cuda_graph .py", line 158, in __call__
(EngineCore_0 pid=66895)     with torch.cuda.graph(cudagraph, pool=self.graph_pool):
(EngineCore_0 pid=66895)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=66895)   File "/home/swg32/miniconda3/envs/vllm_env/lib/python3.12/site-packages/torch/cuda/graphs.py", line  186, in __exit__
(EngineCore_0 pid=66895)     self.cuda_graph.capture_end()
(EngineCore_0 pid=66895)   File "/home/swg32/miniconda3/envs/vllm_env/lib/python3.12/site-packages/torch/cuda/graphs.py", line  84, in capture_end
(EngineCore_0 pid=66895)     super().capture_end()
(EngineCore_0 pid=66895) RuntimeError: CUDA error: out of memory
(EngineCore_0 pid=66895) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below m ight be incorrect.
(EngineCore_0 pid=66895) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_0 pid=66895) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore_0 pid=66895)
(APIServer pid=66728)     raise RuntimeError("Engine core initialization failed. "
(APIServer pid=66728) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

解决方法：跟cuda graph有关，不了解更深层次的原因，可以尝试禁用cuda graph

问题3：解决问题2时使用“--disable-cuda-graph”，参数不存在，由于版本更新问题其替代参数为“--enforce-eager ”

vllm: error: unrecognized arguments: --disable-cuda-graph
解决方法：使用--enforce-eager，enforce_eager是一个参数，用于控制vLLM是否始终使用PyTorch的eager模式（即时执行模式），默认为False，vLLM会默认使用eager模式和CUDA图的混合模式来执行操作，这种混合模式旨在提供最大的性能和灵活性。
CUDA图是PyTorch中用于优化性能的一种技术。禁用CUDA图（即设置enforce_eager为True）可能会影响性能，但可以减少内存需求。对于小型模型，CUDA图可能对性能提升有帮助，但对于大型模型，性能差异可能不大

4、部署成功示例&&显卡占用情况

vllm serve /xxx/model/gpt-oss-20b --enforce-eager --port 16000

(APIServer pid=69302) INFO 08-25 14:38:04 [api_server.py:1880] Starting vLLM API server 0 on http://0.0.0.0:16000
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:36] Available routes are:
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /docs, Methods: HEAD, GET
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /redoc, Methods: HEAD, GET
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /health, Methods: GET
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /load, Methods: GET
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /ping, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /ping, Methods: GET
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /tokenize, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /detokenize, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/models, Methods: GET
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /version, Methods: GET
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/responses, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/chat/completions, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/completions, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/embeddings, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /pooling, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /classify, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /score, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/score, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/audio/translations, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /rerank, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v1/rerank, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /v2/rerank, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /invocations, Methods: POST
(APIServer pid=69302) INFO 08-25 14:38:04 [launcher.py:44] Route: /metrics, Methods: GET
(APIServer pid=69302) INFO:     Started server process [69302]
(APIServer pid=69302) INFO:     Waiting for application startup.
(APIServer pid=69302) INFO:     Application startup complete.

image.png

5、为了提高推理效率，使用多卡并行

vllm serve /xxx/model/gpt-oss-20b --enforce-eager --port 16000 --tensor-parallel-size 3
遇到如下问题，模型是64头，张量并行情况下tensor-parallel-size需要能被2整除

(APIServer pid=74257)   File "/home/swg32/miniconda3/envs/vllm_env/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1343, in create_engine_config
(APIServer pid=74257)     config = VllmConfig(
(APIServer pid=74257)              ^^^^^^^^^^^
(APIServer pid=74257)   File "/home/swg32/miniconda3/envs/vllm_env/lib/python3.12/site-packages/pydantic/_internal/_dataclasses.py", line 123, in __init__
(APIServer pid=74257)     s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=74257) pydantic_core._pydantic_core.ValidationError: 1 validation error for VllmConfig
(APIServer pid=74257)   Value error, Total number of attention heads (64) must be divisible by tensor parallel size (3). [type=value_error, input_value=ArgsKwargs((), {'model_co...additional_config': {}}), input_type=ArgsKwargs]
(APIServer pid=74257)     For further information visit https://errors.pydantic.dev/2.11/v/value_error

修正后成功&&显卡占用情况如下

vllm serve /xxx/model/gpt-oss-20b --enforce-eager --port 16000 --tensor-parallel-size 2

image.png

6、本地测试一下

curl http://localhost:16000/v1/models
curl http://localhost:16000/v1/completions     -H "Content-Type: application/json"     -d '{
        "prompt": "世界上一共有多少个国家，排名前十的国家是哪些",
        "max_tokens": 1024,
        "temperature": 0.7}'

7、通过远端客户端访问

服务端：最终使用全参数如下

vllm serve /xxx/model/gpt-oss-20b                                            \
            --host 0.0.0.0 --port 16000                                      \
            --api-key 123456 --dtype auto                                    \
            --served-model-name gpt-20b                                      \
            --enforce-eager                                                  \
            --tensor-parallel-size 2                                         \
            --gpu-memory-utilization 0.9                                     \    
            --trust-remote-code

客户端

from openai import OpenAI

# 初始化客户端
client = OpenAI(
    base_url="http://x.x.x.x:16000/v1",
    api_key="123456"
)

response = client.chat.completions.create(
    model="gpt-20b",
    messages=[{"role": "user", "content": "全球一共有多少个国家"}]
)

print(response.choices[0].message.content)

image.png

参考：
https://github.com/vllm-project/vllm
https://docs.vllm.ai/en/latest/
https://vllm.hyper.ai/docs/
https://vllm.hyper.ai/docs/inference-and-serving/engine_args