Xinference使用笔记 -- 0x01

参考链接

https://inference.readthedocs.io/

基本环境搭建

安装
初始化python环境(conda管理)

conda create --name xinference python=3.11
conda activate xinference

安装 Xinference（支持所有模型）

pip install "xinference[all]"
## Transformers 引擎
pip install "xinference[transformers]"
## vLLM 引擎
pip install "xinference[vllm]"
## SGLang 引擎
pip install "xinference[sglang]"
## MLX 引擎
pip install "xinference[mlx]"

本地运行

## 指定本地路径，监控ip，端口
XINFERENCE_HOME=/opt/llm_engine/xinference xinference-local --host 0.0.0.0 --port 9991

验证命令：xinference cached -e http://127.0.0.1:9991

验证结果

查看支持的模型

xinference registrations -t LLM -e http://127.0.0.1:9991
## 参数解释
t：模型类型，支持LLM(大语言),embedding(嵌入),rerank(重排序),image(图片),audio(音频)，video(视频)

支持的模型列表

查看模型的相关参数

xinference engine -e http://127.0.0.1:9991 --model-name qwen-chat
## 参数解释
e: xinference 服务节点
model-name：模型名称

查看模型的相关参数

查看运行模型列表&停止模型并释放资源

## 查看运行的模型列表
1. xinference list  -e http://127.0.0.1:9991
## 停止正在运行的模型并释放资源
2. xinference terminate --model-uid "qwen2.5-instruct" -e http://127.0.0.1:9991

模型加载

通过xinference 加载运行
第一次运行默认是从HuggingFace 下载模型参数(国内可以设置镜像网HF_ENDPOINT=https://hf-mirror.com)，下载完成后Xinference会有缓存处理，后续就不需要重新下载;也可以设置从其他模型托管平台下载（启动时设置XINFERENCE_MODEL_SRC，如：XINFERENCE_MODEL_SRC=modelscope xinference-local --host 0.0.0.0 --port 9997）
如运行模型qwen2.5-instruct：

xinference launch --model-engine vLLM -e http://127.0.0.1:9991 -n qwen2.5-instruct -s 0_5 -f pytorch 
## 参数解释
model-engine：模型运行的推理引擎，
e: xinference 服务节点
n: 模型名称
s: 模型参数大小，可通过xinference engine -e http://127.0.0.1:9991 --model-name qwen-chat查看
f: 模型格式，如pytorch, ggufv2

模型文件下载

模型文件下载

模型运行

模型运行

备注
本地运行方式下（支持集群部署，Supervisor节点 + Worker节点），如果需要运行多个模型，可能会提示如下错误：

错误提示

原因及处理方式：

1. 由于 GPU 资源不足或每张卡只能加载一个模型导致
2. 使用 --gpu-idx 参数在同一张 GPU 卡上强制启动多个模型
xinference launch --gpu-idx x
3. 先启动占用资源较少的 嵌入 模型，然后再启动其他大型语言模型
4. 停止正在跑的模型，释放资源

使用本地已经下载的模型文件
无需注册直接启动模型

xinference launch --model_path /opt/hfd/maidalun1020/bce-embedding-base_v1 --model-type embedding -n bce-embedding-base_v1  -e http://127.0.0.1:9991
##参数说明
model_path：本地模型文件路劲
model-type：模型类型，LLM(大语言),embedding(嵌入),rerank(重排序),image(图片),audio(音频)，video(视频)
n：模型名称
e: 服务节点

注册自定义模型 & 启动

1. 自定义模型 
json文件：bce-reranker-base_v1.json 
{
    "model_name": "local_bce-reranker-base_v1",
    "type": "normal",
    "language": ["en", "zh"],
    "model_id": "local_bce-reranker-base_v1",
    "model_uri": "file:///opt/hfd/maidalun1020/bce-reranker-base_v1"
}
2. 注册自定义模型
xinference register --model-type rerank --file /opt/llm_engine/xinference/model_json/bce-reranker-base_v1.json --persist -e http://127.0.0.1:9991
## 参数说明
model-type：模型类型
file：自定义模型json文件路径
e：服务节点
persist：持久化操作
3. 启动自定义模型
xinference launch --model-name local_bce-reranker-base_v1 --model-type rerank  -e http://127.0.0.1:9991

模型推理应用

大语言模型
已经启动qwen2.5-instruct大模型

qwen2.5-instruct

通过兼容openAI的API
样例代码：

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:9991/v1",api_key="123")
response = client.chat.completions.create(
    model="qwen2.5-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the largest animal?"}
    ]
)
print(f'response : {response}\n')

大模型回答

通过Xinference Client
样例代码：

from xinference.client import Client
client = Client("http://127.0.0.1:9991")
m_uid = "qwen2.5-instruct"
messages = [{"role": "user", "content": "What is the largest animal?"}]
model = client.get_model(m_uid)
response = model.chat(
    messages,
    generate_config={"max_tokens": 1024}
)
print(f'response : {response}\n')

模型回答

音频模型
内置模型使用
文字转语音ChatTTS

CHatTTS

样例代码：

from xinference.client import Client
client = Client("http://127.0.0.1:9991")
m_uid = "ChatTTS"
model = client.get_model(m_uid)
out_audio_file = '/opt/llm_engine/codes/test_001.wav'
input_text = "你好 apple"
response = model.speech(
        input=input_text,
       ## 由于默认使用的是mp3格式，运行时报错Encoder not found for codec: mp3，故直接将输出格式指定为wav
        response_format="wav",
        voice="echo"
    )
with open(out_audio_file, "wb") as file:
    file.write(response)
print('############## END ###################')

openAI

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:9991/v1",api_key="123")
m_uid = "ChatTTS"
out_audio_file = '/opt/llm_engine/codes/test_0010.wav'
input_text = "你好 apple"
with client.audio.speech.with_streaming_response.create(
  model=m_uid,
  voice="echo",
  response_format="wav",
  input=input_text
) as response:
    response.stream_to_file(out_audio_file)

print('############## END ###################')

语音转文本whisper-large-v3

样例代码：

from xinference.client import Client
client = Client("http://127.0.0.1:9991")
m_uid = "whisper-large-v3"
model = client.get_model(m_uid)
audio_file = '/opt/llm_engine/codes/common_voice_zh-CN_38026095.mp3'
with open(audio_file, "rb") as audio_file:
    response = model.transcriptions(audio_file.read())
    print(f'response : {response}\n')
print('############## END ###################')

openAI

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:9991/v1",api_key="123")
m_uid = "whisper-large-v3-turbo"
audio_file = '/opt/llm_engine/codes/common_voice_zh-CN_38026095.mp3'
with open(audio_file, "rb") as audio_file:
    response = client.audio.transcriptions.create(model=m_uid, file=audio_file)
    print(f'response : {response}\n')

Xinference使用笔记 -- 0x01

Xinference使用笔记 -- 0x01

参考链接

基本环境搭建

模型加载

模型推理应用

相关阅读更多精彩内容

友情链接更多精彩内容