以下是DeepSeek模型本地部署的详细教程流程,分为环境准备、模型下载、推理部署及优化几个关键步骤:
---
### **一、环境准备**
#### 1. 硬件要求
- **GPU**(推荐):
- 7B模型:至少10GB显存(如NVIDIA RTX 3080/4090)
- 13B/33B模型:需16GB+显存(如A10/A100)
- 若无GPU,可使用CPU运行量化版模型(速度较慢)。
#### 2. 软件依赖
- **Python 3.8+**
- **PyTorch 2.0+**(需匹配CUDA版本):
```bash
# 示例:安装PyTorch with CUDA 11.8
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```
- **Hugging Face库**:
```bash
pip install transformers accelerate sentencepiece huggingface_hub
```
- **可选优化库**:
```bash
# 量化支持(4/8-bit推理)
pip install bitsandbytes
# FlashAttention加速
pip install flash-attn --no-build-isolation
```
---
### **二、下载模型**
#### 1. 获取模型文件
- **Hugging Face Hub**(需申请权限):
```bash
huggingface-cli login # 登录Hugging Face账户
git lfs install
git clone https://huggingface.co/deepseek-ai/deepseek-llm-7b-chat
```
- **官方渠道**:
从DeepSeek官网或合作平台下载模型权重(`.bin`或`.safetensors`)及配置文件(`config.json`)。
#### 2. 模型格式
- 推荐使用Hugging Face格式(包含`pytorch_model.bin` + `tokenizer.json`)。
---
### **三、模型推理**
#### 1. 基础推理代码
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "path/to/deepseek-7b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto", # 自动分配GPU/CPU
torch_dtype=torch.float16, # 半精度减少显存
# load_in_4bit=True # 4-bit量化(需bitsandbytes)
)
input_text = "如何做西红柿炒鸡蛋?"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```
#### 2. 高级优化
- **vLLM推理引擎**(高吞吐量场景):
```bash
pip install vllm
```
```python
from vllm import LLM, SamplingParams
model = LLM(model=model_path, tensor_parallel_size=2) # 多GPU并行
outputs = model.generate(prompts, SamplingParams(temperature=0.8))
```
- **量化加载**(显存不足时):
```python
model = AutoModelForCausalLM.from_pretrained(
model_path,
load_in_4bit=True,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
)
```
---
### **四、部署为API服务**
#### 1. 使用FastAPI
```python
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class Request(BaseModel):
prompt: str
max_tokens: int = 200
@app.post("/generate")
async def generate(request: Request):
inputs = tokenizer(request.prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=request.max_tokens)
return {"response": tokenizer.decode(outputs[0])}
# 启动命令:uvicorn api:app --host 0.0.0.0 --port 8000
```
#### 2. 使用OpenAI兼容接口
- **使用FastChat**:
```bash
pip install "fschat[model_worker,webui]"
python -m fastchat.serve.controller
python -m fastchat.serve.model_worker --model-path deepseek-7b-chat
python -m fastchat.serve.openai_api_server --host 0.0.0.0
```
```bash
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "deepseek-7b-chat",
"messages": [{"role": "user", "content": "你好"}]
}'
```
---
### **五、常见问题**
1. **显存不足**:
- 启用`load_in_4bit`或`device_map="auto"`。
- 使用CPU卸载:`model = model.to('cpu')`(速度下降)。
2. **下载中断**:
- 使用`huggingface_hub`的`resume_download=True`参数。
- 手动下载后指定本地路径。
3. **推理速度慢**:
- 启用FlashAttention或切换到vLLM引擎。
- 使用`batch_size=1`避免内存溢出。
---
通过以上步骤,可在本地高效部署DeepSeek模型,并根据需求调整资源配置和优化策略。