本文介绍如何使用阿里云开源的 Qwen3-ASR 语音识别模型,结合 vLLM 推理引擎和 Spring Boot 后端,搭建一套浏览器端实时流式语音转写系统。
一、Qwen3-ASR 简介
Qwen3-ASR 是通义千问团队开源的 ASR 模型系列,支持 52 种语言和方言的语音识别、语言检测和时间戳预测。
| 模型 | 参数量 | 定位 | 流式推理 |
|---|---|---|---|
| Qwen3-ASR-1.7B | ~1.7B | 精度优先,SOTA 性能 | ✅ 仅 vLLM |
| Qwen3-ASR-0.6B | ~0.6B | 效率优先,TTFT 低至 92ms | ✅ 仅 vLLM |
模型在 Apache 2.0 许可下发布,流式推理目前仅通过 vLLM 后端支持。
二、模型部署
2.1 安装与启动
# 安装 vLLM(需 v0.14.0+)
pip install "vllm[audio]"
# 启动服务
vllm serve Qwen/Qwen3-ASR-1.7B --host 0.0.0.0 --port 8000
2.2 验证
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-ASR-1.7B",
"messages": [{"role": "user", "content": [
{"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"}}
]}]
}'
三、系统架构
┌────────────────────┐ WebSocket ┌──────────────────┐ SSE (stream) ┌────────────┐
│ 浏览器 (HTML5) │ ◄══════════════► │ Spring Boot 后端 │ ◄══════════════► │ vLLM Server│
│ │ WAV 上行 │ │ /v1/chat/ │ Qwen3-ASR │
│ AudioWorklet 采集 │ 转写结果下行 │ WebSocket Handler │ completions │ :8000 │
│ PCM → WAV 编码 │ │ OkHttp SSE 客户端 │ stream=true │ │
└────────────────────┘ └──────────────────┘ └────────────┘
核心流程:浏览器通过 AudioWorklet 采集麦克风 PCM 数据 → 前端编码为 WAV → WebSocket 发送至后端 → 后端 Base64 编码后以 SSE 流式调用 vLLM → 逐 token 推送回浏览器实时显示。
四、后端核心代码
4.1 依赖(pom.xml 关键部分)
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-websocket</artifactId>
</dependency>
<dependency>
<groupId>com.squareup.okhttp3</groupId>
<artifactId>okhttp-sse</artifactId>
<version>4.12.0</version>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.11.0</version>
</dependency>
</dependencies>
4.2 VllmAsrService.java — vLLM 流式调用
@Service
public class VllmAsrService {
@Value("${asr.vllm.base-url}")
private String baseUrl;
@Value("${asr.vllm.model}")
private String model;
@Value("${asr.vllm.api-key}")
private String apiKey;
private OkHttpClient httpClient;
@PostConstruct
public void init() {
this.httpClient = new OkHttpClient.Builder()
.connectTimeout(30, TimeUnit.SECONDS)
.readTimeout(120, TimeUnit.SECONDS)
.build();
}
public void streamTranscribe(String wavBase64,
Consumer<String> onToken,
Consumer<String> onComplete,
Consumer<Throwable> onError) {
// 构建 OpenAI 兼容请求体
JsonObject requestBody = new JsonObject();
requestBody.addProperty("model", model);
requestBody.addProperty("stream", true);
JsonObject audioUrlObj = new JsonObject();
audioUrlObj.addProperty("url", "data:audio/wav;base64," + wavBase64);
JsonObject contentItem = new JsonObject();
contentItem.addProperty("type", "audio_url");
contentItem.add("audio_url", audioUrlObj);
JsonArray content = new JsonArray();
content.add(contentItem);
JsonObject msg = new JsonObject();
msg.addProperty("role", "user");
msg.add("content", content);
JsonArray messages = new JsonArray();
messages.add(msg);
requestBody.add("messages", messages);
Request request = new Request.Builder()
.url(baseUrl + "/v1/chat/completions")
.addHeader("Authorization", "Bearer " + apiKey)
.post(RequestBody.create(new Gson().toJson(requestBody),
MediaType.parse("application/json")))
.build();
StringBuilder fullText = new StringBuilder();
EventSources.createFactory(httpClient).newEventSource(request, new EventSourceListener() {
@Override
public void onEvent(EventSource src, String id, String type, String data) {
if ("[DONE]".equals(data.trim())) {
onComplete.accept(fullText.toString());
return;
}
try {
JsonObject delta = JsonParser.parseString(data).getAsJsonObject()
.getAsJsonArray("choices").get(0).getAsJsonObject()
.getAsJsonObject("delta");
if (delta != null && delta.has("content") && !delta.get("content").isJsonNull()) {
String token = delta.get("content").getAsString();
fullText.append(token);
onToken.accept(token);
}
} catch (Exception ignored) {}
}
@Override
public void onFailure(EventSource src, Throwable t, Response resp) {
onError.accept(t != null ? t : new RuntimeException("SSE 失败"));
}
});
}
public String bytesToBase64(byte[] bytes) {
return Base64.getEncoder().encodeToString(bytes);
}
}
4.3 AsrWebSocketHandler.java — WebSocket 处理器
@Component
public class AsrWebSocketHandler extends AbstractWebSocketHandler {
private static final int MAX_BINARY_SIZE = 10 * 1024 * 1024;
private final VllmAsrService vllmAsrService;
private final Gson gson = new Gson();
private final ConcurrentHashMap<String, AtomicInteger> counters = new ConcurrentHashMap<>();
public AsrWebSocketHandler(VllmAsrService vllmAsrService) {
this.vllmAsrService = vllmAsrService;
}
@Override
public void afterConnectionEstablished(WebSocketSession session) {
// ★ 关键:必须设置缓冲区大小,默认 8KB 不够
session.setBinaryMessageSizeLimit(MAX_BINARY_SIZE);
session.setTextMessageSizeLimit(1024 * 1024);
counters.put(session.getId(), new AtomicInteger(0));
sendJson(session, "status", "connected", "已连接");
}
@Override
protected void handleBinaryMessage(WebSocketSession session, BinaryMessage message) {
byte[] audio = message.getPayload().array();
int chunkId = counters.get(session.getId()).incrementAndGet();
// 验证 WAV 头
if (audio.length < 44 || audio[0] != 'R' || audio[8] != 'W') {
sendError(session, chunkId, "非有效 WAV");
return;
}
sendJson(session, "chunk_start", String.valueOf(chunkId), "");
vllmAsrService.streamTranscribe(
vllmAsrService.bytesToBase64(audio),
token -> {
JsonObject m = new JsonObject();
m.addProperty("type", "token");
m.addProperty("chunkId", chunkId);
m.addProperty("token", token);
sendText(session, gson.toJson(m));
},
text -> {
JsonObject m = new JsonObject();
m.addProperty("type", "chunk_complete");
m.addProperty("chunkId", chunkId);
m.addProperty("text", text);
sendText(session, gson.toJson(m));
},
err -> sendError(session, chunkId, err.getMessage())
);
}
private void sendJson(WebSocketSession s, String type, String id, String msg) {
JsonObject o = new JsonObject();
o.addProperty("type", type);
o.addProperty("id", id);
o.addProperty("message", msg);
sendText(s, gson.toJson(o));
}
private void sendError(WebSocketSession s, int id, String msg) {
JsonObject o = new JsonObject();
o.addProperty("type", "error");
o.addProperty("chunkId", id);
o.addProperty("message", msg);
sendText(s, gson.toJson(o));
}
private void sendText(WebSocketSession s, String text) {
if (s.isOpen()) {
try { synchronized (s) { s.sendMessage(new TextMessage(text)); } }
catch (IOException ignored) {}
}
}
@Override
public void afterConnectionClosed(WebSocketSession s, CloseStatus status) {
counters.remove(s.getId());
}
}
五、前端核心实现
5.1 音频采集:为什么不用 MediaRecorder?
| 方案 | 问题 |
|---|---|
| ❌ MediaRecorder → WebM → decodeAudioData → WAV | 浏览器解码 WebM 经常静默失败 |
| ✅ AudioWorklet / ScriptProcessor → PCM → 手写 WAV | 零依赖,100% 可靠 |
5.2 pcm-worker.js
class PcmCaptureProcessor extends AudioWorkletProcessor {
constructor() {
super();
this._active = true;
this.port.onmessage = (e) => { if (e.data.command === 'stop') this._active = false; };
}
process(inputs) {
if (!this._active) return false;
const ch = inputs[0];
if (ch && ch[0] && ch[0].length > 0) {
this.port.postMessage({ pcmData: new Float32Array(ch[0]) }, [new Float32Array(ch[0]).buffer]);
}
return true;
}
}
registerProcessor('pcm-capture-processor', PcmCaptureProcessor);
5.3 前端 JS 核心逻辑(精简)
// ===== WAV 编码器:Float32 PCM → 标准 16-bit WAV =====
function encodeWav(samples, sr) {
var n = samples.length, dl = n * 2;
var buf = new ArrayBuffer(44 + dl), v = new DataView(buf);
writeStr(v,0,'RIFF'); v.setUint32(4,36+dl,true); writeStr(v,8,'WAVE');
writeStr(v,12,'fmt '); v.setUint32(16,16,true); v.setUint16(20,1,true);
v.setUint16(22,1,true); v.setUint32(24,sr,true); v.setUint32(28,sr*2,true);
v.setUint16(32,2,true); v.setUint16(34,16,true);
writeStr(v,36,'data'); v.setUint32(40,dl,true);
for (var i=0,o=44; i<n; i++,o+=2) {
var s = Math.max(-1, Math.min(1, samples[i]));
v.setInt16(o, s<0 ? s*0x8000 : s*0x7FFF, true);
}
return buf;
}
// ===== 录音:AudioWorklet 采集 + 定时分段发送 =====
async function startRecording() {
websocket = new WebSocket('ws://localhost:8080/ws/asr');
websocket.binaryType = 'arraybuffer';
mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
audioContext = new AudioContext({ sampleRate: 16000 });
var source = audioContext.createMediaStreamSource(mediaStream);
// AudioWorklet 采集 PCM
await audioContext.audioWorklet.addModule('pcm-worker.js');
workletNode = new AudioWorkletNode(audioContext, 'pcm-capture-processor');
var pcmChunks = [];
workletNode.port.onmessage = function(e) { pcmChunks.push(e.data.pcmData); };
source.connect(workletNode);
workletNode.connect(audioContext.destination);
// 每 3 秒发送一个片段
setInterval(function() {
if (pcmChunks.length === 0) return;
var chunks = pcmChunks.splice(0);
var total = chunks.reduce(function(s,c){ return s+c.length; }, 0);
var merged = new Float32Array(total);
for (var i=0,off=0; i<chunks.length; i++) { merged.set(chunks[i],off); off+=chunks[i].length; }
websocket.send(encodeWav(merged, 16000));
}, 3000);
}
// ===== ASR 输出解析:过滤 <|language|>...<|asr_text|> 前缀 =====
function handleToken(chunkId, token) {
var state = chunkStates[chunkId];
state.rawBuffer += token;
if (!state.prefixDone) {
var idx = state.rawBuffer.indexOf('<|asr_text|>');
if (idx === -1) idx = state.rawBuffer.indexOf('<asr_text>');
if (idx !== -1) {
var markerLen = state.rawBuffer.indexOf('>', idx) + 1;
var prefix = state.rawBuffer.substring(0, markerLen);
// 提取语言:<|language|>Chinese → "Chinese"
var m = prefix.match(/<\|?language\|?>\s*(\S+)/i);
state.language = m ? m[1] : null;
state.displayText = state.rawBuffer.substring(markerLen);
state.prefixDone = true;
showLanguageTag(chunkId, state.language); // 显示 🇨🇳 中文
}
} else {
state.displayText += token;
}
showText(chunkId, state.displayText);
}
六、踩坑记录
| 问题 | 原因 | 解决方案 |
|---|---|---|
mediaDevices undefined |
非安全上下文(如 http://192.168.x.x) |
使用 http://localhost 或配置 HTTPS |
Illegal invocation |
直接访问 AudioContext.prototype.audioWorklet
|
改用 typeof AudioWorkletNode !== 'undefined' 检测 |
WebSocket buffer too small (code=1009) |
默认缓冲 8KB,WAV 片段约 96KB | session.setBinaryMessageSizeLimit(10MB) |
| MediaRecorder → WAV 静默失败 |
decodeAudioData(webm) 兼容性差 |
改用 AudioWorklet 直接采集 PCM + 手写 WAV 编码 |
转写结果包含 language ...<asr_text> 前缀 |
Qwen3-ASR 标准输出格式 | 前端流式缓冲解析,提取语言后只显示正文 |
七、效果展示

screenshot.png
八、总结
| 维度 | 方案 |
|---|---|
| ASR 模型 | Qwen3-ASR-1.7B / 0.6B(Apache 2.0 开源) |
| 推理引擎 | vLLM(唯一支持流式推理) |
| 后端 | Spring Boot + WebSocket + OkHttp SSE |
| 前端采集 | AudioWorklet + ScriptProcessor 回退 |
| 音频格式 | 16kHz / 16-bit / Mono WAV |
| 分段策略 | 固定时长 3~5 秒(生产环境建议集成 VAD 智能断句) |
参考资料
| 资源 | 地址 |
|---|---|
| Qwen3-ASR GitHub | https://github.com/QwenLM/Qwen3-ASR |
| Qwen3-ASR-1.7B | https://huggingface.co/Qwen/Qwen3-ASR-1.7B |
| vLLM 官方文档 | https://docs.vllm.ai |
如遇问题优先检查:① vLLM 是否正常运行;② 浏览器是否通过 localhost 或 HTTPS 访问;③ WebSocket 缓冲区是否够大。🎉