Qwen3-ASR + vLLM 实时语音转写

本文介绍如何使用阿里云开源的 Qwen3-ASR 语音识别模型,结合 vLLM 推理引擎和 Spring Boot 后端,搭建一套浏览器端实时流式语音转写系统。


一、Qwen3-ASR 简介

Qwen3-ASR 是通义千问团队开源的 ASR 模型系列,支持 52 种语言和方言的语音识别、语言检测和时间戳预测。

模型 参数量 定位 流式推理
Qwen3-ASR-1.7B ~1.7B 精度优先,SOTA 性能 ✅ 仅 vLLM
Qwen3-ASR-0.6B ~0.6B 效率优先,TTFT 低至 92ms ✅ 仅 vLLM

模型在 Apache 2.0 许可下发布,流式推理目前仅通过 vLLM 后端支持


二、模型部署

2.1 安装与启动

# 安装 vLLM(需 v0.14.0+)
pip install "vllm[audio]"

# 启动服务
vllm serve Qwen/Qwen3-ASR-1.7B --host 0.0.0.0 --port 8000

2.2 验证

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-ASR-1.7B",
    "messages": [{"role": "user", "content": [
      {"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"}}
    ]}]
  }'

三、系统架构

┌────────────────────┐     WebSocket      ┌──────────────────┐    SSE (stream)   ┌────────────┐
│   浏览器 (HTML5)    │ ◄══════════════► │  Spring Boot 后端  │ ◄══════════════► │ vLLM Server│
│                    │   WAV 上行         │                  │  /v1/chat/        │ Qwen3-ASR  │
│ AudioWorklet 采集  │   转写结果下行     │ WebSocket Handler │  completions      │ :8000      │
│ PCM → WAV 编码     │                   │ OkHttp SSE 客户端 │  stream=true      │            │
└────────────────────┘                   └──────────────────┘                   └────────────┘

核心流程:浏览器通过 AudioWorklet 采集麦克风 PCM 数据 → 前端编码为 WAV → WebSocket 发送至后端 → 后端 Base64 编码后以 SSE 流式调用 vLLM → 逐 token 推送回浏览器实时显示。


四、后端核心代码

4.1 依赖(pom.xml 关键部分)

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-websocket</artifactId>
    </dependency>
    <dependency>
        <groupId>com.squareup.okhttp3</groupId>
        <artifactId>okhttp-sse</artifactId>
        <version>4.12.0</version>
    </dependency>
    <dependency>
        <groupId>com.google.code.gson</groupId>
        <artifactId>gson</artifactId>
        <version>2.11.0</version>
    </dependency>
</dependencies>

4.2 VllmAsrService.java — vLLM 流式调用

@Service
public class VllmAsrService {

    @Value("${asr.vllm.base-url}")
    private String baseUrl;

    @Value("${asr.vllm.model}")
    private String model;

    @Value("${asr.vllm.api-key}")
    private String apiKey;

    private OkHttpClient httpClient;

    @PostConstruct
    public void init() {
        this.httpClient = new OkHttpClient.Builder()
                .connectTimeout(30, TimeUnit.SECONDS)
                .readTimeout(120, TimeUnit.SECONDS)
                .build();
    }

    public void streamTranscribe(String wavBase64,
                                  Consumer<String> onToken,
                                  Consumer<String> onComplete,
                                  Consumer<Throwable> onError) {

        // 构建 OpenAI 兼容请求体
        JsonObject requestBody = new JsonObject();
        requestBody.addProperty("model", model);
        requestBody.addProperty("stream", true);

        JsonObject audioUrlObj = new JsonObject();
        audioUrlObj.addProperty("url", "data:audio/wav;base64," + wavBase64);
        JsonObject contentItem = new JsonObject();
        contentItem.addProperty("type", "audio_url");
        contentItem.add("audio_url", audioUrlObj);
        JsonArray content = new JsonArray();
        content.add(contentItem);
        JsonObject msg = new JsonObject();
        msg.addProperty("role", "user");
        msg.add("content", content);
        JsonArray messages = new JsonArray();
        messages.add(msg);
        requestBody.add("messages", messages);

        Request request = new Request.Builder()
                .url(baseUrl + "/v1/chat/completions")
                .addHeader("Authorization", "Bearer " + apiKey)
                .post(RequestBody.create(new Gson().toJson(requestBody),
                        MediaType.parse("application/json")))
                .build();

        StringBuilder fullText = new StringBuilder();

        EventSources.createFactory(httpClient).newEventSource(request, new EventSourceListener() {
            @Override
            public void onEvent(EventSource src, String id, String type, String data) {
                if ("[DONE]".equals(data.trim())) {
                    onComplete.accept(fullText.toString());
                    return;
                }
                try {
                    JsonObject delta = JsonParser.parseString(data).getAsJsonObject()
                            .getAsJsonArray("choices").get(0).getAsJsonObject()
                            .getAsJsonObject("delta");
                    if (delta != null && delta.has("content") && !delta.get("content").isJsonNull()) {
                        String token = delta.get("content").getAsString();
                        fullText.append(token);
                        onToken.accept(token);
                    }
                } catch (Exception ignored) {}
            }

            @Override
            public void onFailure(EventSource src, Throwable t, Response resp) {
                onError.accept(t != null ? t : new RuntimeException("SSE 失败"));
            }
        });
    }

    public String bytesToBase64(byte[] bytes) {
        return Base64.getEncoder().encodeToString(bytes);
    }
}

4.3 AsrWebSocketHandler.java — WebSocket 处理器

@Component
public class AsrWebSocketHandler extends AbstractWebSocketHandler {

    private static final int MAX_BINARY_SIZE = 10 * 1024 * 1024;
    private final VllmAsrService vllmAsrService;
    private final Gson gson = new Gson();
    private final ConcurrentHashMap<String, AtomicInteger> counters = new ConcurrentHashMap<>();

    public AsrWebSocketHandler(VllmAsrService vllmAsrService) {
        this.vllmAsrService = vllmAsrService;
    }

    @Override
    public void afterConnectionEstablished(WebSocketSession session) {
        // ★ 关键:必须设置缓冲区大小,默认 8KB 不够
        session.setBinaryMessageSizeLimit(MAX_BINARY_SIZE);
        session.setTextMessageSizeLimit(1024 * 1024);
        counters.put(session.getId(), new AtomicInteger(0));
        sendJson(session, "status", "connected", "已连接");
    }

    @Override
    protected void handleBinaryMessage(WebSocketSession session, BinaryMessage message) {
        byte[] audio = message.getPayload().array();
        int chunkId = counters.get(session.getId()).incrementAndGet();

        // 验证 WAV 头
        if (audio.length < 44 || audio[0] != 'R' || audio[8] != 'W') {
            sendError(session, chunkId, "非有效 WAV");
            return;
        }

        sendJson(session, "chunk_start", String.valueOf(chunkId), "");

        vllmAsrService.streamTranscribe(
            vllmAsrService.bytesToBase64(audio),
            token -> {
                JsonObject m = new JsonObject();
                m.addProperty("type", "token");
                m.addProperty("chunkId", chunkId);
                m.addProperty("token", token);
                sendText(session, gson.toJson(m));
            },
            text -> {
                JsonObject m = new JsonObject();
                m.addProperty("type", "chunk_complete");
                m.addProperty("chunkId", chunkId);
                m.addProperty("text", text);
                sendText(session, gson.toJson(m));
            },
            err -> sendError(session, chunkId, err.getMessage())
        );
    }

    private void sendJson(WebSocketSession s, String type, String id, String msg) {
        JsonObject o = new JsonObject();
        o.addProperty("type", type);
        o.addProperty("id", id);
        o.addProperty("message", msg);
        sendText(s, gson.toJson(o));
    }

    private void sendError(WebSocketSession s, int id, String msg) {
        JsonObject o = new JsonObject();
        o.addProperty("type", "error");
        o.addProperty("chunkId", id);
        o.addProperty("message", msg);
        sendText(s, gson.toJson(o));
    }

    private void sendText(WebSocketSession s, String text) {
        if (s.isOpen()) {
            try { synchronized (s) { s.sendMessage(new TextMessage(text)); } }
            catch (IOException ignored) {}
        }
    }

    @Override
    public void afterConnectionClosed(WebSocketSession s, CloseStatus status) {
        counters.remove(s.getId());
    }
}

五、前端核心实现

5.1 音频采集:为什么不用 MediaRecorder?

方案 问题
❌ MediaRecorder → WebM → decodeAudioData → WAV 浏览器解码 WebM 经常静默失败
✅ AudioWorklet / ScriptProcessor → PCM → 手写 WAV 零依赖,100% 可靠

5.2 pcm-worker.js

class PcmCaptureProcessor extends AudioWorkletProcessor {
    constructor() {
        super();
        this._active = true;
        this.port.onmessage = (e) => { if (e.data.command === 'stop') this._active = false; };
    }
    process(inputs) {
        if (!this._active) return false;
        const ch = inputs[0];
        if (ch && ch[0] && ch[0].length > 0) {
            this.port.postMessage({ pcmData: new Float32Array(ch[0]) }, [new Float32Array(ch[0]).buffer]);
        }
        return true;
    }
}
registerProcessor('pcm-capture-processor', PcmCaptureProcessor);

5.3 前端 JS 核心逻辑(精简)

// ===== WAV 编码器:Float32 PCM → 标准 16-bit WAV =====
function encodeWav(samples, sr) {
    var n = samples.length, dl = n * 2;
    var buf = new ArrayBuffer(44 + dl), v = new DataView(buf);
    writeStr(v,0,'RIFF'); v.setUint32(4,36+dl,true); writeStr(v,8,'WAVE');
    writeStr(v,12,'fmt '); v.setUint32(16,16,true); v.setUint16(20,1,true);
    v.setUint16(22,1,true); v.setUint32(24,sr,true); v.setUint32(28,sr*2,true);
    v.setUint16(32,2,true); v.setUint16(34,16,true);
    writeStr(v,36,'data'); v.setUint32(40,dl,true);
    for (var i=0,o=44; i<n; i++,o+=2) {
        var s = Math.max(-1, Math.min(1, samples[i]));
        v.setInt16(o, s<0 ? s*0x8000 : s*0x7FFF, true);
    }
    return buf;
}

// ===== 录音:AudioWorklet 采集 + 定时分段发送 =====
async function startRecording() {
    websocket = new WebSocket('ws://localhost:8080/ws/asr');
    websocket.binaryType = 'arraybuffer';

    mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
    audioContext = new AudioContext({ sampleRate: 16000 });
    var source = audioContext.createMediaStreamSource(mediaStream);

    // AudioWorklet 采集 PCM
    await audioContext.audioWorklet.addModule('pcm-worker.js');
    workletNode = new AudioWorkletNode(audioContext, 'pcm-capture-processor');
    var pcmChunks = [];
    workletNode.port.onmessage = function(e) { pcmChunks.push(e.data.pcmData); };
    source.connect(workletNode);
    workletNode.connect(audioContext.destination);

    // 每 3 秒发送一个片段
    setInterval(function() {
        if (pcmChunks.length === 0) return;
        var chunks = pcmChunks.splice(0);
        var total = chunks.reduce(function(s,c){ return s+c.length; }, 0);
        var merged = new Float32Array(total);
        for (var i=0,off=0; i<chunks.length; i++) { merged.set(chunks[i],off); off+=chunks[i].length; }
        websocket.send(encodeWav(merged, 16000));
    }, 3000);
}

// ===== ASR 输出解析:过滤 <|language|>...<|asr_text|> 前缀 =====
function handleToken(chunkId, token) {
    var state = chunkStates[chunkId];
    state.rawBuffer += token;

    if (!state.prefixDone) {
        var idx = state.rawBuffer.indexOf('<|asr_text|>');
        if (idx === -1) idx = state.rawBuffer.indexOf('<asr_text>');
        if (idx !== -1) {
            var markerLen = state.rawBuffer.indexOf('>', idx) + 1;
            var prefix = state.rawBuffer.substring(0, markerLen);
            // 提取语言:<|language|>Chinese → "Chinese"
            var m = prefix.match(/<\|?language\|?>\s*(\S+)/i);
            state.language = m ? m[1] : null;
            state.displayText = state.rawBuffer.substring(markerLen);
            state.prefixDone = true;
            showLanguageTag(chunkId, state.language);  // 显示 🇨🇳 中文
        }
    } else {
        state.displayText += token;
    }
    showText(chunkId, state.displayText);
}

六、踩坑记录

问题 原因 解决方案
mediaDevices undefined 非安全上下文(如 http://192.168.x.x 使用 http://localhost 或配置 HTTPS
Illegal invocation 直接访问 AudioContext.prototype.audioWorklet 改用 typeof AudioWorkletNode !== 'undefined' 检测
WebSocket buffer too small (code=1009) 默认缓冲 8KB,WAV 片段约 96KB session.setBinaryMessageSizeLimit(10MB)
MediaRecorder → WAV 静默失败 decodeAudioData(webm) 兼容性差 改用 AudioWorklet 直接采集 PCM + 手写 WAV 编码
转写结果包含 language ...<asr_text> 前缀 Qwen3-ASR 标准输出格式 前端流式缓冲解析,提取语言后只显示正文

七、效果展示

screenshot.png

八、总结

维度 方案
ASR 模型 Qwen3-ASR-1.7B / 0.6B(Apache 2.0 开源)
推理引擎 vLLM(唯一支持流式推理)
后端 Spring Boot + WebSocket + OkHttp SSE
前端采集 AudioWorklet + ScriptProcessor 回退
音频格式 16kHz / 16-bit / Mono WAV
分段策略 固定时长 3~5 秒(生产环境建议集成 VAD 智能断句)

参考资料

资源 地址
Qwen3-ASR GitHub https://github.com/QwenLM/Qwen3-ASR
Qwen3-ASR-1.7B https://huggingface.co/Qwen/Qwen3-ASR-1.7B
vLLM 官方文档 https://docs.vllm.ai

如遇问题优先检查:① vLLM 是否正常运行;② 浏览器是否通过 localhost 或 HTTPS 访问;③ WebSocket 缓冲区是否够大。🎉

开源地址

github:https://github.com/zll-g/qwen-asr-web.git

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

友情链接更多精彩内容