几行代码快速实现小语种歌词转录

前言

情人节快乐！马上过年了，写点不费脑子的小东西。

笔者日常听歌以J-Pop、J-Rock、ACG、拉丁音乐为主，语言主要是日语、西班牙语等小语种，经常遇到歌词不全或完全找不到的情况。例如最近日推推到的一首电波歌，软件提供的歌词很明显漏掉了4句，而以「聞き取れなかった」（不能听清楚）代替了——事实上也确实不容易听清。

鉴于全网都搜不到完整歌词，还是得动用一点技术手段。

环境准备

CUDA 12.8.1 用于GPU加速（RTX 50系显卡不能低于此版本）
FFmpeg 7.1.1
Python 3.12 虚拟环境
PyTorch、Torchaudio 2.7.0+cu128版本

pip install --force-reinstall torch==2.7.0 torchaudio==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu128

音源分离：demucs

demucs（https://github.com/facebookresearch/demucs）是Facebook开源的一套强大的音源分离模型。目前最新的demucs v4版本采用Hybrid Transformer架构，引入跨时域-频域的交叉注意力机制，效果较好。

安装demucs：

pip install -U demucs

使用精调的Hybrid Transformer模型（htdemucs_ft）做音源分离，由于我只需要人声和伴奏两个轨，所以指定--two-stems vocals参数。

demucs.separate.main(["--mp3", "--two-stems", "vocals","-n", "htdemucs_ft", "Minatoku Azabu.flac"])

结果如下图所示：

歌词转录：openlrc（based on faster-whisper）

openlrc（https://github.com/zh-plus/openlrc）是一个开源的支持多语种的语音转文字 / 翻译工具，基于faster-whisper（https://github.com/SYSTRAN/faster-whisper）模型实现。由于faster-whisper利用了更加高效的CTranslate2推理引擎，在同样的fp16量化条件下，faster-whisper比原版whisper要快2倍左右，而显存占用几乎相同。

openlrc也支持将转录出的文字进行翻译，内置了翻译人员角色的prompt，不过需要自行对接LLM，如下图所示。

安装openlrc：

pip install openlrc

如果PyTorch和Torchaudio出现问题（如出现报错ModuleNotFoundError: No module named 'torchaudio.backend'），需要重新执行一遍环境准备一节中的安装命令。

使用whisper large_v3模型，VAD参数可按需调整，实测默认的VAD参数在识别歌词方面效果就很不错了。

lrcer = LRCer(whisper_model='large-v3', compute_type='float16', device='cuda', vad_options={
        "threshold": 0.500,
        "neg_threshold": 0.363,
        "min_speech_duration_ms": 0,
        "max_speech_duration_s": float("inf"),
        "min_silence_duration_ms": 2000,
        "speech_pad_ms": 400
})

# 指定语言为日语
lrcer.run('separated/htdemucs_ft/Minatoku Azabu/vocals.mp3', target_lang='ja-jp', skip_trans=True)

家用机器配备GeForce RTX 5090D显卡的情况下，整个过程不过1分钟，显存占用最高9GB多一点。

打开生成的lrc文件，可以看到漏掉的4句被转录了出来，风格和整首歌的基调也是比较搭的（当然「カレル」这个怀疑自造词或者外来词歧义还是很大）。

如果同时做翻译的话，需要提供API接入点，并且指定chatbot_model参数，以Deepseek为例。

deepseek_model = ModelConfig(
    provider=ModelProvider.OPENAI,
    name='deepseek-chat',
    base_url='https://api.deepseek.com/v1',
    api_key='******'
)

lrcer = LRCer(whisper_model='large-v3', compute_type='float16', device='cuda', vad_options=vad_options, chatbot_model=deepseek_model)

lrcer.run('separated/htdemucs_ft/Minatoku Azabu/vocals.mp3', target_lang='en-us', bilingual_sub=True)

在翻译前会由LLM生成一份guideline：

### Glossary:
- あざぶ (Azabu): A district in Minato Ward, Tokyo. In this context, it is used as a recurring motif and place name. It should be transliterated as "Azabu" and treated as a proper noun.
- カナブ (Kanabu): Likely a stylized or slang term. Given the context, it may refer to a "beetle" (カナブン/kanabun) or be a playful, nonsensical word. Translators should consider the phonetic and poetic intent.
- ひでぶ (Hidebu): Appears to be a nonsensical or coined term, possibly for rhyme or rhythm. It should be transliterated as "hidebu" to preserve the lyrical flow.
- ハーブ (hābu): "Herb." Refers to aromatic plants, possibly used metaphorically.
- ムーブ (mūbu): From English "move." In context, it likely means "movement" or "action" in daily life.
- タッツ (tattsu): Likely a stylized or abbreviated term. Could be a playful reference to "tats" (short for tattoos) or a nonsensical word for poetic effect.
- パレル (pareru): Possibly derived from "pare" (to pare down) or a coined term. Transliterate as "pareru" to maintain the abstract, lyrical quality.
- 港区 (Minato-ku): "Minato Ward." A special ward in Tokyo, known for its upscale areas like Azabu.
- 稽古 (keiko): "Practice" or "training." In context, it may refer to lessons learned from past experiences or studies.
- 納得 (nattoku): "Understanding," "acceptance," or "conviction." Used here in a fragmented, reflective manner.

### Characters:
- There are no specific named characters in the text. The narrative appears to be a first-person poetic monologue from an unnamed speaker reflecting on their experiences and emotions in Azabu, Tokyo. The speaker references themselves and a vague "you" (あなた), but these are not developed characters.

### Summary:
The text is a lyrical, poetic piece centered on the Azabu district in Tokyo's Minato Ward. It depicts late-night scenes and personal reflections, blending reality with dreamlike imagery. The speaker reminisces about past experiences, from adolescence to adulthood, and expresses a mix of nostalgia, longing, and acceptance. Themes include urban life, time passage, memory, and emotional resilience, conveyed through repetitive phrases and abstract metaphors.

.......

翻译结果如下图所示。

几行代码快速实现小语种歌词转录