前言
情人节快乐!马上过年了,写点不费脑子的小东西。
笔者日常听歌以J-Pop、J-Rock、ACG、拉丁音乐为主,语言主要是日语、西班牙语等小语种,经常遇到歌词不全或完全找不到的情况。例如最近日推推到的一首电波歌,软件提供的歌词很明显漏掉了4句,而以「聞き取れなかった」(不能听清楚)代替了——事实上也确实不容易听清。

鉴于全网都搜不到完整歌词,还是得动用一点技术手段。
环境准备
- CUDA 12.8.1 用于GPU加速(RTX 50系显卡不能低于此版本)
- FFmpeg 7.1.1
- Python 3.12 虚拟环境
- PyTorch、Torchaudio 2.7.0+cu128版本
pip install --force-reinstall torch==2.7.0 torchaudio==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu128
音源分离:demucs
demucs(https://github.com/facebookresearch/demucs)是Facebook开源的一套强大的音源分离模型。目前最新的demucs v4版本采用Hybrid Transformer架构,引入跨时域-频域的交叉注意力机制,效果较好。

安装demucs:
pip install -U demucs
使用精调的Hybrid Transformer模型(htdemucs_ft)做音源分离,由于我只需要人声和伴奏两个轨,所以指定--two-stems vocals参数。
demucs.separate.main(["--mp3", "--two-stems", "vocals","-n", "htdemucs_ft", "Minatoku Azabu.flac"])
结果如下图所示:

歌词转录:openlrc(based on faster-whisper)
openlrc(https://github.com/zh-plus/openlrc)是一个开源的支持多语种的语音转文字 / 翻译工具,基于faster-whisper(https://github.com/SYSTRAN/faster-whisper)模型实现。由于faster-whisper利用了更加高效的CTranslate2推理引擎,在同样的fp16量化条件下,faster-whisper比原版whisper要快2倍左右,而显存占用几乎相同。
openlrc也支持将转录出的文字进行翻译,内置了翻译人员角色的prompt,不过需要自行对接LLM,如下图所示。

安装openlrc:
pip install openlrc
如果PyTorch和Torchaudio出现问题(如出现报错ModuleNotFoundError: No module named 'torchaudio.backend'),需要重新执行一遍环境准备一节中的安装命令。
使用whisper large_v3模型,VAD参数可按需调整,实测默认的VAD参数在识别歌词方面效果就很不错了。
lrcer = LRCer(whisper_model='large-v3', compute_type='float16', device='cuda', vad_options={
"threshold": 0.500,
"neg_threshold": 0.363,
"min_speech_duration_ms": 0,
"max_speech_duration_s": float("inf"),
"min_silence_duration_ms": 2000,
"speech_pad_ms": 400
})
# 指定语言为日语
lrcer.run('separated/htdemucs_ft/Minatoku Azabu/vocals.mp3', target_lang='ja-jp', skip_trans=True)
家用机器配备GeForce RTX 5090D显卡的情况下,整个过程不过1分钟,显存占用最高9GB多一点。

打开生成的lrc文件,可以看到漏掉的4句被转录了出来,风格和整首歌的基调也是比较搭的(当然「カレル」这个怀疑自造词或者外来词歧义还是很大)。

如果同时做翻译的话,需要提供API接入点,并且指定chatbot_model参数,以Deepseek为例。
deepseek_model = ModelConfig(
provider=ModelProvider.OPENAI,
name='deepseek-chat',
base_url='https://api.deepseek.com/v1',
api_key='******'
)
lrcer = LRCer(whisper_model='large-v3', compute_type='float16', device='cuda', vad_options=vad_options, chatbot_model=deepseek_model)
lrcer.run('separated/htdemucs_ft/Minatoku Azabu/vocals.mp3', target_lang='en-us', bilingual_sub=True)
在翻译前会由LLM生成一份guideline:
### Glossary:
- あざぶ (Azabu): A district in Minato Ward, Tokyo. In this context, it is used as a recurring motif and place name. It should be transliterated as "Azabu" and treated as a proper noun.
- カナブ (Kanabu): Likely a stylized or slang term. Given the context, it may refer to a "beetle" (カナブン/kanabun) or be a playful, nonsensical word. Translators should consider the phonetic and poetic intent.
- ひでぶ (Hidebu): Appears to be a nonsensical or coined term, possibly for rhyme or rhythm. It should be transliterated as "hidebu" to preserve the lyrical flow.
- ハーブ (hābu): "Herb." Refers to aromatic plants, possibly used metaphorically.
- ムーブ (mūbu): From English "move." In context, it likely means "movement" or "action" in daily life.
- タッツ (tattsu): Likely a stylized or abbreviated term. Could be a playful reference to "tats" (short for tattoos) or a nonsensical word for poetic effect.
- パレル (pareru): Possibly derived from "pare" (to pare down) or a coined term. Transliterate as "pareru" to maintain the abstract, lyrical quality.
- 港区 (Minato-ku): "Minato Ward." A special ward in Tokyo, known for its upscale areas like Azabu.
- 稽古 (keiko): "Practice" or "training." In context, it may refer to lessons learned from past experiences or studies.
- 納得 (nattoku): "Understanding," "acceptance," or "conviction." Used here in a fragmented, reflective manner.
### Characters:
- There are no specific named characters in the text. The narrative appears to be a first-person poetic monologue from an unnamed speaker reflecting on their experiences and emotions in Azabu, Tokyo. The speaker references themselves and a vague "you" (あなた), but these are not developed characters.
### Summary:
The text is a lyrical, poetic piece centered on the Azabu district in Tokyo's Minato Ward. It depicts late-night scenes and personal reflections, blending reality with dreamlike imagery. The speaker reminisces about past experiences, from adolescence to adulthood, and expresses a mix of nostalgia, longing, and acceptance. Themes include urban life, time passage, memory, and emotional resilience, conveyed through repetitive phrases and abstract metaphors.
.......
翻译结果如下图所示。
