whisperx 是基于 OpenAI Whisper 的增强版语音识别库,相比原版的whipser增加了词级别时间对齐和说话人对齐功能,并且其asr模块使用faster-whisper backend 代替原本whisper,识别速度上也有显著提升。
在正是开始之前,重点说一下词对齐的实现方案在不同语言间的差异。完整的实现见 https://github.com/m-bain/whisperX/blob/main/whisperx/alignment.py。
其中最大的差异是 en, fr, de, es, it, 这5种语言使用的是 pipelines
bundle = torchaudio.pipelines.__dict__[model_name]
align_model = bundle.get_model(dl_kwargs={"model_dir": model_dir}).to(device)
包括中文在内的其它语言是通过Wav2Vec2ForCTC来加载的模型,例如中文用的是 "zh": "jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn"
,详见源代码的 DEFAULT_ALIGN_MODELS_HF
变量
processor = Wav2Vec2Processor.from_pretrained(model_name, cache_dir=model_dir)
align_model = Wav2Vec2ForCTC.from_pretrained(model_name, cache_dir=model_dir)
为方便阅读和学习,将三大能力内部实现所用的模块汇总如下
功能类别 | 库/工具 | 链接/说明 |
---|---|---|
语音识别 | fast_whisper | https://github.com/SYSTRAN/faster-whisper |
词级别对齐 | Wav2Vec2 |
1、实现方案:Wav2Vec2 2、不同语言的实现差异 |
说话人识别 | pyannote | https://github.com/pyannote/pyannote-audio |
以下是三个功能的具体使用示例
一、语音转文字
import whisperx
device = "cpu" # cpu 或 cuda
audio_file = "test.mp4" # 支持音频和视频
# step 1.1, load_model(...) 加载ASR模型,常用参数有
# - whisper_arch: tiny, tiny.en, base, base.en, small(默认), small.en, medium, medium.en, large, turbo,
# 实测mac m3上,turbo模型也可以顺畅的运行
# 支持的最新模型列表见 https://github.com/openai/whisper
# - device: 可选值 cpu(默认), cuda
# - compute_type: float16(默认), float32, int8, 注意 device='cpu' 不支持float16
# - language: en, zh, yue, ...等近百种语言,默认会自动检测
# ...
# 完成参数见 https://github.com/m-bain/whisperX/blob/main/whisperx/asr.py 的 load_model方法
model = whisperx.load_model(whisper_arch="small", device=device, compute_type="int8") # 如果需要指定模型目录,可以设置download_root参数
# step 1.2, load_audio(...) 加载音频,audio_file可以使音频也可以是视频
audio = whisperx.load_audio(audio_file)
# step 1.3, 用模型来识别音频(可以通过batch_size参数设置batch_size)
result = model.transcribe(audio)
result的结构如
{
"segments": [
{
"text": " Don't laugh too fast when others fall.",
"start": 0.031,
"end": 23.099
}
],
"language": "en"
}
二、词级别对齐
# step 2.1 加载对齐模型
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
# step 2.2, 开始对齐
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
result结构如
{
"segments": [
{
"start": 0.031,
"end": 3.057,
"text": " See, the reality is people out there watching you.",
"words": [ // 同word_segments,但这里是按句进行组织
{ "word": "See,", "start": 0.031, "end": 0.312, "score": 0.889 },
{ "word": "the", "start": 0.332, "end": 0.412, "score": 0.846 }
],
"chars": [ // 字符级别的对齐。注意,只有 return_char_alignments = True,才会有此部分
{ "char": " " }, // 注意并不是所有的 `char` 都有 start end 和score
{
"char": "a",
"start": 21.479, "end": 21.579, "score": 0.745
}, {
"char": "n",
"start": 21.579, "end": 21.599, "score": 0.596
},...
},
],
"word_segments": [ // word_segments是在原结果基础上词对齐结果
{ "word": "See,", "start": 0.031, "end": 0.312, "score": 0.889 },
{ "word": "the", "start": 0.332, "end": 0.412, "score": 0.846 }
]
}
三、说话人识别
注意事项:whisperx的说话人识别实际使用的是 pyannote
,该模型需要在HF上同意其协议才可使用。授权地址: https://huggingface.co/pyannote/segmentation-3.0
# 3.1 加载说话人模型
diarize_model = whisperx.diarize.DiarizationPipeline(device=device)
# 3.2 从音频中识别出说话人
diarize_segments = diarize_model(audio) # diarize_segments是一个DataFrame对象,
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers) # 如果说话人数量范围可确定,则可进行限定
# 3.3 讲说话人结果和识别结果进行合并,注意从对比看这个结果的结果是个数组,并且没有词级别的对齐信息
result = whisperx.assign_word_speakers(diarize_segments, result)
result结果示例
{
"segments": [
{
"start": 3.22, "end": 8.248, "text": "放上车",
"words": [
{
"word": "放", "start": 3.22, "end": 16.813, "score": 0.968,
"speaker": "SPEAKER_04"
}, ...
],
"speaker": "SPEAKER_04" // 词和句结果均增加了说话人信息
},
"word_segments": [
{
"word": "放", "start": 3.22, "end": 16.813, "score": 0.968,
"speaker": "SPEAKER_04" // 词和句结果均增加了说话人信息
}, ...
]
}
另外补充 diarize_segments 这个对象的数据示例(类型是DataFrame)
segment label speaker start end intersection union
0 [ 00:00:03.254 --> 00:00:03.861] A SPEAKER_03 3.254094 3.861594 -183.937406 192.256906
1 [ 00:00:03.895 --> 00:00:05.093] B SPEAKER_03 3.895344 5.093469 -182.705531 191.615656
2 [ 00:00:07.607 --> 00:00:08.468] C SPEAKER_03 7.607844 8.468469 -179.330531 187.903156