搭建基于transformer的端到端自动语音识别系统

Pytorch 上的端到端语音识别
基于 Transformer 的语音识别模型
开源地址
https://github.com/gentaiscool/end2end-asr-pytorch

简介

自从十年前采用基于深度神经网络 (DNN)的混合建模以来，自动语音识别 (ASR) 的准确率得到了显着提高。这种突破主要是使用DNN代替传统的高斯混合模型进行声学似然评估，同时保留声学模型、语言模型和词典模型等所有模块，进而组成了混合ASR系统。最近，语音社区通过从混合建模过渡到端到端（E2E）建模有了新的突破，新方案使用单个网络将输入语音序列直接转换为输出标记序列。这样的突破更具革命性，因为它推翻了传统ASR系统中已经使用了几十年的模块式建模。

端到端模型比传统的混合模型有几个主要优点：

首先，端到端模型使用与ASR目标一致的单一目标函数来优化整个网络，而传统的混合模型单独优化每个模块，无法保证全局最优。并且，端到端模型已被证明不论在学术界还是在工业界都优于传统的混合模型。
其次，由于端到端模型直接输出字符甚至单词，大大简化了语音识别流程。相比之下，传统混合模型的设计复杂，需要大量ASR专家经验知识。
最后，由于端到端模型采用单一网络，比传统的混合模型更加紧凑，因此，端到端模型可以部署到高精度、低延迟的设备上。

随着深度神经网络的发展和硬件算力支持，基于RNN，DCNN，attenton 和transformer等神经网络也逐渐开始在语音识别应用，并得到好的效果。

这里就是基于一种低秩结构 low-rank的transformer ，就是将attention 的keys和values的长度维投影到较低维的表示形式，从而改善了transformer在内存的存储复杂度和提高了计算效率。此方法减少了冲过50%的神经网络参数，比baseline的transformer模型提高了1.35x的速度。同时实验说明了LRT model 在测试集获得了更好的性能表现。 LRT在现存的一些数据集上表现更佳，在不用外部语言模型火声学数据的情况下。

image.png

部署

这里先使用docker 部署，后期可以转到本机或带声音的设备
Dockerfile

FROM pytorch/pytorch:1.4-cuda10.1-cudnn7-devel
RUN   apt-get update   \  apt-get install -y libsndfile1 
RUN pip install -i https://pypi.douban.com/simple torchaudio  tqdm python-Levenshtein librosa wget
RUN pip install -i https://pypi.douban.com/simple SoundFile numpy==1.19 numba==0.48.0 librosa==0.6.0

运行容器

docker run --gpus=all  -itd --name asr  --shm-size 12G -v /media/nizhengqi/sdf/wyh/end2end-asr-pytorch:/workspace  asr:v2

数据处理

从https://www.openslr.org/33/下载中文数据集

数据集存放

在工作目录 end2end-asr-pytorch下建立Aishell_dataset文件夹
下面存放

transcript 原始数据

transcript_clean transcript_clean_lang 处理后数据

建立划分元数据存放训练集开发集和测试集的划分
位于end2end-asr-pytorch/manifests

aishell_dev_lang_manifest.csv aishell_test_lang_manifest.csv aishell_train_lang_manifest.csv

aishell_dev_manifest.csv aishell_test_manifest.csv aishell_train_manifest.csv

修改 /opt/conda/lib/python3.7/codecs.py

这里只是暂时略过异常，会有不少脏数据

ef decode(self, input, final=False):

        # decode input (taking the buffer into account)
        try:
            data = self.buffer + input
            (result, consumed) = self._buffer_decode(data, self.errors, final)
        # keep undecoded input until the next call
            self.buffer = data[consumed:]
        except:
            result  = "012345"
        return result

修改处理代码 data/aishell.py

 with open(text_file_path, "r", encoding="utf-8") as text_file:

        for line in text_file.readlines():
            if line=="012345":
                continue
            print(line)
            
            
with open(text_file_path, "r", encoding="utf-8") as text_file:
        for line in text_file.readlines():
            if line=="012345":
                continue
            print(line)
            
                        
                                    
with open(text_file_path, "r", encoding="utf-8") as text_file:
        for line in text_file.readlines():
            if line=="012345":
                continue
            print(line)

修改文件名错误

with open("manifests/aishell_train_manifest.csv", "w+") as train_manifest:
    for i in range(len(tr_file_list)):
        wav_filename = tr_file_list[i]

        text_filename = tr_file_list[i].replace(".wav", "").replace("transcript", "transcript_clean")  # 修改
        print(text_filename)
        
 with open("manifests/aishell_dev_manifest.csv", "w+") as valid_manifest:
    for i in range(len(dev_file_list)):
        wav_filename = dev_file_list[i]
        text_filename = dev_file_list[i].replace(".wav", "").replace("transcript", "transcript_clean")
       
 with open("manifests/aishell_test_manifest.csv", "w+") as test_manifest:
    for i in range(len(test_file_list)):
        wav_filename = test_file_list[i]
        text_filename = test_file_list[i].replace(".wav", "").replace("transcript", "transcript_clean")
        
        
 with open("manifests/aishell_train_lang_manifest.csv", "w+") as train_manifest:
    for i in range(len(tr_file_list)):
        wav_filename = tr_file_list[i]
        text_filename = tr_file_list[i].replace(".wav", "").replace("transcript", "transcript_clean_lang")
        
 with open("manifests/aishell_dev_lang_manifest.csv", "w+") as valid_manifest:
    for i in range(len(dev_file_list)):
        wav_filename = dev_file_list[i]
        text_filename = dev_file_list[i].replace(".wav", "").replace("transcript", "transcript_clean_lang")
        
        
        
 with open("manifests/aishell_test_lang_manifest.csv", "w+") as test_manifest:
    for i in range(len(test_file_list)):
        wav_filename = test_file_list[i]
        text_filename = test_file_list[i].replace(".wav", "").replace("transcript", "transcript_clean_lang")

注意 label位置,根据具体情况修改

with open("data/labels/aishell_labels.json", "w+") as labels_json:

修改 utils/audio.py

def load_audio(path):
    sound, _ = torchaudio.load(path,normalize=True) # normalization=True)

运行aishell.py

训练

python train.py --train-manifest-list manifests/aishell_train_manifest.csv --valid-manifest-list manifests/aishell_dev_manifest.csv --test-manifest-list manifests/aishell_test_manifest.csv --cuda --batch-size 12 --labels-path data/labels/aishell_labels.json --lr 1e-4 --name aishell_drop0.1_cnn_batch12_4_vgg_layer4 --save-folder save/ --save-every 5 --feat_extractor vgg_cnn --dropout 0.1 --num-layers 4 --num-heads 8 --dim-model 512 --dim-key 64 --dim-value 64 --dim-input 161 --dim-inner 2048 --dim-emb 512 --shuffle --min-lr 1e-6 --k-lr 1

测试

python test.py --test-manifest-list libri_test_clean_manifest.csv --cuda --continue_from save/model

后续还要收集处理自己的数据，继续训练，调试参数，这也是更为麻烦的工作

搭建基于transformer的端到端自动语音识别系统

简介

部署

数据处理

训练

推荐阅读更多精彩内容