语言模型融合 Language Model Fusion

Introduction

In the previous articals, we have learnt the CTC loss makes assumption of conditional independency. This can be improved by introducing the language model (LM). The ASR decoder can make use of the acoustic information as well as the LM. The fusion between the two is what we will discuss in this artical.

LM Fusion

There are several ways of LM fusion.

  • Shallow fusion;
  • Deep fusion[1];
  • Cold fusion[2];

The three approaches differ in two important criteria: 1) Early/late model integration: At what point in the ASR model’s computation should the LM be integrated? 2) Early/late training integration: At what point in the ASR model’s training should the LM be integrated[3]?

Fusion Mode Model integration Training integration
Shallow fusion late late
Deep fusion early late
Cold fusion early early

LM Distillation

In shallow fusion, the LM takes role of prior and the acoustic information gives the likelihood and the loss takes form of Bayesian criterion. In deep fusion and cold fusion, the loss function is the softmax of neural network output, which is not easy to understand. In the paper[4], a LM distillation method is proposed.
Besides 1-of-K hard labels provided by the transcriptions, the RNNLM provides soft labels, which carries the knowledge of the text corpus. The loss function is Kullback-Leibler divergence (KLD) between estimated probability of the RNNLM P_\textrm{LM} and the estimated probability of the Seq2Seq model P_\textrm{S2S}. Because P_\textrm{LM} is fixed during training the Seq2Seq model, the loss function is equivalent to the cross entropy form:
L_\textrm{LST}(\theta) = \sum_{k=1}^K P_\textrm{LM}^k \log P_\textrm{S2S}^k
The loss
L(\theta) = \lambda L_\textrm{CE}(\theta) + (1-\lambda)L_\textrm{LST}(\theta)
where L_\textrm{CE}(\theta) is the training loss of acoustic sequence.

Reference


  1. Gulcehre, C., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H.-C., Bougares, F., Schwenk, H., & Bengio, Y. (2015). On Using Monolingual Corpora in Neural Machine Translation. http://arxiv.org/abs/1503.03535

  2. Sriram, A., Jun, H., Satheesh, S., & Coates, A. (2018). Cold fusion: Training Seq2seq models together with language models. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2018-Septe, 387–391. https://doi.org/10.21437/Interspeech.2018-1392

  3. Toshniwal, S., Kannan, A., Chiu, C., Wu, Y., Sainath, T. N., & Livescu, K. (2018). A COMPARISON OF TECHNIQUES FOR LANGUAGE MODEL INTEGRATION IN ENCODER-DECODER SPEECH RECOGNITION. 2018 IEEE Spoken Language Technology Workshop (SLT), d, 369–375.

  4. Bai, Y., Yi, J., Tao, J., Tian, Z., & Wen, Z. (2019). Learn spelling from teachers: Transferring knowledge from language models to sequence-to-sequence speech recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-Septe, 3795–3799. https://doi.org/10.21437/Interspeech.2019-1554

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。