Introduction

In the previous articals, we have learnt the CTC loss makes assumption of conditional independency. This can be improved by introducing the language model (LM). The ASR decoder can make use of the acoustic information as well as the LM. The fusion between the two is what we will discuss in this artical.

LM Fusion

There are several ways of LM fusion.

Shallow fusion;
Deep fusion^[1];
Cold fusion^[2];

The three approaches differ in two important criteria: 1) Early/late model integration: At what point in the ASR model’s computation should the LM be integrated? 2) Early/late training integration: At what point in the ASR model’s training should the LM be integrated^[3]?

Fusion Mode	Model integration	Training integration
Shallow fusion	late	late
Deep fusion	early	late
Cold fusion	early	early

LM Distillation

In shallow fusion, the LM takes role of prior and the acoustic information gives the likelihood and the loss takes form of Bayesian criterion. In deep fusion and cold fusion, the loss function is the softmax of neural network output, which is not easy to understand. In the paper^[4], a LM distillation method is proposed.
Besides 1-of-K hard labels provided by the transcriptions, the RNNLM provides soft labels, which carries the knowledge of the text corpus. The loss function is Kullback-Leibler divergence (KLD) between estimated probability of the RNNLM $P_\textrm{LM}$ and the estimated probability of the Seq2Seq model $P_\textrm{S2S}$ . Because $P_\textrm{LM}$ is fixed during training the Seq2Seq model, the loss function is equivalent to the cross entropy form:
$L_\textrm{LST}(\theta) = \sum_{k=1}^K P_\textrm{LM}^k \log P_\textrm{S2S}^k$
The loss
$L(\theta) = \lambda L_\textrm{CE}(\theta) + (1-\lambda)L_\textrm{LST}(\theta)$
where $L_\textrm{CE}(\theta)$ is the training loss of acoustic sequence.

Reference

Gulcehre, C., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H.-C., Bougares, F., Schwenk, H., & Bengio, Y. (2015). On Using Monolingual Corpora in Neural Machine Translation. http://arxiv.org/abs/1503.03535 ↩
Sriram, A., Jun, H., Satheesh, S., & Coates, A. (2018). Cold fusion: Training Seq2seq models together with language models. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2018-Septe, 387–391. https://doi.org/10.21437/Interspeech.2018-1392 ↩
Toshniwal, S., Kannan, A., Chiu, C., Wu, Y., Sainath, T. N., & Livescu, K. (2018). A COMPARISON OF TECHNIQUES FOR LANGUAGE MODEL INTEGRATION IN ENCODER-DECODER SPEECH RECOGNITION. 2018 IEEE Spoken Language Technology Workshop (SLT), d, 369–375. ↩
Bai, Y., Yi, J., Tao, J., Tian, Z., & Wen, Z. (2019). Learn spelling from teachers: Transferring knowledge from language models to sequence-to-sequence speech recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-Septe, 3795–3799. https://doi.org/10.21437/Interspeech.2019-1554 ↩

语言模型融合 Language Model Fusion

语言模型融合 Language Model Fusion

Introduction

LM Fusion

LM Distillation

Reference