Introduction
In the previous articals, we have learnt the CTC loss makes assumption of conditional independency. This can be improved by introducing the language model (LM). The ASR decoder can make use of the acoustic information as well as the LM. The fusion between the two is what we will discuss in this artical.
LM Fusion
There are several ways of LM fusion.
The three approaches differ in two important criteria: 1) Early/late model integration: At what point in the ASR model’s computation should the LM be integrated? 2) Early/late training integration: At what point in the ASR model’s training should the LM be integrated[3]?
Fusion Mode | Model integration | Training integration |
---|---|---|
Shallow fusion | late | late |
Deep fusion | early | late |
Cold fusion | early | early |
LM Distillation
In shallow fusion, the LM takes role of prior and the acoustic information gives the likelihood and the loss takes form of Bayesian criterion. In deep fusion and cold fusion, the loss function is the softmax of neural network output, which is not easy to understand. In the paper[4], a LM distillation method is proposed.
Besides 1-of-K hard labels provided by the transcriptions, the RNNLM provides soft labels, which carries the knowledge of the text corpus. The loss function is Kullback-Leibler divergence (KLD) between estimated probability of the RNNLM and the estimated probability of the Seq2Seq model . Because is fixed during training the Seq2Seq model, the loss function is equivalent to the cross entropy form:
The loss
where is the training loss of acoustic sequence.
Reference
-
Gulcehre, C., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H.-C., Bougares, F., Schwenk, H., & Bengio, Y. (2015). On Using Monolingual Corpora in Neural Machine Translation. http://arxiv.org/abs/1503.03535 ↩
-
Sriram, A., Jun, H., Satheesh, S., & Coates, A. (2018). Cold fusion: Training Seq2seq models together with language models. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2018-Septe, 387–391. https://doi.org/10.21437/Interspeech.2018-1392 ↩
-
Toshniwal, S., Kannan, A., Chiu, C., Wu, Y., Sainath, T. N., & Livescu, K. (2018). A COMPARISON OF TECHNIQUES FOR LANGUAGE MODEL INTEGRATION IN ENCODER-DECODER SPEECH RECOGNITION. 2018 IEEE Spoken Language Technology Workshop (SLT), d, 369–375. ↩
-
Bai, Y., Yi, J., Tao, J., Tian, Z., & Wen, Z. (2019). Learn spelling from teachers: Transferring knowledge from language models to sequence-to-sequence speech recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-Septe, 3795–3799. https://doi.org/10.21437/Interspeech.2019-1554 ↩