Background
Automatic Speech Recognition (ASR) uses both acoustic model (AM) and language model (LM) to output the transcript of an input audio signal.
Acoustic Model
Assigns a probability distribution over a vocabulary of characters given an audio frame.
Language Model
- Task 1: Assigns a probability distribution over a sequence of words.
- Task 2: Given a sequence of words, it assigns a probability to the whole sequence.
Given a corpus of tokens , the task of language modeling is to estimate the joint probability (Task 2). can be auto-regressively factorized as by chain rule. With the factorization, the problem reduces to estimating each conditional factor (Task 1).
Structure
A proposed structure[1] is shown bellow. With n-gram model (such as KenLM[2]) is implement for decoding and gives ASR output, and NN LM(neural network language model) complements it. Beam search in hypothesis formation is governed by the joint score of ASR and KenLM, and the resulting beams are additionally re-scored with NN LM in a single forward pass.Using NN LM instead of KenLM all the way along is too expensive due to much slower inference of the former.
Language Model
n-gram
The n-gram model is the most widely used language models.
Definition: an n-gram model is a probability distribution based on the th order Markov assumption
Maximum Likelihood Estimation
is the count of sequence , the ML estimation for :
The problem: makes .
N-Gram Model Problems
- Sparsity: assigning probability zero to sequences not found in the sample ==> means speech recognition errors.
Solution:
- Smoothing: adjusting ML estimates to reserve probability mass for unseen events. Typical form: interpolation of n-gram models, e.g., trigram, bigram, unigram frequencies.
Some widely used techniques:
- Katz Back-off models (Katz, 1987).
- Interpolated models (Jelinek and Mercer, 1980).
- Kneser-Ney models (Kneser and Ney, 1995).
- Class-based models: create models based on classes (e.g., DAY) or phrases.
- Representation: for , the number of bigrams is , the number of trigrams !
Solution:
- Weighted automata: exploiting sparsity.
In deepspeech, a n-gram model called KenLM[2] is used. KenLM exploits Hash table with linear probing. Linear probing places at most one entry in each bucket. When a collision occurs, linear probing places the entry to be inserted in the next (higher index) empty bucket, wrapping around as necessary.
Neural Network
A trainable neural network is used to encode the context into a fixed size hidden state, which is multiplied with the word embeddings to obtain the logits. The logits are then fed into the Softmax function, yielding a categorical probability distribution over the next token. The central problem is how to train a Transformer to effectively encode an arbitrarily long context into a fixed size representation.
Vanilla Model
One feasible but crude approximation is to split the entire corpus into shorter segments of manageable sizes, and only train the model within each segment, ignoring all contextual information from previous segments[3] .
Masked Attention: To ensure that the model’s predictions are only conditioned on past characters, we mask our attention layers with a causal attention, so each position can only attend leftward.
Auxiliary Losses: The training through addition auxiliary losses not only speed up convergence but also serve as an additional regularizer. Auxiliary losses only valid during training so that a number of the network parameters are only used during training--“training parameters”(distinguish between “inference parameters”). Three types of auxiliary losses, corresponding to intermediate positions, intermediate layers, and non-adjacent targets.
Positional Embeddings: The timing information may get lost during the propagation through the layers. To address this, replace the timing signal with a learned per-layer positional embedding added to the in-put sequence before each transformer layer, giving a total of L×N×512 additional parameters. (512-dimensional embedding vector, L: context positions, N: layers)
During evaluation, at each step, the vanilla model also consumes a segment of the same length as in training, but only makes one prediction at the last position. Then, at the next step, the segment is shifted to the right by only one position, and the new segment has to be processed all from scratch. The evaluation procedure is extremely expensive.
Transformer-XL
Transformer-XL (meaning extra long) is introduced to address the limitations of using a fixed-length context[4]. During training, the hidden state sequence computed for the previous segment is fixed and cached to be reused as an extended context when the model processes the next new segment. Although the gradient still remains within a segment, this additional input allows the network to exploit information in the history, leading to an ability of modeling longer-term dependency and avoiding context fragmentation.
Multi-layer Dependency: the recurrent dependency between the hidden states shifts one layer downwards per-segment, which differs from the same-layer recurrence in conventional RNN-LMs.
Faster Evaluation: during evaluation, the representations from the previous segments can be reused. 1,800+ times faster than the vanilla model.
Length-M Memory Extension: We can cache M (more than one) previous segments, and reuse all of them as the extra context when processing the current segment.
Relative Positional Encodings: how can we keep the positional information coherent when we reuse the states? Traditional method: input the transformer element-wise addition of the word embeddings and the positional encodings.
In this way, both and are associated with the same positional encoding .
In order to avoid this failure mode, the fundamental idea is to only encode the relative positional information in the hidden states. The positional encoding gives the model a temporal clue or “bias” about how information should be gathered, i.e., where to attend. Instead of incorporating bias statically into the initial embedding, one can inject the same information into the attention score of each layer.
In the standard Transformer:
The proposed relative positional encodings:
- Replace all appearances of the absolute positional embedding for computing key vectors in term (b) and (d) with its relative counterpart .
- Introduce a trainable parameter to replace the query . It suggests that the attentive bias towards different words should re- main the same regardless of the query position.
- Deliberately separate the two weight matrices and for producing the content-based key vectors and location-based key vectors respectively.
Summarize the computational procedure for a N-layer Transformer-XL with a single attention head:
Summary
In the analytic study[5], it has empirically shown that a standard LSTM language model can effectively use about 200 tokens of context on two benchmark datasets, regardless of hyperparameter settings such as model size. It is sensitive to word order in the nearby context, but less so in the long-range context. In addition, the model is able to regenerate words from nearby context, but heavily relies on caches to copy words from far away.
Reference
-
Hrinchuk, O., Popova, M., & Ginsburg, B. (2020). Correction of Automatic Speech Recognition with Transformer Sequence-To-Sequence Model. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2020-May, 7074–7078. https://doi.org/10.1109/ICASSP40776.2020.9053051 ↩
-
Kenneth Heafield. (2011). KenLM: Faster and Smaller Language Model Queries. Proceedings of the Sixth Workshop on Statistical Machine Translation, 187--197. http://sites.google.com/site/murmurhash/%0Ahttp://kheafield.com/professional/avenue/kenlm.pdf ↩ ↩
-
Al-Rfou, R., Choe, D., Constant, N., Guo, M., & Jones, L. (2019). Character-Level Language Modeling with Deeper Self-Attention. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 3159–3166. https://doi.org/10.1609/aaai.v33i01.33013159 ↩
-
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2020). Transformer-XL: Attentive language models beyond a fixed-length context. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 2978–2988. https://doi.org/10.18653/v1/p19-1285 ↩
-
Khandelwal, U., He, H., Qi, P., & Jurafsky, D. (2018). Sharp nearby, fuzzy far away: How neural language models use context. ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers), 1, 284–294. https://doi.org/10.18653/v1/p18-1027 ↩