key: cord-0144398-ptrn7xa7
authors: Zheng, Xianrui; Liu, Yulan; Gunceler, Deniz; Willett, Daniel
title: Using Synthetic Audio to Improve The Recognition of Out-Of-Vocabulary Words in End-To-End ASR Systems
date: 2020-11-23
journal: nan
DOI: nan
sha: c38c2ba40bd9e08727d0f576d37f3291fd5b6d39
doc_id: 144398
cord_uid: ptrn7xa7

Today, many state-of-the-art automatic speech recognition (ASR) systems apply all-neural models that map audio to word sequences trained end-to-end along one global optimisation criterion in a fully data driven fashion. These models allow high precision ASR for domains and words represented in the training material but have difficulties recognising words that are rarely or not at all represented during training, i.e. trending words and new named entities. In this paper, we use a text-to-speech (TTS) engine to provide synthetic audio for out-of-vocabulary (OOV) words. We aim to boost the recognition accuracy of a recurrent neural network transducer (RNN-T) on OOV words by using those extra audio-text pairs, while maintaining the performance on the non-OOV words. Different regularisation techniques are explored and the best performance is achieved by fine-tuning the RNN-T on both original training data and extra synthetic data with elastic weight consolidation (EWC) applied on encoder. This yields 57% relative word error rate (WER) reduction on utterances containing OOV words without any degradation on the whole test set.

Traditional hybrid ASR systems consist of an acoustic model (AM), a language model (LM) and a pronunciation model (lexicon), all of which are trained independently. In contrast, all components are jointly trained in E2E ASR systems thanks to an integrated modelling structure. Some well-known E2E models include Listen, Attend and Spell (LAS) [1] and RNN-T [2] .

A challenge for these E2E models is that they require a large amount of labelled audio data for training to achieve a good performance. For words that are not frequently seen in the training data (rare words) or not seen at all (OOV words), E2E models often struggle to correctly recognise them [3] . Even though E2E models are typically trained to output subword tokens and those can in theory construct some OOV words, in practice, words that are OOV or rare in training suffer from recognition errors. And in case there is no correct hypothesis in the N-Best list or the lattice from the beam search inference, it is also difficult for the second pass methods such as LM rescoring [4] to improve the recognition accuracy. Hybrid ASR systems do not suffer from the same limitation, as the factorised components allow simpler update without the need for speech samples. The lexicon can be extended manually and the LM can be updated with a small amount of targeted text data to support rare or OOV words.

In real-world voice assistant applications, it is common that an ASR system needs to predict rare or OOV words. For example, after the system is released, new trending words and named entities not included in the original training data might become important. In a frequently encountered scenario for music applications, an ASR system needs to support new artists or newly published albums on an ongoing basis. Furthermore, when extending functionalities of an assistant to new domains, it is highly likely that existing training data does not cover the traffic in these new domains. With hybrid ASR systems, such domain mismatch can be mitigated by updating the LM with targetted text-only data. However, with E2E models, it is costly in term of both time and financial resources to collect additional annotated audio data that contains rare and OOV words for each application domain.

Previous study improved the tail performance of an E2E ASR system by combining shallow fusion with MWER finetuning [3] , or with a density ratio approach for LM fusion [5] .

These methods need to incorporate extra language models during decoding, which increase the amount of computation. Few-shot learning of E2E model was explored in [6] , but in a small vocabulary command recognition task only. This paper focuses on improving the performance of an existing word-piece based RNN-T model on new trending words which are completely missing in the training data, i.e. trending OOV words, without doing shallow fusion or second pass rescoring using an extra LM. In particular, a TTS engine is used to generate audio from text data containing OOV words, and the synthetic data is used to improve the recognition accuracy for OOV words. Various regularisation techniques for fine-tuning are investigated and shown to be critical for both boosting the performance on OOV words and minimising the degradation on non-OOV words.

Domain adaptation is a relevant research thread that improves the performance of an ASR model on the test data following a different statistical distribution from the training data. To tackle the domain mismatch with text-only data from target domain, the output of E2E ASR models can be interpolated via shallow-fusion with an LM trained on the target domain text data [7] . Another approach is to employ TTS to generate synthetic audio based on text data from target domain. The synthetic audio-text pairs can be used to adapt an E2E model [8, 9] to target domain, or to train a spelling correction model [9, 10] to correct the errors of an existing ASR system.

Employing synthetic audio from TTS for ASR training recently gained popularity thanks to the advancement in TTS. Previous work [11] [12] [13] studied creating acoustically and lexically diverse synthetic data, exploring the feasibility of replacing or augmenting real recordings with synthetic data during ASR model training, without compromise on the recognition performance. It was found that synthetic audio could improve the training convergence when the amount of available real data is as small as 10 hours, but it could not yet replace real speech recordings to achieve the same recognition performance given the same text sources [11] . In [14] , instead of mapping texts to waveforms with an extra vocoder, the mel-spectrograms are synthesised directly and used to update the acoustic-to-word (A2W) attention-based sequenceto-sequence model [15, 16] . [14] confirmed that TTS synthetic data could be used to expand the vocabulary of an A2W E2E model during domain adaptation. Another highlight from [9, 14] is that freezing all encoder parameters was found beneficial when fine-tuning the model with synthetic data towards the target domain.

The mismatch in acoustic characteristics between real and synthetic audio can be problematic for ASR model training and fine-tuning. Besides encoder freezing, another approach is to combine real and synthetic audio when fine-tuning the model, which can also hinder catastrophic forgetting [9] . A third approach is to add a loss term to prevent the parameters of adapted model from moving too far away from the baseline model. This approach is particularly suitable for applications where the established domains covered by the original training data and the target new domain are equally important. The extra loss function can be as simple as the squared sum of the difference between the parameters prior fine-tuning and during fine-tuning, or it can be more advanced such as the elastic weight consolidation (EWC) [17] .

The E2E model used in this work is RNN-T [2, 9, 18] . Three methods are considered to fine-tune a baseline model and improve its performance on new trending OOV words while maintaining the overall accuracy in source domain.

Fine-tuning on both synthetic and real data prevents the model from forgetting real audio. [9] shows that a subset of the real data in the original source domain is needed if no regularisation method is applied during fine-tuning. In particular, [9] kept about 30% of source domain data used in training and combined them with the synthetic target domain data to form the final data for fine-tuning.

Instead of directly combining a portion of the original real data with synthetic data, as done in previous study [9, 11, 19] , we propose to sample data on-the-fly from the source domain real data and the target domain synthetic data. The sampling distribution is a global and configurable hyperparameter that propagates into each training batch. This allows a consistent sample mixing and it also makes data combination independent from the absolute sizes of synthetic and real data.

With encoder freezing (EF), the encoder parameters of a trained RNN-T are fixed, and only the parameters of decoder and joint network are updated during fine-tuning. In previous work [9] , freezing the encoder provided much better results than not freezing the encoder when fine-tuning RNN-T on synthetic data only. This paper examines whether encoder freezing can bring extra benefit on top of the sampled data combination method as explained in Section 3.1.

Encoder freezing only applies regularisation on the encoder in RNN-T, but the unrealistic synthetic audio may also bring an indirect negative impact to other components. In addition, the word distribution of the text data for real recordings and synthetic audio is different. When applying the method in Section 3.1 during fine-tuning, the prior probability of words previously represented by real recordings will likely decrease, potentially causing a degradation in the overall WER of source domain. Since such changes in word probability are likely to impact decoder and joint networks more than encoder, a regularisation for the decoder and joint networks may also be required during fine-tuning. We experiment with EWC [17, 20] for this purpose, with its loss function formulated as:

where θ new,i is the current value for the ith parameter and θ old,i is the value of that same parameter before fine-tuning, therefore θ old,i is fixed throughout the fine-tuning process. F i is the diagonal entries of the fisher information matrix used to give each parameter a selective constrain. L EWC is added to the regular RNN-T loss to force the parameters important to the source domain to stay close to the baseline model.

We use 2.3K hours of anonymised far-field in-house data (Train19) as the training data of baseline RNN-T. A dev set (Dev) and an eval set (Eval) are constructed with more recent data (1K hours each). Since Dev and Eval were from the live traffic of a later time period after Train19, some trending words in Dev and Eval may have rarely appeared in Train19. A list of OOV words is extracted by comparing Dev with Train19, i.e. all the words that have not appeared in Train19 but have appeared at least three times in Dev. The minimal occurrence of three helps exclude typos from the OOV word list. The utterances in Dev containing any OOV words are extracted as a subset DevOOV, which contains 6.5K utterances and only accounts for 0.7% of the Dev set. Similarly, the utterances in Eval containing any OOV words are extracted as a subset EvalOOV, containing 4.3K utterances. To reduce the decoding time, Dev and Eval are down-sampled randomly to 200K utterances each into DevSub and EvalSub. The utterances in EvalSub not covered by EvalOOV make another subset, i.e. EvalSub\OOV, which is used to monitor the recognition accuracy of non-OOV words.

The US English standard voice of Amazon Polly is used to generate synthetic audio from the text of DevOOV with one voice profile. This TTS system is based on hybrid unit selection. Future work can use a better TTS to reduce acoustic mismatch between real and synthetic audio.

A baseline RNN-T model is trained on Train19 until convergence. It has 5 LSTM layers in the encoder network (1024×5), 2 LSTM layers in the decoder network (1024×2), a joint network with one feedforward layer (512×1). The output of the RNN-T is the probability distribution of 4000 word-piece units from a unigram wordpiece model [21] .

We report the results in all tables with normalised word error rate (NWER), which is the regular word error rate (WER) divided by a fixed number shared globally in this work, i.e. the WER of baseline RNN-T on DevSub. Each method in Section 3 was tested both on its own and in combination with other methods. The goal is to find the setup that gives the lowest NWER on DevOOV without degrading the performance on DevSub, and the setup is further validated on EvalOOV and EvalSub.

When fine-tuning the baseline RNN-T to improve the performance on OOV words, Table 1 Table 1 . NWERs after fine-tuning the baseline on the combination of Train19 and DevOOV. The weight on the left in the column Weights% is the percentage of samples from Train19 and the weight on the right is the percentage of samples from DevOOV. S/R indicates whether real (R) or synthetic (S) audio is used to pair with DevOOV text data for fine-tuning.

control the sampling weights when combining Train19 and DevOOV for fine-tuning. The model performs the worst on DevSub when fine-tuned on DevOOV with synthetic audio only, i.e. weights (0, 100). As the percentage of samples from Train19 increases, the NWERs on DevSub decreases, proving that combining synthetic data with the original training data can prevent the model from forgetting what it learnt before. Without degrading on DevSub, the best performance on De-vOOV is observed with 70% fine-tuning data from Train19 and 30% from DevOOV with synthetic audio, achieving a 49% relative improvement on DevOOV compared to baseline. To find out the influence of the acoustic mismatch between real and synthetic audio, the last three rows of Table  1 replace synthetic audio with real recordings for DevOOV during fine-tuning. Fine-tuning on 20% of DevOOV with real audio achieves a 33% relative NWER reduction on EvalOOV compared to the same setup with synthetic audio. This motivates us to use extra methods to reduce the acoustic gap between real recordings and synthetic audio.

Applying EF or EWC on encoder may prevent the model from learning unwanted acoustic characteristics in synthetic audio. Comparing Table 1 with Table 3 . EWC applied on decoder network (D) and joint network (J) networks on top of EF.

encoder further improved the NWER on DevOOV by 2% relative, suggesting that freezing all encoder parameters during fine-tuning is suboptimal. In this (90, 10) weights setup, compared to not applying any regularisation on encoder in Table  1 , EWC introduces 10%, 1% and 14% relative NWER reduction on DevOOV, DevSub and EvalOOV respectively. Table 3 adds EWC regularisation on decoder and joint networks. EF is used instead of EWC on encoder because EF does not require careful hyperparameter tuning while it performs closely to EWC. Compared with Table 2 , when fine-tuning on DevOOV only with synthetic audio, regularising decoder and joint networks with EWC helps mitigate the degradation on DevSub. However, when 90% of fine-tuning data is sampled from Train19, adding EWC on decoder and joint networks does not improve the performance on DevSub while it introduces 7% relative degradation on DevOOV.

Two models are evaluated on subsets from Eval. When the original training data is available, the model fine-tuned with (90, 10) weights setup with EWC applied on encoder (highlighted in Table 2 ) is selected. Otherwise, we choose the model fine-tuned on 100% synthetic audio with EF and EWC on decoder and joint networks (highlighted in Table 3 ). As shown in Table 4 on EvalOOV and EvalSub respectively compared to baseline. In addition, for both models the improvement previously observed on DevOOV is successfully replicated on EvalOOV. Fig. 1 shows the relative WER reduction averaged over all OOV words that appeared the same number of times in DevOOV, based on the recognition results highlighted in Table 4 . Overall, the recognition performance improves more if an OOV word is seen many times during fine-tuning, but the performance also improves for some OOV words that are seen only a few times. The two most frequent OOV words are "coronavirus" and "covid". Even though the wordpiece vocabulary could theoretically compose "coronavirus", in fact the baseline model does not recognise any "coronavirus" in EvalOOV. After using the best model in Table 4 , the WER for "coronavirus" drops by 87% relative in EvalOOV.

This paper has shown that using synthetic audio is an effective way to incrementally update an existing RNN-T model to learn OOV words. The best result gives a 57% relative WER reduction on EvalOOV without degradation on EvalSub, indicating that the WERs of OOV words can be significantly reduced while preserving the performance elsewhere. This is accomplished by applying regularisation on encoder parameters and mixing the original training data with synthetic data during fine-tuning. Our study also shows that when finetuning on synthetic data only, applying regularisation on all RNN-T components can better mitigate the degradation in the overall WER than just applying regularisation on encoder.

Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition

Sequence Transduction with Recurrent Neural Networks

Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition

A Density Ratio Approach to Language Model Fusion in End-To-End Automatic Speech Recognition

Few-Shot Learning with Attention-Based Sequence-To-Sequence Models

On Using Monolingual Corpora in nEural Machine Translation

Personalization of End-To-End Speech Recognition on Mobile Devices for Named Entities

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability

A Spelling Correction Model for End-To-End Speech Recognition

Speech Recognition with Augmented Synthesized Speech

Training Neural Speech Recognition Systems with Synthetic Speech Augmentation

You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation

Leveraging Sequence-To-Sequence Speech Synthesis for Enhancing Acoustic-To-Word Speech Recognition

End-To-End Continuous Speech Recognition Using Attention-Based Recurrent NN: First Results

Attention-Based Models for Speech Recognition

Overcoming Catastrophic Forgetting in Neural Networks

Streaming End-To-End Speech Recognition for Mobile Devices

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

Forget Me Not: Reducing Catastrophic Forgetting for Domain Adaptation in Reading Comprehension

SentencePiece: A simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing