key: cord-0501627-132lndrw authors: Eskimez, Sefik Emre; Wang, Xiaofei; Tang, Min; Yang, Hemin; Zhu, Zirun; Chen, Zhuo; Wang, Huaming; Yoshioka, Takuya title: Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement date: 2021-06-05 journal: nan DOI: nan sha: dd4c34df951a68f751ae0e8c0d2195d692c74945 doc_id: 501627 cord_uid: 132lndrw With the surge of online meetings, it has become more critical than ever to provide high-quality speech audio and live captioning under various noise conditions. However, most monaural speech enhancement (SE) models introduce processing artifacts and thus degrade the performance of downstream tasks, including automatic speech recognition (ASR). This paper proposes a multi-task training framework to make the SE models unharmful to ASR. Because most ASR training samples do not have corresponding clean signal references, we alternately perform two model update steps called SE-step and ASR-step. The SE-step uses clean and noisy signal pairs and a signal-based loss function. The ASR-step applies a pre-trained ASR model to training signals enhanced with the SE model. A cross-entropy loss between the ASR output and reference transcriptions is calculated to update the SE model parameters. Experimental results with realistic large-scale settings using ASR models trained on 75,000-hour data show that the proposed framework improves the word error rate for the SE output by 11.82% with little compromise in the SE quality. Performance analysis is also carried out by changing the ASR model, the data used for the ASR-step, and the schedule of the two update steps. Due to the need for social distancing led by the COVID-19 pandemic, people and organizations have had to rely on digital technologies to stay connected and work remotely [1] . This has resulted in a surge in the usage of online conferencing tools. Most organizations rely on these tools to conduct day-to-day business while people rely on them to connect with their family members and friends in these challenging times. As a result, it is becoming more important than ever to provide high-quality speech audio under various household noise conditions. In recent years, deep neural networks have shown great potential for single-channel speech enhancement (SE) (or noisesuppression) [2, 3, 4, 5, 6, 7, 8] . Although these models substantially remove background noise, most of them degrade the performance of downstream tasks such as automatic speech recognition (ASR) performance significantly, as modern commercial multi-condition trained ASR systems can usually recognize original noisy speech [5] well and the SE models introduces unseen distortions that are particularly harmful to ASR. When both live captioning and high-quality audio are needed, a common solution is for the local client to send both an enhanced signal for communication and an unaltered signal for transcription. Clearly, there is a benefit of creating a speech enhancement * Equal contribution system that can improve speech quality without compromising the ASR accuracy. In this paper, we propose a framework for optimizing speech enhancement models for both communication and transcription quality by leveraging pre-trained ASR models. The framework aims to build an SE model that achieves superior ASR performance while retaining the same speech quality as an SE model trained solely for SE objectives. Our training framework alternately performs two steps: SEstep and ASR-step. In the SE-step, the model is trained with parallel data created by artificially mixing clean speech with noise files as with conventional supervised SE training. For the ASR-step, we use a realistic ASR model trained on a large amount of data. The training audio is passed through the SE network and fed to the ASR network to calculate an ASRbased loss. In both steps, only the SE network is updated while the ASR model parameters remain unchanged. This training scheme allows us to leverage real noisy recordings which do not have the corresponding clean speech signals to optimize the SE network. We evaluate our framework by using both real and synthetic data under various conditions with respect to signalto-noise ratios (SNRs), types of noise, and recording scenarios. The experimental results show that our proposed framework improves the word error rate (WER) by 11.82% over the stateof-the-art causal DCCRN model [3] for real recordings. Besides, we conduct ablation studies by changing the ASR model, the data for the ASR-step, and the schedule of the two training steps. We use different ASR systems for the training and evaluation to show the generalization capability of the proposed framework. Speech Enhancement: There are regression-based and masking-based approaches for speech enhancement using neural network models [9, 3] . Regression-based approaches try to predict a clean speech signal or its time-frequency representation (TF) from the noisy speech input. Masking-based methods try to estimate a TF mask from the noisy input and apply the predicted mask to the same input to obtain the clean signal. Various architectures, along with many objective functions, were proposed for solving the SE problem. We focus on the most recent and promising work. Choi et al. [6] proposed a deep complex U-net (DCUNET) that uses real-valued convolution operations on the real and imaginary parts of the input speech STFT. They took advantages of the U-net structure and deep complex networks [10] , which was shown to be useful for SE [11] . They employed a scale-invariant signal-tonoise ratio (SI-SNR) calculated in the time domain to optimize their model parameters. Hu et al. [3] Figure 1 : Proposed multi-task training framework.Ŝ andŶ denote enhanced signal and predicted transcription, respectively. (DCCRN). They introduced complex long-short term memory (LSTM) layers in the bottleneck layer and complex batch normalization. The resulting model was significantly faster and had fewer parameters than DCUNET. Kinoshita et al. [5] considered using a convolutional timedomain audio separation network (Conv-TasNet) for SE and proposed Denoising-TasNet, which directly processes a speech waveform with 1D convolutions. Their results show that the time-domain SE approach achieved a relative WER improvement of more than 30% for a robust ASR back-end. However, we could not observe WER improvements in our preliminary evaluation with time-domain approaches using a productiongrade ASR back-end trained from both clean and noisy audio. Hence, we do not consider this approach in this work. ASR Multi-task Training: There are prior studies that employ joint or multi-task training for the front-end and backend [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22] . However, they concerned only about the ASR accuracy and did not pay close attention to the SE quality nor analyzed the trade-off between the two tasks. In contrast, we focus on the front-end and optimize it for both tasks. This section elaborates on our proposed framework. We first present the proposed multi-task training framework using SE and ASR. In the subsections that follow, we describe the pieces that constitute the proposed framework, i.e., our SE model as well as the SE and ASR loss functions to be used. The idea behind the proposed method is to optimize an SE model for both noise suppression and ASR. Multi-task training is usually performed by using training samples that have supervision signals (i.e., reference signals and word labels) for all the tasks to be considered. We consider a scenario where each training sample has a supervision signal for only either SE or ASR. This is because, in practice, most of the ASR datasets do not contain training samples with reference clean speech signals that could be used for supervised SE training. Also, some SE datasets do not come with human transcriptions. To cope with this scenario, we update the SE model parameters by alternately performing the following two steps: SE-step: We mix clean speech and noise samples on the fly and use the noisy and clean speech pairs to evaluate a signal difference-based loss function. The SE model parameter gradients with respect to this loss function are computed to update the model parameters. ASR-step: Noisy training samples in a mini-batch are fed to the SE network. The generated enhanced signals are input to the ASR network. We evaluate a loss function by comparing the ASR model output and the reference transcriptions. The loss is back-propagated all the way down to the SE network, and only the SE model parameters are updated. This is because our objective is to find SE model parameter values that would work for existing well-trained ASR systems, and therefore we do not want the ASR network to adapt to the characteristics of the SE model. Figure 1 shows the diagram of the proposed framework. The two-step approach allows us to take advantage of the real noisy speech samples that only have reference transcriptions for the SE model training. At each training iteration, the update step to be used is chosen randomly from a Bernoulli distribution. We refer to the probability of choosing the SE-step as the "SE-step probability". Before performing the multi-task training, the SE model parameters are pre-trained on the SE training dataset. We employ DCCRN [3] as our front-end model because it achieved the best SE performance in our preliminary test, while we also provide results using DCUNET to examine the generalization capability of the proposed framework. DCCRN applies an encoder-decoder architecture with two LSTM layers in between. Instead of using conventional 2D convolutional/deconvolutional layers (conv2D/deconv2D), DCCRN builds on the DCUNET [6] that employs complex conv2D/deconv2D followed by complex batch normalization in the encoder and decoder blocks. Besides, DCCRN employs complex LSTM layers instead of conventional LSTM layers. Furthermore, it contains U-Net style skip connections (concatenation) from the encoder to the decoder as in DCUNET. The DCCRN model takes the real and imaginary parts of a noisy spectrogram as input. It estimates a complex ratio mask (CRM) and applies it to the noisy speech. The masked signal is converted back to the time domain with ISTFT. The CRM can be defined as follows: where Yr and Yi denote the real and imaginary parts of the noisy spectrogram, respectively while Sr and Si denote the real and imaginary parts of the clean spectrogram, respectively. We consider both non-causal and causal DCCRN configurations. The non-causal model can look ahead and employs bidirectional LSTM layers. The causal model can only process the current and previous frames and employs unidirectional LSTM layers and causal padding for conv2D/deconv2D layers. The latter is suitable for most applications. For further details of DCCRN, we refer the reader to [3] . For the loss function of the SE-step and the SE model pretraining, we use the PHASEN loss function [4] . It outperformed alternative loss functions, such as SI-SNR and powerspectrogram mean squared errors, in our preliminary tests. The PHASEN loss comprises two parts: amplitude La and phase-aware Lp losses. The definition is as follows: where S andŜ are the estimated and reference (i.e., clean) spectrograms, respectively. Hyper-parameter p is a spectral compression factor and is set to 0.3. Operator ϕ calculates the argument of a complex number. Our back-end model used for the ASR-step training is based on a sequence-to-sequence (Seq2Seq) model using an attentionbased encoder-decoder structure [22, 23] . We use the Seq2Seq model because it is simpler than a hybrid ASR system and thus facilitates multi-task training. The model estimates text sequence C = (c1, c2, · · · ) from STFT X that is generated by the SE model. We integrate STFT, log-mel filterbank energy extraction, and global mean-variance normalization into the ASR network to allow the gradients to pass through to the SE model. In the ASR-step, the SE model parameters are updated to minimize the cross entropy loss between C and reference label R = {r1, ..., rN , rN+1 = eos }, where N is the length of the reference sequence R. Special symbol eos indicates the sequence end. See Sections 2.3 and 4.4 of [22] for our Seq2Seq ASR model. We conducted ASR and SE experiments to evaluate the proposed multi-task training framework under realistic settings. Training Data for SE: We utilized a large-scale and highquality simulated dataset described in [24] , which includes around 1,000 hours of paired speech samples 1 . As a clean speech corpus, the dataset collects 544 hours of speech recordings with high mean opinion score (MOS) values from the LibriVox corpus [25] . The mixtures are created using 247 hours of non-stationary noise recordings from the Au-dioset+Freesound [26, 27] (187 hours), internal noise recordings (65 hours), and colored stationary noise (1 hour) as noise sources. In addition, the clean speech in each mixture is convolved with an acoustic room impulse response (RIR) sampled from 7,000 measured and simulated responses. See [24] for details of this dataset. The data are available publicly, except for the 65 hours of the internal noise recordings 2 . Training Data for ASR: We trained our Seq2Seq ASR model based on 64 million anonymized and transcribed English utterances, totaling 75K hours. Multi-task training Data: We used different training data for the SE and ASR-steps as described earlier. For the ASR-step, we used a subset of 75K-hour transcribed data. Section 4.3.2 explores considerations for selecting the subset, such as in-domain vs. out-of-domain, and including vs. excluding simulated data. Note that the 75K-hour data for the ASR-step also contained augmented/simulated data and that these simulated data were different from the SE training data. Evaluation Data: We used both simulated and real test data. The simulated test set comprised 60 hours of simulated audio with SNRs ranging from -10 dB to 30 dB. The simulation was performed as with the training data while clean speech signals were taken from LibriSpeech train-clean-100 [28] and convolved with RIRs generated by using different configurations. We added both Gaussian and non-stationary noise. The latter was generated by convolving noise recordings from Sound-Bible+Freesound (10 hours in total) with simulated RIRs. For the real test data, we employed two sets of noisy audio. The first set consisted of 18 hours of data recorded in an acoustically configurable audio lab where high-fidelity spatial noise sounds were played back from eight loudspeakers. The second set comprised 18 hours of meeting recordings. This test set contained various types of natural noise sounds. Furthermore, we included a clean speech test set consisting of 7803 words to measure the distortion introduced by the SE model. Our framework was implemented in PyTorch [29] . The seed (i.e., pre-trained) SE model was trained for 50 epochs with a batch size of 96 using 4 NVIDIA V100 GPUs. In practice, we cannot fine-tune the SE model for every single back-end, which varies from time to time. Therefore, we used different back-end ASR models for training and evaluation, respectively. For ASR evaluation, a high-performance online hybrid ASR model was employed [30, 31] . For multi-task training, the input feature for the offline Seq2Seq ASR model was 240-dimension log mel filterbanks, stacked by 3 frames with each frame having 10 msec. Global mean and variance normalization was applied before feeding the features to the ASR model encoder. We used 32K mixed-unit with space symbol between words as recognition units [32] . Label-smoothed cross-entropy loss [33] was applied for training. We evaluated our models using PESQ [34] , STOI [35] , SDR [36] , and pMOS [37] metrics, where pMOS is a neuralnetwork-based non-intrusive MOS estimator that shows high correlations with the human MOS ratings without requiring reference signals. Table 1 shows the evaluation results of the seed models and their multi-task trained versions for both the simu- Figure 2 : (a) Performance comparison by using different ASR back-end models. "Strong," "Medium," and "Weak" in different colors represent three Seq2Seq models that have different ASR accuracies for the validation set. Each colored dot represents a checkpoint obtained every 5K iterations from multi-task training. (b) Impact of different ASR-step training datasets. Note that they are subsets of the ASR back-end model training data. "In-domain" means the training data is within a similar application scenario with the evaluation set. (c) Impact of SE-step probability on pMOS and WER. These WER improvements were obtained with little SE quality degradation. For the real recordings, the pMOS values decreased only very marginally from 3.18 to 3.16, from 3.33 to 3.30, and from 3.20 to 3.17 for DCUNet, DCCRN-non-causal, and DCCRN-causal, respectively. If we did not interleave the SE-steps between the ASR-steps by setting the SE-step probability at 0.0, the WER for the DCCRN-causal was further improved to outperform the WER for the noisy signals. However, this also sacrificed the SE quality to some extent, reducing pMOS from 3.20 to 3.09. A similar trend was also observed for the simulation set. There seems to be a trade-off between the ASR and SE quality. An optimal operation point can be chosen by adjusting the SE-step probability based on the application needs. We further investigate this in Section 4.3.3. Figure 2 (a) shows how the SE and ASR performance changed depending on the ASR back-end model used for the multi-task training. We used three Seq2Seq models with different model structures, which are denoted as "strong" (green), "medium" (yellow), and "weak" (blue) based on a WER ranking calculated for an ASR validation set. The result shows that stronger ASR back-end models were more effective in closing the WER gap from the "No Enhancement" setting while preserving the SE improvement. Note that the "strong" ASR model was trained on the 75K-hour data containing various noisy signals. This suggests that it is important to use a powerful back-end model instead of a noise-sensitive model trained only on clean signals. Figure 2 (b) shows the impact that different ASR-step training sets had on the SE model's performance. The "strong" backend model was used. The result shows that the models trained on 1302-hour in-domain data (yellow) and those trained on its 224-hour random subset (red) performed equally well. This indicates that it is sufficient to use a relatively small amount of in-domain data. Meanwhile, combining simulated data (green) benefited the ASR performance compared with using only real data (red), which suggests the importance of acoustic diversity in terms of noise and reverberation conditions. It was also observed that using out-of-domain data (blue) degraded the pMOS score compared with using a similar amount of in-domain data (green) while they led to similar WERs. We also investigated how the SE model's performance was impacted by the SE-step probability value, which controls how often the SE-step is performed in the multi-task training, We used the "strong" Seq2Seq back-end model and "224-hour indomain with simulation" training data for ASR-step. Figure 2 (c) reveals the trade-off between pMOS and WER with the variation of "SE-step probability." Performing the ASR-step more frequently resulted in ASR performance improvement at the expense of pMOS compared with the seed DCCRN model. When only the ASR-step was performed by setting the SE-step probability at 0, the WER surpassed that of "No Enhancement" condition, which substantially compromised the pMOS score. For a general use case, we would prefer a moderate SE-step probability such as 0.5 for serving human listening and live captioning, which is shown in the last row (DCCRN-causal-MT) of Table. 1. In this paper, we proposed a multi-task training framework for deep learning-based SE models to make them more friendly to ASR systems. In the proposed approach, we first pre-trained the SE and ASR models separately. Then, we froze the ASR model parameters and started the multi-task training, where we interleaved two update steps: the SE-step and the ASR-step. The SE-step is the same as the conventional supervised SE training using noisy and clean speech pairs. For the ASR-step, we first enhanced the signals with the SE model, then fed the enhanced signals to the ASR model and backpropagated the cross-entropy loss. We controlled the frequency of each step with a probability parameter. The experimental results showed that our framework improved the ASR performance with small or little SE quality degradation. The ASR model used for the multi-task training was different from that used for evaluation, indicating that the resultant SE model can generalize to different back-end conditions. We also presented ablation study results that would guide the development of the proposed approach. In this paper, we opted to use a mix of public and private data to reflect the scale and complexity of real usage scenarios. We hope our work can encourage the research community to establish an open experimental platform for large-scale SE and ASR investigation. Impact of digital surge during covid-19 pandemic: A viewpoint on research and practice Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement Phasen: A phase-andharmonics-aware speech enhancement network Improving noise robust automatic speech recognition with single-channel time-domain enhancement network Phaseaware speech enhancement with deep complex u-net Joint timefrequency and time domain learning for speech enhancement Poconet: Better speech enhancement with frequency-positional embeddings, semi-supervised conversational data, and biased loss Convolutionalrecurrent neural networks for speech enhancement Deep complex networks Segan: Speech enhancement generative adversarial network Likelihood-maximizing beamforming for robust hands-free speech recognition Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning Multi-task learning for speech recognition: an overview Deep beamforming networks for multi-channel speech recognition Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition Multichannel signal processing with deep neural networks for automatic speech recognition Frequency domain multi-channel acoustic modeling for distant speech recognition Speech enhancement using end-to-end speech recognition objectives The 2020 espnet update: new features, broadened applications, performance improvements, and future plans Exploring end-to-end multi-channel asr with bias information for meeting transcription Listen, attend and spell: A neural network for large vocabulary conversational speech recognition Towards efficient models for real-time deep noise suppression Librivox: Free public domain audiobooks Audio set: An ontology and human-labeled dataset for audio events Freesound datasets: a platform for the creation of open audio datasets Lib-riSpeech: An ASR corpus based on public domain audio books Pytorch: An imperative style, high-performance deep learning library High-accuracy and low-latency speech recognition with twohead contextual layer trajectory lstm model Improving layer trajectory lstm with future context frames Advancing acousticto-word ctc model Towards better decoding and language model integration in sequence to sequence models Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs An algorithm for intelligibility prediction of time-frequency weighted noisy speech mir eval: A transparent implementation of common mir metrics Intrusive and non-intrusive perceptual speech quality assessment using a convolutional neural network," in WASPAA