key: cord-0520612-ir1duyv9 authors: Thakker, Manthan; Eskimez, Sefik Emre; Yoshioka, Takuya; Wang, Huaming title: Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation date: 2022-04-02 journal: nan DOI: nan sha: 5dec72021d7c3a392cdff182b8d29918d02e4167 doc_id: 520612 cord_uid: ir1duyv9 This paper investigates how to improve the runtime speed of personalized speech enhancement (PSE) networks while maintaining the model quality. Our approach includes two aspects: architecture and knowledge distillation (KD). We propose an end-to-end enhancement (E3Net) model architecture, which is $3times$ faster than a baseline STFT-based model. Besides, we use KD techniques to develop compressed student models without significantly degrading quality. In addition, we investigate using noisy data without reference clean signals for training the student models, where we combine KD with multi-task learning (MTL) using automatic speech recognition (ASR) loss. Our results show that E3Net provides better speech and transcription quality with a lower target speaker over-suppression (TSOS) rate than the baseline model. Furthermore, we show that the KD methods can yield student models that are $2-4times$ faster than the teacher and provides reasonable quality. Combining KD and MTL improves the ASR and TSOS metrics without degrading the speech quality. Online conferencing tools have been widely adopted for conducting businesses and connecting with family and friends since the beginning of the COVID-19 pandemic. However, the background noise and reverberation degrade the speech and transcription quality. As these acoustic distortions can hamper communications and thus productivity, the speech enhancement (SE) field has drawn a lot of renewed attention recently [1, 2] . Personalized speech enhancement (PSE) provides an improvement to the general SE approach by using prior knowledge about a target speaker [2, 3, 4, 5] . One exemplary approach to PSE is to extract a speaker embedding vector from a short enrollment audio sample of the target speaker and feed it to an SE model. This enables the SE model to remove interfering speakers in addition to the background noise. PSE provides practical benefits, especially for the users who share the same environments with other people during teleconferencing. To deploy the PSE models on various devices that are used for online meetings, it is paramount to make the computational cost very low while satisfying other performance requirements. The requirements include causal modeling for real-time processing, high-quality noise and interfering speaker suppression, little negative impact on the automatic speech recognition (ASR) accuracy, and little target speaker oversuppression (TSOS). Since the PSE models attempt to remove human voices * These authors contributed equally to this work. spoken by the interfering speakers, they may occasionally suppress the target speaker too by mistake. Preventing TSOS is critical for applying the PSE models to real scenarios. In this paper, we examine multiple approaches to take on this challenge and build real-time PSE models with very low computational cost while maintaining satisfactory accuracy. We investigate this problem from three angles. End-to-end modeling: We propose a personalized end-to-end enhancement network (E3Net), a faster neural network model that is shown to improve the speech quality, WER, and reduce TSOS than a previously proposed personalized deep complex convolution recurrent network (pDCCRN) [4] . Knowledge Distillation (KD): KD is one of the model compression approaches that are widely used in machine learning. We apply generic KD recipes for PSE by using a big capacity teacher model and a much smaller student model. We show the experimental results for both pDCCRN and E3Net. Leveraging unpaired noisy speech: Most of the real-world recordings do not come with clean reference signals. Thus, PSE models are usually trained on clean-noisy speech sample pairs created by simulation, which may cause a mismatch between the training and deployment environments. We examine the effect of applying bigger teacher models to the real noisy data and using the outputs as clean references for student models. Combination with multi-task learning (MTL) using ASR transcriptions [4, 6] is also considered. Our results show that E3Net provides better quality and runs 3× faster than the pDCCRN model. Furthermore, our ablation study for E3Net suggests that using more filters on learnable encoder-decoder for the end-to-end network is required to obtain a high-quality model. In addition, the KD experiments show that student models as fast as 3× their teacher models can provide reasonable quality for both pDCCRN and E3Net models. We show that we can further reduce TSOS with almost no degradation to speech quality by combining MTL with KD methods. Personalized Speech Enhancement (PSE) is a speech enhancement method conditioned on a cue representing the target speaker. This cue is often provided as a static embedding vector that captures the target speaker's speech characteristics. The main goal of PSE is to filter out interfering speakers in addition to background noise and reverberation. Giri et al. [3] proposed Personalized PercepNet by modifying the original PercepNet model [7] to accept speaker embeddings. Although the method was computationally efficient, it was based on a heuristic model and complex to build. The model was tested based on speech communication, and the ASR accuracy was not considered. Eskimez et al. [4] proposed two PSE models, an evaluation metric called target speaker oversuppression (TSOS), and test sets to cover various scenarios. TSOS measures the degree of removal of the target speaker's speech segments and is critical for PSE since removing the target speech hampers effective conversations and degrades the transcription quality, as reported in [8] . Furthermore, Taherian et al. [5] extended [4] to multi-channel scenarios by proposing a model that works with any microphone numbers and array geometries. Although the models of [4] can run on PCs in realtime, the computational cost was still too high for real usage as the audio processing can use only a tiny fraction of the available resources on devices. Knowledge Distillation (KD), or Teacher-Student (TS) learning, has been widely explored in natural language processing [9, 10] , computer vision [11, 12] and speech processing [13, 14] domains for training faster and more compact models with supervisions generated by computationally demanding teacher models. KD was first formalized in [15] . Hinton et al. [16] trained a smaller student network with labels generated by a stronger teacher model, comprising an ensemble of models. Hao et al. [17] used an SNR-based TS method to train an SE model using an ensemble of different models, which were individually trained for different SNR Ranges. Kobayashi et al. [18] used KD to train a unidirectional recurrent student network with a bi-directional teacher network inspired by Born Again Networks [19] . Kim et al. [20] leveraged TS learning to perform online training of the student network for real-time speech enhancement on devices by using teacher's pseudo labels. The KD study in [21] showed that KD helped avoid overfitting with regularization, leading to more robust and generalized student networks. Recently, Chen et al. [22] explored various techniques, including objective shifting (OS) and layerby-layer KD, for speech separation. In this paper, we first propose a real-time end-to-end PSE network with low computational cost, called E3Net. The endto-end modeling and optimization facilitate efficient use of the modeling capacity and easy development. We also apply the KD recipes to build more efficient PSE models and experimentally analyze their effectiveness. In addition, we explore the use of KD to leverage unpaired noisy samples. Furthermore, we combine KD with multi-task training using speech recognition labels, which also allows the use of the unpaired noisy data [6, 4] , to improve the transcription quality further. Extensive experimental results are shown to compare the various techniques. Note that model quantization or pruning, yet another popular approach to model compression, is out of the scope of our investigation as it relies heavily on run-time support. We approach the problem described in Section 1 from two different perspectives. First, we describe a simple and efficient end-to-end architecture that outperforms the previously proposed pDCCRN [4] . Next, we describe knowledge distillation and multi-task learning methods, which can be applied to any SE/PSE model architecture. We propose an end-to-end enhancement network (E3Net) that utilizes a learnable encoder and decoder instead of the shorttime Fourier transform (STFT) features. Figure 1 shows the personalized E3Net model. The network takes a raw waveform as input and processes it using a 1D convolutional layer. Un-like some prior end-to-end enhancement models such as Conv-TasNet [23] , we use filter and stride sizes that are equal to the window and hop sizes, respectively, of typical STFT configurations to reduce the computational cost. The learnable encoder extracts linear features and then feeds them to a PReLU nonlinear activation layer and a layer normalization module. Subsequently, these features are concatenated with a speaker embedding vector and projected to an intermediate feature space by a linear layer plus non-linear activation. N LSTM blocks process the intermediate features and feed them to a mask prediction layer which estimates masks to be applied to the original features with element-wise multiplication. The mask prediction layer consists of a fully connected layer with sigmoid non-linearity. Finally, a learnable decoder recovers a clean signal from the masked features. The LSTM blocks contain two fully connected layers with PReLU activation, where the second fully connected layer is followed by layer normalization. This two-layer fully connected block first maps the features to a higher dimensional nonlinear space with the first layer and then converts them back to another space with the original dimensionality. Then, the output of this block is fed to a single LSTM layer, followed by another layer normalization module. We add the fully-connected block's output to this layer normalization output and apply another layer normalization module. The end-to-end approach allows the whole network to be optimized without using redundant representations like the STFT features, facilitating the use of smaller models. Also, using only basic neural network elements such as fully connected and LSTM layers allows us to leverage optimized runtime modules for efficient computation. The main goal of KD or TS learning is to train a student model with labels or supervision signals generated by a teacher model. The student is typically smaller than the teacher. KD can also be used to leverage data without references, which are abundantly available. We investigate the following strategies for PSE: Vanilla KD: This method trains the student network with a supervised objective function where the teacher model output is used as the ground-truth label. These labels are called pseudo labels. The student model learns to mimic the teacher model. While we also examined Objective Shifting KD [22] as well as starting KD with a pre-trained supervised model, their performance was similar to that of the vanilla KD. Therefore, we report only the vanilla KD results. Leveraging unlabeled data: Real noisy speech data are usually obtained without the corresponding clean reference signals. To utilize these data, we explore using the unlabeled data with a strong teacher. Throughout the paper, we use the term "unlabeled" to refer to the data without a clean reference signal. Specifically, we construct two batches at each training step: 1) simulated noisy data with the corresponding ground-truth clean reference signal 2) real noisy data with pseudo labels generated by the teacher network. Multi-task learning (MTL) using an ASR-based loss function is another approach for utilizing the unpaired real noisy signals for the training. One exemplary scheme is to alternate two model update steps. The first step, called SE-step, uses the conventional supervised loss function by using simulated data with the corresponding ground-truth clean signals. The second Figure 1 : Proposed personalized E3Net model is shown. "Cat", "Sum" and "Mut" stands for concatenation, addition and elementwise multiplication. (a) (b) Figure 2 : E3Net performance is shown for different numbers of encoder/decoder filters and LSTM block numbers N step, called ASR-step, computes an ASR-related loss by feeding the SE model output to a pre-trained sequence-to-sequence ASR model and updates only the PSE model parameters. This method was shown to work with different ASR models for nonpersonalized [6] and personalized SE [4] . We consider applying the MTL to KD's student models using the same technique described in our previous work. We train the student network with alternating steps for ASR and PSE loss. Different from the previous version, we add a third step: extracting the pseudo labels from the unlabeled data using the teacher model and applying supervised loss using these pseudo labels for the student model. We followed the same data configurations as those employed in [4] , using 2000 hours of training and 50 hours of validation data. The clean speech signals for simulation were taken from the DNS Challenge [1] , comprising high-quality Lib-riVox speech data that were selected based on mean opinion score (MOS) values. The noise files were obtained from Au-dioSet [24] and Freesound [25] . Simulated room impulse responses (RIRs) were used to control the target and interfering speakers' distances to the microphone. In addition, we used a subset (1302 hours) of 75K hours ASR training set described in [6] for the experiments requiring unpaired noisy data as it contained ASR reference labels and thus was usable for both the multi-task learning and KD experiments. For evaluation, we used our previous test set [4] that included three scenarios: 1) TS1: target speaker + interfering speaker + noise, 2) TS2: target speaker + noise, and 3) TS3: target speaker only. We constructed these sets based on the VCTK corpus [26] . We simulated noisy speech samples for each file, then concatenated all the files of the same target speaker to obtain a single long-duration test file for each speaker. This test set was challenging as the noise and reverberation characteristics changed every few seconds. The average duration of the files was 27.5 minutes. In addition to the proposed E3Net, we also used pDCCRN to obtain widely applicable insights. For pDCCRN, we used the original setting of [4] for the teacher model: we set the numbers of filters in the encoder and decoder 2D convolutional layers to [16, 32, 64, 128, 128, 128] , the kernel sizes to 5 × 2 and the strides to 2 × 1. For the student model, we set the filter numbers to [8, 16, 32, 32, 64, 64] and retained the kernel sizes and strides. For both configurations, the LSTM hidden size was 128. We used the cosine annealing learning rate scheduler with a peak learning rate of 10 −3 . The batch size for the pDCCRN models was 8 with a gradient accumulation of 2. For STFT, we used a 32 ms window size and a 16 ms hop size. For E3Net, we set the number of LSTM blocks to 4, the number of features for the learnable encoder and decoder to 2048, the embedding and LSTM dimensions to 256, and the hidden dimension of the fully connected block to 1024. We used this baseline configuration for the teacher model, except the number of LSTM blocks was 8 instead of 4. For the student, we set the number of LSTM blocks to 2. In addition, we conducted parameter sweeping experiments to show the impact that the model configurations have on the E3Net performance. Gradient accumulation was not used for the supervised and KD training since the memory footprint was much smaller with E3Net. We used the same learning rate scheduler as pDC-CRN except for the peak learning rate of 10 −4 instead of 10 −3 . The window and hop sizes were set to 20 ms and hop size to 10 ms, respectively, for E3Net, which helps further reduce the processing latency. For MTL experiments, we loaded supervised models trained for 209K updates and resumed the training with a peak Table 1 : Experimental results for the VCTK dataset using three testing scenarios. TS1 includes target, interfering speaker, and noise, TS2 includes target speaker and noise, and TS3 includes only the target speaker. All conditions include reverberation. For WER and TSOS metrics, the lower is better. For the DNSMOS metric, the higher is better. M stands for millions. TSOS is described in seconds. RTF is computed using a single-thread (no parallelization) Intel(R) Xeon(R) W-2133 CPU @ 3.60GHz, averaged over 100 runs. LKDonSup and L KDonU nlab models are trained with pseudo labels obtained from simulation and unlabeled data, respectively. The performance of the models was assessed with the metrics described in [4] . Specifically, we measured the speech and transcription quality, target speaker over-suppression (TSOS), and real-time factor of the models. For speech quality, we used DNSMOS P.835 [27] , which is a neural-network-based mean opinion score (MOS) estimator that shows a good correlation with subjective ratings. We used the word error rate (WER) to measure the transcription quality. TSOS estimates the removed target speaker segments, and thus it is critical for the models to have very low TSOS values to avoid creating disruption during the usage in online meetings. Unlike our previous work, TSOS was normalized to show the target-speaker-suppressed period in seconds per half an hour. Table 1 shows the main results, including those of KD and MTL. First, let us focus on the comparison between E3Net and pDCCRN. Our experiments find E3Net with N = 4 had a fair balance between RTF and quality. Therefore, we chose this configuration as our baseline. In the table, compared to the pDCCRN-baseline, not only did E3Net-baseline provide better results for all the metrics and test scenarios, but it also ran 3× faster (0.195 to 0.065). Notably, the TSOS improvement was substantial for TS1 (14.61 to 3.75 seconds) and TS2 (10.46 to 1.83 seconds) scenarios. Figure 2 shows the results of the parameter sweeping experiments for E3Net. These plots show the average DNSMOS and WER values computed over all three test sets. We investigated the impacts of two hyperparameters while keeping the other parameters fixed: (1) the number of filters in the learnable encoder-decoder; and (2) the number of LSTM blocks (N ). Figure 2a shows that using more filters in the learnable encoderdecoder improved the results significantly. With a small number of filters (256 or 512) as with the STFT feature dimension, E3Net yielded poor results. Figure 2b shows that using more LSTM blocks also improved the results, especially WER. In addition, as implied by the E3Net result with N = 8 trained for additional 100K iterations (shown by a square in the plot), we noticed that training E3Net significantly longer could yield substantial gains irrespective of the model configurations, which was not observed for the STFT-based models like pDCCRN. While we limit the maximum iteration number for a fair comparison with pDCCRN, this E3Net behavior deserves future investigation. From these results, we can conclude that both over-parameterization for the encoder-decoder and using more LSTM blocks significantly improve the performance of E3Net. Table 1 shows that the smaller student models for both architectures resulted in significant degradation compared to the baseline models when trained with only the supervised loss (LLKDonSup). Models trained using KD provided better TSOS values while improving WER and DNSMOS. The networks trained with pseudo labels on the unlabeled data (L KDonU nlab ) provide similar improvements to KD (LLKDonSup), which indicates that the performance gains can be largely attributed to mimicking the behavior of the teacher models rather than the KD data. However, it is noteworthy that the E3Net-student with L KDonU nlab model provides significant improvements to TSOS compared to its pDCCRN counterpart. Combining MTL and KD improved the TSOS and WER significantly for both models (L M T L+KDonU nlab ). This was achieved without causing degradation to DNSMOS. In our prior work [4] , it was observed that MTL slightly degraded the DNS-MOS. Combining KD and MTL might have helped regularize the training to prevent the degradation, although further investigation is desired to draw firm conclusions. In this work, we proposed a novel fast end-to-end personalized speech enhancement (PSE) architecture named E3Net. In addition, we applied knowledge distillation (KD) to E3Net and STFT-based pDCCRN models to compress them and reduce their computational cost while maintaining reasonable quality. Furthermore, we used large-scale unlabeled data to train a student network using larger teacher models' pseudo labels. Our results showed that E3Net outperforms pDCCRN with a 3× faster real-time factor (RTF). In addition, results showed that KD recipes could compress the models further (2 − 4×) with a slight degradation to the quality. Interspeech 2021 deep noise suppression challenge Icassp 2022 deep noise suppression challenge Personalized percepnet: Real-time, low-complexity target voice separation and enhancement Personalized speech enhancement: New models and comprehensive evaluation One model to enhance them all: array geometry agnostic multi-channel personalized speech enhancement Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement Personalized percepnet: Real-time, low-complexity target voice separation and enhancement Voicefilter-lite: Streaming targeted voice separation for on-device speech recognition Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter Well-read students learn better: On the importance of pre-training compact models Knowledge distillation and studentteacher learning for visual intelligence: A review and new outlooks Be your own teacher: Improve the performance of convolutional neural networks via self distillation Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit bert Shrinking bigfoot: Reducing wav2vec 2.0 footprint Model compression Distilling the knowledge in a neural network Snrbased teachers-student technique for speech enhancement Implementation of low-latency electrolaryngeal speech enhancement based on multi-task cldnn Born again neural networks Test-time adaptation toward personalized speech enhancement: Zero-shot learning with knowledge distillation Can model compression improve nlp fairness Ultra Fast Speech Separation Model with Teacher Student Learning Conv-tasnet: Surpassing ideal timefrequency magnitude masking for speech separation Audio set: An ontology and human-labeled dataset for audio events Freesound datasets: a platform for the creation of open audio datasets Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit Dnsmos p. 835: A nonintrusive perceptual objective speech quality metric to evaluate noise suppressors