key: cord-0525300-n1x8nc0o authors: Polyak, Adam; Wolf, Lior; Adi, Yossi; Kabeli, Ori; Taigman, Yaniv title: High Fidelity Speech Regeneration with Application to Speech Enhancement date: 2021-01-31 journal: nan DOI: nan sha: 1f581c74ef838ec7650acb021d7c5ad56e6eacc3 doc_id: 525300 cord_uid: n1x8nc0o Speech enhancement has seen great improvement in recent years mainly through contributions in denoising, speaker separation, and dereverberation methods that mostly deal with environmental effects on vocal audio. To enhance speech beyond the limitations of the original signal, we take a regeneration approach, in which we recreate the speech from its essence, including the semi-recognized speech, prosody features, and identity. We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner and which utilizes a compact speech representation, composed of ASR and identity features, to achieve a higher level of intelligibility. Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source using an auxiliary identity network. Perceptual acoustic metrics and subjective tests show that the method obtains valuable improvements over recent baselines. Speech is the primary means of human communication. The importance of enhancing speech audio for better communication and collaboration has increased substantially amid the COVID-19 pandemic due to the need for physical distancing. In the domain of speech enhancement, denoising and dereverberation, methods have received much of the attention. The vast majority of these methods deal with environmental effects and train masking filters in order to remove unwanted sources, while assuming that the existing vocals are intelligible enough. However, in common environments where the recorded speech comes from a low fidelity microphone or poorly treated acoustic spaces, such methods struggle to reconstruct a clear sounding natural voice, which is similar to a voice that is recorded in a professional studio. Speech recognition and generation have seen remarkable progress in recent years mainly due to advances in the robustness of neural-based Automatic Speech Recognition (ASR) and neural vocoders. We utilize these advances and introduce a speech regeneration pipeline, in which speech is encoded at the semantic level through ASR, and a speech synthesizer is used to produce an output that is not only cleaner than the input, but also has better perceptual metrics. Our main contributions are: (i) We present a novel generative model that utilizes ASR and identity information in order to recreate speech in high-fidelity through comprehension; (ii) We present quantitative and subjective evaluation in the application of speech enhancement, and; (iii) We provide engineering details on how to implement our method efficiently to utilize it in real-time communication. Samples can be found at https://speech-regeneration.github.io. Speech enhancement and speech dereverberation were widely explored over the years [1, 2] . Due to the success of deep networks, there has been a growing interest in deep learningbased methods for both speech enhancement and dereverberation working on either frequency domain or directly on the raw waveform [3, 4, 5] . Deep generative models such as Generative Adversarial Network (GAN) or WaveNet were also suggested for the task [6, 7, 8] . A different method to improve speech quality is by using speech Bandwidth Extension (BWE) algorithms. In BWE, one is interested in increasing the sampling rate of a given speech utterance. Early attempts were based on Gaussian Mixture Models and Hidden Markov Models [9, 10] , and, more recently, deep learning methods were suggested for the task [11, 12] . Despite the success of the BWE methods, these were mainly applied to lower sampling rates (e.g. 8kHz, 16kHz). Recently, several studies suggested Audio Super Resolution algorithms for upsampling to higher sample rates (e.g. 22.5kHz, 44.1kHz). The authors in [13] introduce an end-toend GAN based system for speech bandwidth extension for use in downstream automatic speech recognition. The authors in [14] suggest to use a WaveNet model to directly output a high sampled speech signal while the authors in [15] suggest using GANs to estimate the mel-spectrogram and then apply a vocoder to generate enhanced waveform. Unlike previous BWE methods that mainly use generative models for a sequence to sequence mapping, in the waveform or the spectrum domains, our model is conditioned on several high-level features. The proposed model utilizes a speech synthesis model [16] , an ASR model [17] , a Pitch extraction model [18] , and a loudness feature. Features extracted from an ASR network were utilized in [19] for the task of voice conversion. In addition to differences that arise from the different nature of the task, there are multiple technical differences between the approaches: the generator of [19] is autoregressive, the previous method did not include an identity network and modelled new speakers inaccurately, and lastly the loss terms used to optimized the models were different. Our regeneration pipeline is an encoder-decoder network. First, the raw input speech is passed through a background removal method that masks out non-vocal audio. The output is then passed through several subnetworks to generate disentangled speech representations. The output of these subnetworks together with the spectral features condition the speech generative decoder network, which synthesizes the final output. The architecture is depicted in Figure 1 . Denote the domain of audio samples by X ⊂ R. The representation for a raw noisy speech signal is therefore a sequence of samples x = (x 1 , . . . , x T ), where x t ∈ X for all 1 ≤ t ≤ T , where T varies for different input sequences. We denote by X * the set of all finite-length sequences over X . Consider a single channel recordings with additive noise as follows: x = y * h + n, where y is the clean signal, h is the Acoustic Transfer Function, n is a non stationary additive noise in an unknown Signal-to-Noise Ratio (SNR), and * is the convolution operator. Given a training set of n examples, , we first remove background noise using a pre-trained stateof-the-art speech enhancement model [5] . Denoting the denoised signal asx, we define the set of such samples asS = Our encoding extracts several different representations in order to capture content, prosody, and identity features separately. Specifically, given an input signalx, the content representation is extracted using a pre-trained ASR network, E asr . In our implementation, we use the public implementation [20] of Wav2Letter [17] . The identity representation is obtained in the form of d-vectors [21] , using an identity encoder E id . The d-vector extractor is pre-trained on the VoxCeleb2 [22] dataset, achieving a 7.4% EER for speaker verification on the test split of the VoxCeleb1 dataset. The average activations from the penultimate layer forms the speaker representation. Lastly, the prosody representation includes both the fundamental frequency and a loudness feature. The former F 0 (x), is extracted using YAAPT [18] , which was found to be robust against input distortions. The loudness measurement of the signal, F loud (x) is extracted using A-weighting of the signal frequencies. The F0 and the Loudness feature are upsampled and concatenated to form the prosody conditioning signal. To summarize, the encoding of a denoised signal is given as The generative decoder network is optimized using the leastsquares GAN [23] where the decoder, G and discriminator D minimize the following objective, wherex = G(z, E(x)), is the synthesized audio sample from a random noise vector sampled from a normal distribution z ∼ N (0, 1). The decoder, G, is additionally being optimized with a spectral distance loss using various FFT resolutions between the decoder output,x, and the target clean signal, y, as suggested in [24] . For a single FFT scale, m, the loss component is defined as follows: where · F and · 1 denotes the Forbenius and the L 1 norms, S is the Short-time Fourier transform (STFT), and N the number of elements. The multi-scale spectral loss is achieved by summing Equation (2) Inspired by the recently suggested spectral energy distance formulation of the spectral loss [16] , the spectral loss is applied as part of a compound loss term: +L spec (y, G(z 2 , E(x))) −L spec (G(z 1 , E(x)), G(z 2 , E(x))), where z 1 , z 2 are two different normally distributed random noise vectors. Intuitively, the energy loss maximizes the discrepancy between different values of z. Replacing L spec with L sed has improved the quality of the generated audio. Specifically, in our experiments, we noticed the removal of metallic effects from the generated audio. Overall, the objective function of the decoder generator, G, is defined as: where λ is a tradeoff parameter set to 4 in our experiments. The preliminary enhancement of the input signal, x, produces the denoised signal,x, sampled at 16kHz. The decoder then receives as input the concatenated and upsampled conditioning signal, E(x), sampled at 250Hz. G is conditioned on the concatenation of the noise vector, z, and the speaker identity, E id (x), while the input to the model is E asr (x) concatenated with F 0 (x) and F loud (x). Finally, the proposed model outputs a raw audio signal sampled at 24kHz. The architecture of the decoder G is based on the GAN-TTS [25] architecture consisting seven GBlocks. A GBlock contains a sequence of two residual blocks, each with two convolutional layers. The convolutional layers employ a kernel size of 3 and dilation factors of 1, 2, 4, 8 to increase the network receptive field. Before each convolutional layer, the input is passed through a Conditional Batch Normalization [26] , conditioned on a linear projection of the noise vector and speaker identity, and a ReLU activation. The final five GBlocks upsample the input signal by factors of 2, 2, 2, 3, 4 accordingly, to reach the target sample rate. Figure 1 includes the hyperparameters of each GBlock. While GAN-TTS [25] was originally trained with both conditional and unconditional discriminators operating in multiple scales, we found in preliminary experiments that the proposed method can generate high-quality audio using a single unconditional discriminator, D. Moreover, our architecture of D is much simpler than the one proposed in previous work. It involves a sequence of seven convolutional layers followed by a leaky ReLU activation with a leakiness factor of 0.2, except the final layer. The number of filters in each layer is 16, 64, 256, 1024, 1024, 1024, 1 with kernel sizes of 15, 41, 41, 41, 41, 5, 3. Finally, to stabilize the adversarial training, both the discriminator and the decoder employ spectral normalization [27] . We present a series of experiments evaluating the proposed method using both objective and subjective metrics. We start by presenting the datasets used for evaluation. Next, a comparison to several competitive baselines is held, and we conclude with a discussion on the computational efficiency of the proposed method. We evaluated our method for speech regeneration on two different datasets. The first one is the Device and Produced Speech (DAPS) [28] dataset. DAPS is comprised of twelve different recordings for each sample; two devices and seven acoustic environments. For a fair comparison, we used the same test partition employed by HifiGAN [4] . For this benchmark, we optimized the generator on the public VCTK [29] The second dataset is a standard benchmark in the speech enhancement literature [30] . The dataset contains clean and artificially added noisy samples. The clean samples are based on the VCTK dataset [29] , where the noises are sampled from [31] . The dataset is split into a predefined train, validation, and test sets. In both settings, we resample the audio samples to 24kHz. Evaluation Metrics. We evaluated our method using both objective and subjective metrics. For subjective metric, we used MUSHRA [32] -We ask human raters to compare samples created from the same test signal. The clean sample is presented to the user before the processed files and is labeled with a 5.0 score. For objective metrics we used two distances proposed in [25] : (i) Fréchet Deep Speech Distance (FDSD) -a distance measure calculated between the activations of two randomly sampled sets of output signals to the target sample using DeepSpeech2 [33] ASR model; (ii) Conditional Fréchet Deep Speech Distance (cFDSD) -similar to FDSD, but computed between the generated output to its matched clean target. Unlike cFDSD, where the generated output should match the target signal, in FDSD the random sets do not have to match to the target utterance. Note that the Fréchet distances are computed using a different ASR network than the one used for conditioning the proposed model. We do not employ the PESQ metric [34] , which was designed to quantify degradation due to codecs and transmission channel errors. PESQ validity for other tasks is questionable, e.g., it shows a low correlation with MOS [35] . Moreover, it is defined for narrowband and wideband only. . Table 1 shows the performance of our method on the DAPS dataset. We compared our regeneration method to several competitive baselines in the domain of speech enhancement. As can be seen, our method outperforms the baselines in all metrics. The regenerated speech has substantially lower perceptual distances, and scored convincingly higher Results for the noisy VCTK dataset are presented in Table 2. Our method is superior to the baselines models on the noisy VCTK as well, however, improvements are modest compared to DAPS. This is due to VCTK being a less challenging dataset for enhancement tasks. Unlike DAPS, which its test split was recorded in real noisy-reverberant environments, in VCTK noise files were added to the clean samples to artificially generate noisy samples. The accessibility to speech enhancement greatly relies on its efficiency and ability to be applied while streaming. We have efficiently implemented a server-based module using PyTorch JIT that is able to fetch speech audio and regenerate it with a Real-Time Factor of 0.94. All modules required to compute the conditioning vector run in parallel either on the V100 NVIDIA GPU or Intel Xeon CPU E5. The final pipeline currently has a latency of about 100 milliseconds (ms), which takes into account the receptive field, future context and computation time of every module. All modules operate at a receptive field of up to 40ms. The ASR network and the denoiser use a future context of 20ms and 32ms respectively. The rest of the modules require no future context and are trained to be fully causal. Such latency can fit Voice-Over-IP applications. We present an enhancement method that goes beyond the limitations of a given speech signal, by extracting the components that are essential to the communication and recreating audio signals. The method is taking advantage of recent advances in ASR technology, as well as in pitch detection and identity-mimicking TTS and voice conversion technologies. The recreation approach can also be applied when one of the components is manipulated. For example, when editing the content, modifying the pitch in post-production, or replacing the identity. An interesting application is to create superintelligible speech, which would enhance the audience's perception and could be used, e.g., to improve educational presentations. Noise reduction in speech processing Speech denoising and dereverberation using probabilistic models Wave-u-net: A multi-scale neural network for end-toend audio source separation HiFi-GAN: High-fidelity denoising and dereverberation based on speech deep features in adversarial networks Real time speech enhancement in the waveform domain Time-frequency masking-based speech enhancement using generative adversarial network A wavenet for speech denoising Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement Narrowband to wideband conversion of speech using gmm based transformation Hmm-based frequency bandwidth extension for speech enhancement using line spectral frequencies Artificial bandwidth extension using a conditional generative adversarial network with discriminative training Time-domain neural network approach for speech bandwidth extension Speech audio super-resolution for speech recognition Speech super-resolution using parallel wavenet High-quality speech synthesis using super-resolution mel-spectrogram A spectral energy distance for parallel speech synthesis Wav2Letter: an Endto-End Convnet-based Speech Recognition System Yet another algorithm for pitch tracking TTS skins: Speaker conversion via ASR Jasper: An end-to-end convolutional neural acoustic model Generalized end-to-end loss for speaker verification Voxceleb2: Deep speaker recognition Least squares generative adversarial networks Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram High fidelity speech synthesis with adversarial networks A learned representation for artistic style Spectral normalization for generative adversarial networks Can we automatically transform speech recorded on common consumer devices in realworld environments into professional production quality speech?-a dataset, insights, and challenges CSTR VCTK Corpus: English multispeaker corpus for CSTR voice cloning toolkit Noisy speech database for training speech enhancement algorithms and tts models Demand: a collection of multi-channel recordings of acoustic noise in diverse environments CROWDMOS: An approach for crowdsourcing mos studies Deep speech 2: End-to-end speech recognition in english and mandarin Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs A scalable noisy speech dataset and online subjective test framework