key: cord-0599631-0voy9qt1
authors: Zhang, Kanghao; He, Shulin; Li, Hao; Zhang, Xueliang
title: DBNet: A Dual-branch Network Architecture Processing on Spectrum and Waveform for Single-channel Speech Enhancement
date: 2021-05-06
journal: nan
DOI: nan
sha: e4958ce708d606ed7570ef6f2f8fc3df9b34b548
doc_id: 599631
cord_uid: 0voy9qt1

In real acoustic environment, speech enhancement is an arduous task to improve the quality and intelligibility of speech interfered by background noise and reverberation. Over the past years, deep learning has shown great potential on speech enhancement. In this paper, we propose a novel real-time framework called DBNet which is a dual-branch structure with alternate interconnection. Each branch incorporates an encoder-decoder architecture with skip connections. The two branches are responsible for spectrum and waveform modeling, respectively. A bridge layer is adopted to exchange information between the two branches. Systematic evaluation and comparison show that the proposed system substantially outperforms related algorithms under very challenging environments. And in INTERSPEECH 2021 Deep Noise Suppression (DNS) challenge, the proposed system ranks the top 8 in real-time track 1 in terms of the Mean Opinion Score (MOS) of the ITU-T P.835 framework.

With the COVID-19 pandemic, lots people are working online, and the demand for reliable real-time speech enhancement algorithms has increased sharply. During these times, we need to ensure clear call quality and effective communication and cooperation with others without delay. Our communication are usually disturbed by a lot of background noise, such as washing machines, passing trucks, and the strong reverberation. These will affect the efficiency of our work and communication. Recently, many researchers from academia and industry have made significant contributions to monaural speech enhancement. However, due to the diversity of noise types in reality, even the the state-of-the-art algorithms cannot handle challenging environments well.

With the development of deep learning, many research regard speech enhancement as a supervised learning problem [1] [2] [3] , and obtained excellent performance. Usually the neural network input is time domain signal or STFT domain signal. In [2] [4] , the authors study the noise reduction problem in the short-time Fourier transform (STFT) domain. In [1] [5] , the time-domain signal after framing is directly feed to the neural network. Both frequency domain and time domain has its own advantages. The STFT method is more in line with human hearing perception, and the characteristics of speech is more explicit. Time domain method does not damage the signal by STFT, avoids the well-known invalid STFT problem.

Lim et al. [6] introduced a time-frequency network to jointly optimize the time and frequency domains of a signal for the task of audio super resolution. They show that combine these two domains can boost the audio super resolution perfor-mance and obtain the state-of-the-art in both quantitatively and qualitatively. As we know, for an impulse-like noise as shown in Fig.  1 (a) and 1(b), it is easy to eliminate in the time domain. Only a few samples need to be removed. In the frequency domain, the impulse-like nose pollutes the entire frequency band, and it is difficult to be eliminated by a speech enhancement method based on the frequency domain. In contrast, for a narrowbandlike noise as shown in Figure 1 (c) and 1(d), the noise is distributed on the narrowband and the frequency domain-based method covers well. In the time-domain, the noise and the speech are coupled together, and it is hard to decouple by an enhancement method based on time-domain. In this paper, in order to reduce noise better, we proposed a novel speech enhancement algorithm called DBNet, which combines the time domain and frequency domain together. The DBNet is a dual-branch structure with alternate interconnection. Each branch incorporates an encoder decoder architecture with skip connections. The two branches are responsible for spectrum and waveform modeling, respectively. A bridge layer is adopted to exchange information between the two branches. Experiments show that the proposed method has excellent results on the WSJ0 SI-84 [7] and DNS Challenge [8] .

The rest of the paper is organized as follows. The structure of the proposed DBNet is described in Section 2. Section 3 describes the experimental setup. The experimental results are revealed in Section 4. We conclude this paper in Section 5. 

The overall structure of the proposed method is shown in Fig.  2 . In the following sections, three important parts: SRS module, GLSTM, bridge layer, will be introduced one by one.

Recently, Liu et al. [9] proposed a new separation target based on the time-frequency representation of SRS in their work, which proved the superiority of SRS over STFT. First, SRS takes the phase into consideration, which improves the speech intelligibility and quality. Second, SRS is a spectral representation method in real field instead of the complex field, all of the elements of input are real numbers. So it reduces the modeling difficulty and provide convenience for the information interaction module of our model. Therefore in this paper, SRS is adopted as our frequency domain input based on the two advantages above.

Dauphin et al. [10] improved the masked convolution for image convolution modeling and proposed gated convolution (GCNN) which is described as:

where W and b represent kernel and bias, respectively. * and ⊙ denote operation of convolution and element-wise multiplication, respectively. σ represents a nonlinear activation function. The GCNN can reduce the vanishing gradient problem for deep architectures by providing a linear path for the gradients, so we replaced the convolution in the original crn with it. A diagram of gated convolution is shown in Fig. 3 . Model efficiency is pretty important, and many application scenarios have higher requirements for processing speed and memory usage. However, due to the introduction of the dualbranch structure, the amount of calculation and memory occupied by the model are much higher than that of the standard convolutional recurrent network. Gao et al. [11] proposed a grouped recursive neural network (RNN) strategy, which reduces the complexity of the model while ensuring the performance of the model. The process of a group RNN is shown in Fig. 4 .

In this paper, the group LSTM contains two layers of RNN, and each layer has two LSTMs to learn the features within each group. Between the two layers, a frame-level rearrangement is used to establish the inter-group relationship of features, which guarantees the utilization of inter-group correlations to a certain extent. 

Bridge Layer is a linear unit, which is responsible for converting information from one branch to another. Actually, the bridge layer consists of vectors with the same length as the frame length. We take the real part of the fast Fourier transform (FFT) variable as the initialization parameter of the vectors to fit the situation that SRS is used as the frequency domain representation.

The encoder has six layers, each consist of a bridge layer, a gated convolution followed by batch normalization and ELU nolinearity. Note that a module is added before the input of frequency branch to calculate the frequency-domain representation of the real number field. The input size of the model is [batch size, 1, seq len, features]. The first layer of the encoder expands the input channel from 1 to 64. After that,the rest layer of the encoder do the following operations. The bridge layer first translates the features from the other branch and then concatenate it with the output of the previous layer along the channel axis, finally pass the features to a gated convolution. The kernel and the stride of convolution are set to (1, 3) and (1, 2) Similarly, the decoder also has six layers. In addition to the features from another domain, each layer also gets the skip connection of the corresponding layer of the encoder, and concatenate them with the output of the previous layer along the channel axis. The decoder uses gated deconvolution to double the feature dimensions layer by layer to reconstruct the signal to the original size. The last layer of decoder enhance the signal to one channel. Finally, the output is converted into speech through the overlap-and-add operation. Note that the speech from frequency branch is regarded as final result.

In the early experiments, we used a loss based on STFT magnitude which was proposed in [1] and can be described as:

where T and F represent the number of time frames and frequency dimensions, S andŜ denote STFTs of s andŝ, respectively. Sr andŜi represent the real and imaginary parts of S, respectively. Note that the output of the network contains two enhanced utterances, one of which is from the time branch and the other from the frequency branch and they are optimized independently. So the total loss is defined as:

However, we found that magnitude loss introduced a large number of unknown artifacts. Although it does not affect the objective evaluation score, it would bring terrible auditory perception. Therefore, in DNS Challenge, the magnitude loss is replaced with the phase constrained magnitude loss proposed in [12] and achieved good subjective evaluation scores in the competition.

In this study, we evaluated the performance of our proposed model on the WSJ0 SI-84 dataset [7] which includes 7138 utterances from 83 speakers (42 males and 41 females). We used the utterances of 77 speakers for training and the rest for test. We used 10000 non-speech sounds from a sound effect library (available at www.sound-ideas.com) [13] and generated 320000 and 3000 utterances at the SNRs uniformly sampled from {-5dB, -4dB, -3dB, -2dB, -1dB, -0dB} for training and validation, respectively. For the test set, two noises (babble and cafeteria) from Auditec CD (available at http://www.auditec.com) are used to generate 300 mixtures at each SNR of -5dB, 0dB, and 5dB.

In this study, we compared the proposed dual-branch network with another 3 baselines, namely CRN [14] , GCRN [2] and AECNN [1] , which are given as follows:

• CRN: it is a casual convolutional recurrent network in T-F domain. The network uses 5 convolution layers as the encoder and 5 deconvolution layers as the decoder. Two LSTM layers are used for sequence modeling. This network receiving magnitude as input. The number of channels is decreased and the number of parameters is 4.5M.

• GCRN: it is a causal gated convolutional recurrent network for complex spectral mapping. The structure is similar to CRN except that GCRN has two decoders to model real and imaginary, respectively. The input of network is complex spectral. We kept the best configuration in [2] and the number of parameters is 9.76M.

• AECNN: it is a autoencoder based fully convolutional neural network in the time domain. Raw waveform is chunked into frames with a large tiem frame size (1.024s). We kept the best configuration in [1] . The number of parameters is 18M.

• DBNet: the structure of two branches are same. 6 (de) convolution block are set for encoder and decoder. The number of channels is 64 for each layer. The kernel size of (1, 3) and stride of (1, 2) are set for time and frequency axis. The input is time frames and SRS for time branch and frequency branch, respectively. The numebr of parameters is 2.9M.

All utterances are sample at 16kHz. The frames are extracted using a rectangular window and a hamming window of size 20ms for time domain and frequency domain, respectively. The overlap is 10ms. The models are trained using the Adam optimizer [15] with a learning rate of 0.001. And the batch size is set to 32 at the utterance level. Note that a random segment of 7 seconds is intercepted from an utterance if it is larger than 7 seconds. The smaller utterances are zero-padded to match the size of the largest utterance in the batch.

The performance is evaluated with two objective metrics: shorttime objective intelligibility (STOI) [16] and perceptual evaluation of speech quality (PESQ) [17] . STOI values typically range from 0 to 1, which can be roughly interpreted as percent correct. PESQ values range from -0.5 to 4.5. For both of the STOI and PESQ metrics, the higher number indicates better performance. In conclusion, the proposed dual-branch model outperforms both AECNN which is a time-domain based model and GCRN which is a frequency-domain based model for complex spectrogram mapping, indicating that alternate interconnection of the information of two domains can significantly improve the performance of the model and improve parameter utilization.

The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality. To evaluate the performance of the proposed method on more complicated and real acoustic scenarios, the proposed model was trained with the DNS-Challenge wide band dataset which contains more complex acoustic scenarios including reverberation, singing, emotions, and non-English speech. The settings of generating training set is described as follow. The SNRs of training mixtures vary from -5dB to 25dB. Around 30% of utterances are convolved with provided synthetic and real room impulse responses (RIRs) before mixed with different noise signals, and we process speech, noise and reverberation by using the spectral augmentation filters proposed in [18] . Moreover, there is a 5% chance that there may be multiple compound noises in one utterance. To meet the requirement of DNS-Challenge, the number of channels is appropriately decreased. In addition, kernel size of convolution blocks is set to (2, 3) .

We use DNSMOS [19] which is a reliable non-intrusive objective speech quality metric as our evaluation metrics at training stag and take Mean Opinion Score (MOS) of the ITU-T P.835 framework as result. The results of the evaluation using the ITU-T P.835 criterion [20] which is provided by the organizer are shown in Table 2 . It is obvious that the proposed model outperforms the baseline (NSnet2) by overall 0.12 DMOS score. Then we calculated the processing latency of our algorithm according to the competition requirements. In this model, the frame size T = 20ms, and the overlap between consecutive frames Ts = 10ms, so the latency T =30ms, which meets the requirements. We also evaluated the memory access cost (MAC), and the result is 2.847G per second. 

In this study, we propose a novel single-channel speech enhancement system, which consists of two denoising branches on time domain and frequency domain. The results turns out that the proposed model outperforms other advanced models in terms of objective intelligibility and quality scores. We explain our work has excellent performance because the information in the time domain and the frequency domain is not exactly the same. According to the principle of STFT, convolution in the time domain is equivalent to direct product in the frequency domain. Operations in the time domain tend to focus more on local information, and operations in the frequency domain focus more on the relationship between frames. A reasonable combination of the two can achieve better performance. And the proposed model has fewer parameters which indicate the two branch structure improves parameter utilization. Subjective results showed that the proposed system ranks the top 8 in the Mean Opinion Score (MOS) of the ITU-T P.835 for real-time track 1 of the INTERSPEECH 2021 Deep Noise Suppression (DNS) challenge.

The author would like to thank Yongjie Yan, Tailong Zhang and Pengjie Shen for their valuable comments. This research was partly supported by the China National Nature Science Foundation (No. 61876214).

A new framework for cnn based speech enhancement in the time domain

Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement

Supervised speech separation based on deep learning: An overview

Real-time monaural speech enhancement with short-time discrete cosine transform

Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain

Time-frequency networks for audio super-resolution

The design for the wall street journal-based csr corpus

Interspeech 2021 deep noise suppression challenge

Using shifted real spectrum mask as training target for supervised speech separation

Language modeling with gated convolutional networks

Efficient sequence learning with group recurrent networks

Dense CNN with Self-Attention for Time-Domain Speech Enhancement

Large-scale training to increase speech intelligibility for hearingimpaired listeners in novel noises

A convolutional recurrent neural network for real-time speech enhancement

Adam: A Method for Stochastic Optimization

An algorithm for intelligibility prediction of time-frequency weighted noisy speech

Perceptual evaluation of speech quality (pesq) the new itu standard for end-to-end speech quality assessment: Part i: Time-delay compensation

A hybrid dsp/deep learning approach to real-time full-band speech enhancement

Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors

A crowdsourcing extension of the itu-t recommendation p. 835 with validation