key: cord-0432586-z0a2axec
authors: Li, Andong; You, Shan; Yu, Guochen; Zheng, Chengshi; Li, Xiaodong
title: Taylor, Can You Hear Me Now? A Taylor-Unfolding Framework for Monaural Speech Enhancement
date: 2022-04-30
journal: nan
DOI: nan
sha: afbe212e5123e35e10e23a154f5723d60df1d924
doc_id: 432586
cord_uid: z0a2axec

While the deep learning techniques promote the rapid development of the speech enhancement (SE) community, most schemes only pursue the performance in a black-box manner and lack adequate model interpretability. Inspired by Taylor's approximation theory, we propose an interpretable decoupling-style SE framework, which disentangles the complex spectrum recovery into two separate optimization problems emph{i.e.}, magnitude and complex residual estimation. Specifically, serving as the 0th-order term in Taylor's series, a filter network is delicately devised to suppress the noise component only in the magnitude domain and obtain a coarse spectrum. To refine the phase distribution, we estimate the sparse complex residual, which is defined as the difference between target and coarse spectra, and measures the phase gap. In this study, we formulate the residual component as the combination of various high-order Taylor terms and propose a lightweight trainable module to replace the complicated derivative operator between adjacent terms. Finally, following Taylor's formula, we can reconstruct the target spectrum by the superimposition between 0th-order and high-order terms. Experimental results on two benchmark datasets show that our framework achieves state-of-the-art performance over previous competing baselines in various evaluation metrics. The source code is available at github.com/Andong-Lispeech/TaylorSENet.

As a consequence of the COVID-19 pandemic, people and organizations have become increasingly dependent on remote communication techniques to stay connected and conduct business routines. It is thus imperative to demand highquality speech when background noise and room reverberation exist. As a resolution, monaural speech enhancement (SE) aims to extract the target speech from the noisy mixture when only the single-channel recording is available. * Contact Author Recently, the advent of deep neural networks (DNNs) has significantly promoted the performance of SE algorithms, which can be roughly categorized into two streams, namely in the time-domain [Pascual et al., 2017; Luo and Mesgarani, 2019] and time-frequency (T-F)-domain [Yin et al., 2020; Tang et al., 2021] . As speech and noise patterns tend to be more distinguishable after the short-time Fourier transform (STFT), the latter still dominates the mainstream. Previous works simply estimated the magnitude of the spectrum and left the noisy phase unaltered, but they would inevitably incur heavy performance restrictions. To address this problem, it is necessary to consider the joint optimization of magnitude and phase. In [Yin et al., 2020] , a dual-branch network was designed to separately model the magnitude and cosine representations of phase. However, due to the nonstructural characteristic of phase, the phase-branch tends to be sensitive to nonlinear operations. Another typical strategy is to couple the magnitude and phase into Cartesian coordinates and construct real and imaginary (RI) pairs. By virtue of complex spectral mapping (CSM) [Tan and Wang, 2020] or complex ratio mask (CRM) [Williamson et al., 2015] , both the magnitude and phase can be implicitly recovered. However, such target entanglement will cause the compensation effect , i.e., magnitude distortion is inevitably sacrificed to compensate for the phase prediction accuracy, especially under low signal-to-noise ratios (SNRs). More recently, a decoupling mapping procedure [Li et al., 2021] was proposed to decouple the complex spectrum estimation into two separate steps. In the first step, only the magnitude prior is estimated, which is coupled with the noisy phase to obtain a coarse complex spectrum. Afterward, with residual learning, another network is utilized to estimate the residual component with sparse distribution in the complex domain, which measures the gap between noisy and target phases. Different from previous literature, it endows the separate optimization space toward the magnitude and phase, and therefore, alleviates the compensation effect. Figure 1 (c)-(d) visualize the coarse and residual spectra as an example.

In this paper, we rethink the spectrum decoupling and formulate it as an approximation problem w.r.t. input neighborhood space. In other words, if we can access the complex residual and repair the phase in advance, then we can perfectly approximate the clean spectrum from magnitude estimation theoretically. This process can be expressed as S = F (X + δ), where {S, X, δ} denote clean, mixture, and residual components, respectively, and F is the magnitude estimation function. Based on this new formulation, it is intuitive to leverage Taylor's approximation w.r.t. X to estimate the function representation at X + δ. However, in practical implementation, the residual prior is usually inaccessible. In this regard, we propose a new framework called TaylorSENet to explicitly model the Taylor's approximation by formulating the main term and derivative term as learnable modules. Concretely, the 0th-order module is to consider the magnitude of the spectrum while the high-order modules are concerned with complex residual estimation. In this way, the estimation of the complex spectrum can be obtained by the superimposition of the 0th-order non-derivative and multiple high-order derivative terms. Different from previous SE models in a black-box manner, the proposed framework provides each module with better interpretability. To the best of our knowledge, this is the first time to cast the complex spectrum recovery as Taylor's approximation problem in the speech front-end task. Our contributions can be summarized as three-fold:

• We rethink the decoupling-style SE algorithm and abstract it as Taylor's approximation problem. • We propose an end-to-end framework to simulate the 0th-order and high-order items of Taylor-unfolding. • We conduct comprehensive experiments and the results show that our system achieves state-of-the-art performance among two benchmarks.

T-F Domain Methods. Before entering the deep learning era, traditional denoising algorithms were applied in the T-F domain as Fourier theory provides a feasible feature representation. Typical methods include spectral subtraction [Boll, 1979] , Wiener filtering [Scalart and others, 1996] , and statistical-based methods [Ephraim and Malah, 1984] . After the proliferation of DNNs, the denoising task is formulated into a supervised learning problem and can be trained to grasp the latent mapping relations between noisy features and clean targets. For a long time, only the magnitude is considered as the phase distribution is nonstructured and is thus difficult to predict. More recently, increasing evidence shows that phase also plays a pivotal role in perception quality improvement [Paliwal et al., 2011] . For phase-aware SE approaches, in GCRN [Tan and Wang, 2020], a UNet-style network was utilized to estimate both real and imaginary (RI) parts of the spectrum. As a modification, DCCRN [Hu et al., 2020 ] devised a complex-valued UNet to adapt the complex correlation between RI. [Yin et al., 2020] proposed a dual-branch network to model the magnitude filter and cosine representations of the phase, respectively. CT-SNet [Li et al., 2021] proposed a two-stage mapping regime, where the magnitude was first estimated as the prior to further facilitate the subsequent phase recovery. Time Domain Methods. Thanks to the development of DNNs, time-domain-based methods have gained prosperity more recently. A typical option is to directly learn the sample distribution. In DDAEC [Pandey and Wang, 2020] , the 1-D waveform was first enframed into 2-D format and then passed through a UNet structure with layer-wise dense-nets. SEGAN [Pascual et al., 2017] adopted the generative adversarial network (GAN) to predict the waveform directly. Another strategy is to use a pair of learnable encoder and decoder to convert the waveform samples into latent space and then devise a separate module to distinguish between different sources. Representative works include Conv-TasNet [Luo and Mesgarani, 2019] and DPRNN [Luo et al., 2020] . Multi-stage Learning. Due to the lack of prior information, the performance of a single-stage SE pipeline is heavily limited in complicated acoustic scenarios. In contrast, in a multi-stage pipeline, the original mapping problem is usually decomposed into several separate subtasks and enables progressive learning. Besides, the previous estimation also serves as the latent prior to guide the subsequent learning process. [Westhausen and Meyer, 2020] combined the complementarity of T-F and latent domain and proposed a stacked dual signal transformation network. [Hao et al., 2021] rethinked spectrum recovery from the dual optimization of subband and fullband and proposed a two-stage approach to capture both local and global spectral contexts. Based on the glance and gaze behavior of humans in visual perception, [Li et al., 2022] proposed to stack multiples glance-gaze modules to reconstruct the spectrum collaboratively.

Given STFT, the mixture signal in the T-F domain can be formulated as:

where {X k,l , S k,l , N k,l } ∈ C denote complex-valued noisy, clean, and noise signals in the frequency bin of k ∈ {1, · · · , K} and time index of l ∈ {1, · · · , L}. For brevity, we omit the subscripts {k, l} if no confusion arises. The aim of SE is to design an operator to extract the target speech from the noisy mixture, i.e.,

where F 1 denotes the estimation function. Although various networks can be employed to accomplish this process, these methods usually encapsulate the whole recovery process as a black-box and thus have weak interpretability in the intermediate stages [Tan and Wang, 2020; Hu et al., 2020] . Recently, a decoupling-style forward pipeline was proposed in and given as Figure where F mag and F com are the mapping functions for magnitude and residual estimation, respectively, and θ X denotes the noisy phase. The above procedure decomposes the entanglement of magnitude and phase by step-wise optimization, and can be summarized into two core operations:

(d) RDL UNet-block 2D-DeGLU Norm PReLU 1 i H    1 i H   + i H r S i S   1, , X     2, , X     , , Q X   (c) S-TCN S-TCM S-TCM S-TCM ... ...

• op1: suppress the noise in the magnitude domain while ignoring the phase term to obtain a coarse estimation. • op2: estimate the complex residual while fixing the magnitude term to refine the target spectrum. In traditional SE algorithms, op1 can be fulfilled by either noise subtraction [Boll, 1979] or noise filtering [Ephraim and Malah, 1984] . For both two techniques, the aforementioned procedure can be respectively expressed as

where N 1 = |N | e jθ X , N 2 = (1 − M ) |X| e jθ X respectively denote the estimated noise components in traditional noise subtraction and noise filtering algorithms, and M denotes the spectral filter gain. {δ 1 , δ 2 } denote the corresponding phase residual term. Comparing Eqn.(5) and Eqn. (6), we can find that they have a similar format, which is intuitive since op1 essentially encourages an accurate noise power spectral density (NPSD) estimation and then subtracts it from the mixture. Furthermore, denoting F (X) = X − N , and X := X + δ, Eqns. (5)-(6) can thus be abstracted into a more general case:

Eqn. (7) implies that if we can access the residual term δ and add it into input X in advance, then we are able to perfectly recover the spectrum by magnitude estimation theoretically. However, in practical scenarios, we usually can not access the residual prior δ. Therefore, to resolve the above generalized function, we expand Eqn. (7) with an infinite Taylor's series expansion at X, given as

which can be simplified as

As such, we provide a novel perspective of complex spectrum recovery from Taylor's approximation theory. In detail, the 0th-order term is grounded with regard to the estimation of spectral magnitude, and the high-order terms attempt to approximate the distribution of the residual component.

To adapt the formulation of Taylor's series expansion to the network, it is necessary to derive the correlation between adjacent derivative terms. For practical implementation, we truncate the number of orders in the derivative part into Q and neglect higher-order parts. In light of [Fu et al., 2021] , we first define the qth order derivative term as T (q, X, δ)

where the factorial term is dropped for derivation convenience. To investigate the correlation between qth order T (q, X, δ) and (q + 1)th order T (q + 1, X, δ), we differ-

Substituting Eqns.(12)-(13) into Eqn.(11) and multiplying δ on both sides, we can derive the following recursive formula between T (q + 1, X, δ) and T (q, X, δ):

T (q + 1, X, δ) = qT (q, X, δ) + δ ∂T (q, X, δ) ∂X .

Generally speaking, it is quite difficult to access the derivative operator δ ∂T (q,X,δ)

. To this end, we design a trainable network module, notated as P (q, X, δ), to replace the complicated operator. Note that as the network weights are purely learned from the training data, it does not necessarily follow a strict mathematical definition of derivation, but we empirically find that it indeed involves the residual estimation.

We instantiate the proposed Taylor-unfolding framework, as shown in Figure 2(a) . It mainly comprises two parts, namely the 0th-order module and multiple high-order modules. According to our previous formulation, the 0th-order module targets at magnitude estimation. To this end, we first convert the input RI into magnitude, i.e., |X| = X 2 r + X 2 i , and then we sent it to the 0th-order module to obtain the output gain M with range (0, 1) for noise filtering in the magnitude domain, as shown in Eqn.(15). To model high-order terms, we employ the high-order encoder to directly extract the patterns from the RI input, and the output feature map R is then concatenated with the output from the last high-order module as the input of the next module for high-order term update, as shown in Eqns. (16)-(17). This operation is implemented recursively. Note that after magnitude filtering, we couple the filtered spectral magnitude with the noisy phase θ X to yield the coarse complex spectrum. After all the terms are obtained, following Taylor's formula, we superimpose all of them to reconstruct the target spectrum. In a nutshell, the overall forward stream is formulated as:

P (q, X, δ) = G (Concat (T (q, X, δ) , R)) , (16) T (q + 1, X, δ) = q T (q, X, δ) + P (q, X, δ) , (17)

where q ∈ {1, · · · , Q} is the order index, and G denotes the mapping function of derivative operator in the high-order module. denotes the Hadamard product.

In the 0th-order module, we adopt a classical UNet-style encoder-decoder structure, which has been widely utilized in the SE task [Tan and Wang, 2020; . The encoder is to gradually decrease the feature size with consecutive downsampling operations while extracting the spectral features. In contrast, the decoder has a mirror structure and attempts to recover the original spectral size with deconvolution layers. Nonetheless, high-level semantic information is usually embedded in the various-length frame correlations and naive convolution operations often can not capture such complicated multiscale information. Inspired by the success of U 2 -Net in the salient object detection field [Qin et al., 2020], we adapt the U 2 -Encoder and U 2 -decoder with multiple recalibration encoding/decoding layers (REL/RDLs) herein, as shown in Figure 2 (b)(d). Taking REL as an example, each 2D-gated linear unit (GLU) [Dauphin et al., 2017] is followed by instance normalization (IN), and PReLU. Then a UNet-block is inserted with the residual connection, which takes the UNet-style structure except that the layer depth dynamically varies with the current input size. The process can be expressed as:

where H i denotes the input feature map of the i-th REL. The rationale is two-fold. First, by further feature downsamlingupsampling operations, different scale information can be effectively grasped. Besides, as the feature map close to the input tends to be rather noisy, we can recalibrate the feature map and preserve the target information.

To establish long-term relations between adjacent frames, we stack multiple temporal convolution networks (TCNs) [Luo and Mesgarani, 2019] , which comprise multiple temporal convolution modules (TCMs) with increasing dilations in the time axis. Besides, to decrease the computational footprint, we adopt a squeezed version of TCN (dubbed S-TCN) [Li et al., 2021] , as shown in Figure 2 (c), where the squeezed TCM (S-TCM) is leveraged for more compact channels. We also investigate other advanced structures for 0th-order module such as transformers [Vaswani et al., 2017] and conformers [Gulati et al., 2020] in experimental ablation studies (see Section 4.4).

For effective network training, we model the complicated derivative operator with a trainable network module, whose internal structure is presented in Figure 2 (e). As Eqn. (14) indicates, the operator involves both the last-order term and input X, so we utilize both the encoded feature R from noisy input X and the estimation from the last-order term T (q − 1, X, δ) as the input and send to a 1D-Conv. Several S-TCMs are employed as the modeling unit, and we can generate δ ∂T (q,X,δ) ∂X with linear transformation. Note that our framework also applies to other more advanced network structures and are expected to achieve even better performance, which we leave as future work. Besides, as the derivative operator should be parameter-invariant theoretically, we also investigate the case when the parameters are shared among different derivative modules (see Section 4.4).

WSJ0-SI84. It consists of 7138 utterances by 83 speakers (42 males and 41 females) [Paul and Baker, 1992] . We randomly select 5428 and 957 clips for training and validation, and another 150 clips by untrained speakers are used for testing. To generate noisy-clean pairs, we sample around 20,000 noises from the DNS-Challenge noise set [Reddy et al., 2020] DNS-Challenge. The Interspeech 2020 DNS-Challenge corpus covers over 500 hours of clean clips by 2150 speakers and over 180 hours of noise clips [Reddy et al., 2020] . For model evaluation, it provides a non-blind validation set with two categories, namely with and without reverberation, and each includes 150 noisy-clean pairs. Following the scripts provided by the organizer, we generate around 3000 hours of noisy-clean pairs for training and the SNRs randomly range from -5dB to 15dB.

Network configuration. In the U 2 -encoder and U 2decoder, the kernel size and stride of 2D-GLUs are namely set as (1, 3) and (1, 2) in the time and frequency axes, and we set the kernel size in the UNet-block as (2, 3). The number of 2D-conv channels remains 64 by default. Denote the number of (de)encoder layers in the i-th UNet-block as U i , and then U = {4, 3, 2, 1, 0} where 0 means no UNet-block is employed. For S-TCN and derivative operators, similar to [Li et al., 2022] , two groups of S-TCMs are utilized, each of which includes four S-TCMs with kernel size and dilation rates of 5 and {1, 2, 5, 9}, respectively. Causal convolution operations are adopted by zero-padding along the past frames.

Training configuration. We sample all the utterances at 16 kHz. The window size is set as 20 ms, with 50% overlap between adjacent frames. 320-point FFT is utilized, leading to 161-D in the feature axis. The model is trained on Pytorch platform with a NVIDIA V100 GPU. We use the Adam optimizer (β 1 = 0.9, β 2 = 0.999) with a batch size of 8 to train the proposed model, and the learning rate is initialized as 5e-4. For WSJ0-SI84, we train the model by 60 epochs, while 30 epochs for DNS-Challenge. The RI loss together with magnitude constraint is utilized as the training loss and the powerspectrum compression strategy is utilized with the compression coefficient empirically set as 0.5 [Li et al., 2022] . The learning rate will be halved if the validation loss does not decrease for two consecutive epochs.

Multiple objective metrics are adopted, including narrowband (NB) [Rix et al., 2001] and wide-band (WB) perceptual evaluation speech quality (PESQ) [Rec, 2005] for speech quality, short-time objective intelligibility (STOI) [Taal et al., 2011] and its extended version ESTOI [Jensen and Taal, 2016] 

We randomly sample around 100-hour pairs from WSJ0-SI84 corpus to conduct the ablation study, which spans the following three aspects: (1) Is-shared: whether the parameters are shared among multiple high-order modules;

(2) Q: the number of derivative orders; and (3) Zero-type: network type used in the 0th-order module. We fix the random seed, and PESQ, ESTOI, and SISNR are utilized as the evaluation metric, whose results are shown in Table 1 .

Effect of the parameter-shared scheme for high-order modeling. In the aforementioned problem formulation, we adopt a trainable network to model the complicated derivative operator for high-order modeling. It is thus necessary to explore whether the parameters can be shared among modules. By comparing the entries from entries 1c-1f to 2c-2f in Table 1 , the non-shared case yields relatively better performance over the shared case, especially in terms of ES-TOI and SISNR. We can explain the phenomenon from two aspects. First, the former scheme provides more parameter freedom in the optimization space by independent gradient update, while the latter has to balance all the terms with only one set of parameters. high-order terms are mainly responsible for phase modification. As a consequence, the nonshared scheme only yields a slightly better PESQ score than the shared case and relatively more notable improvements in terms of ESTOI and SISNR.

Effect of the order number. As shown in entries 1a-1f of Table 2 . When Q = 0, i.e., no high-order module is employed, it is not surprising to observe the worst performance as only the magnitude is considered and the phase term is neglected. When Q increases from 1 to 3, consistent improvements are achieved among the three metrics, which shows the effectiveness of Taylor series modeling. However, when Q further increases, the performance inclines to saturate and even slightly drop, e.g., Q = 4. A similar trend is also observed in entries 2c-2f. This might result from that highorder modules are responsible for complex residual modeling, which has a rather sparse spectral distribution. Therefore, 3-order can be sufficient to approximate the real distribution.

To further validate the mechanism of Taylor-unfolding framework, taking Q = 3 as an example, we visualize the output of each order in Figure 3 . Noisy, clean, and final estimated spectra are also presented as reference. As we can see, the output from the 0th-order module is quite similar to the clean version, and most noise components are suppressed, which infers that 0th-order indeed serves as the filter to eliminate the noise and captures the overall speech structure in the magnitude domain. Besides, the outputs in the high-order modules seem rather sparse, but we can notice the contour in the harmonic region, indicating that the high-order module effetively captures the sparse residual structure to refine the phase. Remark that as we only supervise the final estimation, despite not strictly following the mathematical definition of Taylor's approximation, the network still learns to allocate the role of 0th-order and high-order terms as expected.

Effect of network types in 0th-order module. To validate the superiority of the network used in the 0th-order module, we investigate other networks, as shown in entries 3a-3c of Table 1 . "U" denotes that no UNet-block is utilized, and "Transformer" and "Conformer" respectively denote the network is replaced by six transformer and conformer encoding layers, respectively. As we can see, a notable performance drop is observed from entry 1d to 3a, suggesting the effectiveness of UNet-block in feature representation. It is interesting to notice that despite transformer and conformer have shown promising performance in ASR and NLP-related tasks more recently, they are inferior to the proposed U 2 version herein. This might because they mainly consider the global sequential correlations and ignore the local spectral patterns, which hampers the overall filter estimation.

4.5 Comparison with the State-of-the-Art Methods WSJ0-SI84. Configurations of entries 1d and 2d in Table 1 are selected for comparisons with other ten top-performed baselines, where the superscript " †" denotes that the parameters are shared among high-order modules. Note that except PHASEN, all baselines are based on causal implementation. The quantitative results are shown in Table 2 , where NB-PESQ, ESTOI, and SISNR are utilized as the evaluation metrics. We also present the number of parameters, multiplyaccumulate operations (MACs) per second, and real-time factor (RTF) to evaluate the model computational complexity. As we can see, our method achieves the highest metric scores among all the systems with reasonable trainable parameters and computational complexity. Even with the causal convolutions, our method still dramatically surpasses PHASEN, a noncausal system, which reveals the superiority of our system based on Taylor-unfolding theory. It is interesting to notice that, when parameters are shared, our method will slightly degrade in metric scores, nonetheless, it is also comparable over the state-of-the-art systems but with relatively less trainable parameters.

To verify the superiority of the proposed SE system in more complicated acoustic scenarios, we report the results on Interspeech2020 DNS-Challenge corpus, as shown in Table 3 . WB-PESQ, NB-PESQ, STOI, and SISNR are used for evaluation. One can see that our method achieves the highest metric performance in both reverberation and anechoic acoustic scenarios. It substantially demonstrates the denoising potential of our method in both reverberant and anechoic environments.

We propose a decoupling-style framework based on Taylor's approximation theory for speech enhancement. Specifically, the original complex spectrum reconstruction is decoupled into two parts, namely magnitude estimation and complex residual estimation. For the former, a magnitude filter is devised to suppress the noise components in the magnitude domain. For the latter, multiple trainable modules are unfolded to simulate the complicated derivation operator and estimate the corresponding high-order terms. Afterward, we can recover the target spectrum by superimposition following Taylor's formula. Experiments show that our method achieves state-of-the-art performance over previous top-performed baselines and provides better internal interpretability at the same time.

Suppression of acoustic noise in speech using spectral subtraction

Real-Time Denoising and Dereverberation wtih Tiny Recurrent U-Net

Real time speech enhancement in the waveform domain

Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator

Unfolding Taylor's Approximations for Image Restoration

Conformer: Convolution-augmented transformer for speech recognition

DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement

An algorithm for predicting the intelligibility of speech masked by modulated noise maskers

Two Heads are Better Than One: A Two-Stage Complex Spectral Mapping Approach for Monaural Speech Enhancement

Glance and gaze: A collaborative learning framework for single-channel speech enhancement

Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation

Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation

Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain

U2-Net: Going deeper with nested U-structure for salient object detection

2: Wideband extension to recommendation p. 862 for the assessment of wideband telephone networks and speech codecs

The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results

Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs

Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement

PHASEN: A phase-and-harmonics-aware speech enhancement network