key: cord-0047217-i6mg52fh
authors: Bella, Mostafa; Saylani, Hicham
title: A New Sparse Blind Source Separation Method for Determined Linear Convolutive Mixtures in Time-Frequency Domain
date: 2020-06-05
journal: Image and Signal Processing
DOI: 10.1007/978-3-030-51935-3_38
sha: 18c4f311898d9f61dea8cb8b84d1dd4abd3fa1d6
doc_id: 47217
cord_uid: i6mg52fh

This paper presents a new Blind Source Separation method for linear convolutive mixtures, which exploits the sparsity of source signals in the time-frequency domain. This method especially brings a solution to the artifacts problem that affects the quality of signals separated by existing time-frequency methods. These artifacts are in fact introduced by a time-frequency masking operation, used by all these methods. Indeed, by focusing on the case of determined mixtures, we show that this problem can be solved with much less restrictive sparsity assumptions than those of existing methods. Test results show the superiority of our new proposed method over existing ones based on time-frequency masking.

Blind Source Separation (BSS) aims to find a set of N unknown signals, called sources and denoted by s j (n), knowing only a set of M mixtures of these sources, called observations and denoted by x i (n). This discipline is receiving increasing attention thanks to the diversity of its fields of application. Among these fields, we can cite those of audio, biomedical, seismic and telecommunications. In this paper, we are interested in so-called linear convolutive (LC) mixtures for which each mixture x i (n) is expressed in terms of the sources s j (n) and their delayed versions as follows:

where:

-h ij (q) represents the impulse response coefficients of the mixing filter linking the source of index j to the sensor of index i, -Q is the order of the longest filter, -the symbol " * " denotes the linear convolution operator.

Indeed, in the field of BSS, the case of LC mixtures is still of interest since the performance of existing methods is still modest compared to the particular case of linear instantaneous mixtures for which Q = 0. BSS methods for LC mixtures can be classified into two main families. The so-called temporal methods that deal with mixtures in the time domain and the so-called frequency methods that deal with mixtures in the time-frequency (TF) domain. The performance of the former is generally very modest and remains very restrictive in terms of assumptions compared to the latter. Indeed, based mostly on the independence of source signals, most efficient methods are compared to frequency ones only for very short filters (i.e. Q low), and generally require over-determined mixtures (i.e. for M > N) [12, 16] . Based mostly on the sparsity of source signals in the TF domain, the frequency methods have shown good performance in the determined case (i.e. for M = N ) or even under-determined case (i.e. for M < N), and this despite increasing the filters length [4, 8, 9, [13] [14] [15] . These frequency methods start by transposing the Eq. (1) into the TF domain using the short time Fourier transform (STFT) as follows: (2) where:

-X i (m, k) and S j (m, k) are the STFT representations of x i (n) and s j (n) respectively, -K and T are the length of the analysis window 1 and the number of time windows used by the STFT respectively 2 , -H ij (k) is the Discrete Fourier Transform of h ij (n) calculated on K points.

Among most efficient and relatively more recent frequency methods, we can mention those based on TF masking [2, 4, [6] [7] [8] [9] [13] [14] [15] . The sparsity is often exploited by these methods by assuming that the source signals are W-disjoint orthogonal, i.e. not overlapping 3 in the TF domain. The principle of these methods is to estimate a separation mask, denoted by M j (m, k) and specific to each source S j (m, k), which groups the TF points where only this source is present. 1 Assuming that the length K of the analysis window used is sufficiently larger than the filters order Q (i.e. K > Q). 2 It should be noted however that the equality in Eq. (2) is only an approximation.

This equality would only be true if the discrete convolution used was circular, which is not the case here. We also note that this STFT is generally used with an analysis window different than the rectangular window [2, 4, [6] [7] [8] [9] [13] [14] [15] . 3 Which means, in each TF point at most one source is present.

The application of the estimated mask M j (m, k) to one of the frequency observations X i (m, k) allows us to keep from the latter only the TF points belonging to the source S j (m, k), and then separate it from the rest of the mixture. Depending on the procedure used to estimate the masks, we distinguish between two types of BSS methods based on TF masking. The so-called full-band methods [2, 4, 6, 9] for which the masks are estimated integrally using a clustering algorithm that processes all frequency bins simultaneously, and the so-called bin-wise methods [7, 8, [13] [14] [15] for which the masks are estimated using a clustering algorithm that processes only one frequency bin at a time.

Among the most popular full-band methods we can cite those proposed in [4, 9] which are based on the clustering of the level ratios and phase differences between the frequency observations X i (m, k) to estimate the separation masks. However, this clustering is not always reliable, especially when the order Q of the mixing filters increases [4] . Moreover, when the maximum distance between the sensors is greater than half the wavelength of the maximum frequency of source signals involved, a problem called spatial aliasing is inevitable [4] . The bin-wise methods [7, 8, [13] [14] [15] are robust to these two problems. However, these methods require the introduction of an additional step to solve a permutation problem in the estimated masks, when we pass from one frequency bin to another, which is a classical problem that is common to all bin-wise BSS methods.

However, all of these BSS methods based on TF masking (full-band and binwise) suffers from artifacts problem which affect the quality of the separated signals and due to the fact that the W-disjoint orthogonality assumption is not perfectly verified in practice. Indeed, being introduced by the TF masking operation, these artifacts are more and more troublesome when the spectral overlap of source signals in the TF domain becomes important. In [11] the authors proposed a first solution to this problem which consists of a cepstral smoothing of spectral masks before applying them to the frequency observations X i (m, k). An interesting extension of this technique, which was proposed in [3] , consists in applying cepstral smoothing not to spectral masks but rather to the separated signals, i.e. after applying the separation masks. Knowing that these two techniques [3, 11] were have only been validated on a few full-band methods, in [5] we have recently proposed to evaluate their effectiveness using a few popular bin-wise methods. However, these two solutions could only improve one particular type of artifact called musical noise [3, 5, 11] . In the same sense, in order to avoid the artifacts caused by the TF masking operation, we propose in this paper a new BSS method which also exploits the sparsity of source signals in the TF domain for determined LC mixtures. Indeed, by focusing on the case of determined mixtures, we show that we can avoid TF masking and also relax the W-disjoint orthogonality assumption. Note that the case of determined mixtures was also addressed in [1] , but with an assumption which is again very restrictive and which consists in having at least a whole time frame of silence 4 for each of the source signals. Thus, our new method makes it possible to carry out the sep-aration while avoiding the artifacts introduced by the operation of TF masking, with sparsity assumptions much less restrictive than those of existing methods.

We begin in Sect. 2 by describing our method. Then we present in Sect. 3 various experimental results that measure the performance of our method compared to existing methods, then we conclude with a conclusion and perspectives of our work in Sect. 4.

The sole sparsity assumption of our method is the following.

Assumption: For each source s j (n) and for each frequency bin k, there is at least one TF point (m, k) where it is present alone, i.e:

Thus, if we denote by E j the set of TF points (m, k) that verify the assumption (3), called single-source points, then the relation (2) gives us:

Our method proceeds in two steps. The first step, which exploits the probabilistic masks used by Sawada et al. in [14, 15] , consists in identifying for each source of index "j" and each frequency bin "k" the index "m jk " such that the TF point (m jk , k) best verifies the Eq. (4), then in estimating the separating filters, denoted F ij (k) and defined by:

The second step consists in recombining the mixtures X i (m, k) using the separating filters F ij (k) in order to finally obtain an estimate of the separated sources. The two steps of our method are the subject of Sects. 2.1 and 2.2 respectively.

Since the proposed treatment in this first step of our method is performed independently of the frequency, we propose in this section to simplify the notations by omitting the frequency bin index "k". So using a matrix formulation, the Eq. (2) gives us: 

where W is given by

Each vector Z(m) is modeled by a complex Gaussian density function of the form [14] :

where a j and σ 2 j are respectively the centroid (with unit norm ||a j || = 1) and the variance of each cluster C j . This density function p(Z) can be described by the following mixing model:

where α j are the mixture ratios and θ = {a 1 , σ 1 , α 1 , ..., a N , σ N , α N } is the parameter set of the mixing model. Then, an iterative algorithm of the type expectation-maximization (EM) is used to estimate the parameter set θ, as well as the posterior probabilities P (C j |Z(m), θ) at each TF point, which are none other than the probabilistic masks used in [14] .

In the expectation step, these posterior probabilities are given by:

In the maximization step, the update of centroid a j is given by the eigenvector associated with the largest eigenvalue of the matrix R j defined by:

The parameters σ 2 j and α j are updated respectively via the following relations:

However, since the EM algorithm used in [14, 15] is sensitive to the initialization 5 , we propose in our method to initialize the masks with those obtained by a modified version of the MENUET method [4] . Indeed, we replaced, in the clustering step for the estimation of the masks, the k-means algorithm used in [4] by the fuzzy c-means (FCM) algorithm used in [13] , in order to have probabilistic masks. 3. After the convergence of the EM algorithm, the classical permutation problem between the different frequency bins is solved by the algorithm proposed in [15] , which is based on the inter-frequency correlation between the time sequences of posterior probabilities P (C j |Z(m), θ) in each frequency bin k.

In the following we denote these posterior probabilities by P (C j |Z(m, k) ). 4. Unlike the approach adopted in [14, 15] which consists in using all the TF points of the estimated probabilistic masks P (C j |Z(m, k)), we are interested in this step only in identifying one single-source TF point for each source of index "j" and for each frequency bin "k", therefore a single time frame index that we denote by "m jk ", which best verifies our working assumption (4). We then define this index m jk as being the index "m" for which the presence probability of the corresponding source is maximum 6 :

5. After having identified these "best" single-source TF points (m jk , k), we finish this first step of our method by estimating the separating filters F ij (k) defined in (5) by:

In this section, for more clarity, we provide the mathematical bases for the second step of our method for two LC mixtures of two sources, i.e. for M = N = 2. The generalization to the case M = N > 2 can be derived directly from this in an obvious way. In this case, the mixing Eq. (1) gives us:

As we pass to the TF domain, we get:

We use the separating filters F ij (k), with i = 2 and j = 1, 2, estimated in the first step to recombine these two mixtures as follows:

Since we have F 21 (k) = H21(k) H11(k) and F 22 (k) = H22(k) H12(k) , based on the Eq. (15), we get after all simplifications have been made:

In order to ultimately obtain the contributions of sources in one of the sensors, we propose to add a post-processing step (as in [1] ) which consists in multiplying the signals S j (m, k) by filters, denoted by G j (k), as follows:

where

and

.

After all the simplifications are done, we get:

By denoting y j (n) the inverse STFT of Y j (m, k) we get:

These signals are none other than the contributions of source signals s 1 (n) and s 2 (n) on the first sensor (see the expression of the mixture x 1 (n) in (16)).

In order to evaluate the performance of our method and compare it to the most popular bin-wise methods known for their good performance, that is the method proposed by Sawada et al. [15] and the UCBSS method [13] , we performed several tests on different sets of mixtures. Each set consists of two mixtures of two real audio sources, which are sampled at 16 KHz and with a duration of 10 s each, using different filter sets. Generated by the toolbox [10] , which simulates a real acoustic room characterized by a reverberation time denoted by RT 60 7 , the coefficients h ij (n) of these mixing filters depend on the distance between the two sensors (microphones), denoted as D and on the absolute value of the difference between directions of arrival of the two source signals, denoted as δϕ. For the calculation of the STFT, we used a 2048 sample Hanning window (as analysis window) with a 75% overlap. To measure the performance we used two of the most commonly used criteria by the BSS community, called Signal to 7 RT 60 represent the time required for reflections of a direct sound to decay by 60 dB below the level of the direct sound.

Distortion Ratio (SDR) and Signal to Artifacts Ratio (SAR) provided by the BSSeval toolbox [17] and both expressed in decibels (dB). The SDR measures the global performance of any BSS method, while the SAR provides us with a specific information on its performance in terms of artifacts presented in the separated signals.

For each test we evaluated the performance of the three methods, in terms of SDR and SAR, over 4 different realizations of the mixtures related to the use of different sets of source signals cited above. Thus, the values provided below for SDR and SAR represent the average obtained over these 4 realizations 8 .

In the first experiment, we evaluated the performance as a function of the parameters D and δϕ for an acoustic room characterized by RT 60 = 50 ms. Table 1 According to Table 1 , we can see that our method is performing better than the other two methods, and this over the 4 realizations of mixtures tested. Indeed, the proposed method shows superior performance over these two methods by about 5 dB for D = 0.3 m and 3.5 dB for D =1 m in terms of SDR. This performance difference is even more visible in terms of SAR, which confirms that the artifacts introduced by our method are less significant than those introduced by the other two methods.

In our second experiment we were interested in the behavior of our method with regard to the increase of the reverberation time while fixing the parameters D and δϕ respectively to D = 0.3 m and δϕ = 55 • . Table 2 groups the performance of the three methods in terms of SDR, for RT 60 belonging to the interval {50 ms, 100 ms, 150 ms, 200 ms} 9 .

According to Table 2 , we can see again that the best performance is obtained by using our method whichever the reverberation time. However, we note that 8 We have indeed opted for these 4 realizations instead of only one in order to approach as close as possible to a statistical validation of our results. 9 I.e. the mixing filters length (Q + 1 = fs · RT60) varies from 800 coefficients (for RT 60 = 50 ms) to 3200 coefficients (for RT 60 = 200 ms). this performance is degraded when RT 60 increases. This result, which is common to all BSS methods, is expected and is mainly explained by the fact that the higher the reverberation time, the less the assumption (here of sparseness in the TF domain) assumed by these methods on source signals is verified.

In this paper, we have proposed a new Blind Source Separation method for linear convolutive mixtures with a sparsity assumption in the time-frequency domain that is much less restrictive compared to the existing methods [1, 2, 4, [6] [7] [8] [9] [13] [14] [15] . Indeed, by focusing on the case of determined mixtures, we have shown that our method avoids the problem of artifacts at the separated signals from which suffers most of these methods [2, 4, [6] [7] [8] [9] [13] [14] [15] . According to the results of the several tests performed, the performance of our new method, in terms of SDR and SAR, is better than that obtained by using the method proposed by Sawada et al. [15] and the UCBSS method [13] , which are known for their good performance within existing methods. Nevertheless, considering that these results were obtained over 4 different realizations of the mixtures and only for some values of the parameters involved, a larger statistical performance study including all these parameters is desirable to confirm this results. Furthermore, it would be interesting to propose a solution to this problem of artifacts also in the case of under-determined linear convolutive mixtures.

Alternative structures and power spectrum criteria for blind segmentation and separation of convolutive speech mixtures

Joint mixing vector and binaural model based stereo source separation

Cepstral smoothing of separated signals for underdetermined speech separation

Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors

Réduction des artéfacts au niveau des sources audio séparées par masquage temps fréquence en utilisant le lissage cepstral

Permutation-free convolutive blind source separation via full-band clustering based on frequency-independent source presence priors

Modeling audio directional statistics using a complex bingham mixture model for blind source extraction from diffuse noise

Relaxed disjointness based clustering for joint blind source separation and dereverberation

Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures

Prediction of energy decay in room impulse responses simulated with an image-source model

Temporal smoothing of spectral masks in the cepstral domain for speech separation

A survey of convolutive blind source separation methods

Underdetermined convolutive blind source separation via time-frequency masking

A two-stage frequency-domain blind source separation method for underdetermined convolutive mixtures

Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment

Blind separation of convolutive mixtures of non-stationary and temporally uncorrelated sources based on joint diagonalization

Performance measurement in blind audio source separation