key: cord-0329875-h2yv34fq
authors: Torcoli, Matteo; Kastner, Thorsten; Herre, Jurgen
title: Objective Measures of Perceptual Audio Quality Reviewed: An Evaluation of Their Application Domain Dependence
date: 2021-10-21
journal: nan
DOI: 10.1109/taslp.2021.3069302
sha: c83dd8949074163b228d2b87582718e8eda0ece5
doc_id: 329875
cord_uid: h2yv34fq

Over the past few decades, computational methods have been developed to estimate perceptual audio quality. These methods, also referred to as objective quality measures, are usually developed and intended for a specific application domain. Because of their convenience, they are often used outside their original intended domain, even if it is unclear whether they provide reliable quality estimates in this case. This work studies the correlation of well-known state-of-the-art objective measures with human perceptual scores in two different domains: audio coding and source separation. The following objective measures are considered: fwSNRseg, dLLR, PESQ, PEAQ, POLQA, PEMO-Q, ViSQOLAudio, (SI-)BSSEval, PEASS, LKR-PI, 2f-model, and HAAQI. Additionally, a novel measure (SI-SA2f) is presented, based on the 2f-model and a BSSEval-based signal decomposition. We use perceptual scores from 7 listening tests about audio coding and 7 listening tests about source separation as ground-truth data for the correlation analysis. The results show that one method (2f-model) performs significantly better than the others on both domains and indicate that the dataset for training the method and a robust underlying auditory model are crucial factors towards a universal, domain-independent objective measure.

B ASIC Audio Quality (BAQ) defines a general, domainindependent quality criterion to rate the perceived overall quality of a signal being tested [1] . BAQ is one of the main evaluation criteria in audio coding and was used also as assessment criterion in the field of blind source separation [2] . Listening tests Manuscript (often referred to as subjective evaluation, e.g. MUSHRA [1] ) in a controlled environment are the most reliable method for assessing BAQ. These are, however, time-consuming and costly and cannot be easily carried out at each development stage, e.g. of a new audio codec. The recent social distancing measures due to the COVID-19 pandemic have added another difficulty to conducting listening tests in the laboratory. Therefore, objective evaluation measures are greatly desired, i.e. computational methods that are able to estimate BAQ as closely as possible to the human assessment [3] . These models are usually designed and trained on audio material and distortion types representing the specific domain in which the measures are intended to be used. A good measure is expected to generalize to audio material unseen during its development, as long as the distortions typical of the application domain are encountered. A universal measure would generalize also to unseen distortions from different application domains. Universality or domain-independence is often implied to a certain degree, even without clear evidence that this is a valid assumption. An interesting example is represented by PESQ [4] , which was finalized 20 years ago for the evaluation of speech quality over telephony systems. It is nowadays widely used for evaluating methods potentially introducing very different types of distortion, e.g. speech separation based on Deep Neural Networks (DNNs), singing voice extraction, and dereverberation, e.g. for hearing aids [5] - [9] . PESQ has also been proposed as loss function for supervised learning [10] , [11] .

The correlation between objective measures and perceptual scores has been studied by many authors, but usually within a specific domain application or with limited amount of perceptual ground-truth data [7] , [12] - [27] .

The aim of this paper is to shed some light on these issues with the following contributions:

r State-of-the-art intrusive objective measures are reviewed, with a glance at DNN-based non-intrusive estimates.

r The correlation with ground-truth data from 7 listening tests about audio coding and 7 listening tests about source separation is analyzed. The prediction generalization on different domains is investigated. The used listening tests are based on MUSHRA, which is suited for assessing intermediate quality of audio signals. A generalization of the results on estimating BAQ of signals with small impairments should not be done without further research. r A novel measure based on the 2f-model, preceded by a BSSEval-based signal decomposition is presented for assessing the perceptual relevance of artifacts.

This section reviews state-of-the-art objective measures. The focus of this work is on intrusive measures, i.e. ideal reference signals are used for comparison to estimate the audio quality of the signal under test. The only novel contribution of this section is given in Section II-M. A reader with little time and previous knowledge of the state of the art can skip the rest and continue with Section III.

The measures described from Section II-A to Section II-C belong to the speech enhancement domain, while the ones from Section II-D to Section II-G were developed in the field of audio coding. The measures from Section II-H to Section II-M focus on source separation. Section II-N introduces HAAQI, which was designed for hearing aids. Finally, Section II-O discusses the recent developments leveraging deep learning.

Many of the following measures originally support only mono signals. In this case, we compute the mean of the per-channel outputs when dealing with stereo signals.

The fwSNRseg [28] quantifies the power ratio of the reference signal and a noise signal that is obtained as the difference of the reference and the test signal. The fwSNRseg is computed and weighted for each time frame and each subband of a filterbank with critical-band spacing. The implementation in [5] is used, where the weights are computed from the subband-magnitude of the reference raised to the power of 0.2.

The dLLR [29] is based on the assumption that, over short time intervals, speech can be represented by an all-pole model. Linear Prediction Coefficients (LPC) are computed for the test signal and the reference. The two LPC sets predict the reference with certain residual energies. The dLLR is defined as the logarithm of the ratio of these residual energies. We employ the implementation in [5] , where the distance is limited to 2 before averaging over time.

PESQ [4] , [30] , [31] was designed for speech transmitted over telecommunication networks and narrow-band speech codecs. The method comprises a pre-processing that mimics a telephone handset. Measures for audible disturbances are computed from the specific loudness of the signals and combined in PESQ scores. From these, a Mean Opinion Score (MOS) [32] is predicted by means of a polynomial mapping function. We use the wideband mode of the ITU reference software [4] . This operates at a sampling frequency of 16 kHz. So the signals are resampled to this sampling frequency before they are fed to the tool. Stereo signals are supported natively. The tool exited with a processing error for 8% of the signals in the PEASS datasets (details in Section III-B). These signals are discarded in the following correlation analysis of PESQ.

PEAQ [33] , [34] is a measurement scheme for the perceptual evaluation of coded audio signals. Several mid-level perceptual features, called Model Output Variables (MOVs), are derived by either comparing the error signal with estimated masking thresholds or by comparing the internal ear representations of reference and test signal. They are combined in a neural network computing the main output, i.e. the Overall Difference Grade (ODG). Two versions of PEAQ are defined: 1) the Basic version, designed for applications requiring high processing speed, and 2) the Advanced version, designed for higher accuracy at the expense of speed. We use the Basic version by the McGill University, publicly available as MATLAB code [35] . Multi-channel signals are natively supported. The individual MOVs were also shown to be good predictor of perceived audio quality for different tasks [21] , [36] . We consider the MOVs that exhibited the highest correlation performance in our experiments as well as in [21] : Average Distorted Blocks (ADB), Noise-to-Mask Ratio (NMR), Windowed average of the Modulation Difference #1 (WinModDiff1), and Average Modulation Difference #1 (AvgModDiff1).

POLQA [23] , [37] was developed as a "technology update" for PESQ and it was designed to predict the perceived overall speech quality of listening tests that comply with [32] or [38] (the test signals used in this work do not necessarily meet this requirement). POLQA operates in two modes: narrowband or superwideband. We use a proprietary implementation licensed by OPTICOM in the superwideband mode and compare the three main versions of POLQA: Version 1.1 (01/2011), Version 2.4 (09/2014), and Version 3 (03/2018) [39] .

PEMO-Q [40] aims to be a general measure of perceived audio quality for any type of audio signals and audio signal distortions. It is an extension of a previous work on speech quality assessment [41] . The measurement scheme compares the internal ear representations of the reference and the test signal like PEAQ and POLQA. The internal representations are estimated using a psychoacoustic model [42] . Three-dimensional (time, frequency, and modulation) internal representations of the signals are obtained and the cross-correlation coefficient between the test and reference representations is calculated and used as a measure of the perceived similarity, i.e. the Perceptual Similarity Measure (PSM). A regression function based on subjective data is then applied to map the PSM to the ODG. For consistency with PEASS (Section II-J), we use the PEMO-Q version used by PEASS and publicly available from [43] .

ViSQOLAudio [44] is a metric designed for estimating the quality of general coded audio at 48 kHz developed from Virtual Speech Quality Objective Listener (ViSQOL) [45] , [46] , which was focused on speech signals. Both metrics are based on a model of the peripheral auditory system to create spectrotemporal internal representations of the signals called neurograms. These are compared via an adaptation of the structural similarity index, originally developed for evaluating the quality of compressed images and then adapted to predict intelligibility [47] . Version 3 was recently released [48] , [49] and it is here referred to as ViSQOLAudioV3. The declared aim for this new version is to "fill the blind spots in the training/validation datasets" so as to have a more general system that would perform better "in the wild". This tool internally down-mixes multi-channel signals to mono.

BSSEval [50] is a multi-criteria performance evaluation toolbox. The toolbox is widely used in the source separation community and it was used as main figure of merit in several communitybased evaluation campaigns from 2007 [51] to 2018 [52] . BSSEval projects the estimated source onto the subspace spanned by all reference source signals, including filtered versions of those. This filter can be time-variant and its length can be adjusted by the user. In practice, a time-invariant 512tap-FIR-filter is normally used. The estimated signal is thereby decomposed into target signal s target and components, meant to be related to different types of error: spatial distortion (e spatial ), interference from other sources (e interf ) and projection error, interpreted as artifacts (e artif ). Energy-based signal-to-error ratios are computed from these components and expressed in dB. Two modes are available: sources (only for mono sources) and source images (i.e. multi-channel sources). We use the images mode, but we do not consider the spatial distortion, i.e. s target = s target + e spatial . Source to Distortion Ratio (SDR), Source to Interference Ratio (SIR) and Source to Artifact Ratio (SAR) are considered. We use version 3.0 of the Matlab toolbox [53] . We limit the range of the output metrics to [−30 dB, 30 dB].

Starting from the premise that BSSEval has "generally been improperly used and abused, resulting in misleading results", modified and simpler definitions for the BSSEval measures were proposed in [54] . These are called scale-invariant (SI), i.e. SI-SDR, SI-SAR, SI-SIR, and they are particularly recommended by their authors for single-channel separation evaluation. The main difference to BSSEval is the usage of a single coefficient α to account for scaling discrepancies instead of the full 512-tap filter. Hence, a broadband scaling is the only forgiven difference with the reference signal. The measures are defined in a way for which the following relationship holds: 10 − SI-SDR /10 = 10 − SI-SIR /10 + 10 − SI-SAR /10 . We use our own implementation of the measures, done following the description in [54] . For stereo signals (not covered in the original paper) we compute α considering all channels during the projection, as per BSSEval. Also in this case, we limited the range of the output metrics to [−30 dB, 30 dB].

PEASS [55] was proposed for the perceptual assessment of separated audio source signals, developed as a perceptually motivated successor of BSSEval. Perceptual similarity scores are computed in a two-stage fashion: Error signals reflecting different types of signal distortion are computed from the estimated source signal similar to BSSEval but in a time-frequency selective manner using a gammatone filterbank. Differently from BSSEval, the perceptual salience of the error signals reflecting target source, interfering source, and artifacts is assessed with PEMO-Q, i.e. considering the perceptually relevance of the error signals. For this purpose, a reference signal is generated first for each error type by subtracting the according error signal from the estimated source signal. The resulting perceptual similarity scores (q overall , q target , q interf , q artif ) are mapped using a small neural network trained with subjective ratings to form measures of different perceptual audio quality. Herein, Overall-, Artifact-and Interference-related Perceptual Score (OPS, APS and IPS) are considered. We use Version 2.0.1 of the PEASS software [43] , where a 2-layer neural network is used for generating the output metrics. This version includes substantial differences with respect to the original proposal in [55] . The authors report that these modifications "greatly improve correlation with human assessments". Multi-channel signals are natively supported.

LKR-PI [56] is a measure of perceived musical artifacts (a.k.a. artificial noise, musical noise, or birdies) caused by spectral holes or islands. It is based on the measured change in spectral kurtosis between before and after processing. The change in spectral kurtosis is measured in a black-box fashion, i.e. without assumptions on the distribution of the signal power spectra. This is combined with a perceptually motivated pre-processing. For the source separation domain, the measurement is performed only on the leaking interferer, where the spectral kurtosis should not change. This excludes the portions of the signal where the target source is active. In the correlation analysis of LKR-PI, we ignore the signals for which the excluded portion is bigger than 95% of the total length. About 40% of the signals in the following evaluation have to be discarded for this domain, while no signal is excluded for the audio coding domain.

The 2f-model [16] estimates the perceived quality of separated source signals, driven by 2 MOVs from PEAQ Basic: ADB and AvgModDiff1. ADB estimates the amount of noticeable distortions in units of the just noticeable level difference between test and reference signal. AvgModDiff1 assesses differences in the temporal modulation of the loudness envelopes between the signals. The two MOVs are computed with the PEAQ Basic version provided by the McGill University [35] , which is publicly available. The parameters for combining these MOVs and so obtaining the final 2f-model score were newly computed for this implementation and are available online [57] . This differs slightly from the original proposal in [16] , in which an internal PEAQ implementation (and so a different set of combining paramers) was used.

For BAQ any perceived deviation from the reference signal is considered a degradation. However, it is often of interest to assess the presence of artifacts independently from the interferer reduction, e.g. in applications such as source separation. For this purpose, we propose to combine the signal decomposition used in SI-BSSEval and the perceptual model offered by the 2f-model. We name this novel measure SI-SA2f (i.e. a blend of SI-SAR and 2f-model) and it has the same aim as (SI-)SAR, APS, and LKR-PI, i.e. assessing the amount of artifacts independently of the interferer reduction. Starting from the signal under test y and the ground-truth sources, the SI-BSSEval signal decomposition provides the following signals: s target , e interf , e artif , where y = s target + e interf + e artif . SI-SA2f is obtained by running the 2fmodel on the signal under test y and using s target + e interf as reference signal. Hence:

r If e artif → 0 then SI-SA2f→ 100. r If e artif → y (and s target → 0) then SI-SA2f→ 0.

HAAQI [58] was designed to predict music quality for individuals listening through hearing aids. The index is based on a model of the auditory periphery [59] , extended to potentially include the effects of hearing loss. This is fitted to a dataset of quality ratings made by listeners having normal or impaired hearing. The rated signals feature musical content, modified by different types of processing found in hearing aids. Some of these processes are common also in audio coding and source separation. The hearing loss simulation can be bypassed and the index becomes valid also for normal-hearing people; we use HAAQI in this normal hearing. An implementation provided by the original author is used in this investigation. Based on the same auditory model, the authors of HAAQI also proposed a speech quality index (HASQI) and a speech intelligibility index (HASPI): references are given in [58] .

All measures reviewed so far originate outside the deep learning paradigm. DNNs have gained a lot of momentum over the past few years and DNNs for estimating the perceived audio quality were proposed. Two main approaches can be identified. The first one uses a large amount of subjective perceptual scores to train new DNN-based objective measures [60] . The second approach is to train DNNs to estimate existing measures (such as the ones described in the previous sections), e.g. with the aim of making them completely or partially non-intrusive [61] - [64] . Of the referenced works, only [63] provides the trained DNNs [65] , referred to as Waveform Evaluation Networks (WEnets). These are four DNNs, trained to predict PESQ, POLQA, PEMO-Q, or Short-Time Objective Intelligibility (STOI), without reference signals. The four networks were here tested (with input level normalization active, 3-seconds segments, 50% overlap, and averaging over the estimated scores for one items). The DNN achieving the best overall performance is reported in the following. This is the one predicting PESQ and is referred to as WEnets PESQ.

This section describes the datasets of subjective reference ratings, which will be used as ground truth for the correlation analysis in the following sections. An overview of these datasets is given in Table I together with the number of ratings from each listening test. We consider 7 listening tests in the audio coding domain (from 2 independent sources) and 7 listening tests in the source separation domain (also from 2 independent sources). The ground-truth perceptual scores consist of averages over all ratings given to each signal.

All the listening tests followed MUSHRA or MUSHRA-like procedures for assessing intermediate quality of audio signals. The perceptual scores of the considered listening tests span the full quality scale, from poor to excellent quality, which is an important factor to consider while interpreting the correlation results in the following sections. Further research is required for domains where only a small portion of the quality scale is spanned or where only small impairments are observed.

A. Audio Coding 1) Coding Artifacts [66] : In this set of listening tests, 16 subjects assessed the quality of signals that were distorted in a controlled fashion with different monaural coding artifacts so to simulate sub-optimal audio coding operating points. Each distortion was applied on a different set of 8 musical signals, with no overlap between sets. The following 5 distortions were considered, each applied with 5 different coarse quality levels:

r Birdies, i.e. warbling artifacts generated by spectral holes or islands.

r Bandwidth limitation (BW Lim), i.e. low-pass-filtered versions with an adapted crossover frequency.

r Pre-echoes, i.e. fuzzy onsets, imprecise percussion timing, and ghost voice for speech signals.

r Tonality or harmonicity mismatch, i.e. simulating a suboptimal bandwidth extension, where all spectral content above a given crossover frequency was replaced by a scaled copy of the remaining lower part of the spectrum.

r Unmasked noise, i.e. simulating a suboptimal bandwidth extension, where all spectral content above a given crossover frequency is substituted by random noise with the same spectral envelope. [67] : Three verification tests were run to assess BAQ of the Unified Speech and Audio Coding (USAC) [67] , where USAC was compared with AMR-WB+ and HE-AAC v2 at different bit-rates. We consider Test 1 (USAC t1) and Test 3 (USAC t3). Excluding the items used during the listener training, USAC t1 and USAC t3 contain the same 24 audio excerpts. USAC t1 considers only the first [55] : The PEASS dataset was used for the development of PEASS. The dataset contains separated sources and specifically defined anchor signals including listener ratings on global quality (i.e. BAQ), preservation of the target source, suppression of other sources and absence of additional artificial noise for each audio signal. The following evaluation considers the ratings regarding the global quality (referred to as PEASS OPS LT) and the ratings on the absence of additional artificial noise (referred to as PEASS APS LT).

2) Subjective Evaluation of Blind Audio Source Separation (SEBASS) [16] , [57] : The SEBASS dataset is a collection of five listening tests on BAQ of separated audio sources from blind and informed source separation systems. These listening tests are referred to as: SASSEC, SiSEC08, PEASS BAQ, SiSEC18, and SAOC DB. In each listening test, except SAOC DB, the listeners rated separated signals submitted as part of community-based signal separation evaluation campaigns, as indicated by the names of the datasets. PEASS BAQ contains the signals from the PEASS OPS LT but ratings from [16] . The main difference with PEASS OPS LT in terms of listening test design is that the listeners of PEASS BAQ were not instructed to rate the worst item with 0. Instructing to rate the worst item with 0 is not compliant with MUSHRA. SAOC DB contains scores investigating the influence on quality of a separated source when an enhanced t/f rendering architecture, as it is offered by MPEG Spatial Audio Object Coding (SAOC) [68] , is used for acoustic reproduction [69] . Separated source signals from SASSEC were used to drive the enhanced rendering architecture and the resulting signals were evaluated alongside the original separated signals. The ratings relative to the original separated signals are not considered as part of the SAOC DB in the following, as ratings for the same signals are already contained in SASSEC. As a technology, SAOC is an interesting case where (informed) source separation and audio coding overlap [68] , [70] , [71] .

In order to assess the performance of the considered objective measures (Section II), a correlation analysis of the measures outputs with the subjective scores (Section III) is carried out. For each listening test, Pearson's and Kendall's correlation coefficients are computed.

All signals from the datasets are re-sampled to 48 kHz or to 16 kHz for PESQ and WEnets PESQ (highest supported sampling frequency).

Given the ground-truth perceptual scores X for a set of signals and the outputs Y from a measure run on the same signals, the Pearson's correlation coefficient ρ is computed as:

where i serves as index for the signals in the considered listening test and the over-line indicates the mean over all signals, e.g.

Pearson's correlation measures linear correlation between X and Y , i.e. how close their relationship is to a first-oder polynomial: ρ(X, Y ) = 1 indicates total positive linear correlation, while ρ(X, Y ) = 0 indicates no correlation at all and ρ(X, Y ) = −1 indicates total negative correlation. As we are not interested in distinguishing between positive or negative correlation, but we are interested in how strong the correlation is, the absolute value of ρ is reported, ranging between 0 and 1. Pearson's ρ can be significantly smaller than 1 even with identical ranking between the elements in X and the one of the elements in Y . For this reason, we consider also Kendall's rank correlation τ , which is a measure of the ordinal association (or ranking): where K correponds to the number of concordant pairs minus the number of discordand pairs, i.e.:

where i and j serve as indices for the signals in the listening test and c(X, Y, i, j) measures the pairs concordance:

In order to make the values of τ more comparable with the ones of ρ, τ is mapped to τ and the absolute value of τ is reported in the following. The mapping is as follows [72] :

These two metrics (ρ, τ ) have the advantage of being scaleindependent, which is desired in our analysis, in which measures with outputs on different scales are compared. E.g. PEASS and the 2f-model range from 0 to 100 (MUSHRA scores), while PEAQ estimates an ODG ranging from −4.0 to 0. Other metrics (such as the ones suggested in [73] , e.g. the Root Mean Square Error) are not scale-independent and are not adopted here. The statistical significance for the correlation coefficients is also tested (t-test, two-tailed, α = 0.05). Tables III, IV, and V report an asterisk on the coefficients for which the null hypothesis could be rejected.

A meta-analysis is conducted where the correlation coefficients for a number of experiments (i.e. a subjective data pool) are aggregated in one score, referred to as aggregated score. This aggregation is done by applying the Fisher-z transform on the correlation coefficients, calculating the mean in this domain (where the sampling distribution of the resulting coefficients is approximately normal), and inverting the transformation [74] . The coefficients of to the datasets used during the development of a given measure are not considered in the computation of the aggregated score for that measure. This is noted by ( †) next to the ignored coefficients in Tables III, IV, and V. The Fisher-z transform is defined as:

where γ can be either ρ or τ . If γ = ρ, the aggregated score for the Pearson's correlation is computed, which is noted as ρ.

If γ = τ , the aggregated score for the Kendall's correlation is computed, noted as τ . In Tables III, IV , and V, the objective measures are displayed in descending order according to ρ. The statistical significance of the difference between aggregated score couples is also tested in the Fisher-z domain [73] . Also for this statistical test, the two-tailed t value for α = 0.05 is used as significance threshold. The smallest statistically significant differences are shown by columns A and B in Tables III,  IV , and V. The aggregated score ρ for measure A (see symbol reported in column A) is significantly different to the aggregated score for measure B (same symbol in column B) and it is not significantly different to the measures listed in between. Taking as example Table III , the 2f-model ((φ) column A) differs significantly from SI-SA2f ((φ) column B) and from all other following measures to the end of the list.

The following presentation of the results is divided into three parts considering: BAQ in the audio coding domain (Section V-A, Table III ); BAQ in the source separation domain (Section V-B, Table IV) ; artifacts-only ratings in source separation (Section V-C, Table V ). The aggregated scores ρ and τ for some selected measures are summarized in Table II . (Table III) In the audio coding domain, the best aggregated scores are exhibited by the 2f-model (ρ = 0.90, τ = 0.91). SI-SA2f shows similar aggregated scores (ρ = 0.87, τ = 0.89), but with remarkable differences, especially for pre-echoes, as shown in Fig. 1 . Considering the aggregated scores, the novel SI-SA2f outperforms the other artifacts-related measures: (SI-)SAR, APS, and LKR-PI. Both the 2f-model and SI-SA2f were designed in the source separation domain, but the underlying MOVs were developed in the audio coding domain. The lowest correlation coefficients observed for the 2f-model are for the listening test on tonality mismatch (ρ = 0.65, τ = 0.71), for which also the underlying MOVs (ADB and Win-ModDiff1) show weak correlation. This is one of the most challenging listening tests to be predicted in this domain, with only NMR, PEAQ ODG, and APS showing ρ and τ > 0.80. Even more challenging is the listening test on unmasked noise, for which only NMR shows ρ and τ > 0.80.

The four considered PEAQ MOVs are among the top ten aggregated scores, showing higher aggregated scores than their combination inside the PEAQ ODG, as also observed in [21] .

Among the top five aggregated scores, three are achieved by measures not explicitly calibrated for audio coding, i.e. 2fmodel, SI-SA2f, and HAAQI. Ignoring the MOVs, the best measures from the audio coding domain are met on Rank 8 (PEAQ ODG) and 9 (ViSQOLAudio). The different versions of POLQA are on Rank 15 (v1), 21 (v3) and 24 (v2).

WEnets PESQ is not able to mimic PESQ in any test. SI-SIR shows no correlation, as expected, since there is no interference present in this domain. On the other hand, IPS strongly correlates with some of the artifact types, such as BW Lim and Pre-echoes. This can be observed in detail in Fig. 2 and is surprising as no interfering signal is present in this case. As expected, q interf is always maximum (= 1.0) for these signals. PEASS Version 2 takes q interf , q artif , and q global as inputs for the final 2-layers neural network producing the IPS. Here, q global seems to have a decisive impact on the final IPS, even if IPS should be only related to the interference.

Over all measures, best aggregated scores are achieved on the BW Lim dataset, while the worst ones are achieved on Tonality Mismatch and Unmasked Noise. (Table IV) In the source separation domain, the best measure is again the 2f-model (ρ = 0.86 and τ = 0.82), even if with slightly lower aggregated scores than in the audio coding domain. The 2f-model shows similar performance to ADB, with which no significant difference is observed. The other aggregated scores up It might seem counterintuitive at first, but it is to be expected that measures designed to assess only artifacts (SI-SA2f, APS, (SI-)SAR, and LKR-PI) perform better in the audio-coding domain even if designed for source separation. In both domains, listeners assessed BAQ and judged any and all perceived differences between the reference and the test signals, but only the signals in the source separation domain present also non-artifact related differences with the reference, i.e. interferer of varying level. In other words, assessing only artifacts and assessing BAQ are the same task in the considered listening tests about audio coding, while they are very different tasks in source separation. The best measure of these is again SI-SA2f (ρ = 0.65 The non-perceptual, signal-based measuring methods, such as BSSEval, dominate again the lower third of the ranking. WEnets PESQ is able to mimic PESQ only for SASSEC and not for all other listening tests. (Table V) In source separation, it is often of interest to assess the interferer reduction and the presence of distortions, artifacts, and colorations independently rather than jointly [55] , [75] . This can give useful diagnostic information, e.g., for supporting the interpretation of a listening test [76] or for controlling the amount of interferer reduction such that a desired artifacts-related quality level is met [77] . As far as perceptual scores assessing exclusively artifacts, only one dataset is available (PEASS APS LT), in which listeners rated the quality in terms of absence of additional artificial noise. Conclusions should be corroborated on more data for this scenario.

APS yields the highest scores, but it was trained on this dataset. LKR-PI shows no statistically significant difference with APS, but the dataset was used as validation set for this measure. SI-SA2f is the first system in the list for which the data was completely unknown (ρ = 0.64 and τ = 0.67). Furthermore, the novel SI-SA2f outperforms the remaining artifacts-related measures (SI-)SAR. Also, SI-SA2f was shown to consistently outperform APS and the other artifacts-related measures in the other considered cases (Table III and Table IV ). AvgModDiff1 follows with ρ = 0.55 and τ = 0.46. This MOV assesses the differences in the temporal modulation of the loudness envelopes of the reference and the test signal. These differences can be an indicator of the presence of artifacts-related distortions.

All other considered measures either generally assess the differences between test and reference signal or are tailored to assess other specific aspects (e.g. interferer). As expected, these measures perform poorly on this dataset. More work is still to be done on both fronts. Purely signal-based measurement methods without perceptual aspects, e.g. SDR, perform worst in both domains, but perceptuallymotivated measures also show their limits, especially with more modern distortion types, e.g. tonality mismatch and unmasked noise (suboptimal parametric coding of the higher frequency bands). On the other hand, more research and training data are needed for improving the correlation performance, especially in the source separation domain, where the correlation scores are generally lower. OPS from the PEASS toolkit shows on the source separation datasets a high correlation only on the PEASS OPS LT and PEASS BAQ dataset. This may indicate that the training dataset was too limited or over-fitted. Moreover, PEMO-Q is used internally by OPS as perceptual model, but PEMO-Q performs overall significantly better than OPS in the source separation domain. Besides 2f-model, ADB, PESQ, and PEMO-Q, the performance of the remaining measures show a certain domain dependence or medium to low correlation in general. Some (e.g. PEAQ ODG) reveal lower performance on the unknown domain. Surprisingly, others achieve better performance on the unknown domain (e.g. POLQA).

When assessing exclusively the perceptual relevance of artifacts, SI-SA2f seems to be the most promising measure. Compared to the measures performing best in this task (APS and LKR-PI, Table V) , SI-SA2f was not calibrated or validated on this dataset and showed better robustness on the other datasets (Table III and IV) . More test data is needed, however, for this special case.

Similar but not identical results were observed for the 2fmodel and SI-SA2f in the audio coding experiments. Only artifacts-related degradations are present in the listening tests for this domain. Intuition would suggest that the results should be identical for the 2f-model and SI-SA2f. However, the two measures rely on different definitions of artifacts. The 2f-model leverages a perceptually motivated definition as per PEAQ MOVs, which is a well established approach in the audio coding community. SI-SA2f uses the purely signal-based definition of BSSEval, which is a popular approach in the source separation community. Here, the artifacts are defined as the projection error when trying to explain the separated target source signal by a linear projection of the original source signals onto the signal mixture. Everything that cannot be explained by a filtered version of the original source signals is considered as artifact. This is not necessarily congruent with the perceptual definition of artifacts and highlights the importance of a more perceptually-motivated signal decomposition.

Finally, WEnets PESQ showed very low correlation in almost all experiments. It has to be noted that WEnets PESQ operates without reference signal, so the task for this measure is significantly more difficult than for all other measures. The original work reports Pearson correlation of 0.97 with PESQ [63] , where training and testing signals were speech items processed by different speech codecs followed by noise suppression. This type of material fits PESQ original domain, but not many of our experiments, where, e.g. also music is present and very different processes. This domain mismatch brings much more dramatic consequences for the purely data-driven method (WEnets PESQ) with respect to the perceptually-motivated steps of PESQ.

Aggregating the correlation coefficients from 13 listening tests including a range of application domains and distortions, the possibility of a domain-independent model for predicting Basic Audio Quality (BAQ) has been shown. However, only a very limited number of the considered state-of-the-art tools exhibit domain independence along with high correlation scores. The source separation domain appears to be a particularly challenging application domain, with only two measures showing aggregated scores ≥ 0.80. For this application domain both the interferer level and artifacts and colorations contribute to the final BAQ. The 2f-model showed the best aggregated correlation for both the audio coding domain (ρ = 0.90, τ = 0.91) and the source separation domain (ρ = 0.86, τ = 0.82). This model uses perceptual features from the audio coding domain developed more than 20 years ago as part of PEAQ. Their combination is trained with data from the source separation domain. This mix of domains, the validity of the underlying perceptual model, and the varied training data appear to be successful strategies in addressing a big range of audio quality degradations. The main output from PEAQ itself shows good correlation only for the domain on which it was trained and it is also often outperformed by the individual underlying features. This suggests that the auditory models themselves still hold their validity (with the only exception of parametric bandwidth extension), while more heterogeneous data is needed for training and calibration. The same conclusion can be drawn when analyzing OPS, which performs well only on the signals on which it was trained.

Besides the 2f-model, also ADB, PESQ, and PEMO-Q show similar correlation scores on both domains along with mediumto-high correlation (0.72 ≤ ρ ≤ 0.77), as it can also be observed in Fig. 3 .

DNN-based methods are still in the early phase of development, mostly because of the difficulty in collecting large amount of ground-truth subjective scores. Training DNNs to predict the output of an existing measure (e.g. so to make it non-intrusive) is an alternative that still needs to be proven robust in the actual application.

If only artifacts and colorations are to be assessed, regardless of the interferer level, the most promising measure seems to be SI-SA2f, which is a novel measure based on the 2f-model and preceded by a BSSEval-based signal decomposition.

Method for the subjective assessment of intermediate quality level of audio systems

Preliminary guidelines for subjective evaluation of audio source separation algorithms

Objective assessment of speech and audio quality-technology and applications

Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs

Speech Enhancement: Theory and Practice

Multichannel dereverberation for hearing aids with interaural coherence preservation

A summary of the REVERB challenge: State-of-theart and remaining challenges in reverberant speech processing research

Supervised speech separation based on deep learning: An overview

Singing voice extraction with attention-based spectrograms fusion

A deep learning loss function based on the perceptual evaluation of the speech quality

Stable training of DNN for speech enhancement based on perceptually-motivated black-box cost function

Evaluation of objective quality measures for speech enhancement

Subjective and objective quality assessment of single-channel speech separation algorithms

On the effect of artificial distortions on objective performance measures for dialog enhancement

Comparison of subjective and objective evaluation methods for audio source separation

An efficient model for estimating subjective quality of separated audio source signals

BSS Eval or PEASS? Predicting the Perception of Singing-voice Separation

Evaluation of quality of sound source separation algorithms: Human perception vs quantitative metrics

A perceptually-motivated approach for low-complexity, realtime enhancement of fullband speech

An objective metric of human subjective audio quality optimized for a wide range of audio fidelities

Comparing the effect of audio coding artifacts on objective quality measures and on subjective ratings

Objective assessment of perceptual audio quality using ViSQOLAudio

Subjective and objective assessment of perceived audio quality of current digital audio broadcasting systems and web-casting applications

Can we still use PEAQ? A. Performance analysis of the ITU standard for the objective assessment of perceived audio quality

Towards a model of perceived quality of blind audio source separation

Modeling perceptual similarity of audio signals for blind source separation evaluation

Evaluating physical measures for predicting the perceived quality of blindly separated audio source signals

A study of complexity and auality of speech waveform coders

Objective Measures of Speech Quality

Mapping function for transforming P.862 raw results scores to MOS-LQO

Wideband extension to recommendation P.862 for the assessment of wideband telephone networks and speech codecs

Methods for subjective determination of transmission quality

Method for objective measurement of perceived audio quality

Perceptual quality assessment for digital audio: PEAQ -The new ITU standard for objective measurement of the perceived audio quality

An examination and interpretation of ITU-R BS.1387: Perceptual evaluation of audio quality

An objective measure of quality for timescale modification of audio

Perceptual objective listening quality prediction

Subjective performance assessment of telephone-band and wideband digital codecs

Introduction to POLQAv3, implementing the 3rd edition of ITU-T Rec

PEMO-Q -A new method for objective audio quality assessment using a model of auditory perception

Objective modeling of speech quality with a psychoacoustically validated auditory model

Modeling auditory processing of amplitude modulation. I detection and masking with narrowband carriers

The PEASS toolkit, perceptual evaluation methods for audio source separation

ViSQOLAudio: An objective audio quality metric for low bitrate codecs

ViSQOL: The virtual speech quality objective listener

Robustness of speech quality metrics to background noise and network degradations: Comparing ViSQOL, PESQ and POLQA

Speech intelligibility prediction using a neurogram similarity index measure

ViSQOL V3: An open source production ready objective speech and audio metric

ViSQOL V3 software, revision number 92273f7

Performance measurement in blind audio source separation

First stereo audio source separation evaluation campaign: Data, algorithms and results

The 2018 signal separation evaluation campaign

BSS Eval: A toolbox for performance measurement in (blind) source separation

SDR -half-baked or well done?

Subjective and objective quality assessment of audio source separation

An improved measure of musical noise based on spectral kurtosis

Subjective evaluation of blind audio source separation database: SEBASS-DB

The hearing-aid audio quality index (HAAQI)

An auditory model for intelligibility and quality predictions

Intrusive and non-intrusive perceptual speech quality assessment using a convolutional neural network," in Proc

Predicting algorithm efficacy for adaptive multi-cue source separation

Referenceless performance evaluation of audio source separation using deep neural networks

WaweNets: A no-reference convolutional waveform-based approach to estimating narrowband and wideband speech quality

Quality-Net: An endto-end non-intrusive speech quality assessment model based on BLSTM

WAVEnets Reference Implementations

Generation and evaluation of isolated audio coding artifacts

USAC verification test report N12232

Spatial audio object coding (SAOC) -The upcoming MPEG standard on parametric object based audio coding

The influence of the rendering architecture on the subjective performance of blind source separation algorithms

MPEG-D spatial audio object coding for dialogue enhancement (SAOC-DE)

MPEG-H audio-the new standard for universal spatial/3D audio coding

JMASM9: Converting kendall's tau for correlational or meta-analytic analyses

Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models

Averaging correlations: Expected values and bias in combined pearson RS and fisher's Z transformations

The dimensions of perceptual quality of sound source separation

The adjustment/satisfaction test (A/ST) for the evaluation of personalization in broadcast services and its application to dialogue enhancement

Controlling the perceived sound quality for dialogue enhancement with deep learning