key: cord-0658486-lpcuxt5b
authors: Coppock, Harry; Akman, Alican; Bergler, Christian; Gerczuk, Maurice; Brown, Chloe; Chauhan, Jagmohan; Grammenos, Andreas; Hasthanasombat, Apinan; Spathis, Dimitris; Xia, Tong; Cicuta, Pietro; Han, Jing; Amiriparian, Shahin; Baird, Alice; Stappen, Lukas; Ottl, Sandra; Tzirakis, Panagiotis; Batliner, Anton; Mascolo, Cecilia; Schuller, Bjorn W.
title: A Summary of the ComParE COVID-19 Challenges
date: 2022-02-17
journal: nan
DOI: nan
sha: d74afb8e32258f48d17b806aaca3c8607693c7b1
doc_id: 658486
cord_uid: lpcuxt5b

The COVID-19 pandemic has caused massive humanitarian and economic damage. Teams of scientists from a broad range of disciplines have searched for methods to help governments and communities combat the disease. One avenue from the machine learning field which has been explored is the prospect of a digital mass test which can detect COVID-19 from infected individuals' respiratory sounds. We present a summary of the results from the INTERSPEECH 2021 Computational Paralinguistics Challenges: COVID-19 Cough, (CCS) and COVID-19 Speech, (CSS).

Significant work has been conducted exploring the possibility that COVID-19 yields unique audio biomarkers in infected individuals' respiratory signals [8, 41, 20, 36, 5, 28, 27, 4, 29, 31, 6, 11, 26, 30] . This has shown promising results although many still remain sceptical, suggesting that models could simply be relying on spurious bias signals in the datasets [12, 11] . These worries have been supported by findings that when sources of bias are controlled, the performance of the classifiers decreases [17, 13] . Along with this, cross dataset experiments have reported a marked The datasets used in these challenges are two curated subsets of the crowd sourced Cambridge COVID-19 Sounds database [8, 42] . COVID-19 status was self reported and determined through either a PCR or rapid antigen test. The number of samples of both positive and negative cases for these selected subsets are detailed in Table 1 . The submission date for both COVID-19 positive and negative case recordings are detailed in Figure 1a . Figure 1b shows the age distribution for both CSS and CCS challenges. Results for both CCS and CSS were reported in two of these papers, while two papers reported results exclusively for CCS and one paper exclusively for CSS. In this section, we provide a brief overview of methodologies used in these accepted works which included data augmentation techniques, feature types, classifier types, and ensemble model strategies. Teams that did not have their work accepted at INTERSPEECH 2021 will be named NN_X to preserve anonymity. NN refers to nomen nescio and X is the order in which they appear in Figure 2 .The performance measured in Unweighted Average Recall (UAR) achieved by these methodologies is summarised in Table 2 ; UAR has been used as a standard measure in the Computational Paralinguistics Challenges at Interspeech since 2009 [34] . It is the mean of the diagonal of the confusion matrices in percent and by that, fair towards sparse classes.Note that UAR is sometimes called 'macro-average', see [24] .

To combat the limited size and imbalance of the Cambridge COVID-19 Sounds database, the majority of the teams used data augmentation techniques in their implementation. Team Casanova, et al. exploited a noise addition method and SpecAugment to augment the challenge dataset [9] . Team Illium, et al. temporal shifting, noise addition, SpecAugment and loudness adjustment [19] . Instead of using a data augmentation method to manipulate the challenge dataset, team Klumpp, et al. used three auxiliary datasets in different languages aiming their deep acoustic model to better learn the properties of healthy speech [22] .

The teams chiefly used spectrogram-level features including mel-frequency cepstral coefficients (MFCC) and melspectrograms. For higher-level features, the teams used the common feature extraction toolkits openSMILE [15] , openXBOX [33] , DeepSpectrum [3] , and auDeep [16] , where a simple support vector machine (SVM) model was built on top of these features. Team Solera-Urena, et al. exploited transfer learning to extract feature embeddings by using pre-trained TDNN-F [39] , VGGish [18] , and PASE+ [32] models with appropriate fine-tuning on the challenge dataset. Team Klumpp, et al. targeted to extract their own phonetic features by using an acoustic model consisting of convolutional neural network (CNN) and long short-term memory (LSTM) parts.

Team Solera-Urena, et al. [37] and the challenge baseline [35] fitted a SVM model to high level audio embeddings extracted using TDNN-F [39] , VGGish [18] , and PASE+ [32] models, and the openSMILE framework [15] , respectively.

While the challenge baseline [35] searched for the complexity parameter of the SVM ranging from 10 −5 to 1, team Solera-Urena, et al. [37] explored different kernels (linear, RBF), data normalisations (zero mean and unit variance, [0,1] range) and class balancing methods (majority class downsampling, class weighting). In addition to the SVM model, the baseline explored using the multimodel profiling toolkit End2You [38] to train a recurrent neural network using Gated Recurrent Units (GRUs) with hidden units of 64. Team Casanova, et al. [9] utilised the deep models: SpiraNet [10] , CNN14 [23] , ResNet-38 [23] , and MobileNetv1 [23] where they explored kernel size, convolutional dilatation, dropout, number of fully connected layer neurons, learning rate, weight decay and optimizer. Team Klumpp, et al. [22] trained SVM and logistic regression (LR) models to perform COVID-19 classification on top of phonetic features extracted by their deep acoustic model. They explored the complexity parameter of the SVM ranging from 10 −4 to 1. Team Illium, et al. [19] adapted a vision transformer [14] for mel-spectrogram representations of audio signals. Tree-structured Parzen Estimator-algorithm (TPE) [1] was exploited in [19] for hyperparameter search mainly exploring embedding size, learning rate, batch size, dropout, number of heads and head dimension. The teams Solera-Urena, et al., Casanova, et al. , and the baseline also reported classification results by using the fusion of their best features and classifiers. Figure 4 visualises a two-sided significance test (based on a Z-test concerning two proportions, [21] , section 5B) employing the CCS and CSS Test sets and the corresponding baseline systems [35] . Various levels of significance (α-values) were used for calculating an absolute deviation with respect to the Test set, being considered as significantly better or worse than the baseline systems. Due to the fact that a two-sided test is employed, the α-values must be halved to derive the respective Z-score used to calculate the p-value of a model fulfilling statistical significance for both sides [21] . Consequently, significantly outperforming the best CCS baseline system (73.9 % and 208 Test set samples) at a significance level of α = 0.01 requires at least an absolute improvement of 6.7 %; for CSS (best baseline system with 72.1 % and 283 Test set samples), the improvement required is 6.0 %. Note that Null-Hypothesis-Testing with p-values as criterion has been criticised from its beginning; see the statement of the American Statistical Association in [40] and [7] .Therefore, we provide this plot with p-values as a service for readers interested in this approach, not as a guideline for deciding between approaches.

Another way of assessing performance measures as for their 'uncertainty' is computing confidence intervals (CIs). [35] employed two different CIs: first, 1000x bootstrapping for Test (random selection with replacement) and UARs based on the same model that was trained with Train and Dev; in the following, the CIs for these UARs are given first. : Two-sided significance test on the COVID-19 Cough (4a) and Speech (4b) Test sets with various levels of significance according to a two-sided Z-test.

Then, 100x bootstrapping for the corresponding combination of Train and Dev; the different models obtained from these combinations were employed to get UARs for Test and subsequently, CIs; these results are given in second place. Note that for this type of CI, the Test results are often above the CI, sometimes within and in a few cases below, as can be seen in [35] ; obviously, reducing the variability of the samples in the training phase with bootstrapping results on average in somehow lower performance. For CCS with a UAR of 73.9 %, the first CI was 66.0 %-82.6 %; the second one could not be computed because this UAR is based on a fusion of different classifiers. For CSS with a UAR of 72.1 %, the CIs were 66.0 %-77.8 % and 70.2 %-71.1 %, respectively. Both Figure 4 and the spread of the CIs reported demonstrate the uncertainty of the results, caused by the relatively low number of data points in the test set.

Figures 2 and 3 detail the rankings for the 19 teams which submitted predictions for the test set. We congratulate [9] for winning the COVID-19 Cough Sub-Challenge with an UAR of 75.9 % on the held out test set. We note that for the COVID-19 Speech Sub-Challenge, no team exceeded the performance of the baseline which scored 72.1 % UAR on the held out test set. To significantly outperform the baseline system for the cough modality, with a significance level of α = 0.1, as detailed in Figure 4 , would require an improvement of 6.7 %, an improvement which the winning submission fell short of by 4.7 %.

For both Sub-Challenges, teams struggled to outperform the baseline. Postulating why this could be the case one could suggest one, or a combination, of the following: COVID-19 detection from audio is a particularly hard task, the baseline score -being already a fusion of several state-of-the-art systems for CCS -represents a performance ceiling and that higher classification scores are not possible for this dataset, or, as a result of the limited size of the dataset, the task lends itself to less data hungry algorithms, such as the openSMILE-SVM baseline models for CSS.

It is important to analyse the level of agreement of COVID-19 detection between participant submissions. This is shown schematically in Figures 5 and 6 . From these figures, we can see that there are clearly COVID-19 positive cases which teams across the board are able to correctly predict, but there are also positive COVID-19 cases which all teams have missed. These findings are reflected in the minimal performance increase of 0.3 % and 0.8 % for cough and speech tasks, respectively, obtained when fusing n best submission predictions through majority voting schemes. The results from fusing n best models using majority voting are detailed in Figures 8 and 9 . This suggests that models from all teams are depending on similar audio features when predicting COVID-19 positive cases.

participants were selected if they were displaying at least one symptom (b) and when they were displaying no symptoms (c). These figures can be paired with Figure 7 which details the recall scores for positive cases across these same curated test sets. From this analysis, it does not appear that there was a trend across teams to perform favourably on cases where symptoms were being displayed or visa versa. While this does not disprove worries that these algorithms are simply cough or symptom identifiers, it does not add evidence in support of this claim.

While this challenge was an important step in exploring the possibilities of a digital mass test for COVID-19, it has a number of limitations. A clear limiting factor of the challenge was the small size of the dataset. While many participants addressed this through data augmentation and regularisation techniques, it restricted the extent to which conclusions could be taken from the results, particularly investigating teams' performance on carefully controlled subsets of the data. We look forward to the newly released COVID-19 sounds dataset [42] which represents a vastly greater source of COVID-19 samples.

A further limitation of this challenge is the unforeseen correlation between low sample rate recordings, below 12 kHz, and COVID-19 status. In fact all low sample rate recordings in the challenge for both CCS and CSS were COVID-19 positive. For CCS and CSS there were 30 and 37 low sample rate cases, respectively. The reason for this is that at the start of the study the label in the survey for COVID-19 negative was unclear, and could have been interpreted as either 'not tested' or 'tested negative'. For this reason no negative samples from the time period were used. This can be seen in 1a. This early version of data collection also correlated with the study allowing for lower sample rate recordings, a feature which later was changed to restrict submissions to higher sample rates. This resulted in all the low sample rate recordings being COVID-19 positive. As can be seen in Figures 10, 11, 12 and 13, teams' trained models were able to pick up on the sample rate bias, with most teams correctly predicting all the low sample rate cases as COVID-19 positive. When this is controlled for and low sample rate recordings are removed from the test set, as shown in Figures  12 and 13 , teams' performances drop significantly. For the challenge baselines this too was the case, with the fusion of baseline models for CCS falling from 73.8% to 68.6% UAR and the opensmile-SVM baseline for CSS dropping from 72.1% to 70.9% UAR. This is a great example of the effect of overlooked bias which expresses itself as an identifiable audio feature, leading to inflated classification scores. We regret that this was not found earlier.

As with most machine learning methods, it still remains unclear how to interpret the decision making process at inference time. This results in it being tricky to determine which acoustic features the model is correlating with COVID-19. Whether that be true, acoustic features caused by the COVID-19 infection or other acoustic bias is still an unanswered question [12] . We also note that this is a binary classification task, in that models only had to decide between COVID-19 positive or negative. This 'closed word fallacy' [7] leads to inflated performance as models are not tasked with discerning between confounding symptoms such as heavy cold or asthma. Tasking models to predict COVID-19 out of a wide range of possible conditions/symptoms would be a harder task.

In this challenge, participants were provided with the test set recordings (without the corresponding labels). In future challenges, test set instances should be kept private, requiring participants to submit trained models along with pipeline scripts for inference. Teams' test set predictions can then be run automatically by the challenge organisers. This will help in reducing the possibility of overfitting and foul play. We note that there was no evidence of foul play, e. g., training in an unsupervised manner on the test set, in this challenge.

Another limitation of this challenge was the lack of meta data that organisers could provide to participants. This tied teams' hands to some extent in evaluating for themselves the level of bias in the dataset and so their opportunity to implement methods to combat it. This was not a desired feature. However, we now point teams towards the newly open sourced COVID-19 Sounds database [42] which also provides collected meta data. It is this dataset from which a subset of samples was taken for this challenge.

This challenge demonstrated that there is a signal in crowdsourced COVID-19 respiratory sounds that allows for machine learning algorithms to fit a classifier which achieves moderate detection rates of COVID-19 in infected individuals' respiratory sounds. Exactly what this signal is, however, still remains unclear. Whether these signals are truly audio biomarkers in respiratory sounds of infected individuals uniquely caused by COVID-19 or rather identifiable bias in the datasets, such as confounding flu like symptoms, is still an open question to be answered next.

(a) (b) (c) Figure 5 : Schematic detailing the level of agreement between teams for each test instance for the COVID-19 Cough Sub-Challenge. 

Here we present some results from ablation studies of teams' performances through evaluating performance on curated subsets of the test set. Figure 7 details the effect of controlling for symptom cofounders on teams' performance. Figures  12 and 13 repeats this analysis however controlling for sample rate. Figures 10 and 11 details the level of agreement between teams for the low 10a, 11a and high 10b 11b sample rate test cases. Figures 8 and 9 detail the classification performance of a fusion of teams' predictions on the test set. Figure 10 : Schematic detailing the level of agreement as in 5 with test instances with either a low sample rate (below 12 kHz) (10a) or high sample rate (above 12 kHz) (10b).

(a) (b) Figure 11 : Schematic detailing the level of agreement as in 6 with test instances with either a low sample rate (below 12 kHz) (11a) or high sample rate (above 12 kHz) (11b).

(a) (b) Figure 12 : Team performance on two curated test sets from the COVID-19 Cough Sub-Challenge. 12a controls for test samples with a sample rate of greater than 12 kHz and 12b controls for test samples with a sample rate of 12 kHz and below. The metric reported is recall for positive cases. 95 % confidence intervals are shown, calculated via the normal approximation method.

(a) (b) Figure 13 : Team performance on two curated test sets from the COVID-19 Speech Sub-Challenge. 13a controls for test samples with a sample rate of greater than 12 kHz and 13b controls for test samples with a sample rate of 12 kHz and below. The metric reported is recall for positive cases. 95 % confidence intervals are shown, calculated via the normal approximation method.

Optuna: A next-generation hyperparameter optimization framework. CoRR, abs/1907.10902

Evaluating the COVID-19 Identification ResNet (CIdeR) on the INTERSPEECH COVID-19 from Audio Challenges

Bag-of-deep-features: Noise-robust deep feature representations for audio analysis

A Generic Deep Learning Based Cough Analysis System from Clinically Validated Samples for Point-of-Need COVID-19 Test and Severity Levels

Cough Against COVID: Evidence of COVID-19 Signature in Cough Sounds

The voice of COVID-19: Acoustic correlates of infection in sustained vowels

Ethics and Good Practice in Computational Paralinguistics

Exploring Automatic Diagnosis of COVID-19 from Crowdsourced Respiratory Sound Data

Transfer Learning and Data Augmentation Techniques to the COVID-19 Identification Tasks in ComParE 2021

Deep learning against COVID-19: Respiratory insufficiency detection in Brazilian Portuguese speech

End-to-end convolutional neural network enables COVID-19 detection from breath and cough audio: a pilot study

COVID-19 Detection from Audio: Seven Grains of Salt. The Lancet Digital Health

Bias and privacy in AI's cough-based COVID-19 recognition -Authors' reply. The Lancet Digital Health

An image is worth 16x16 words: Transformers for image recognition at scale

Recent developments in opensmile, the munich open-source multimedia feature extractor

audeep: Unsupervised learning of representations from audio with deep recurrent neural networks

Sounds of COVID-19: exploring realistic performance of audio-based digital testing

Cnn architectures for large-scale audio classification

Visual Transformers for Primates Classification and Covid Detection

AI4COVID-19: AI enabled preliminary diagnosis for COVID-19 from cough samples via an app

Test of Hypothesis -Concise Formula Summary

The Phonetic Footprint of Covid-19?

Panns: Large-scale pretrained audio neural networks for audio pattern recognition

An Introduction to Information Retrieval

DiCOVA Challenge: Dataset, task, and baseline system for COVID-19 diagnosis using acoustics

Detecting COVID-19 from Breathing and Coughing Sounds using Deep Neural Networks

The COUGHVID crowdsourcing dataset, a corpus for the study of large-scale cough analysis algorithms. Scientific Data

SARS-CoV-2 Detection From Voice

IATos: AI-powered pre-screening tool for COVID-19 from cough audio samples

Project achoo: A practical model and application for covid-19 detection from recordings of breath, voice, and cough

Computer Audition for Fighting the SARS-CoV-2 Corona Crisis -Introducing the Multi-task Speech Corpus for COVID-19

Multi-task self-supervised learning for robust speech recognition

openxbow -introducing the passau open-source crossmodal bag-of-words toolkit

The INTERSPEECH 2009 Emotion Challenge

The interspeech 2021 computational paralinguistics challenge: Covid-19 cough, covid-19 speech, escalation & primates

Coswara -A Database of Breathing, Cough, and Voice Sounds for COVID-19 Diagnosis

Transfer Learning-Based Cough Representations for Automatic Detection of COVID-19

End2you -the imperial toolkit for multimodal profiling by end-to-end learning

State-of-the-art speaker recognition with neural network embeddings in nist sre18 and speakers in the wild evaluations

The ASA's Statement on p-values: Context, Process, and Purpose. The American Statistician

Uncertainty-Aware COVID-19 Detection from Imbalanced Sound Data

Covid-19 sounds: A large-scale audio dataset for digital covid-19 detection

We acknowledge funding from the DFG's Reinhart Koselleck project No. 442218748 (AUDI0NOMOUS) and the ERC project No. 833296 (EAR).