key: cord-0522709-keop7ncy authors: Akman, Alican; Coppock, Harry; Gaskell, Alexander; Tzirakis, Panagiotis; Jones, Lyn; Schuller, Bjorn W. title: Evaluating the COVID-19 Identification ResNet (CIdeR) on the INTERSPEECH COVID-19 from Audio Challenges date: 2021-07-30 journal: nan DOI: nan sha: 7ea78bc49c3b7dc081e1945956d00de9c106e2d2 doc_id: 522709 cord_uid: keop7ncy We report on cross-running the recent COVID-19 Identification ResNet (CIdeR) on the two Interspeech 2021 COVID-19 diagnosis from cough and speech audio challenges: ComParE and DiCOVA. CIdeR is an end-to-end deep learning neural network originally designed to classify whether an individual is COVID-positive or COVID-negative based on coughing and breathing audio recordings from a published crowdsourced dataset. In the current study, we demonstrate the potential of CIdeR at binary COVID-19 diagnosis from both the COVID-19 Cough and Speech Sub-Challenges of INTERSPEECH 2021, ComParE and DiCOVA. CIdeR achieves significant improvements over several baselines. The current coronavirus pandemic , caused by the severe-acute-respiratory-syndrome-coronavirus 2 (SARS-CoV-2), has infected a confirmed 126 million people and resulted in 2,776,175 deaths (WHO) 1 . Mass testing schemes offer the option to monitor and implement a selective isolation policy to control the pandemic without the need for regional or national lockdown [1] . However, physical mass testing methods, such as the Lateral Flow Test (LFT) have come under criticism since the tests divert limited resources from more critical services [2, 3] and due to suboptimal diagnostic accuracy. Sensitivities of 58 % have been reported for self-administered LFTs [4] , unacceptably low when used to detect active virus, a context where high sensitivity is essential to prevent the reintegration into society of falsely reassured infected test recipients [5] . Investigating the potential for digital mass testing methods is an alternative approach, based on findings that suggest a biological basis for identifiable vocal biomarkers caused by SARS-CoV-2's effects on the lower respiratory track [6] . This has recently been backed up by empirical evidence [7] . Efforts have been made to collect and classify a range of different modality audio recordings of COVID-positive and COVID-negative individuals and several datasets have been released that use applications to collect the breath and cough of volunteer individuals. Examples include the 'Coughvid' [8] , 'Breath for Science' 2 , 'Coswara' [9] , COVID-19 sounds 3 , and 'CoughAgain- [11] with its COVID-19 Cough and Speech Sub-Challenges, and Diagnosing COVID-19 using acoustics (Di-COVA) 5 [12] have been organised with this focus as their challenge. Several studies have been published that propose machine learning-based COVID classifiers exploiting distinctive sound properties between positive and negative cases to classify these datasets. [13] and [14] demonstrate that simple machine learning models perform well in these relatively small datasets. In addition, deep neural networks are exploited in [15, 16, 17, 18] with proven performance at the COVID detection task. Although there are works that try to combine different modalities computing the representations separately, [19] (CIdeR) proposes an approach computing joint representation of a number of modalities. The adaptability of this approach to different types of datasets has not to our knowledge been explored or reported. To this end, we present the results of the application of COVID-19 Identification ResNet (CIdeR), a recently developed end-to-end deep learning neural network optimised for binary COVID-19 diagnosis from cough and breath audio [19] , to the two COVID-19 cough and speech Challenges of INTER-SPEECH 2021, ComParE and DiCOVA. CideR [19] is a 9 layer convolutional residual network. A schematic detailing of the model can be seen in Figure 1 . Each layer or block consists of a stack of convolutional layers with Rectified Linear Units (ReLUs). Batch normalisation [20] also features in the residual units, acting as a source of regularisation and supporting training stability. A fully connected layer with sigmoid activation terminates the model yielding a single logit output which can be interpreted as an estimation of the probability of COVID-19. As detailed in Figure 1 the network is compatible with a varying number of modalities, for example, if a participant has provided cough, deep breathing, and sustained vowel phonation audio recordings, they can be stacked in a depth wise manner and passed through the network as a single instance. At training time, a window of s-seconds, which was fixed at 6 seconds for these challenges, is sampled from the audio recording randomly. If the audio recording is less than s-seconds long, the sample is padded with repeated versions of itself. The sampled audio is then converted into Mel-Frequency Cepstral Coefficients (MFCCs) resulting in an image of width s * the sample rate and height equal to the number of MFCCs. Three data augmentation steps are then applied to the sample. First, the pitch of the recording is randomly shifted, secondly, bands of the Mel spectrogram are masked in the time and Mel coefficient axes and finally, Gaussian noise is added. At test time, the sampled audio recording is chunked into a set of s-second clips and processed in parallel. The mean of the set of logits is then returned as the final prediction. The DiCOVA team ran baseline experiments for the track 1 (coughing) sub-challenge; only the best performing (MLP) model's score was reported. For the track 2 (deep breathing/vowel phonation/counting) sub-challenge, however, baseline results were not provided. Baseline results were provided for the ComParE challenge but only Unweighted Average Recall (UAR) was reported rather than Area Under Curve of the Receiver Operating Characteristics curve (ROC-(AUC)). To allow comparison across challenges, we created new baseline results for the ComParE sub-challenges and the DiCOVA Track 2 sub-challenge, using the same baseline methods described for the DiCOVA Track 1 sub-challenge. The three baseline models applied to all four sub-challenge datasets were Logistic Regression (LR), Multi-layer Perceptron (MLP), and Random Forrest (RF), where the same hyperparameter configurations that were specified in the DiCOVA baseline algorithm was used [12] . To provide a baseline comparison for the CIdeR track 2 results, we built a multimodal baseline model. We followed a similar strategy with the provided DiCOVA baseline algorithm, while extracting the features for each modality. Rather than individual training for different models, we developed an algorithm that concatenates input features from separate modalities. Then, this combined feature set was fed to the baseline models: LR, MLP, and RF. We used 39 dimensional MFCCs as our feature type to represent the input sounds. For LR, we used Least Square Error (L2) as a penalty term. For MLP, we used a single hidden layer of size 25 with a Tanh activation layer and L2 regularisation. The Adam optimiser and a learning rate of 0.0001 was used. For RF, we built the model with 50 trees and split based on the gini impurity criterion. ComParE hosted two COVID-19 related sub-challenges, the COVID Cough Sub-Challenge (CCS) and the COVID Speech Sub-Challenge (CSS). Both CCS and CSS are subsets of the crowd sourced Cambridge COVID-19 sound database [13, 21] . CCS consists of 926 cough recordings from 397 participants. Participants provided 1-3 forced coughs resulting in a total of 1.63 hours of recording. CSS is made up of 893 recordings from 366 participants totalling 3.24 hours of recording. Participants were asked to recite the phrase "I hope my data can help manage the virus pandemic" in their native language 1-3 times. The train-test splits for both sub-challenges are detailed in Table 1 . Once again, DiCOVA hosted two COVID-19 audio diagnostic sub-challenges. Both sub-challenge datasets were subsets of the crowd sourced Coswara dataset [9] . The first sub-challenge, named Track-1, comprised of a set of 1,274 forced cough audio recordings from 1,274 individuals totalling 1.66 hours. The second, Track-2, was a multi-modality challenge, where 1,199 individuals provided three separate audio recordings; deep breathing, sustained vowel phonation, and counting from 1-20. This dataset represented a total of 14.9 hours of recording. The traintest splits are detailed in Table 2 4 The results from the array of experiments with CIdeR and the 3 baseline models are detailed in Table 3 . CIdeR performed strongly across all four sub-challenges, achieving AUCs of 0.799 and 0.787 in the DiCOVA Track 1 and 2 sub-challenges, respectively, and 0.732 and 0.787 in the ComParE CCS and CSS sub-challenges. In the DiCOVA cough sub-challenge, CIdeR significantly outperformed all three baseline models based on 95 % confidence intervals calculated following [22] , and in the DiCOVA breathing and speech sub-challenge it achieved a higher AUC although the improvement over the baselines was not significant. Conversely, while CIdeR performed significantly better than all three baseline models in the ComParE speech sub-challenge based on 95 % confidence intervals calculated following [22] , it performed no better than baseline in the COMPARE cough sub-challenge. One can speculate that this may have resulted from the small dataset sizes favouring the more classical machine learning approaches which do not need as much training data. A key limitation with both the ComParE and DICOVA COVID challenges is the size of the datasets. Both datasets contain very few COVID-positive participants. Therefore, the certainty in results is limited and this is reflected in the large 95 % confidence intervals detailed in Table 3 . This issue is compounded by the demographics of the datasets. As detailed in [13] and in [12] for the ComParE datasets and the DiCOVA datasets, respectively, not all demographics from society are represented evenly -most notably, there is poor coverage of age and ethnicity and both datasets are skewed towards the male gender. In addition, the crowd-sourced nature of the datasets introduces some confounding variables. Audio is a tricky sense to control. It contains a lot of information about the surrounding environment. As both datasets were crowd-sourced, there could have been correlations between ambient sounds and COVID-19 status, for example, sounds characteristic of hospitals or intensive care units being more often present for COVID-19-positive recordings compared to COVID-19-negative recordings. As the ground truth labels for both datasets were self reported, presumably the participants knew at the time of recording whether they had COVID-19 or not. One could postulate that the individuals who knew they were COVID-19-positive might have been more fearful than COVID-19-negative participants at the time of recording, an audio characteristic known to be identifiable by machine learning models [23] . Therefore, the audio features which have been identified by the model may not be specific audio biomarkers for the disease. We note that both the DiCOVA Track 1 and ComParE CCS sub-challenges were cough recordings. Therefore, there was an opportunity to utilise both training sets. Despite having access to both the DiCOVA and ComParE datasets, training on the two datasets together did not yield a better performance on either of the challenges' test sets. Additionally, a model which performed well on one of the challenges test sets would see a marked drop in performance on the other challenge's test set. We run cross dataset experiments to analyse this effect further. For these experiments, we also included the COUGHVID dataset [8] in which COVID-19 labels were assigned by experts and not as a results of clinically validated test. The results in Table 4 show that the trained models for each dataset do not generalise well and perform poorly on excluded datasets. This is a worrying find, as it suggests that audio markers which are useful in COVID classification in one dataset are not useful or present in the other dataset. This agrees with the concerns presented in [24] that current COVID-19 audio datasets are plagued with bias, allowing for machine learning models to infer COVID-19 status, not by audio biomarkers uniquely produced by COVID-19, but by other correlations in the dataset such as nationality, comorbidity and background noise. Future Work One of the most important next steps is to collect and evaluate machine learning COVID-19 classification on a larger dataset that is more representative of the population. To achieve optimal ground truth, audio recordings should be collected at the time that the Polymerase Chain Reaction (PCR) test is taken, before the result is known. This would ensure full blinding of the participant to their COVID status and exclude any environmental audio biasing in the dataset. The Cycle Threshold (CT) of the PCR test should also be recorded, CT correlates with viral load [25] and therefore would enable researchers to determine the model's classification performance to the disease at varying viral loads. This relationship is critical in assessing the usefulness of any model in the context of a mass testing scheme, since the ideal model would detect a viral load lower than the level that confers infectiousness [26, 27] . Finally, studies similar to [7] , directly comparing acoustic features of COVID-positive and COVID-negative participants should be conducted on all publicly available datasets. Cross-running CIdeR on the two 2021 Interspeech COVID-19 diagnosis from cough and speech audio challenges has demonstrated the model's adaptability across multiple modalities. With little modification, CIdeR achieves competitive results in all challenges, advocating the use of end-2-end deep learning models for audio processing thanks to their flexibilty and strong performance. The support of the EPSRC Center for Doctoral Training in High Performance Embedded and Distributed Systems (HiPEDS, Grant Reference EP/L016796/1) is gratefully acknowledged along with the UKRI CDT in Safe & Trusted AI. The authors further acknowledge funding from the DFG (German Research Foundation) Reinhart Koselleck-Project AUDI0NOMOUS (grant agreement No. 442218748) and the Imperial College London Teaching Scholarship. Weekly covid-19 testing with household quarantine and contact tracing is feasible and would probably end the epidemic Covid-19: Concerns persist about purpose, ethics, and effect of rapid testing in liverpool Newsdesk covid-19 testing in slovakia Covid-19: Innova lateral flow test is not fit for test and release strategy, say experts Lateral flow tests need low false positives for antibodies and low false negatives for virus A Framework for Biomarkers of COVID-19 Based on Coordination of Speech-Production Subsystems The voice of COVID-19: Acoustic correlates of infection The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms Coswara -A Database of Breathing, Cough, and Voice Sounds for COVID-19 Diagnosis Cough Against COVID: Evidence of COVID-19 Signature in Cough Sounds The interspeech 2021 computational paralinguistics challenge: Covid-19 cough Dicova challenge: Dataset, task, and baseline system for covid-19 diagnosis using acoustics Exploring Automatic Diagnosis of COVID-19 from Crowdsourced Respiratory Sound Data COVID-19 Patient Detection from Telephone Quality Speech Data COVID-19 Artificial Intelligence Diagnosis using only Cough Recordings SARS-CoV-2 Detection From Voice AI4COVID-19: AI Enabled Preliminary Diagnosis for COVID-19 from Cough Samples via an App Detecting COVID-19 from Breathing and Coughing Sounds using Deep Neural Networks End-2-end covid-19 detection from breath & cough audio Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Exploring automatic covid-19 diagnosis via voice and symptoms from crowdsourced data The meaning and use of the area under a receiver operating characteristic (ROC) curve Adieu features? end-toend speech emotion recognition using a deep convolutional recurrent network Covid-19 detection from audio: Seven grains of salt Duration of infectiousness and correlation with rt-pcr cycle threshold values in cases of covid-19, england Seventy-third SAGE meeting on COVID-19 Covid-19: Rapid antigen detection for sars-cov-2 by lateral flow assay: a national systematic evaluation for mass-testing