key: cord-0489671-epl8rudf authors: Muguli, Ananya; Pinto, Lancelot; Nirmala, R.; Sharma, Neeraj; Krishnan, Prashant; Ghosh, Prasanta Kumar; Kumar, Rohit; Bhat, Shrirama; Chetupalli, Srikanth Raj; Ganapathy, Sriram; Ramoji, Shreyas; Nanda, Viral title: DiCOVA Challenge: Dataset, task, and baseline system for COVID-19 diagnosis using acoustics date: 2021-03-16 journal: nan DOI: nan sha: b759eb8379438adbec6eefc49733f786ae4c537a doc_id: 489671 cord_uid: epl8rudf The DiCOVA challenge aims at accelerating research in diagnosing COVID-19 using acoustics (DiCOVA), a topic at the intersection of speech and audio processing, respiratory health diagnosis, and machine learning. This challenge is an open call for researchers to analyze a dataset of sound recordings collected from COVID-19 infected and non-COVID-19 individuals for a two-class classification. These recordings were collected via crowdsourcing from multiple countries, through a website application. The challenge features two tracks, one focusing on cough sounds, and the other on using a collection of breath, sustained vowel phonation, and number counting speech recordings. In this paper, we introduce the challenge and provide a detailed description of the task, and present a baseline system for the task. The COVID-19 pandemic has emerged as a significant health crisis. At the time of writing (15−June-2021), more than 175 million cases and more than 3.8 million casualties have been reported by the World Health Organization (WHO) from about 200 countries across the world [1] . Physical distancing and implementation of wide-scale population testing have served as key measures to contain the pandemic. The testing methods in use can be broadly divided into molecular and antibody testing. In molecular testing, chemical reagents are used to detect the constituents, like nucleic acids and proteins, of the SARS-CoV-2 virus in an individuals' throat or nasal swab sample. The reverse transcription polymerase chain reaction (RT-PCR) is one such testing method, and currently serves as a gold standard for COVID-19 testing. However, cost of machinery, time, and expertise have limited the scalability of this method. The rapid antigen test (RAT) is another molecular testing method which alleviates the time limitation of RT-PCR but has high false negatives (low specificity). The swab based tests and molecular tests also violate physical distancing between participant and the health worker, posing a serious practical challenge. In summary, there is a need to discover alternative methodologies to diagnose COVID-19 infection that are efficient in terms of time, cost, and ease, allowing scalability. The WHO [1] has maintained dry cough, breathing difficulty, chest pain, and fatigue as symptoms of the infection, Thanks to the Department of Science and Technology, Government of India. Professional Affiliation Figure 1 : Illustration of the distribution 80 plus challenge registrants (or teams). manifested between 2 − 14 days after exposure to the virus. This was also validated by a modeling study that analyzed data pertaining to the symptoms reported by 7178 COVID-19 positive individuals [2] . The chest X-ray (and CT) scans of many COVID-19 infected individuals have revealed infection in the lungs [3] , and effort is being directed to evaluate the feasibility of early diagnosis using imaging techniques. Interestingly, respiratory medical literature suggests that sounds emanating through coordinated release of air pressure through the lungs, such as breathing, cough, and speech, are intricately tied to changes in the anatomy and the physiology of the respiratory system [4] . A lung infection can affect the inspiratory and expiratory capacity. This, in addition to the presence of cough, can result in difficulty in vocalizing sustained phonation and/or continuous speech [5, 6] . This has been the scientific principle based on which studies analyzing vocal sounds have shown some success in detecting respiratory ailments, such as pertussis [7] , chronic obstructive pulmonary disease (COPD) [8] , and tuberculosis [9] . Based on such biological plausibility, we hypothesize that the evaluation of the accuracy of detecting COVID-19 using the acoustics of respiratory sounds merits research. A success can provide an excellent point-of-care, quick, easy to use, and cost-effective tool to diagnose COVID-19 infection, and consequently contain COVID-19 spread. Altogether, it can supplement the molecular testing methods for COVID-19 detection or screening. The DiCOVA Challenge 1 is designed to accelerate research efforts along this direction by creation and release of an acoustic signal dataset, and inviting researchers to build detection models and report performance on a blind test set. Since its release on 04−Feb-2021, the DiCOVA Challenge has cre- Figure 2 : In each track, the dataset is grouped non-COVID and COVID subjects. The non-COVID subjects are either healthy, have symptoms (cough/cold), or have pre-existing respiratory ailments (chronic lung disease, asthma, or pneumonia). The COVID subjects are either symptomatic or asymptomatic COVID positive. The distribution of age, gender, and the splits of the development dataset is also shown. ated a widespread interest amongst researchers. We have received registration from more than 80 teams. These come from various countries and professional affiliation (see Figure 1 ). In this paper, we present an overview of the topic, tasks in the challenge, and the baseline system. Since the onset of the COVID-19 pandemic, several attempts are being made to evaluate the potential of sound based screening (and diagnosis). These attempts [10, 11, 12, 13, 14, 15, 16, 17] have primarily focused on cough sounds, and are work in progress. Brown et.al. [16] use cough and breathing sounds from 141 COVID-19 patients, extract a collection of short-time frame-level acoustic features and embeddings from a VGGish network, and pass these through a logistic regression classifier. An area-under-the-curve (AUC) 80% is reported. The study by Imran et al. [15] uses sound samples from 48 COVID-19 patients, and reports a sensitivity of 94% (and 91% specificity) using a convolutional neural network (CNN) architecture, fed with mel-spectrogram features as the input. The study by Bagad et.al. [17] uses cough samples from 376 COVID-19 patients, and a CNN architecture based on ResNet18 with short-time magnitude spectrogram as input, and reports an AUC of 72%. Altogether, these studies are encouraging. The limitations include: (i) a different COVID-19 patient population used in each study, (iii) varied evaluation methodology, (iii) small population size, and (iii) lack of insight on acoustic feature differences between healthy and COVID-19 individuals. The DiCOVA Challenge is aimed to encourage multiple research groups to analyze the same dataset, evaluate the system performance using fixed metrics, and facilitate obtaining benchmarks for future system development. The DiCOVA Challenge dataset is derived from the Coswara dataset [18] , a crowd-sourced dataset of sound recordings from COVID-19 positive and non-COVID-19 individuals. The Coswara data is collected using a web-application 2 , launched in April-2020, accessible through the internet by anyone around the globe. The volunteering subjects are advised to record their respiratory sounds in a quiet environment. Each subject provides 9 audio recordings, namely, For the challenge, the subjects have been divided into two groups, namely, • non-COVID: Subjects who are either healthy or have symptoms such as cold or cough, or have pre-existing respiratory ailments (asthma, pneumonia, chronic lung disease), and confirm that they are not COVID-19 positive. • COVID: Subjects who confirm as COVID-19 positive, either symptomatic and asymptomatic. The Track-1 and Track-2 development datasets are composed of 1040 (965 non-COVID subjects) and 990 (930 non-COVID subjects), respectively. A breakdown of the subject population with respect to symptoms, age group, and gender is shown in Figure 2 . The Coswara data collection is via crowd-sourcing, which means the quality of the audio files has high variability and serves as a good representation of audio data collected in the wild. A majority of the audio files are clean as confirmed via informal listening. More than 90% of the collected files have a sampling rate of 48 kHz and stored as WAV files. For the challenge datasets, all audio recordings have been resampled to 44.1 kHz and compressed as FLAC format files. The Track-1 audio files correspond to cough sound signals. Each audio file is derived from one unique subject, and has one or more cough bouts. In total, there are 1040 recordings. The DiCOVA challenge features two tracks. Below we present the task and the instructions associated with each track. A participant can choose to participate in one or both the tracks. The goal is to use cough sound recordings from COVID-19 and non-COVID-19 individuals for the task of COVID-19 detection. • The Track-1 development dataset is composed of cough audio data from 1040 subjects. The dataset also contains lists corresponding to a 5−fold cross validation split. The distribution of COVID and non-COVID in these splits is shown in Figure 2 (a). All participants are required to adhere to these lists and report the average performance over the 5 validation sets. • A separate blind evaluation dataset is provided to all participants. The participants are required to report their COVID-19 detection scores as probabilities. • This is the primary track for the challenge. A baseline system is provided, and an online leaderboard 3 is set up for all participants to report and compare their performance. The goal is to use breathing, sustained phonation, and speech sound recordings from COVID-19 and non-COVID-19 individuals for any kind of detailed analysis which can contribute towards COVID-19 detection. • The dataset also contains 5 train-validation splits. The distribution of COVID and non-COVID in these splits is shown in Figure 2 (b). • The participants are encouraged to design COVID-19 detection systems using above splits. • This track has no baseline system and leaderboard. A non-blind test set is provided to all participants. Participants are free to use any other data except the publicly available Project Coswara dataset 4 for data augmentation, transfer learning, etc. Both Track-1 and Track-2 are binary classification tasks. With a focus on COVID detection, the performance is evaluated using the traditional detection metrics, namely, true positive (TP) and false positive (FP) rates, over a range of decision thresholds between 0 − 1 with a step-size of 0.0001. For track-1, the participant is required to submit a COVID probability score for every audio file (corresponding to a subject) in the blind test set. In the evaluation, we use the probability scores to compute the receiver operating characteristic (ROC) curve, and use the area under the curve (AUC) to quantify the model performance. An AUC > 50% indicates a better than chance performance, and an AUC closer to 100% indicates the ideal model performance. We also compute the model specificity at 80% sensitivity. The audio data is pre-processed by normalizing the amplitude range to ±1. Subsequently, a simple sample level sound activity detection (SAD) is applied. This keeps any audio sample with absolute value greater than 0.01 (and a margin of ±50 msec around it) and discards the rest of the audio samples. Further, the initial and the final 20 msec audio samples are also discarded to remove abrupt start and end burst due to device noise. Here, 39 dimensional mel-frequency cepstral coefficients (MFCC) [19] and the delta and delta-delta coefficients are extracted with a window of size 1024 samples and a hop of size 441 samples. The librosa python library [20] is used for the computation. Three different classifier models are trained for the two class classification tasks of COVID versus non-COVID detection. The models are trained using the extracted features and a (class) balanced loss function, separately, for each of the five training data splits. The implementation uses the scikit-learn python library [21] . The classifier models include the following. • Random Forest (RF): The random forest classifier is trained with 50 trees in the forest and gini impurity criterion to measure the split quality. To obtain a classification score for an audio file: (i) a preprocessing with amplitude normalization and SAD is done, (ii) frame-level MFCC features are extracted, (iii) frame-level probability scores are computed using the trained model, and (iv) all the frame scores are averaged to obtain a single COVID probability score for the audio file. Fig. 3 . The Track-2 test dataset release contains 209 (21 COVID) audio files for each of the three sound categories. Here, the RF model gave a better performance than other models in all the three sound categories. Its performance was best for breathing (76.85% AUC) and worst for speech (65.27% AUC). The uniqueness of the dataset makes the DiCOVA challenge a first-of-its kind in the INTERSPEECH conference. The practical and timely relevance of the task encourages a focused effort from researchers across the globe, and from diverse fields such as respiratory sciences, speech and audio processing, and machine learning. Along with the dataset, we also provide the baseline system software to all the participants. We expect this will serve as an example data processing pipeline for the participants. Further, participants are encouraged to explore different kinds of features and models of their own choice to obtain significantly better performance compared to the baseline system. We thank Anand Mohan for his enormous help in web design and data collection efforts. WHO Coronavirus Disease (COVID-19) Dashboard Realtime tracking of self-reported symptoms to predict potential COVID-19 Thoracic imaging tests for the diagnosis of COVID-19 Speech Breathing Across the Life Span and in Disease Speech breathing in patients with lung disease Perceived phonatory effort and phonation threshold pressure across a prolonged voice loading task: a study of vocal fatigue A cough-based algorithm for automatic diagnosis of pertussis Tussiswatch: A smartphone system to identify cough episodes as early symptoms of chronic obstructive pulmonary disease and congestive heart failure Detection of tuberculosis by automatic cough sound analysis UK -COVID-19 Sounds App NYU Breathing Sounds for COVID-19 EPFL Cough for COVID-19 Detection CMU sounds for COVID Project AI4COVID-19: AI enabled preliminary diagnosis for COVID-19 from cough samples via an app Exploring automatic diagnosis of covid-19 from crowdsourced respiratory sound data Cough against covid: Evidence of covid-19 signature in cough sounds Coswara -a database of breathing, cough, and voice sounds for COVID-19 diagnosis Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences Scikit-learn: Machine learning in Python