key: cord-0534001-3ok8ijcu authors: Sharma, Neeraj Kumar; Muguli, Ananya; Krishnan, Prashant; Kumar, Rohit; Chetupalli, Srikanth Raj; Ganapathy, Sriram title: Towards sound based testing of COVID-19 -- Summary of the first Diagnostics of COVID-19 using Acoustics (DiCOVA) Challenge date: 2021-06-21 journal: nan DOI: nan sha: dce758b60200fd0a384eca891eacd288a0ac9831 doc_id: 534001 cord_uid: 3ok8ijcu The technology development for point-of-care tests (POCTs) targeting respiratory diseases has witnessed a growing demand in the recent past. Investigating the presence of acoustic biomarkers in modalities such as cough, breathing and speech sounds, and using them for building POCTs can offer fast, contactless and inexpensive testing. In view of this, over the past year, we launched the ``Coswara'' project to collect cough, breathing and speech sound recordings via worldwide crowdsourcing. With this data, a call for development of diagnostic tools was announced in the Interspeech 2021 as a special session titled ``Diagnostics of COVID-19 using Acoustics (DiCOVA) Challenge''. The goal was to bring together researchers and practitioners interested in developing acoustics-based COVID-19 POCTs by enabling them to work on the same set of development and test datasets. As part of the challenge, datasets with breathing, cough, and speech sound samples from COVID-19 and non-COVID-19 individuals were released to the participants. The challenge consisted of two tracks. The Track-1 focused only on cough sounds, and participants competed in a leaderboard setting. In Track-2, breathing and speech samples were provided for the participants, without a competitive leaderboard. The challenge attracted 85 plus registrations with 29 final submissions for Track-1. This paper describes the challenge (datasets, tasks, baseline system), and presents a focused summary of the various systems submitted by the participating teams. An analysis of the results from the top four teams showed that a fusion of the scores from these teams yields an area-under-the-curve of 95.1% on the blind test data. By summarizing the lessons learned, we foresee the challenge overview in this paper to help accelerate technology for acoustic-based POCTs. The viral respiratory infection caused by the novel coronavirus, SARS-CoV-2, termed as the coronavirus disease 2019 (COVID- 19) , was declared a pandemic by the World Health Organization (WHO) in March 2020. The current understanding of COVID-19 prognosis suggests that the virus infects the nasopharynx and then spreads to the lower respiratory tract [1] . One of the key strategies to combat the rapid spread of infection across populations is to perform rapid and large-scale testing. The prominent COVID-19 testing methodologies currently take a molecular sensing approach. The current gold-standard technique, termed as reverse transcription polymerase chain reaction (RT-PCR) [2] , relies on using nasopharyngeal or throat swab samples. The swab sample is treated with chemical reagents enabling isolation of the ribonucleic acid (RNA), followed by deoxyribonucleic acid (DNA) formation, amplifica-tion, and analysis, facilitating the detection of COVID-19 genome in the sample. However, this approach has several limitations. The swab sample collection procedure violates physical distancing [3] . The processing of these samples requires a well equipped laboratory, with readily available chemical reagents and expert analysts. Further, the turnaround time for test results can vary from several hours to a few days. The protein based rapid antigen testing (RAT) [4] improves over the speed of detection while being inferior to the RT-PCR in performance. The RAT test also involves the need of chemical agents. In view of the above mentioned limitations in RT-PCR/RAT testing, there is a need to design highly specific, rapid and easy-to-use point-of-care tests (POCTs) that could identify the infected individuals in a decentralized manner. Using acoustics for developing such a POCT would overcome various limitations in terms of speed, cost and scalability, and allows remote testing. Acoustics for diagnosis of pertussis [10] , tuberculosis [11] , childhood pneumonia [12] , and asthma [13] were explored using cough sounds recorded with portable devices. As COVID-19 is an infection affecting the respiratory pathways [14] , recently, researchers have made efforts towards acoustic data collection. A list of acoustic datasets is provided in Table 1 . Building on these datasets, few studies have also evaluated the possibility of COVID-19 detection using acoustics. Brown et al. [9] used cough and breathing sounds jointly and attempted a binary classification task of separating COVID-19 infected individuals from healthy. The dataset was collected through crowd-sourcing, and the analysis was done on 141 COVID-19 infected individuals. The authors reported a performance between 80 − 82% AUC (area-under-curve). Agbley et al. [15] demonstrated 81% specificity (at 43% sensitivity) on a subset of the COUGHVID dataset [5] . Imran et al. [16] studied cough sound samples from four groups of individuals, namely, healthy, and those with bronchitis, pertussis, and COVID-19 infection. They report an accuracy of 92.64%. Laguarte et al. [17] used a large sample set of COVID-19 infected individuals and report an AUC performance of 90%. Andreu-Perez et al. [18] created a more controlled dataset via collecting cough sound samples from patients visiting hospitals, and they report 99% AUC on a COVID-19 pool of 2339 coughs. Although these studies are encouraging, they suffer from some challenging limitations. Some of these studies are based on privately collected, small datasets. Further, the ratio of COVID-19 patients to healthy (or non-COVID) is different in every study. The performance metrics are also different across studies. Some of the studies report performance per-cough bout, and others report per-patient. Further, most of the studies have not bench-marked on other open source datasets, making it difficult to compare among the various propositions. We launched the "Diagnostics of COVID-19 using Acoustics (DiCOVA) Challenge" [19] with two primary goals. Firstly, to encourage the speech and audio researchers to analyze acoustics of cough and speech sounds for a problem of immediate societal relevance. The challenge was launched under the umbrella of Interspeech 2021, and participants were given an option to submit their findings to a special session in this flagship conference. Secondly, and more importantly, to provide a benchmark for monitoring the progress in acoustic based diagnostics of COVID-19. The development and (blind) test datasets were provided to the participants to facilitate design of classifier systems. A leaderboard was created allowing participants to rank order their performance against others. This paper describes the details of the challenge including the dataset, the baseline system (Section 2), and provides a summary of the various submitted systems (Section 3). An analysis of the scores submitted by the top teams (Section 4), and the insights gained from the challenge are also presented (Section 5). The DiCOVA challenge 2 was launched on 4−Feb, 2021 and the challenge lasted till 23−Mar, 2021. A timeline of the challenge is shown in Figure 1 . The participation was through a registration process followed by the release of development and test datasets. A remote server based scoring system was created with a leaderboard setting. This provided near real-time ranking of teams, and monitoring progress on the blind test set 3 . The call for participation in the challenge attracted 85 plus registrations. Further, 29 teams made final submissions on the blind test set. The challenge dataset is derived from the Coswara dataset [6] , a crowd-sourced dataset of sound recordings. The Coswara data is collected using a website 4 . The volunteers of all age groups and health conditions were requested to record their sound data in a quiet environment using a mobile web connected device. The participants initially provide demographic information like age and gender. An account of their current health status in form of a questionnaire of symptoms as well as pre-existing conditions like respiratory ailments and co-morbidity are recorded. The web based tool also records the result of the COVID test conducted and the possibility of exposure to the virus through primary contacts. The acoustic data from each subject contains 9 audio categories, namely, (a) shallow and deep breathing The DiCOVA Challenge used a subset of the Coswara dataset, sampled from the data collected between April-2020 and Feb-2021. The sampling included only age group of 15 − 80 years. The subjects with health status of "recovered" (who were COVID positive however 3 https://competitions.codalab.org/competitions/29640 4 https://coswara.iisc.ac.in/ fully recovered from the infection) and "exposed" (suspecting exposure to the virus) were not included in the dataset. Further, subjects with audio recordings of duration less than 500 msec were discarded. The resulting curated subject pool was divided into the following two groups. • non-COVID: Subjects self reported as healthy, having symptoms such as cold/cough/fever or having pre-existing respiratory ailments (like asthma, pneumonia, chronic lung disease) but were not tested positive for COVID-19. • COVID: Subjects self-declared as COVID-19 positive (symptomatic with mild/moderate infection or asymptomatic) The DiCOVA 2021 challenge featured two tracks. The Track-1 dataset composed of (heavy) cough sound recordings from 1040 subjects. The Track-2 dataset is composed of deep breathing, vowel [i], and number counting (normal pace) speech recordings from 992 subjects. For each track, the dataset was divided into training with validation (development set) and test set. An illustration of the important metadata details in the development set is provided in Figure 2 . About 70% of the subjects were male. The majority of the participants lie in the age group of 15 − 40 years. Also, the dataset is highly imbalanced with less than 10% of the participants belonging to the COVID category. We retained this class imbalance in the challenge as this addresses the typical real-world POCT scenario. The crowd-sourced dataset collection acts as a good representation of real-world data with sensor variability arising from diverse recording devices. For the challenge, we re-sampled all audio recordings to 44.1 kHz and compressed them to FLAC (Free Lossless Audio Codec) format for ease of distribution. The • Track 1: The task was based on cough samples. This was the primary track of the challenge with most teams participating only in this track. We released the baseline system for this track as well. The development dataset release contained a fivefold validation setup. A leaderboard website was hosted for the challenge enabling teams to evaluate their system performance (validation and blind test). The participating teams were required to submit the COVID probability score for each audio file in the validation and test sets. The tool computes the area-under-curve (AUC) and specificity/sensitivity. Every team was provided a maximum of 25 tickets for evaluation over the course of the challenge. • Track 2: Track-2 explored the use of recordings other than cough for the task of COVID diagnostics. The audio recordings released in this track composed of breathing, speech related to sustained phonation of vowel [i] and number count- ing (1 − 20) . The development and (non-blind) test sets were released concurrently, without any formal leaderboard style evaluation and competition. The data and the baseline system setup were provided to the registered teams after signing the terms and conditions document. As per the document, the teams were not allowed to use the publicly available Coswara dataset 5 . The focus of the task (Track-1 and Track-2) was binary classification. As the dataset was imbalanced, we choose not to use accuracy as an evaluation metric. Each team submitted COVID probability scores (∈ [0, 1] with higher values indicating more likelihood of COVID infection) for the list of validation/test audio recordings. For performance evaluation, we used the scores with the ground truth labels to compute the receiver operating characteristics (ROC) curve. The curve was obtained by varying the decision threshold between 0 − 1 with a step size of 0.0001. The area under the resulting ROC curve was used as a performance measure for the classifier, where the area was computed using the trapezoidal method. The AUC formed the primary evaluation metric. Further, specificity (true negative rate), at a sensitivity (true positive rate) greater than or equal to 80% was also used as secondary evaluation metric. The baseline system was implemented using the scikit-learn Python library [20] . 5 Pre-processing: For every audio file, the signal was normalized in amplitude. Using a sound activity detection threshold of 0.01 and a buffer size of 50 msec on either side of a sample, any region of the audio signal with amplitude lower than threshold was discarded. Also, initial and final 20 ms snippets of the audio were removed to avoid abrupt start and end activity in the recordings. 39 × 1. Classifiers: The following three classifiers were designed. • Logistic Regression (LR): The LR clasifier was trained for 25 epochs. The binary cross entropy (BCE) loss with a l 2 regularization strength of 0.01 was used for optimizing the model. • Multi layer perceptron (MLP): A single-layer perceptron model with 25 hidden units and tanh() activation was used. Similar to the LR model, the BCE loss with a 2 regularization of strength 0.001 was optimized for parameter estimation. The loss was optimized using Adam optimizer with an initial learning rate of 0.001. The COVID samples were over-sampled to compensate for the data imbalance (weighted BCE loss). • Random Forests (RF): A random forest classifier was trained with 50 trees using Gini impurity criterion for tree growing. Inference and performance: To obtain a classification score for an audio recording: (i) the file is pre-processed, (ii) frame-level MFCC features are extracted, (iii) frame-level probability scores are computed using the trained model(s), and (iv) all the frame scores are averaged to obtain a single COVID probability score for the audio recording. For evaluation on the test set files, the probability scores from five validation models (for a classifier) are averaged to obtain the final score. The performance of the three classifiers used in the baseline system is provided in Further, among the category of acoustic sounds, the breathing samples provided the best AUC (76.85%) performance followed by vowel sound [i] (75.47%). The baseline system code 6 was provided to the participants as a reference for setting up a processing, and scoring pipeline. A total of 28 teams participated in the Track-1 leaderboard. Out of these, 20 teams submitted their system reports describing the explored approaches 7 . In this section, we provide a brief overview of the submissions, emphasizing on the feature extraction approaches, data augmentation methods, classifier types, and model performances. The Track-2 submission did not require system report submission. Hence, we limit the discussion of the system highlights to Track-1 submissions only. The performance summary of all the submitted systems on the validation and the blind test data is given in Figure 4 . Figure 4 (a) depicts a comparison of the validation and test results. Interestingly, there is a slight positive correlation between test and validation performance. For some teams, the validation performances exceed 90% AUC. Deducing from the system reports, Table 3 : Summary of submitted systems in terms of feature and model configurations. The specificity(%) is reported at a sensitivity of 80%. Here, †: denotes report accepted in Interspeech 2021, * denotes team did not give consent for public release of report. these outliers are primarily due to training/over-fitting to the validation data. Figure 4 (b) depicts the best AUC posted by 29 participating teams (including baseline) on the blind test data. The best AUC performance on the test data was 87.07% AUC, a significant improvement over the baseline AUC (69.85%). In total, 23 out of the 28 teams reported a performance better than the baseline system. We refer to the teams with Team IDs corresponding to their rank on the leaderboard, that is, best AUC performance as T-1 and so on. The teams designed and experimented with a wide spectrum of features and classifiers. A concise highlight is shown in Table 3 . We elaborate on this below. A majority of the teams used mel-spectrograms, melfrequency cepstral coefficients [35] , or equivalent rect-angular bandwidth (ERB) [36] spectrograms (15 submissions out 21). Further, the openSMILE features [37] , which consist of statistical measures extracted on low-level acoustic feature descriptors, were explored by 4 teams. Few teams explored features derived using Teager energy based cepstral coefficients (TECC [38] ; T-15), and pool of short-term features such as short-term energy, zero-crossing rate, and voicing (T-5, T-14, T-27). Other teams resorted to using embeddings derived from pre-trained neural networks as features. These included VGGish [39] , DeepSpectrum [40] , OpenL3 [41] , YAMNet [42] embeddings (T-7, T-12), and x-vectors [43] (T-15). The teams explored various classifier models. These included classical machine learning models, such as decision trees, random forests (RFs), and support vector machines (SVMs), and modern deep learning models, such The fusion of scores from different classifier architectures was explored by multiple teams (T-3, T-4,T-6 T-10, T-11, T-12, T-13). The fusion of multiple features was explored by (T-13). Further, (T-2, T-3) investigated score fusion of outputs obtained from the model tuned on the five validation folds. Data augmentation is a popular strategy in which external or synthetic audio data is used in training of deep network models. Few teams (5 nos.) reported using this strategy by including COUGHVID cough dataset [5] (publicly available), adding Gaussian noise at varying SNRs, or doing audio manipulations (pitch shifting, time-scaling, etc., via tools such as audiomentations 8 ). Few teams also used data augmentation approaches to circumvent the problem of class imbalance. These included T-1 using mixup [45] , (T-3,T-9,T-11) using SpecAugment [46] , (T-2, T-5, T-9) using additive noise, T-21 using sample replication, and T-5 using Vocal-Tract Length Perturbation (VTLP) [47] , to increase the sample counts of the minority class. Besides these, other strategies for training included gender aware training (T-21), using focal loss [48] objective function (T-2,T-8,T-11), and hyper-parameter tuning using model searching algorithm TPOT (T-7) [49] . In the next section, we discuss in detail the approaches used by the four top performing teams. The team [21] focused on using a multi-layered CNN network architecture. Special emphasis was laid on having a small number of learnable parameters. Every audio segment was trimmed or zero padded to 7 secs. For feature extraction, this segment was represented using 15 dimensional MFCC features per frame, and a matrix of 15 × 302 frames was obtained. A cascade of a CNN and fully connected layers, with max-pooling and ReLU non-linearities, was used in the neural network architecture. For data augmentation, the team used the audiomentations tool. The classifier was trained using binary cross entropy (BCE) loss to output COVID probability score. The team did not report performing any Measures Team T-1 Team T-2 Team T-3 Team T- Table 4 : A comparison of AUC and sensitivity of top four teams, their score fusion and the baseline system. system combination unlike several other participating teams. The team focused [22] on using the residual network (ResNet) model with spectrogram images as features. To overcome the limitations of data scarcity and imbalance, the team resorted to three key strategies. Firstly, data augmentation was done by adding Gaussian noise to spectrograms. Secondly, focal loss function was used instead of cross entropy loss. Thirdly, the ResNet50 was pre-trained on ImageNet followed by fine-tuning on Di-COVA development set and an ensemble of four models was used to generate final COVID probability scores. The team [23] used long short term memory (LSTM) models. With the motivation of generative modeling of mel-spectrogram for capturing informative features of cough, the team proposed using the auto-regressive predictive coding (APC) [50] . The APC is used to pretrain the initial LSTM layers operating on the input melspectrogram. The additional layers of the full network, which was composed of BLSTM and fully connected layers, was trained using the DiCOVA development set. As the number of model parameters was high, the team also used data augmentation using COUGHVID dataset [5] and SpecAugment [46] tool. The binary cross entropy was chosen as the loss function. The final COVID-19 probability score was obtained as an average of several similar models, trained on development data subsets or sampled at different checkpoints during training. The team [24] explored classical machine learning models like random forests (RF), support vector machines (SVM), and multi-layer perceptron (MLP) rather than deep learning models. The features used were the 6373 dimensional openSMILE functional features [37] . The openSMILE features were z-score normalized to prevent feature domination. The hyper-parameters of the models were tuned to obtain the best results. The SVM models alone provided an AUC of 85.1% on the test data. The RF and the MLP scored an AUC of 82.15 and 75.65, respectively. The final scores were obtained by a weighted average of the probability scores from the RF and SVM models, with weights of 0.25 and 0.75, respectively. Here, we present a fairness analysis of the scores generated by the top 4 teams. We particularly focus on gender-wise and age-wise performance on the test set. Figure 5 depicts this performance. Interestingly, all the four teams gave a better performance for female subjects. Similarly, the test dataset was divided into two groups based on subjects with age< 40 and age≥ 40. Here, the top two teams had a considerably higher AUC for age≥ 40 subjects, while T-3 had a lower AUC for this age group and T-4 had the highest. In summary, the performance of top four teams did not reflect the bias in the development data (70% male participants, largely in age ≤ 40 group). The systems from the top four teams differ in terms of features, model architectures, and data augmentation strategies. We consider a simple arithmetic mean fusion of the scores from top 4 teams. Let p i j , 1 ≤ i ≤ N and 1 ≤ j ≤ T be the COVID probability score predicted by the j th team submission for the i th subject in the test data. Here, N denotes the number of subjects in the test set, and T is four. The scores are first calibrated by correcting for the range aŝ where p min, j = min p 1 j , . . . , p N j and p max, j = max p 1 j , . . . , p N j . The fused scores are obtained as, Figure 6 : Illustration of ROCs obtained on the test set for the top four teams. The ROC associated with the hypothetical score fusion system obtained using the top four teams is also shown. The ROC obtained using these prediction scores is denoted by Fusion in Figure 6 . This gives an AUC of 95.10%, a significant improvement over each of the individual system results. Table 4 depicts the sensitivity of the top four systems, the fusion, and baseline (MLP) at 95% specificity. The fused model surpasses all the other models and achieves a sensitivity of 70.7%. The challenge problem statement for Track-1 required the design of a binary classifier. A clear problem statement, with a well-defined evaluation metric (AUC), encouraged a significant number of registrations. This included 85 plus teams from around the globe. We also noticed a good representation from both industry and academia. The 28 teams which completed the challenge came from 9 different countries. Additionally, 8 teams associated themselves with industry. Among the submissions, 23 out of the 28 teams exhibited a performance well above the baseline system AUC (see Figure 4(b) ). Altogether, the challenge provided a common platform for interested researchers to explore a timely diagnostic problem of immense societal impact. The results indicate potential in using acoustics for COVID-19 POCT development. The challenge turnaround time was 49 days, and the progress made by different teams in this short time span highlighted the efforts that were undertaken by the community. Several works in this challenge will be presented at the DiCOVA Special Session, Interspeech 2021 (to be held during 30 Aug-3 Sept, 2021). Following the peer review process, the special session will feature 11 accepted papers. The World Health Organization (WHO) has stated that a sensitivity of ≥ 70% (at a specificity of 95%) is necessary for an acceptable POCT tool [3] . The top four teams fell short of this benchmark (see Table 4 ), indicating that there is scope for further development in future. However, a simple combination of the scores from the systems of these teams achieves this benchmark. This suggests ways to reap advantage via collaboration between multiple teams for improved tool development. The development of such a sound based diagnostic tool for COVID-19 diagnosis would offer multiple advantages in terms of speed, cost and remote testing. The challenge, being first of its kind, also had its own limitations. The dataset provided was largely imbalanced, with a majority of the samples belonging to the non-COVID class. Although the imbalance reflects the prevalence of the infection in the population, it will be ideal to improve this imbalance in future challenges. The Coswara dataset [6] is being regularly updated, and as of June 2021, it contains data from approximately 200 COVID-19 positive individuals and 2000 non-COVID individuals. However, at the time of the challenge, the size of the COVID positive class was only about 120 in the dataset. A majority of the DiCOVA dataset samples came from India. While the cultural dependence of cough and breathing is not well established, it will be ideal to evaluate the performance on datasets collected from multiple geographical sites. Towards this, future challenges can include demographically balanced datasets, with close collaborations between multiple sites involved in the data collection efforts. The task involved in the challenge simplified to a binary classification setting. However, in a practical scenario, there are multiple respiratory ailments resulting from bacterial, fungal, or viral infections, with each condition potentially leaving a unique biomarker. The future evaluation of respiratory ailments may target multi-class categorization, which will also widen the usability of the tool. The data did not contain information regarding the progression of the disease (or the time elapsed since the positive COVID-19 test). Also, the participants in the "recovered" and "exposed" category were not analyzed in the challenge. The leaderboard and system highlights reported were limited to the cough recordings only. As seen in Table 2 , analysis using breathing and speech signals can also yield performance results comparable to those observed in cough recordings. In addition, the Coswara tool [6] also records the symptom data from participants. The combination of all acoustic categories with symptoms in developing the tool might further push the performance metrics of these tools to surpass the regulatory requirements. In the DiCOVA challenge, the performance ranking of the teams was based on AUC metric, which just conveyed the significance of the model's ability to perform binary classification. However, the challenge did not emphasize model interpretability and explainability as key requirements. In a healthcare scenario, the interpretability of the model decisions may be as important as the accuracy. Hence, future challenges should encourage this aspect. Additionally in future, it's important to focus on reproducibility of the models as well as lower memory/computational foot-print, which will benefit the rapid development of a tool. In situ detection of SARS-CoV-2 in lungs and airways of patients with COVID-19 Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR Scaling up COVID-19 rapid antigen tests: promises and challenges The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms Coswara -a database of breathing, cough, and voice sounds for COVID-19 diagnosis Virufy COVID-19 Open Cough Dataset Novel coronavirus cough database: NoCoCoDa Exploring automatic diagnosis of COVID-19 from crowdsourced respiratory sound data A cough-based algorithm for automatic diagnosis of pertussis Detection of tuberculosis by automatic cough sound analysis Cough sound analysis can rapidly diagnose childhood pneumonia Development of machine learning for asthmatic and healthy voluntary cough sounds: a proof of concept study Epidemiology of COVID-19: A systematic review and meta-analysis of clinical characteristics, risk factors, and outcomes Wavelet-based cough signal decomposition for multimodal classification AI4COVID-19: AI enabled preliminary diagnosis for COVID-19 from cough samples via an app COVID-19 artificial intelligence diagnosis using only cough recordings A generic deep learning based cough analysis system from clinically validated samples for point-of-need COVID-19 test and severity levels DiCOVA challenge: Dataset, task, and baseline system for COVID-19 diagnosis using acoustics Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python The Brogrammers DiCOVA 2021 Challenge System Report DiCOVA-Net: Diagnosing covid-19 using acoustics based on deep residual network for the DiCOVA challenge 2021 Classification of COVID-19 from Cough Using Autoregressive Predictive Coding Pretraining and Spectral Data Augmentation Detecting COVID-19 from audio recording of coughs using Random Forests and Support Vector Machines Samsung R&D Bangalore dicova 2021 challenge system report Diagnosis of COVID-19 using Auditory Acoustic Cues Covid-19 detection using recorded coughs in the 2021 DiCOVA challenge Investigating the Feature Selection and Explainability To appear in Proc. Interspeech Recognising COVID-19 from coughing using ensembles of SVMs and LSTMs with handcrafted and deep audio features A residual network based deep learning model for detection of COVID-19 from cough sounds Contrastive Learning of Cough Descriptors for Automatic COVID-19 Preliminary Diagnosis Cough-based COVID-19 Detection with Contextual Attention Convolutional Neural Networks and Gender Information The DiCOVA 2021 Challengean encoder-decoder approach for COVID-19 recognition from coughing audio Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences Bark and ERB bilinear transforms Opensmile: The Munich Versatile and Fast Open-Source Audio Feature extractor Analysis of reverberation via teager energy features for replay spoof speech detection CNN architectures for large-scale audio classification Snore sound classification using image-based deep spectrum features Look, listen, and learn more: Design choices for deep audio embeddings X-vectors: Robust dnn embeddings for speaker recognition LightGBM: A Highly Efficient Gradient Boosting Decision Tree mixup: Beyond empirical risk minimization Specaugment: A simple data augmentation method for automatic speech recognition Vocal tract length perturbation (VTLP) improves speech recognition Focal loss for dense object detection Scaling tree-based automated machine learning to biomedical big data with a feature set selector Representation learning with contrastive predictive coding The authors would like to thank the Department of Science and Technology (DST), Government of India, for providing financial support to the Coswara Project through the RAKSHAK programme. The authors would like to thank the Organizing Committee, Interspeech 2021 for giving us the opportunity to host this challenge under the umbrella of ISCA. The authors would like to express their gratitude to Anand Mohan for the design of the web based data collection platform, and Dr. Nirmala R., Dr. Shrirama Bhat, Dr. Lancelot Pinto, and Dr. Viral Nanda for their coordination in data collection.