key: cord-0482051-9wjhodzr authors: Truong, Hoang Van; Pham, Lam title: A Cough-based deep learning framework for detecting COVID-19 date: 2021-10-07 journal: nan DOI: nan sha: 82f82901da4a87395017e53415bb7c39cc6baa25 doc_id: 482051 cord_uid: 9wjhodzr In this paper, we propose a deep learning-based framework for detecting COVID-19 positive subjects from their cough sounds. In particular, the proposed framework comprises two main steps. In the first step, we generate a feature representing the cough sound by combining embedding features extracted from a pre-trained model and handcrafted features, referred to as the front-end feature extraction. Then, the combined features are fed into different back-end classification models for detecting COVID-19 positive subjects. The experimental results on the Second 2021 DiCOVA Challenge - Track 2 dataset achieve the top-2 ranking with an AUC score of 81.21 on the blind Test set, improving the challenge baseline by 6.32 and showing competitive with the state-of-the-art systems. The cumulative number of COVID-19 positive subjects reported globally is now over 231 million and the cumulative number of deaths by COVID-19 is more than 4.7 million [1] . Furthermore, the COVID-19 crisis now is spanning across 200 countries quickly and the number of COVID-19 infections per day is always counted in thousands without a sign of going down. It is vital that one of the effective solutions to prevent and control the current epidemic is to conduct a large number of COVID-19 testing in popularity which has been widely applied in many countries. Indeed, if COVID-19 positive subjects can be detected early, it is very useful for selfobservation, isolation, and effective treatment methods. However, take a large number of rapid antigen or RT-PCR tests shows a very high cost of both time and money. As a result, the DiCOVA Challenges are designed to find scientific and engineering insights to the question -Can COVID-19 be detected from the cough, breathing, or speech sound signals of an individual? In particular, while the First 2021 DiCOVA Challenge [2] provides a dataset of cough sound, the Second 2021 DiCOVA Challenge [3] provides different sound signals of cough, speech, and breath. The audio recordings are gathered from both COVID-19 positive and non-COVID-19 individuals 1 . Given the cough, speech, and breath recordings, research community can propose systems for detecting the COVID-19, which is potentially applied on edge devices as a COVID-19 testing solution. Focusing on cough sound, recent researchers show that it potential to detect COVID-19 through evaluating coughing. For an example, a machine learning-based framework proposed in [4] uses handcrafted features and Support Vector Machine (SVM) model, achieved the AUC score of 85.02 on the First 2021 DiCOVA 1 https://competitions.codalab.org/competitions/34801#learn the details dataset [2] . On this dataset, a deep learning-base framework proposed in [5] , which use the ConvNet model incorporated with Data Augmentation, achieved the best AUC score of 87.07 and claiming the 1st position on the First DiCOVA 2021 Challenge leaderboard. Focusing on feature extraction, Madhu et al. [6] combined the Melfrequency cepstral coefficients (MFCC) with the delta features (i.e. The delta features are extracted from a complicated framework using Long Short-Term Memory (LSTM), Gabor filter bank, and the Teager energy operator (TEO) in the order). By using the combined feature and the LightGBM model, the authors can achieve the AUC score of 76.31 on the First 2021 DiCOVA dataset [2] . Similarly, Vincent et al. [7] conducted extensive experiments to evaluate the role of the feature extraction. In particular, they proposed to use three types or features: (1) Handcrafted features extracted by openSMILE toolkit [8] , (2) the deep features extracted from different pre-trained VGGish networks which are trained with AudioSet [9] , and (3) the deep features extracted from different standard pre-trained models (ResNet50, DenseNet121, MobileNetV1, etc.) trained with Imagenet dataset. They then obtained the best AUC score of 72.8 on the First 2021 DiCOVA dataset [2] by using the deep features extracted from the pre-trained VGG16 (i.e. The pre-trained VGG16 was trained with AudioSet) and the back-end LSTM-based classification. Recently, a benchmark dataset of cough sound for detecting COVID-19 [10, 11] , which was recorded on mobile phone, has been published. Notably, the current achievement of 98% accuracy on this dataset shows potential to apply as an effective solution of COVID-19 testing. In this paper, we also aim to explore cough sounds, then propose a framework for detecting COVID-19. We mainly contribute: (1) By conducting extensive experiments, we indicate that a combination of handcrafted feature and embedding-based feature is effective to representing cough sound input, and (2) we propose a robust framework which can be further developed on edge devices for an application of COVID-19 testing. Our experiments were conducted on the Second 2021 DiCOVA Challenge -Track 2 dataset (i.e. The Track 2 dataset only contains cough sounds). The remaining of our paper is organized as follows: Section 2 presents the Second 2021 DiCOVA Challenge as well as the Track-2 dataset, evaluation setting, and metrics. Section 3 presents the proposed deep learning framework. Next, Section 4 presents and analyses the experimental results. Finally, Section 5 presents the conclusion and future work. As we focus on cough sounds, which is also the First 2020 DI-COVA Challenge [2] , only Track-2 dataset is explored in this paper. The Second 2021 DiCOVA Challenge Track-2 dataset provided a Development set of 965 audio recordings and a blind Test set of 471 audio recordings. All audio recordings are not less than 500 milliseconds and recorded with different sample rates. While the Development set is used for training, and then obtaining the best model, the Blind Test set is used for evaluating and comparing the systems' performance submitted. In the Development set, there are totally 793 negative labels and 172 positive labels, which shows an unbalanced dataset [12] . To evaluate on the Development set, the challenge requires to follow five-fold cross-validation [3] , each fold comprises Train and Valid subsets as shown in Fig. 3 . The evaluation result on the Development set is the average of results on all five folds. To evaluate on the blind Test set, the obtained result on this set is submitted to the Second 2021 DiCOVA Challenge for evaluating, ranking, and comparing with the other submitted systems. The 'Area under the ROC curve' (AUC) is used as the primary evaluation metric in the Second 2021 DiCOVA Challenge. The curve is obtained by varying the decision threshold between 0 and 1 with a step size of 0.0001. Additionally, the Sensitivity (Sen.) and the Specificity (Spec.), which are computed at every threshold value, are used as the secondary evaluation metrics (Note that Spec. is required to be equal or greater than 95%). The Leaderboard evaluates the submitted systems on the blind Test set as well as the average performance on five-fold cross validation from the Development set (Avg. AUC) [3] . The overall framework architecture is described as Fig. 1 . As the audio recordings show different sample rates, they are firstly resampled to 44.1 kHz using mono channel. Then, the re-sampled recordings are fed into the front-end feature extraction where embedding-based features and handcrafted features are extracted and concatenated to obtain the combined features. To deal with the issue of unbalanced dataset mentioned in Section 2.1, SVM-based SMOTE method [13] is applied on the combined features to make sure the equal number of positive and negative samples. Finally, the features after data augmentation are fed into different back-end classification models for detecting COVID-19 positive cases. In this step, we propose a method to create a combined feature by combining handcrafted features and embedding features extracted from pre-trained models. Regarding handcrafted features, 64 Melfrequency cepstral coefficients (MFCCs), 12 Chromatic (Chroma), 128 Mel Spectrogram (Mel), 1 Zero-Crossing rate, 1 Gender and 1 Duration are used in this paper. These handcrafted features are used as they are popular adoption in speech processing and show robust in the First 2021 DiCOVA Challenge [6, 7, 4] . To extract these handcrafted features, Librosa [14] , a powerful library of audio signal processing, is used in this paper. As MFCC, Chromatic and Mel spectrogram are two-dimensional features, they are converted into one-dimensional shape before concatenating with the other features. As regards the embedding features, we evaluate different embedding features which are extracted from different pre-trained models: YAMNet [15] , Wave2Vec [16] , TRILL [17] , and the COMPARE 2016 feature sets [18] using OpenSMILE [8] toolkit. As using these pre-trained models shows effective for a wide range of classification tasks (i.e. For an example, the pre-trained TRILL model with Au-dioSet [9] proved robust for a wide range of classification tasks on non-semantic speech signal such as speaker identity, language, and emotional state in [17] ), these embeddings are expected to work well with the 2021 DiCOVA Track-2 dataset of cough sounds. By using the pre-trained models, when we feed the cough recordings into the pre-trained models, two-dimensional embeddings are extracted. We then compute mean and standard deviation across the time dimension, concatenating mean and standard deviation to obtain onedimensional embeddings. The embeddings are then concatenated with the handcrafted features mentioned above to create the combined features. Finally, the combined features are scaled into the range of [0:1] before doing data augmentation and then feeding into the back-end classification models. In this paper, we evaluate different back-end classification models: Light Gradient Boosting Machine (LightGBM), Random Forrest (RF), Support Vector Machine (SVM), Multi-layer Perceptron (MLP), and Extra Tree Classifier (ETC). The setting of these backend classification models are described in Table 1 and all these models are implemented by using Scikit-Learn toolkit [19] . To obtain results, each classification model is run with 10 seeds numbered from 0 to 9. The output of the cross-validation session will calculated by using soft voting [21] between seeds. The GTX 1080 Titan GPU environment is used for running classification experiments. The Table 4 presents the performance comparison across the top-10 systems submitted for the Second 2021 DiCOVA Challenge Track-2. As shown in Table 4 , our best results from handcrafted & TRILLbased embedding features and LightGBM model achieve the top-2 ranking, reporting the AUC score of 81.21, the Sen. score of 48.33, the Spec. score of 95.13 on the blind Test set, and the Avg. AUC score of 77.18 on the Development set. Notably, our Sen. result on blind Test set and Avg. AUC on the Development set achieve the top-1 ranking. These results prove that our proposed system is robust, competitive, and has the potential to be further applied on edge devices for detecting COVID-19. This paper presents a deep learning-based framework for detecting COVID-19 positive subjects by exploring their cough sounds. By conducting extensive experiments on the Second 2021 DiCOVA Challenge Track-2 dataset, we showed that our best model, which uses a combination of handcrafted & TRILL-based embedding features and LightGBM model, achieve the top-2 ranking of the challenge and are competitive to the state-of-the-art systems. Our further research are to focus on different sound representations such as Chroma Feature, Spectral Contrast, Tonnetz, etc [22] , as well as to explore breathing, speech sounds provided by the Second 2021 DiCOVA Challenge. I would like to express deep gratitude to the organizers and all the teams for making the Second Dicova Challenge competition. WHO Coronavirus Disease (COVID-19) Dashboard Dicova challenge: Dataset, task, and baseline system for covid-19 diagnosis using acoustics The second dicova challenge: Dataset, task, and baseline system for covid-19 diagnosis using acoustics Detecting covid-19 from audio recording of coughs using random forests and support vector machines Covid-19 diagnosis from cough acoustics using convnets and data augmentation Panacea cough sound-based diagnosis of covid-19 for the dicova 2021 challenge Recognising covid-19 from coughing using ensembles of svms and lstms with handcrafted and deep audio features opensmile -the munich versatile and fast open-source audio feature extractor Audio set: An ontology and human-labeled dataset for audio events Artificial intelligence model detects asymptomatic covid-19 infections through cellphone-recorded coughs Covid-19 artificial intelligence diagnosis using only cough recordings Learning from imbalanced data sets Smote: Synthetic minority oversampling technique librosa: Audio and music signal analysis in python Sound classification with yamnet wav2vec 2.0: A framework for self-supervised learning of speech representations Towards learning a universal non-semantic representation of speech Scikit-learn: Machine learning in Python Lightgbm: A highly efficient gradient boosting decision tree Soft voting-based ensemble approach to predict early stage drc violations Unsupervised detection of anomalous sound for machine condition monitoring using fully connected u-net