key: cord-0506485-0xpa488n authors: Pahar, Madhurananda; Klopper, Marisa; Reeve, Byron; Warren, Rob; Theron, Grant; Diacon, Andreas; Niesler, Thomas title: Automatic Tuberculosis and COVID-19 cough classification using deep learning date: 2022-05-11 journal: nan DOI: nan sha: af8f8bd658082bc191fd192ec85c8a2780e5b183 doc_id: 506485 cord_uid: 0xpa488n We present a deep learning based automatic cough classifier which can discriminate tuberculosis (TB) coughs from COVID-19 coughs and healthy coughs. Both TB and COVID-19 are respiratory disease, have cough as a predominant symptom and claim thousands of lives each year. The cough audio recordings were collected at both indoor and outdoor settings and also uploaded using smartphones from subjects around the globe, thus contain various levels of noise. This cough data include 1.68 hours of TB coughs, 18.54 minutes of COVID-19 coughs and 1.69 hours of healthy coughs from 47 TB patients, 229 COVID-19 patients and 1498 healthy patients and were used to train and evaluate a CNN, LSTM and Resnet50. These three deep architectures were also pre-trained on 2.14 hours of sneeze, 2.91 hours of speech and 2.79 hours of noise for improved performance. The class-imbalance in our dataset was addressed by using SMOTE data balancing technique and using performance metrics such as F1-score and AUC. Our study shows that the highest F1-scores of 0.9259 and 0.8631 have been achieved from a pre-trained Resnet50 for two-class (TB vs COVID-19) and three-class (TB vs COVID-19 vs healthy) cough classification tasks, respectively. The application of deep transfer learning has improved the classifiers' performance and makes them more robust as they generalise better over the cross-validation folds. Their performances exceed the TB triage test requirements set by the world health organisation (WHO). The features producing the best performance contain higher order of MFCCs suggesting that the differences between TB and COVID-19 coughs are not perceivable by the human ear. This type of cough audio classification is non-contact, cost-effective and can easily be deployed on a smartphone, thus it can be an excellent tool for both TB and COVID-19 screening. Tuberculosis (TB) is a bacterial infectious disease which affects the human lungs, prevalent in low-income settings and 95% of all TB cases are reported in developing countries [1] , [2] . Modern diagnostic tests are costly as they rely on special equipment and laboratory procedure [3] - [5] . Suspected patients are tested when they show the symptom criteria of TB investigation and the results indicate that most of them cough due to other lung ailments; in fact most of those TB-suspected patients do not suffer from TB [6] . COVID-19 (COrona VIrus Disease of 2019) was declared as a global pandemic on February 11, 2020 by the World Health Organisation (WHO). At the time of writing, there are 513.9 million COVID-19 global cases and sadly, the pandemic has claimed the life of 6.2 million [7] . Thus, many suspected TB patients are very likely to be suffering from COVID-19 in developing countries and experimental evidence suggests that healthy people cough less than those who are sick from lung ailments [8] . Therefore, there is a need for automated non-contact, low-cost, easily-accessible tools for both TB and COVID-19 screening on cough audio. One of the major symptoms of respiratory diseases like TB and COVID-19 is a cough [9] , [10] . Depending on the nature of the respiratory disease, the airway is to be either blocked or restricted and this can affect the acoustic properties of the coughs, thus enabling the cough audio to be used by machine learning algorithms in many studies including our own [11] - [13] for discriminating both TB [14] and COVID-19 [15] from healthy coughs. TB coughs are rare, thus datasets are small and not publicly available. Successful studies [8] , [11] , [16] have experimentally found that shallow classifiers such as a multilayer perceptron (MLP) or logistic regression (LR) model works well in detecting TB in cough audio. However, COVID-19 data is widely available [17] - [19] and many recent studies have successfully applied deep neural network (DNN) classifiers to detect COVID-19 in cough audio [15] , [20] , [21] . In this study, we present a deep learning based automatic cough classifier which discriminates TB coughs from COVID-19 coughs. We have used both public and private datasets and as COVID-19 coughs are under-represented in our dataset, we have used synthetic minority over-sampling technique (SMOTE) to create new datapoints and balance the dataset. We have also used both Area under the ROC curve (AUC) and F1-score as the performance metric for our three DNN classifiers: CNN, LSTM & Resnet50 and used nested crossvalidation to make the best use of our dataset. The highest F1score of 0.9042 has been achieved from a Resnet50 classifier in discriminating TB coughs from COVID-19 coughs. Inspired by our previous research [22] , we have made use of sneeze, speech and noise to pre-train these three deep architectures as well. This has improved the F1-score of this two-class classification task to 0.9259 with more robust performance across the cross-validation folds. The corresponding AUC has been 0.9245 with a 96% sensitivity at 80% specificity, exceeding the TB triage test requirement of 90% sensitivity at 70% specificity set by WHO. We have further investigated these three DNN classifiers' performances in a three-class classification task, where we added healthy coughs as a third class. Initially, an F1-score of 0.8578 has been achieved from the Resnet50 and it has been improved to 0.8631 from the same architecture in discriminating TB, COVID-19 and healthy coughs after applying the transfer learning. Section II will detail the datasets used for pre-training the DNN classifiers and the datasets used for both two-class and three-class classification and fine-tuning those three classifiers. Section III explains the features extracted from the audio and Section IV describes the classification and hyperparameter optimisation process. Section V summarises the results and Section VI discusses them. Finally, Section VII concludes this study. We have made use of both public and private data in this study. TASK, Brooklyn, Sarcos and Wallacedene datasets were compiled by ourselves as part of the research projects concerning cough monitoring and cough classification. Coswara, ComParE, Google Audio Set & Freesound and LibriSpeech were compiled from publicly available data. Coughs with labels 'TB', 'COVID-19' and 'healthy' are used for the classification task. Coughs were excluded from the data used for pre-training altogether as coughs without these three labels may originate from other diseases and we only classified disease in either classification (two-class and three-class) task or fine-tuning the pre-trained DNNs on cough audio. All recordings were downsampled to 16 kHz. The following six datasets of coughs with TB, COVID-19 and healthy labels were available for experimentation and are described in Table I. 1) TASK dataset: This dataset contains 6000 continuous cough recordings and 11393 non-cough sounds such as laughter, doors opening and objects moving [22] . It was collected at TASK, a TB research centre near Cape Town, South Africa from patients undergoing TB treatment [23] . The data were compiled to develop cough detection algorithms and monitor patients' long-term health recovery in a multi-bed ward environment using a smartphone with an attached external microphone [24] . 2) Brooklyn dataset: Cough audio was compiled from 17 TB and 21 healthy subjects to discriminate TB from healthy cough for developing a TB cough audio classifier [8] . The recordings were taken inside a controlled indoor booth, using an audio field recorder and a RØDE M3 microphone. 3) Wallacedene dataset: This dataset was collected to extend the previous TB cough audio classification study [8] to discriminate TB coughs from other sick coughs in a realworld noisy environment [11] . Here, the cough recordings were collected using a RØDE M1 microphone and an audio field recorder and it took place in an outdoor booth located at a busy primary healthcare clinic. It contains 402 coughs from 16 TB patients and more environmental noise and therefore a poorer signal-to-noise ratio than the Brooklyn dataset. This publicly available dataset is specifically developed for the purpose of developing COVID-19 classification algorithms [18] , [25] . Data collection is web-based, and participants record their vocal audio including coughs using their smartphones. In this study, we used the deep coughs from 92 COVID-19 positives and 1079 healthy subjects, located on five different continents [12] , [13] . All recordings were pre-processed to remove periods of silence to within a margin of 50 ms using a simple energy detector [12] . This dataset was provided as a part of the 2021 Interspeech Computational Paralinguistics ChallengE (ComParE) [19] . The ComParE dataset contains 119 COVID-19 positives and 398 healthy subjects whose cough recordings were sampled at 16 kHz. 6) Sarcos dataset: This dataset was collected in South Africa as part of our own COVID-19 research [12] , [13] and contains coughs from 18 COVID-19 positive subjects. The audio was pre-processed in the same way as the Coswara data. Summary of data used for classification: Table I shows that our data contain only 18.54 minutes of COVID-19 cough audio, compared to 1.68 hours of TB coughs and 1.69 hours of healthy coughs, indicating COVID-19 labelled data are underrepresented. As such a data-imbalance can detrimentally affect the neural networks' performance [26] , [27] , we have applied SMOTE [28] , which oversamples the minor class by creating additional synthetic samples rather than, for example, randomly oversampling during training. SMOTE has been applied successfully in the past to address training set class imbalances in cough detection [22] and cough classification [12] . TASK dataset contains only 14 patients but the length of cough audio per patient was much longer than the other two datasets. The audio and spectrograms of a TB, COVID-19 and healthy cough are shown in Figure 1 . There are very little obvious visual differences between these three coughs. An informal subjective test was conducted where approximately 20 university students were asked to spot the sick and healthy coughs just by listening to these cough audio and the results showed that human auditory system is unable to spot any disease or differentiate sick coughs from healthy coughs only by listening to the coughs. Our classifier training is limited as cough audio data is not available abundantly. Hence, we use three other types of audio data for pre-training and they include sneeze, speech and noise from Google Audio Set & Freesound, LibriSpeech and TASK datasets, as described in Table II. 1) Google Audio Set & Freesound: The Google Audio Set dataset contains manually-labelled excerpts from 1.8 million Youtube videos belonging to 632 audio event categories [29] . The Freesound audio database is a collection of tagged sounds uploaded by contributors from various parts of the world [30] . The audio recordings come from many different individuals under widely varying recording conditions and noise levels. From these, we have compiled a collection of recordings that include 1013 sneezes, 2326 speech excerpts and 1027 other non-vocal sounds such as restaurant chatter, running water and engine noise. This manually annotated dataset was successfully used in developing cough detection algorithms [31] . 2) LibriSpeech: From the freely available LibriSpeech corpus [32] , utterances by 28 male and 28 female speakers were selected as a source of speech audio data with very little noise. 3) Summary of data used for pre-training: In total, the data described in Table II includes 1013 sneezing sounds (13.34 minutes of audio), 2.91 hours of speech from both male and female participants, and 2.98 hours of noise. As sneezing is under-represented, we have again applied SMOTE to create additional synthetic samples. In total, therefore, a dataset containing 7.84 hours of audio recordings with three class labels (sneeze, speech, noise) was used to pre-train the three DNN classifiers. Features such as mel-frequency cepstral coefficients (MFCCs), zero-crossing rate (ZCR) [33] and kurtosis [33] were extracted from the audio recordings and were used as the input to the DNN classifiers. The feature combination containing MFCCs rather than linearly-spaced log filerbanks [8] showed better performance in our previous TB [11] and COVID-19 [12] , [22] classification tasks, so we have extracted MFCCs, along with the first and second order differences, ZCR and Kurtosis for both classification and pre-training task. MFCCs are the features of choice in detecting and classifying voice audio such as speech [34] and coughs [31] . Overlapping frames were used to extract features, where the frame overlap ensures that the audio signal is always divided into a certain exact number of frames, thus always representing the entire audio event by a fixed number of frames. This allows an image-like fixed input dimension to be maintained for classification while preserving the general overall temporal structure of the sound. The input feature matrix has the dimension of (3M+2, S) for M MFCCs along with their M velocity and M acceleration coefficients. Such fixed two-dimensional features are particularly useful for the training of DNN classifiers and have performed well in our previous experiments [12] , [22] . The frame length (F), exact number of frames (S) and number of lower order MFCCs (M) are used as the feature extraction hyperparameters, listed in Table III . The table shows that the number of extracted MFCCs (M) lies between 13 and 65, which varies the spectral information in each audio event and each audio signal is divided into between 70 and 200 frames, each of which consists of between 512 and 4096 samples, corresponding to between 32 msec and 256 msec of audio, as different phases of coughs carry different information and the sampling rate is 16 kHz in our experiments. We have used three DNN classifiers: CNN [35] , LSTM [36] and Resnet50 [37] in this study. We have refrained from experimenting with any shallow classifier as using deep architectures along with SMOTE data balancing technique yielded better results in our previous experiments [12] , [22] . For our initial set of experiments, we have used these three DNN classifiers for two-class (TB vs COVID-19) and threeclass (TB vs COVID-19 and healthy) classification and the classifier hyperparameters are mentioned in Table IV . The classifier training process is stopped when the performance wasn't improved after 10 epochs. Finally, for the improved performance, we have applied the transfer learning. The application of transfer learning has improved the classification performance in our previous studies [12] , [13] . Hence, we have also applied transfer learning in this study to improve the classification performance, where the DNN classifiers are pre-trained on the dataset, explained in Section II-B, and then fine-tuned on the classification datasets, explained in Section II-A. The feature extraction hyperparameters are adopted from our previous studies [12] , [13] and the hyperparameters of the CNN and LSTM were determined during the cross-validation process. These hyperparameters are mentioned in Table V . A standard Resnet50, as explained in Table 1 of [37] , with 512unit dense layer has been used for the transfer learning. The transfer learning process for a CNN is explained in Figure 2 . The feature extraction process and classifiers have a number of hyperparameters, listed in Table III and IV. They were optimised by using a leave-p-out cross-validation scheme [38] . The train and test split ratio was 4:1, due to its effectiveness in medical applications [39] . This 5-fold cross-validation process ensured the best use of our dataset by using all subjects in both training and testing the classifiers and implementing a strict no patient-overlap between cross-validation folds. The F1-score has been the optimisation criteria in the crossvalidation folds and the performance-indicator of the classifiers [40] , [41] . We note the mean per-frame probability that a cough is from a COVID-19 positive subject isĈ in Equation where P (Y = 1|X i , θ) is the output of the classifier for feature vector X i and parameters θ for the i th frame. The average F1-score along with its standard deviation (σ F 1 ) over the outer folds during cross-validation are shown in Tables VI, VII. Hyperparameters producing the highest F1-score over the inner loops have been noted as the 'best classifier hyperparameters' in Tables VI, VII. TABLE II : Datasets used in pre-training. Classifiers are pre-trained on 7.84 hours of audio recordings annotated with three class labels: sneeze, speech and noise. These data do not contain any cough. For the initial classification task in Table VI , the Resnet50 architecture has performed the best by producing the highest mean F1-score of 0.9042 and mean AUC of 0.9190 with the σ F 1 of 0.83. Although the CNN and LSTM have produced a lower F1-score and AUC, the σ F 1 has also been lower, 0.61 and 0.49 respectively, suggesting better generalisation and robustness over the folds for less deep architectures. This also indicates that the very deep architectures such as a Resnet50, although able to perform better, are prone to over-fitting. The best feature hyperparameters have been 26 MFCCs, 1024 sample-long frames and 150 frames per event such as a cough. To prevent over-fitting, we have applied transfer learning and noticed a slight improvement in the DNN classifiers' performance. The F1-score and the AUC have increased to 0.9259 and 0.9245 and the σ F 1 has decreased to 0.03 from the pre-trained Resnet50 classifier. A similar trend has also been noticed for CNN and LSTM classifiers, where their performance (F1-score and AUC) has also increased along with a lower σ F 1 . Although the CNN outperformed the LSTM initially, LSTM has outperformed the CNN after applying transfer learning. The mean ROC curves for the initial and pre-trained Resnet50 Feature extraction hyperparameters were adopted from our previous related work [12] , [13] , while classifier hyperparameters were optimised on the pre-training data using crossvalidation. Fig. 3 : The ROC curves for discriminating TB coughs from COVID-19 coughs: An AUC of 0.9190 is achieved from the Resnet50 and the highest AUC of 0.9245 is achieved after applying transfer learning to this Resnet50 architecture. Both systems achieve 96% and 93% sensitivity respectively at 80% specificity, thus exceed the community-based TB Triage test requirement of 90% sensitivity at 70% specificity set by WHO. are shown in Figure 3 . These two systems achieve 96% and 93% sensitivity respectively at 80% specificity. Thus they exceed the community-based TB Triage test requirement of 90% sensitivity at 70% specificity set by WHO [42] . We observe a similar pattern in three-class classification as well. Table VII shows that the highest F1-score of 0.8578 has been achieved from the Resnet50 classifier with a σ F 1 of 0.67 from the best feature hyperparameters of 39 MFCCs, 1024 sample-long frames and 120 frames per cough. At the same time, CNN and LSTM produce the F1-scores of 0.8220 and 0.8125 with σ F 1 of 0.41 and 0.49 respectively. Both these F1-scores and σ F 1 scores are lower than those produced by the Resnet50. As this is a three-class classification, we have replaced AUC with accuracy in Table VII . Again, the signs of overfitting are clear in these performances and we apply transfer learning next. Application of the transfer learning has improved the classification performance by a small margin. The F1-score from the Resnet50 rose to 0.8631 and the σ F 1 decreased to 0.11. The performances from CNN and LSTM have also improved, as the F1-scores of 0.8455 and 0.8427 have been achieved from these two DNN classifiers, respectively. Their σ F 1 scores are also much lower: 0.07 and 0.09 respectively. Although, pre-trained CNN and LSTM models have produced lower F1scores, their σ F 1 is also lower, unlike in the previous twoclass classification. This shows that the application of transfer learning helps the classifier to be more robust in a classification task. Although many previous studies have shown that both TB and COVID-19 can be discriminated from healthy coughs, here we show that there are unique disease signatures present in cough audio which is responsible for the machine learning classifiers to discriminate TB coughs from COVID-19 coughs. We have experimentally found that when the cough data are limited, classifier performance can be poor and they are also prone to overfitting. Very deep architectures generally produce higher mean F1-scores, however with the expense of higher variances along the cross-validation folds. Our study shows that the application of transfer learning using vocal data which do not even include cough can be used to improve classifiers' performance in disease classification. Here in this study, a deep learning based cough classifier which can discriminate between TB coughs and COVID-19 coughs and healthy coughs has been presented, where a subjective test confirmed that respiratory disease can't be confirmed just by listening to the cough audio. The cough audio recordings contain various types and levels of background noise as they were collected inside a TB research centre, recording booth and by using smartphones from subjects around the globe. This cough data include 47 TB subjects, 229 COVID-19 subjects and 1498 healthy subjects contributing 1.68 hours, 18.54 minutes and 1.69 hours of audio respectively. Application of transfer learning has yielded better performance in our previous studies, thus a separate data containing 2.14 hours of sneeze, 2.91 hours of speech and 2.79 hours of noise such as door slamming, engine running etc. have been used to pre-train three deep neural networks: CNN, LSTM and Resnet50. The class-imbalance in our dataset was addressed by using SMOTE data balancing technique during the training process and using performance metrics such as F1-score and AUC. The classifiers were evaluated by using a 5-fold nested cross-validation scheme. The experimental results show that the highest F1-score of 0.9259 has been achieved from a pretrained Resnet50 for the two-class (TB vs COVID-19) cough classification task and the highest F1-score of 0.8631 has been achieved from a pre-trained Resnet50 for three-class (TB vs COVID-19 vs healthy) cough classification task. The pretrained Resnet50 architecture also produces the highest AUC of 0.9245 with 96% sensitivity at 80% specificity, which exceeds the TB triage test requirement of 90% at 70% specificity. The results also show that the application of transfer learning has improved the performance and generalises better over the cross-validation folds, making the classifiers more robust. The best feature hyperparameters also contain higher order of MFCCs, suggesting auditory patterns responsible for disease classification are not perceivable by the human auditory system. This type of cough audio classification is non-contact, cost-effective and can easily be deployed to a smartphone, thus it can be a useful tool for both TB and COVID-19 screening, especially in a developing country setting. As for the future work, we are investigating the length of the coughs required for effective overall classification scores. We are also compiling a bigger dataset containing both TB and COVID-19 patients to improve the existing cough classification models and deploying the TensorFlow-based models on an Android platform. Tuberculosis; who is most at risk? The global tuberculosis epidemic and progress in care, prevention, and research: an overview in year 3 of the End TB era Feasibility, acceptability, and cost of tuberculosis testing by whole-blood interferon-gamma assay Direct susceptibility testing for multi drug resistant tuberculosis: a meta-analysis Testing for tuberculosis Chronic wet cough: protracted bronchitis, chronic suppurative lung disease and bronchiectasis COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE) Detection of Tuberculosis by Automatic Cough Sound Analysis Persistent symptoms in patients after acute COVID-19 Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirusinfected pneumonia in Wuhan, China Automatic cough classification for tuberculosis screening in a real-world environment COVID-19 cough classification using machine learning and global smartphone recordings COVID-19 detection in cough, breath and speech using deep transfer learning and bottleneck features Use of cough sounds for diagnosis and screening of pulmonary disease COVID-19 Artificial Intelligence Diagnosis using only Cough Recordings Cough detection algorithm for monitoring patient recovery from pulmonary tuberculosis Annual International Conference of the IEEE Engineering in Medicine and Biology Society The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms Coswara-A Database of Breathing, Cough, and Voice Sounds for COVID-19 Diagnosis The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates AI4COVID-19: AI enabled preliminary diagnosis for COVID-19 from cough samples via an app Automated detection of COVID-19 cough Deep Neural Network based Cough Detection using Bedmounted Accelerometer Measurements Automatic Non-Invasive Cough Detection based on Accelerometer and Audio Signals Accelerometer-based bed occupancy detection for automatic, non-invasive long-term cough monitoring DiCOVA Challenge: Dataset, task, and baseline system for COVID-19 diagnosis using acoustics Experimental perspectives on learning from imbalanced data Learning from imbalanced data: open challenges and future directions SMOTE: synthetic minority over-sampling technique Audio set: An ontology and human-labeled dataset for audio events Freesound technical demo A comparative study of features for acoustic cough detection using deep architectures Librispeech: An ASR corpus based on public domain audio books Voiced/unvoiced decision for speech signals based on zero-crossing rate and energy Coding and Decoding Speech using a Biologically Inspired Coding System Imagenet classification with deep convolutional neural networks Long short-term memory Deep residual learning for image recognition Leave-p-Out Cross-Validation Test for Uncertain Verhulst-Pearl Model With Imprecise Observations Effect of dataset size and train/test split ratios in qsar/qspr multiclass classification Anomaly Detection: How to Artificially Increase Your F1-Score with a Biased Evaluation Protocol An introduction to ROC analysis World Health Organization We would like to thank the South African Centre for High Performance Computing (CHPC) for providing computational resources on their Lengau cluster for this research and gratefully acknowledge the support of Telkom South Africa. We also thank the Clinical Mycobacteriology & Epidemiology (CLIME) clinic team for assisting in data collection, especially Sister Jane Fortuin and Ms. Zintle Ntwana. We also especially thank Igor Miranda, Corwynne Leng, Renier Botha, Jordan Govendar and Rafeeq du Toit for their support in data collection and annotation.The content and findings reported are the sole deduction, view and responsibility of the researcher and do not reflect the official position and sentiments of the SAMRC, EDCTP2, European Union or the funders.