key: cord-0701381-1apq2kui
authors: Han, Jing; Brown, Chloe; Chauhan, Jagmohan; Grammenos, Andreas; Hasthanasombat, Apinan; Spathis, Dimitris; Xia, Tong; Cicuta, Pietro; Mascolo, Cecilia
title: Exploring Automatic COVID-19 Diagnosis via voice and symptoms from Crowdsourced Data
date: 2021-02-10
journal: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
DOI: 10.1109/icassp39728.2021.9414576
sha: 46e2bd753b4b38777789f762d3783a8b815b2a87
doc_id: 701381
cord_uid: 1apq2kui

The development of fast and accurate screening tools, which could facilitate testing and prevent more costly clinical tests, is key to the current pandemic of COVID-19. In this context, some initial work shows promise in detecting diagnostic signals of COVID-19 from audio sounds. In this paper, we propose a voice-based framework to automatically detect individuals who have tested positive for COVID-19. We evaluate the performance of the proposed framework on a subset of data crowdsourced from our app, containing 828 samples from 343 participants. By combining voice signals and reported symptoms, an AUC of $0.79$ has been attained, with a sensitivity of $0.68$ and a specificity of $0.82$. We hope that this study opens the door to rapid, low-cost, and convenient pre-screening tools to automatically detect the disease.

On 11 March 2020, the World Health Organisation announced the COVID-19 outbreak as a global pandemic. At the time of writing this paper, more than 37 million confirmed COVID-19 cases and one million deaths globally have been reported. Nowadays, in addition to developing drugs and vaccines for treatment and protection [1, 2] , scientists and researchers are also investigating primary screening tools that ideally should be accurate, cost-effective, rapid, and meanwhile easily accessible to the mass at large.

Amongst the efforts towards rapid screening [3, 4] , audiobased diagnosis appears promising, mainly due to its non-invasive and ubiquitous character, which would allow for individual prescreening 'anywhere', 'anytime', in real-time, and available to 'anyone' [5] . Many applications have been developed for monitoring health and wellbeing in recent times via intelligent speech and sound analysis [6, 7, 8] .

COVID-19 is an infectious disease, and most people infected with the COVID-19 experience mild to moderate respiratory illness [9] . More specifically, on the one hand, COVID-19 symptoms vary widely, such as cough, dyspnea, fever, headache, loss of taste or smell, and sore throat [10] . On the other hand, however, many symptoms are associated with and hence can be recognised via speech and sound analysis. Such symptoms include shortness of breath, dry or wet cough, dysphonia, fatigue, to name but a few. As a consequence, most recently several research works have been published, aiming at providing sound-based automatic diagnostic solutions [4, 11, 12] . * Ordered alphabetically, equal contribution. This work was supported by ERC Project 833296 (EAR).

In this paper, we propose machine learning models for voicebased COVID-19 diagnosis. More specifically, we analyse a subset of data from 343 participants crowdsourced via our app, and show the discriminatory power of voice for the diagnosis. We demonstrate how voice can be used as signal to distinguish symptomatic positive tested individuals, from non-COVID-19 (tested) individuals, who also have developed symptoms akin to COVID-19. We further show performance improvement by combining sounds and symptoms for the diagnosis, yielding a specificity of 0.82 and an AUC of 0.79.

With the advent of COVID-19, researchers have started to explore if respiratory sounds could be diagnostic [5] . For instance, in [4] , breathing and cough sounds have been targeted and researchers demonstrate that COVID-19 individuals are distinguishable from healthy controls as well as asthmatic patients. In [13] , an interpretable COVID-19 diagnosis framework has been devised to distinguish COVID-19 cough from other types of cough. Likewise, in [12] , a detectable COVID-19 signature has been found from cough sounds and can help increase the testing capacity.

However, none of the aforementioned efforts have analysed the potential of voice. Recently, the feasibility for COVID19 screening using voice has been introduced in [14] . Similarly, in [15] , significant differences in several voice characteristics are observed between COVID-19 patients and healthy controls. Moreover, in [16] , speech recordings from hospitalised COVID-19 patients are analysed to categorise their health state of patients. Our work differs from these works, as we utilise an entirely crowdsourced dataset, for which we have to deal with the complexity of the data such as recordings in different languages and varied environmental noises. Furthermore, we jointly analyse the voice samples and symptoms metadata, and show that better performance can be obtained by combining them. Our study confirms that even in the most challenging scenario of in-the-wild crowdsourced data, voice is a promising signal for the pre-screening of COVID-19.

This section presents a comprehensive description spanning the data acquisition, preprocessing, and tasks of interest. We note that the data collection and study have been approved by the Ethics Committee of the Department of Computer Science and Technology at the University of Cambridge. 

The crowdsourced data is collected via our "COVID-19 Sounds App 1 ". It has three versions: web-based, Android, and iOS, with an aim to reach a high number of users while maintaining their anonymity. When using the app, users are asked to record and submit their breathing, coughing, and voice samples, report symptoms if any, and provide some basic demographic and medical information. Moreover, it also asks users if they have been tested positive (or negative) for the virus, and if they are in hospital. For more details of our data collection framework, the reader is referred to [4] . Fig. 1 illustrates some symptom-and voice-collection screens from the iOS app.

As of 14th October 2020, data from 13722 unique users (4690 from the web app, 6334 from Android, and another 2698 from iOS) were collected. In this study, we explore data from two groups of participants, i. e., users who declared having tested positive for COVID-19, and those who tested negative. As a consequence, data from 343 participants were selected for our analysis. In particular, 140 participants were tested positive, 199 tested negative, one transitioned from being initially positive to negative later, and another three transitioned the other way round: negative to positive.

Note that in our selected subset of data, similar to the positive participants, negative participants declared their symptoms to varying extents as well. Likewise, there are asymptomatic positive participants who selected "None" when asked about their symptoms. A comparison of the percentage occurrence of 11 symptoms ("None" excluded) between positive and negative participants is depicted in Fig. 2 . It appears that loss of smell or taste is more frequently reported among positive participants than negative ones, while the differences of the percentage occurrence is rather small between positive and negative participants across other reported symptoms. 1 www.covid-19-sounds.org

Recently, the potential of respiratory sounds for COVID-19 diagnosis has been explored in our previous work as well as by other researchers. However, few research works have yet investigated the possibility of detecting Covid-19 infection from voice. In this study, we focus on voice-based analysis for disease diagnostics, and the performed data preprocessing workflow is detailed as follows.

First, all voice recordings from the selected users were converted to mono signals with a sampling rate of 16 kHz. Moreover, recordings that do not contain any speech signal were discarded. Then, we considered applying segmentation. As mentioned previously, each recording consists of multiple repetitions of the given sentence by the same user, varying from one to three times. However, in our preliminary analysis, we noticed that the effect of segmentation was negligible, and that segmentation might eliminate the possible breathing differences and temporal dynamics between repetitions. For this reason, we retained only unsegmented samples for further analysis, while trimming the leading and trailing silence from each recording as in [4] . Lastly, audio normalisation was applied to each recording, aiming at eliminating the volume inconsistency across participants caused by varied devices or different distances between the mouth and the microphone.

After preprocessing, we obtained a total of 828 voice samples (326 positive and 502 negative) from 343 participants. They mostly come from the UK, Portugal, Spain, and Italy.

In this study, a series of binary classification tasks are developed. In particular, based on the dataset collected, we train models for the following clinically meaningful tasks:

• Task 1: Distinguish individuals who have declared that they were tested positive for COVID-19, from individuals who have declared that they were tested negative for COVID-19. This is a general scenario, and we refer to this task as 'Pos. v.s. Neg.' • Task 2: Distinguish individuals tested positive for COVID- 19 recently in the last 14 days, from individuals tested negative for COVID-19, specifically for those with a negative test and no reported symptoms. We refer to this task as 'newPos. v.s. Neg. w/o sym.' This case is set following our previous work in [4] , so as to compare the capability of voice samples with breathing and cough ones for COVID-19 diagnosis. • Task 3: Distinguish asymptomatic individuals tested positive for COVID-19, from individuals tested negative, specifically for those healthy controls that do not have any symptom. This task is devised to investigate whether asymptomatic carriers of the disease can be identified from their voice. This is of concern given the high rate of asymptomatic infection reported in the population [17] . Therefore, identifying asymptomatic individuals may play a significant role in controlling the ongoing pandemic [17] . We refer to this task as 'Pos. w/o sym. v.s. Neg. w/o sym.' • Task 4: Distinguish symptomatic individuals who have declared that they were tested positive for COVID-19 and have developed at least one symptom, from individuals who have declared that they were tested negative though suffering from one or more symptoms. This task is considered with an aim to understand the feasibility of voice analysis to differentiate COVID-19 from other disease such as the common-flu. We refer to this task as 'Pos. w/ sym. v.s. Neg. w/ sym.' In addition to voice-based analysis, we explore the symptoms to provide complementary information. In particular, for symptomatic individuals, their voice and the symptoms are integrated as inputs to the models. More specifically, another three tasks are investigated:

• S only : Distinguish symptomatic positive individuals from symptomatic negative users by using their symptoms only.

• (V+S)F F : Distinguish symptomatic positive individuals from symptomatic negative users via feature-level fusion by concatenating voice features and symptom-based features as inputs of a model.

• (V+S)DF : Distinguish symptomatic positive individuals from symptomatic negative users via decision-level fusion by combining the predictions from a voice-based model and another symptom-based model. In our case, the final decision will be the same as the prediction from the model with the highest probability estimate for a given instance.

In this section, a comprehensive evaluation is performed to investigate the performance of the tasks provided in 3.3. We describe the features, experiment setup, and result analysis, respectively.

In this study, we applied an established acoustic feature set, namely the INTERSPEECH 09 Computational Paralinguistics Challenge (COMPARE) set [18] , extracted by an open-source openSMILE toolkit [19] . For each audio file, 12 functionals were applied on 16 frame-level descriptors and their corresponding delta coefficients, resulting in a total of 384 features. Particularly, the 16 framelevel descriptors chosen are Zero-Crossing-Rate (ZCR), Root Mean Square (RMS) frame energy, pitch frequency (F0), Harmonicsto-Noise Ratio (HNR), and Mel-Frequency Cepstral Coefficients (MFCCs) 1-12, covering prosodic, spectral, and voice quality features [18] . For more details about these features, please refer to [18] . Moreover, we combined the voice-based analysis with symptoms for COVID-19 diagnosis. In specific, 11 symptoms are chosen as the most common symptoms of COVID-19, as shown in Fig. 2 .

In order to convert these symptoms into feature vectors, one-hot encoding was utilised, resulting in a 11-dimensional symptom-based feature vector for each sample. Each dimension of the vector indicates the presence (1) or absence (0) of a particular symptom.

Following feature extraction, we used Support Vector Machines (SVMs) with linear kernel as the classifiers for all tasks, due to its widespread usage and robust performance achieved in intelligent audio analysis [18, 20] . The complexity parameter C was set to 0.01 based on our preliminary research. Code was implemented using the scikit-learn library in Python. Moreover, for each task, 5-fold cross-validation was performed while the subject-independent constraint was kept, ensuring that data points from the same participant do not appear in both splits. Further, to deal with the imbalanced data during training, data augmentation via Synthetic Minority Oversampling Technique (SMOTE) [21] was carried out to create synthetic observations of the minority class.

To validate the recognition performance of the voice-based models for disease diagnosis under various scenarios, we selected the following standard evaluation metrics: sensitivity (also know as recall or true positive rate (TPR) and calculated as T P/(T P + F N )), specificity (also referred to as true negative rate (TNR) and calculated as T N/(T N + F P )), the area under the ROC curve (ROC-AUC) which measures the performance by consider both sensitivity and specificity at various probability thresholds, and the area under precision-recall curve (PR-AUC) which computes the area under the precision-recall curve. Moreover, for each model, the mean and standard deviation across all five folds were computed separately.

Experiment results are presented in Table 1 . For Task 1, when distinguishing positive tested individuals from negative ones without taking their symptoms into account, the model achieves a sensitivity and specificity of 62%, 74% respectively. Further, when distinguishing recently tested positive individuals from healthy controls without any symptoms, the ROC-AUC and PR-AUC both increase from around 75% to 79%, while the sensitivity and specificity are improved from 62% to 70%, and from 74% to 75%. This indicates that voice signals have a detectable COVID19 signature. Besides, in [4] , the analysis based on cough and breathing sounds, achieved the sensitivity of 69% and ROC-AUC of 80%, on a different subset of users though. It can be seen that the obtained performance from human voice is quite comparable to cough and breathing for COVID-19 diagnosis. Hence it would be interesting to analyse all three sounds jointly to understand a comprehensive overview.

Next, when distinguishing asymptomatic patients from healthy controls (Task 3), we observe a noticeable performance decrease of the sensitivity from 70% to 40%, indicating that a high rate of asymptomatic patients are misclassified as healthy participants. The ROC-AUC also drops from 79% to 65%. This is in alignment with findings in a recent study [12] , where researchers achieved 67% in ROC-AUC when identifying COVID-19 coughs from asymptomatic individuals. It implies that with the current features and model, it is difficult to identify asymptomatic patients just from their voice.

However, when distinguishing symptomatic COVID-19 patients from non-COVID-19 controls who also developed similar symptoms (Task 4), our model achieves better performance than Task 3, attaining an AUC of 77%. It demonstrates the potential of exploiting voice to serve as a primary screening tool. Such a tool could rapidly Table 1 : Performance in terms of sensitivity(SE), specificity(SP ), receiver operating characteristic -area under curve (ROC-AU C), and area under precision-recall curve(P R-AU C) for the voice-based diagnosis. For each measurement, its mean and standard deviation across 5-fold cross-validation are reported. In addition, when taking the symptoms into account, we further trained another three models. In particular, both feature-level and decision-level fusion were explored. The former concatenates audio features and the encoded symptoms as the input as a single feature matrix, while the latter chooses the prediction with a higher prob-ability from the two independently-trained models. Corresponding results are shown in Table 2 . When comparing the results, the best performance is achieved by decision-level fusion. It is better than each unimodal model, attaining 79% in ROC-AUC and PR-AUC, 68% in sensitivity and 82% in specificity. It shows the promise of combining voice and symptoms in our analysis. However, note that performance varies across folds, leading to a wide standard deviation of our models. This can also be seen from the ROC curves displayed in Fig. 3 . It is believed that with more training data, it can be alleviated, as shown in our previous work [4] .

In this paper, voice-based models are proposed to discriminate COVID-19 positive cases from healthy controls. The effectiveness of our models are evaluated on a crowdsourced dataset, and highlights the great potential of developing an early-stage screening tool based on voice signals for disease diagnosis. In addition to voice analysis, this work further explores fusion strategies to combine voice and reported symptoms which yield encouraging results.

For future work, we plan to incorporate other sounds such as breathing and coughing alongside voice. In addition, we will investigate the impact of the disease on voice by analysing the correlation of voice characteristics before and after the infection. Furthermore, our data collection is ongoing, and we will improve the robustness of our models by training on a larger pool of users.

Developing COVID-19 vaccines at pandemic speed

Pharmacologic treatments for coronavirus disease 2019 (COVID-19): A review

Artificial intelligence-enabled rapid diagnosis of patients with COVID-19

Exploring automatic diagnosis of COVID-19 from crowdsourced respiratory sound data

COVID-19 and computer audition: An overview on what speech & sound analysis could contribute in the sars-cov-2 corona crisis

Finegrained sleep monitoring: Hearing your breathing with smartphones

Snore-GANs: Improving automatic snore sound classification with synthesized data

Speech landmark bigrams for depression detection from naturalistic smartphone speech

Clinical characteristics of coronavirus disease 2019 in China

Symptom clusters in COVID19: A potential clinical prediction tool from the COVID symptom study app

Ai4COVID-19: Ai enabled preliminary diagnosis for COVID-19 from cough samples via an app

Cough against COVID: Evidence of COVID-19 signature in cough sounds

Pay attention to the cough: Early diagnosis of COVID-19 using interpretable symptoms embeddings with cough sound signal processing

SARS-CoV-2 detection from voice

Voice quality evaluation in patients with COVID-19: An acoustic analysis

An early study on intelligent analysis of speech under COVID-19: Severity, sleep quality, fatigue, and anxiety

Prevalence of asymptomatic SARS-CoV-2 infection: A narrative review

The IN-TERSPEECH 2009 emotion challenge

openS-MILE -the Munich versatile and fast open-source audio feature extractor

Prediction-based learning for continuous emotion recognition in speech

SMOTE: Synthetic minority oversampling technique