key: cord-0127758-b3dwne7s
authors: Han, Jing; Xia, Tong; Spathis, Dimitris; Bondareva, Erika; Brown, Chloe; Chauhan, Jagmohan; Dang, Ting; Grammenos, Andreas; Hasthanasombat, Apinan; Floto, Andres; Cicuta, Pietro; Mascolo, Cecilia
title: Sounds of COVID-19: exploring realistic performance of audio-based digital testing
date: 2021-06-29
journal: nan
DOI: nan
sha: 2de0f94ffc62d8f7c0338acc15efc02e5c4f01a3
doc_id: 127758
cord_uid: b3dwne7s

Researchers have been battling with the question of how we can identify Coronavirus disease (COVID-19) cases efficiently, affordably and at scale. Recent work has shown how audio based approaches, which collect respiratory audio data (cough, breathing and voice) can be used for testing, however there is a lack of exploration of how biases and methodological decisions impact these tools' performance in practice. In this paper, we explore the realistic performance of audio-based digital testing of COVID-19. To investigate this, we collected a large crowdsourced respiratory audio dataset through a mobile app, alongside recent COVID-19 test result and symptoms intended as a ground truth. Within the collected dataset, we selected 5,240 samples from 2,478 participants and split them into different participant-independent sets for model development and validation. Among these, we controlled for potential confounding factors (such as demographics and language). The unbiased model takes features extracted from breathing, coughs, and voice signals as predictors and yields an AUC-ROC of 0.71 (95% CI: 0.65$-$0.77). We further explore different unbalanced distributions to show how biases and participant splits affect performance. Finally, we discuss how the realistic model presented could be integrated in clinical practice to realize continuous, ubiquitous, sustainable and affordable testing at population scale.

Since its outbreak in early December 2019, over 169 million cases of the novel coronavirus disease have been reported, including 3.5 million deaths. Researchers and scientists have made considerable strides in developing treatments and vaccines for COVID-19, and effective and easily accessible tests have been key to trace infected people quickly. Currently, the most commonly used and first-line diagnostic tool for COVID-19 is the reverse transcription polymerase chain reaction (RT-PCR) assay to detect the presence of viral ribonucleic acid (RNA) from swab samples [1, 2] . RT-PCR tests are highly sensitive in laboratory testing (over 95% diagnostic sensitivity and specificity), however they have been found to perform differently in several commercial kits, with sensitivity ranging from 75% to 100%, and in the worst case reaching as low as 38% [3, 4, 5] . Moreover, the sample analysis process is involved, time-consuming, and limited to approved laboratories with highly-trained staff, leading to limited testing capacity and failing to meet the rapid increase in demand. It is crucial that the pandemic response overcomes these challenges to timely test on a massive scale. This requires fast, affordable, sustainable and effective testing methods, which can be repeated over time by individuals to track progression. This would help contain the current spread but also suppress resurgence and minimise health risks.

Within this context, in the past year researchers have developed and published multiple models for COVID-19 prediction using audio [6, 7, 8, 9, 10, 11] . In particular, advances in machine learning have demonstrated the potential of automated auscultation of respiratory sounds and brought about new possibilities for fully automated COVID-19 screening [12, 13, 14, 15, 16, 17, 18, 19] . For instance, a systematic review by Wynants et al reports that AUC-ROC (Area Under the Receiver Operating Characteristics) performance of over 75 existing COVID-19 prediction models [20] is in the range of .70 and .99.

There is however a lack of studies exploring the biases and model evaluation processes which affect (potentially even positively but unrealistically) these performance results. Such issues include:

• Potential underlying data biases or study limitations not reported sufficiently, where models were developed and evaluated with limited data which might not be representative of the target population (e.g., 19 subjects in [19] , 51 subjects in [21] , and 88 subjects in [15] ).

• Risk of model overfitting, especially when deploying complex modelling strategies (e.g., a 100% accurate diagnosis of asymptomatic COVID-19 individuals was reported in [14] ).

• Methodological flaws (e.g. same users during model development and validation [22] ) which would be unrealistic in a practical clinical setting, resulting in artificial performance boost.

• Lack of systematic comparison with other respiratory diseases like asthma and bronchitis, and only distinguish COVID-19 from healthy controls [23] .

Due to these issues, many researchers raised concerns on the feasibility and effectiveness of such models if deployed in real settings [20, 24, 25] .

In this work, we investigate the limits of audiobased COVID-19 testing with the aim of creating the foundation of realistically applicable audio tools. The aim of this study is two-folds: First, based on a large crowdsourced dataset, to investigate the performance of an audio-based testing method when working with, to the best of our knowledge, unbiased data with a methodological design based on realistic assumptions (e.g. independent user split). Secondly, to explore the impact of biases and design pipeline on performance.

For this purpose, we first gathered crowd-sourced respiratory sound data from general population via smartphones. We carefully prepared data for model development and validation, by selecting representative audio samples from self-declared COVID-19 positive or negative participants. Subsequently, we developed a deep learning model on a portion of the data and then validated its predictive performance on an independent population. In particular, we adhered to the TRIPOD reporting guideline [26] , aiming at reporting in a complete, transparent and usable manner. Our discussion explores the biases and how machine learning model hyperparameters could be tuned, depending on the use of the tool (e.g. on symptomatic or asymptomatic populations) and public health needs.

Dataset preparation and statistics. For data gathering purposes, an app 1 was developed and released in April 2020 to crowd-source participants' demographics, medical history, symptoms, COVID-19 test results and audio recordings: three voluntary cough sounds, three to five breathing sounds, and three speech recordings where user was asked to read a specific sentence. The self-reported COVID-19 test results included receiving a positive/negative test result, or not being tested before the recording. More details can be found in Methods in Appendix. By Our app is a multi-language tool, but in this study we focus only on audio samples from Englishspeaking participants (77.7% of the overall participants) to avoid language-related bias. Audio quality checks were conducted to filter out incomplete or noisy samples. Finally, 2,478 participants (514 positive and 1,964 negative participants) with 5,240 samples were included for experiments, as shown in Fig. 1a (more detailed data selecting criteria can be found in Methods). Demographic statistics for the experimental data are presented in Fig. 1b -d: 56% participants in the selected data were male, the majority aged 20-49, half never smoked. In addition, as shown in Fig. 1e , 84% of the participants who tested positive reported symptoms like fever or cough, while others did not report any symptoms at the recording time. 51% of the negative participants reported no symptoms, while 49% had symptoms such as dry/wet cough, fever, dizziness, etc.

From the 2,478 English-speaking participants, we prepared a training & validation set which consists of 800 participants with balanced COVID-19 status and other demographics to optimise the parameters of our deep learning model, as labelled by the yellow box in Fig. 1a , and the rest of the data were used for evaluation, namely testing set pool (green box in Fig. 1a) . To inspect the performance in different realistic deployment scenarios, we first held out a representative testing set with balanced and demographic controlled participants, containing 100 positive and 100 negative participants. Furthermore, we randomly selected positive and negative participants from the testing pool to form new groups with the criteria of various prevalence levels, medical history, smoking status, to holistically validate our model. Apart from the controlled training and testing data, to simulate the impact the unrealistic experiment setting and bias, we also prepared some training and testing sets with improper data splitting and various biases the model. Details can be found in Methods. Accuracy is presented by ROC-AUC (Area Under the Receiver Operating Characteristics curve), sensitivity and specificity.

COVID-19 detection performance. On the demographic-representative testing set with 200 participants (see age and gender distribution in Appendix Table 1 ), our deep learning model with three sound types yielded a ROC-AUC of 0.71 (0.65-0.77) (Fig. 2a) , with sensitivity 0.65 (0.58-0.72) and specificity 0.69 (0.62-0.76) (Fig. 2b) . The combination of three sound types outperformed any single modality: a ROC-AUC of 0.66 (0.60-0.71) on cough, 0.62 (0.56-0.68) on breathing, and 0.61 (0.55-0.67) on voice was achieved. Moreover, breathing yielded the highest sensitivity of 0.64 (0.56-0.71), but cough showed the highest specificity of 0.66 (0.58-0.73). This indicates that all modalities are informative, and their combination leads to the optimal performance. We further tested the performance of the model on different demographic subgroups under this testing set ( Fig. 2c) , which showes similar results across different age and gender distributions: ROC-AUCs were all above 0.65, and sensitivity and specificity were similar for each group. Accuracy on the over-60 subgroup is slightly higher, but we suspect that the increased performance might be a result of the limited number of participants in this group. We also inspected how symptoms impact the model performance by dividing the 200 participants in the testing set into asymptomatic and symptomatic subgroups. As presented in Fig. 2c , for both subgroups our model yielded ROC-AUCs above 0.66. Yet, from the comparison we can also observe that our model performs better on distinguishing asymptomatic negative participants (specificity=0.85 (0.77-0.92)) and symptomatic positive participants (sensitivity=0.67 (0.59-0.74)). While for more challenging cases, i.e. predicting symptomatic negative and asymptomatic positive cases, we achieved a lower accuracy: Sensitivity of 0.50 (0.25-0.76) for asymptomatic and specificity of 0.56 (0.45-0.66) for symptomatic participants. A potential explanation could be that asymptomatic positive participants might not manifest changes in audio characteristics, and thus are intractable for detection. Further discussion about the real implications and applications can be found in Discussion. Model performance on varied prevalence rate. According to a recent statistical study, the prevalence rate of COVID-19 ranges from 0.12% to 33% worldwide [27] . Therefore, in addition to testing our model in a balanced setting (50% prevalence level, Fig. 2 ), we evaluated its performance in various prevalence scenarios. To simulate this, we re-sampled participants from the testing pool ( Fig. 1a) and lowered the proportion of COVID-19 positives to 5%, 10% and 20% (Fig. 3a) . The performance does not degrade compared to that of 50% prevalence (Fig. 2b) : ROC-AUC of 0.71 (95% CI 0.66-0.75), 0.69 (0.65-0.74), and 0.69 (0.65-0.74) can be achieved on 5%, 10%, and 20% prevalence levels, respectively. This is a promising result, suggesting the potential of AIenabled COVID-19 screening in the real world.

Model performance on various health and smoking status. One of the most important concerns to address is whether these audio models for COVID-19 testing might be confusing COVID-19 with other illnesses or respiratory pathologies. To investigate this topic further, we split the participants of the testing pool into several non-overlapping groups. Results are presented in Fig. 3 .

The first controlled criterion is medical history. We selected participants who reported that they have Asthma, HPB (High Blood Pressure), and those who claimed having no medical history. We compared the model performance and found that all metrics reach comparable level of accuracy on average: the specificity on Asthma group was 0.62 (0.55-0.68), on HPB group was 0.69 (0.62-0.76), on no medical history group was 0.65 (0.62-0.76) (Fig. 3a) , and on a mix of participants was 0.69 (0.62-0.76) (Fig. 2b) . Wallis H Test [28] on the three negative groups' probabilities yielded a p-value of 0.62 (>0.05), showing that the predictions are from the same distribution. This validated the assumption that the medical history cannot confuse our model. Worth noting, that the declined sensitivity for Asthma and HPB groups might be caused by the very limited number of testing samples, leading to relatively large performance fluctuations.

The second controlled criterion is the reported smoking status. The variance of the performance across groups was marginal ( Fig. 3a) : Specificity for those never smoking was 0.66 (0.63-0.68), for those having quit smoking was 0.67 (0.62-0.71), and for those smoking currently was 0.63 (0.58-0.68). Similar to medical history, predicted probabilities for these three are presented in Fig. 3b , with a p-value of 0.51 (>0.05) from Kruskal-Wallis H Test. Sensitivity for smokers was slightly lower: 0.47 (0.31-0.66), which might be explained by the fact that five of the 22 COVID-19-positive smokers were asymptomatic (23% in this group against 16% in Fig. 1e ). As our model is better in predicting symptomatic COVID-19 correctly, this explains the slight drop in the overall sensitivity for this group.

Model performance with unrealistic evaluation and biases. To show how the bias and unrealistic experiment design impacts the model performance, we re-selected and purposefully introduced various biases that previous works might have had, to generate another four training and testing sets to attempt to artificially inflate the results. The artificially created biases are as follows: 1) Using sample-level random splits (random-splits for short) instead of participant-independent splits (user-splits for short) for training and testing. 2) Introducing gender bias into the data by selecting 85% of the negative participants as female. 3) Bringing age bias into negative group. There are two biased groups: selecting all negative participants as who aged over 39 (Group 1) and as who aged under 39 (Group 2). 4) Replacing some English-speaking participants with Italian-speaking participants and making the proportion of Italian-speaking participants relatively higher in the positive group. Details of the data used for comparison can be found in Methods and Appendix Fig. 4-5 . We trained the model without changing the network structure, and if not specifically mentioned, the results are based on the combination of three sound types (breathing, cough and voice). Fig. 4 presents the key findings (a detailed com- parison can be found in Appendix Table 2 -5). From Fig. 4a , random-splits yielded a higher accuracy than user-splits, with the performance gains coming from the overlapping participants whose data have been seen from training: with sensitivity of 0.84 (0.75-0.92) and specificity of 0.78 (0.68-0.87), since personal sound traits are easy to memorise for the model. However, this is less realistic, as in real world scenarios the model should ideally be well adapted to unseen new population. This also may validate our hypothesis that some previous works reported opti-mistic performance by using this random-split protocol. Demographic bias either in age or gender appears to also lead to biased results. Overall ROC-AUC might be boosted as shown in Appendix Table 3-4, but a great difference between sensitivity and specificity can be observed in some subgroups. For instance, sensitivity of 0.23 (0.14-0.33) but specificity of 0.93 (0.90-0.97) were obtained as shown in Fig. 4b on biased (Female) group, because positive females were under-represented in the training set and this model tends to treat female participants as negative.

Similar results can be observed in age-biased groups (see Fig. 4c ). In the group where negative participants aged-over-39 in training set, the model yielded higher specificity than sensitivity on the aged-over-60 participants in the testing set. On the contrary, the model trained from the data biased to aged-under-39 negative participants yielded higher specificity on the younger group (seen Fig. 4c ). When it comes to the language bias, i. e., for Italian-speakers, positive participants were over-represented for training, we get the results that sensitivity is as low as 0.25 (0.15-0.36) in English subgroup and specificity is close to 0 in Italian subgroup from Fig. 4d , and this bias particularly impacted voice modality (see Fig. 4f ) and slightly influenced cough (see Fig. 4e ). Yet, our performance (namely controlled model in Fig. 4) shows consistent sensitivity and specificity across all subgroups, presenting a realistic value for model application.

Comparison with other studies. For digital technologies to penetrate the clinical practice it is pivotal that studies become more explainable and that the models are resilient to the data noise, variability and bias present in real data.

Demographic Bias: In our study design and data selection we concerned ourselves with potential confounding factors and tried to rule out selection bias, as these may lead to unrealistically inflated results. Specifically, we split positive and negative samples into three partitions for model training, optimisation, and testing, while adjusting the data partitioning and maintaining similar distributions of age and gender across different data splits to control for potential confounding variables (see Appendix Table 1 ). This is different from some prior studies in the literature, in which the selection of the data is unclear and lacking a cohort diagram [14, 15] . More importantly, we further performed experimental analysis to explore the effect of demographic bias on the model.

Language Bias: With the potential of COVID-19 digital testing to be applicable worldwide, it is important to explore the effect of language bias on differ-ent audio-based data (such as cough, breathing and voice). To disentangle possible confounding effect of language, we restricted our analysis to Englishspeaking data, which gives the most realistic perspective of the capabilities of audio based diagnostics for COVID-19. In addition, similar to demographic bias, we carried out experiments to test the effect of language bias when the model was trained with unbalanced multi-language data.

User Independence: Moreover, in some prior studies, cross-validation was applied for performance validation: this is generally done when the data is scarce and user samples become very important. Data from the same participant might be used for both model training and validation [22, 29] : while this might be considered acceptable in testing theoretical machine learning techniques, if a user appears in both training and testing sets, such models typically do not generalise well, making them poorly-suited for a realistic setting. With the luxury of a large dataset, we could choose to perform user-independent validation, where participants' data used for model validation are not included and unseen during model training. We are confident that this is a more realistic approach, which could inform future in-the-wild audio-based screening.

Limitations of our study. Several limitations to our work should be acknowledged. COVID-19 is known to often manifest as respiratory symptoms, which are also common for other relatively widespread diseases, as well as among the smoking population. Therefore, we conducted an in-depth analysis to establish whether our model could be influenced by other respiratory pathology. Specifically, we evaluated the ability of our model to correctly identify a COVID-19 infection in participants who indicated Asthma and High blood pressure in their medical history as well (as these are reasonably large cohorts in our data collection) compared to a cohort who indicated no other medical conditions. We also tested the model on participants with a variety of smoking status reported (e. g., few to many cigarettes per day). However, we note that we have not had the opportunity to test against a wider variety of specific respiratory infections, such as influenza or rhinovirus, since they were not prevalent when our data was collected and are difficult to have a reliable ground truth for. It is possible, however, that the participants who reported a cough and had a negative COVID-19 test result were indeed suffering from some respiratory condition at the time of the sample collection.

Also, as our models did not fully control for all potential confounding factors such as race and have much less number of elderly participants, future studies should investigate these biases. In addition, though in the present study the language was well controlled (all English), it is yet unclear whether and how different types of accents would affect the model, while we lack such information to study this.

Our data is crowdsourced: we rely on the trustworthiness of the responses from individuals, especially with respect to their COVID-19 testing status. The scale of the data helps in amortising the noise generated by the crowdsourcing process while, at the same time, shows robustness of the approach to uncontrolled conditions. Our data, while aims to match the cohort to target population as much as possible, lacks clinical validation. Thus, additional external validation should be performed to assess the generalisation of the prediction model before being applied in clinical practice.

Potential implications for practice. The model's trainable parameters are optimised based on a default threshold of 0.5 from the final softmax output layer (see Fig. 5 ): this value is used to classify COVID-19 positive and negative predictions. While our model could be used on the general population for COVID-19 digital testing, we explore different contexts of applications where this threshold, as a hyper-parameter, can be adjusted for a more optional optimal outcome: we report the ROC curve and sensitivity/specificity under different decision thresholds for asymptomatic and symptomatic groups (participants who did and did not declare symptoms) in Appendix: Fig. 1 and Fig. 2 , respectively. Specifically, when applying the model with the aim of screening the asymptomatic population for risk of exposure, from Fig. 1b) , a lower threshold can be used to guarantee a higher Youden Index (defined as Sensitivity + Specificity −1) and a higher sensitivity compared to the threshold of 0.5, so that potential COVID-19 infections are exhaustively covered, and false positives can be easily filtered by a further clinical testing. Yet, if the targeted group is symptomatic (Fig. 2b) , to limit the false positives, a higher specificity can be achieved by by slightly increasing the threshold to maximise the Youden Index. In this study, as we have limited samples for validation in our dataset, we only demonstrate the performance on the test set data under different threshold settings as a proof-of-concept. For clinical use, a further investigation is required on how to adjust and calibrate this threshold to meet different testing criteria.

Finally, audio-based predictive models could be combined with other signatures from other biological signals such as heart rate [30] , as well as selfreported symptoms [6] for improved accuracy, however this would require crowdsourcing additional data from the participants.

Conclusion. In this work, we have developed and validated a deep learning method for detecting COVID-19 solely by analysing human sounds via mobile or web applications. In particular, the crowdsourced data have been collected and processed to make the results reliable, by controlling potential confounding factors in COVID-19 positive and negative cases. We analysed the presented model's predictive performance on detecting COVID-19 infection, which may bring insights into the adoption of digital health technologies in the COVID-19 era. Moreover, we analysed the risks of modelling with various biased data, which led to overestimated performance. This demonstrated that biased data or modelling should be avoided to rigorously validate the digital testing tool for clinical efficacy.

Data collection and preparation. Our data were crowdsourced via a data gathering framework released in April 2020, in multiple languages and for multiple platforms (a webpage, an Android app, and an iOS app). Collected data consist of participants' age, gender, medical history, current symptoms, and three audio recordings: three voluntary cough sounds, three to five inhalation-exhalation sounds, and the participant reading a standard sentence from the screen three times. Participants were asked whether they had been tested for COVID-19, and an optional geo-location sample was collected. The mobile apps also prompted the participant to input symptoms and sounds every two days. No identifiable information was collected. As of 26th April 2021, a total of 36,364 participants contributed 75,201 samples to our project.

We used samples with self-reported COVID-19 test results for experiments as ground truth. Hence, 61,615 samples without reported test results were excluded. Further 110 samples with COVID-19 testing results declared to be obtained 2 weeks before the recordings were made were also discarded due to the delayed audio recordings with respect to COVID-19 testing. Our data was sourced in multiple languages (English, Italian, Spanish, Portuguese, etc.) and the number of samples in each language varied. To avoid language bias, for the main results of this paper we used English audio samples only, with 8,102 non-English samples excluded. Lastly, we manually checked the quality of each recording, deleting in total another 134 samples, that were either incomplete with recordings shorter than 2 seconds, or samples with silent recordings, or distorted samples with poor audio quality. As a result, 5,240 samples from 2,478 participants were explored for the majority of the experiments.

Data used and experiment design for bias evaluation. In addition to the above-mentioned data, we also prepared four datasets with known biases to validate the impact of confounding factors on audio based COVID-19 testing, by selecting from all eligible samples with COVID-19 test results. To be more specific, the following strategies were followed to generate the data:

• Splitting: Our balanced training set and testing set contained 1000 participants (800 participants for training&validation, 200 for testing) and 1486 samples (1162 samples for train-ing&validation, and 329 for testing), with 1.5 samples per participant on average. Instead of splitting training and testing by participants, for this comparison group, we randomly shuffled all samples and split them into training and testing according to the original ratio (1162:329).

• Gender bias: To simulate the scenario where COVID-19 positive rate is significantly different in different gender groups, which raises the concern that the model is detecting gender instead of COVID-19, we manually selected 500 positive and 500 negative participants from the total 2,478 participants (blue box in Fig. 1 ) with gender distribution bias. Specifically, 56% of the positive group are male and the rest 44% are female, while in the negative group, females account for 85% and males for %15. Age demographics are kept balanced and the total number of participants is unchanged, as shown in Appendix Fig. 4 .

• Age bias: With the same approach, for negative participants, we also purposefully selected 1) those aged over 39, and 2) those aged under 39 to simulate the scenarios when participants were not from the whole population. The revised distribution can be found in Appendix Fig. 3 .

• Language bias: Rather than using all English speakers, to investigate the effect of language, we replaced some English-speaking participants with Italian-speaking participants. Specifically, we used more positively-tested Italians than negative. As a result, the positive group mainly consists of Italian-speakers, indicating the bias that participants who speak Italian are more likely to be COVID-19 positive. The detailed percentage can be found in Appendix Fig. 5 .

Data processing. For data pre-processing, all the collected audio recordings were resampled to 16 kHz and converted to mono channel. Then, these audio recordings were cropped by removing the silence periods at the beginning and the end of the recording, after which each sample was normalised.

Model architecture. We implemented a Convolutional Neural Network (CNN) based model for COVID-19 classification, as shown in Appendix Fig. 5 . The network receives one sample with three audio recordings as input, including breathing, cough, and voice from one participant. A spectrogram is computed for each of the recordings and is fed into a VGGish subnetwork. VGGish is a state-ofthe-art pre-trained CNN, with which we leverage and transfer the knowledge learnt from an external massive general-audio dataset [31] . Each VGGish block transforms the input spectrogram into a latent feature vector, then the features from three sound types are concatenated, and finally fed into a binary classifier. Such model design allowed the three sound types to be analysed jointly. Specifically, the model is composed of three parts (Fig. 5) :

(1) Input Layers: The audio recording is first chunked into non-overlapping segments of 0.96 seconds. Log-mel spectrogram is computed for each segment with a window size of 25 ms, a window hop of 10 ms, and a periodic Hanning window. 64 Mel bins are adopted for Mel spectrogram covering the frequency range from 125 Hz to 7500 Hz.A small offset is used to convert the mel-spectrogram into log-scale, resulting in a log-mel spectrogram with size of 64 x 96 per chunk.

(2) Feature Extraction Layers: The main component of the model is VGGish, a CNN-based network with cascaded convolutional layers, max-pooling, and fully connected layers. This network transforms each input spectrogram frame into a 128-dimensional feature vector. Then, an average pooling layer is employed to aggregate all frames within one audio recording into one fixed-length latent feature vector. The size of the CNN kernels and the number of hidden states of fully connected layers are kept consistent with the original work [31] .

(3) Prediction Layers: The resulting latent feature vectors for three modalities are concatenated, and fed into the binary classifier, which consists of two dense layers (the number of hidden states are 96 and 2, respectively) with non-linear ReLU and Softmax activation functions, respectively. The output of the model is a continuous score within the range of 0 to 1 (i.e., the probability of the participant being infected with COVID-19), which can then be categorised into a binary score (0: negative, and 1: positive) with a threshold of 0.5.

To improve the robustness and generalisation ability of our deep learning model, the following techniques were employed:

• Transfer learning: Our collected data is relatively small compared to the number of param-eters in the proposed deep neural network. In light of this, we harness transfer learning to improve the representing ability. Specifically, VG-Gish layers are initialised by a pre-train model, which is designed for audio classification task.

• Differential learning rate: Both VGGish and dense layers are jointly updated by using our audio data. However, we used a small learning rate for parameter update of the VGGish part of the network, and increased the learning rate 10 times for the dense layers. Specifically, learning rate was set as 1e-6 for VGGish and 1e-5 for the final dense layers.

• Avoiding over-fitting: We utilised learning rate decay (factor = 0.9) and L2-regularisation (penalty coefficient = 1e-6).

• Two-phase training: To use the data more efficiently, we primarily trained the model via training set and identified the best hyper-parameters based on the averaged sensitivity and specificity of the 15 th epoch on validation set, and then we merged the training set to fine tune the model until the training performance kept unchanged.

Parameters of the deep learning model were updated by iterative gradient back propagation by the binary cross-entropy loss function on training set. The training batch size was 1. The whole framework was implemented by Python 3.6 and Tensorflow 1.15. Model training was done on a Nvidia Quadro RTX 8000 GPU. Experiment design. We performed participantindependent splitting, which means that samples from a participant included in training set for parameters estimation will not be used for model evaluation.We randomly selected 80% of all positive participants for model learning and sampled the same number of negative participants by maintaining a similar demographic distributions in these two groups (see Appendix Table 1 ). This can minimise the data bias collected in a crowd-sourcing manner. 10% of participants were held out for hyper-parameters searching like the size of dense layers. Once a final trained model was obtained, the performance were evaluated on different demographics, prevalence levels, and health conditions data. Measures of performance include the Area Under the Receiver Operating Characteristic Curve (AUC-ROC), sensitivity, and specificity. For all the metrics, we calculated 95% confidence intervals (95% CIs), using bootstrap resampling with 1000 bootstrap samples and replacement [32] .

The data is sensitive as voice sounds can be deanonymised. Anonymised data will be made available for academic research upon requests directed to the corresponding author. Institutions will need to sign a data transfer agreement with the University of Cambridge to obtain the data. A copy of the data will be transferred to the institution requesting the data. We already have the data transfer agreement in place.

The study was approved by the ethics committee of the Department of Computer Science at the University of Cambridge, with ID #722. Our app displays a consent screen, where we ask the user's permission to participate in the study by using the app. Also note that the legal basis for processing any personal data collected for this work is to perform a task in the public interest, namely academic research. More information is available at https: //covid-19-sounds.org/en/privacy.html.

Python code and parameters used for training of neural networks will be available on GitHub for reproducibility purposes. models, conducted the experiments and generated all tables and figures in the manuscript. JH, TX performed the statistical analysis. CB, DS, EB, JH, TX, TD wrote the first draft of the manuscript. All authors vouch for the data, analyses, and interpretations. All authors critically reviewed, contributed to the preparation of the manuscript, and approved the final version. b. a. 

Virology, transmission, and pathogenesis of SARS-CoV-2

Analytical sensitivity and efficiency comparisons of SARS-CoV-2 RT-qPCR primer-probe sets

Evaluation of seven commercial RT-PCR kits for COVID-19 testing in pooled clinical specimens

Positive rate of RT-PCR detection of SARS-CoV-2 infection in 4880 cases from one hospital in

Comparison of commercial RT-PCR diagnostic kits for COVID-19

Real-time tracking of self-reported symptoms to predict potential COVID-19

Artificial intelligence-enabled rapid diagnosis of patients with COVID-19

Artificial intelligence for the detection of COVID-19 pneumonia on chest CT using multinational datasets

CovidCTNet: An open-source deep learning approach to identify covid-19 using CT image

AI-based analysis of CT images for rapid triage of COVID-19 patients

Deep COVID DeteCT: an international experience on COVID-19 lung detection and prognosis using chest CT

AI4COVID-19: AI enabled preliminary diagnosis for COVID-19 from cough samples via an app

Exploring Automatic Diagnosis of COVID-19 from Crowdsourced Respiratory Sound Data

COVID-19 Artificial Intelligence Diagnosis using only Cough Recordings

SARS-CoV-2 Detection From Voice

Exploring Automatic COVID-19 Diagnosis via Voice and Symptoms from Crowdsourced Data

A Generic Deep Learning Based Cough Analysis System from Clinically Validated Samples for Point-of-Need Covid-19 Test and Severity Levels

End-to-end convolutional neural network enables COVID-19 detection from breath and cough audio: a pilot study

Detection of Covid-19 Through the Analysis of Vocal Fold Oscillations

Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal

An Early Study on Intelligent Analysis of Speech Under COVID-19: Severity, Sleep Quality, Fatigue, and Anxiety

Exploring Self-Supervised Representation Ensembles for COVID-19 Cough Classification

Hi sigma, do i have the coronavirus?: Call for a new artificial intelligence approach to support health care professionals dealing with the covid-19 pandemic

Is my cough COVID-19? The Lancet

Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans

Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration

SARS-CoV-2 infections in 171 countries and over time. medRxiv

Kruskal-wallis test. The corsini encyclopedia of psychology

Machine Learning based COVID-19 Detection from Smartphone Recordings: Cough, Breath and Speech

Wearable sensor data and self-reported symptoms for COVID-19 detection

CNN architectures for large-scale audio classification

Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians

This work was supported by ERC Project 833296 (EAR) and the UK Cystic Fibrosis Trust. We thank everyone who volunteered their data.Author contributions AF, CM, PC designed the study. AH, AG, CB, DS, JC designed and implemented the app to collect the sample data. AG designed and implemented the server infrastructure. EB, JH, TX, selected the data for analysis. TX developed the neural network

All authors declare no competing interests.