key: cord-0432354-41l1fo1b authors: Han, Jing; Qian, Kun; Song, Meishu; Yang, Zijiang; Ren, Zhao; Liu, Shuo; Liu, Juan; Zheng, Huaiyuan; Ji, Wei; Koike, Tomoya; Li, Xiao; Zhang, Zixing; Yamamoto, Yoshiharu; Schuller, Bjorn W. title: An Early Study on Intelligent Analysis of Speech under COVID-19: Severity, Sleep Quality, Fatigue, and Anxiety date: 2020-04-30 journal: nan DOI: nan sha: c4fcd41b1c5dbc6c799ace09080118338ec026b2 doc_id: 432354 cord_uid: 41l1fo1b The COVID-19 outbreak was announced as a global pandemic by the World Health Organisation in March 2020 and has affected a growing number of people in the past few weeks. In this context, advanced artificial intelligence techniques are brought to the fore in responding to fight against and reduce the impact of this global health crisis. In this study, we focus on developing some potential use-cases of intelligent speech analysis for COVID-19 diagnosed patients. In particular, by analysing speech recordings from these patients, we construct audio-only-based models to automatically categorise the health state of patients from four aspects, including the severity of illness, sleep quality, fatigue, and anxiety. For this purpose, two established acoustic feature sets and support vector machines are utilised. Our experiments show that an average accuracy of .69 obtained estimating the severity of illness, which is derived from the number of days in hospitalisation. We hope that this study can foster an extremely fast, low-cost, and convenient way to automatically detect the COVID-19 disease. The emergence and spread of the novel coronavirus and the related COVID-19 disease is deemed as a major public health threat for almost all countries around the world. Moreover, explicitly or implicitly, the coronavirus pandemic has brought an unprecedented impact on people across the world. To combat the COVID-19 pandemic and its consequences, clinicians, nurses, and other care providers are battling in the front-line. Apart from that, scientists and researchers from a bench of research domains are also stepping up in response to the challenges raised by this pandemic. For instance, several different kinds of drugs and vaccines are being developed and trialled, to treat the virus or to protect against it [1, 2, 3, 4, 5] , and meanwhile methods and technologies are designed and investigated to accelerate the diagnostic testing speed [6, 7] . Particularly, considering the community of data science, massive efforts have been and are still being made to mine information data-driven. In particular, a number of works have been proposed to promote automatic screening by analysing chest CT images [8, 9, 10, 11] . For instance, in [8] , the deep model COV-Net was developed to extract visual features to detect COVID-19. However, no research work has yet been done to explore sound-based COVID-19 assessment. In the perspective of sound analysis, as coronavirus is a respiratory illness, abnormal breathing patterns from patients intuitively might be a potential indicator for diagnosis [12] . According to the latest clinical research, the severity of the COVID-19 disease can be categorised into three levels, namely, mild, moderate, and severe illness [13] . For each level, various typical respiratory symptoms can be observed, from dry cough presented in mild illness, to shortness of breath in moderate illness, and further to severe dyspnea, respiratory distress, or tachypnea (respiratory frequency > 30 breaths/min) in severe illness [13] . Meanwhile, all these breathing disorders lead to abnormal variations of articulation. Consequently, it can be of great interest to use automatic speech and voice analysis to aid COVID-19 diagnosis, which is non-invasive and low-cost. In addition, there could be many meaningful and powerful audio-based tools and applications, which are so far underestimated, and hence underutilised. Pretty recently, scientists elaborated several potential use-cases in the fight against COVID-19 spread via exploiting intelligent speech and sound analysis [14] . Specifically, these envisioned directions are grouped into three categories, i. e., audio-based risk assessment, audio-based diagnosis, and audio-based monitors such as for monitoring of spread, social distancing, treatment and recovery, and patient wellbeing [14] . Albeit the importance of the work by analysing voice or speech signals to battle this virus pandemic, no empirical research work has been done to date. To fill this gap, we present an early study on the intelligent analysis of speech under COVID-19. To the best of our knowledge, this is the first work towards this direction. Particularly, we take a data-driven approach to automatically detect the patients' symptom severity, as well as their physical and mental states. It is our hope that this step can help develop a rapid, cheap, and easy way to diagnose the COVID-19 disease, and assist medical doctors. 3. Today is the twelfth day since I stayed in the hospital. 4. I wish I could rehabilitate and leave hospital soon. Moreover, three self-report questions were answered by each patient, regarding her (or his) sleep quality, fatigue, and anxiety. Specifically, participants rated their sleep quality/ fatigue/ anxiety by choosing from three different levels (i. e., low, mid, and high). Furthermore, regarding demographic information, another four characteristics of the patients were collected, including age, gender, height, and weight. Note that, the height and weight information from 13 patients were not provided. A statistical overview of the data can be seen in Table 1 . Furthermore, a distribution of the self-reported sleep quality, fatigue, and anxiety, grouped by gender, is illustrated in Fig. 1 Figure 1 : Distribution of 51 COVID-19 patients' self-reported questionnaires regarding to their health states in three categories, namely, sleep quality, fatigue, and anxiety. For each category, the patient is asked to select one of three degrees (i. e., low, mid, and high). Once the COVID-19 audio data were collected, a series of data preprocessing processes were implemented. Specifically, we did the following four processes: data cleansing, handannotating of voice activities, speaker diarisation, and speech transcription. Details are provided below. Data Cleansing: as the recordings were collected in the wild, there were few unsuccessful recordings where the patient failed to provide any speech rather than noisy background. In such cases, the recordings were discarded for further analysis. As a result, recordings from one female patient were discarded, resulting in data from only 51 subjects for further processing. Voice Activity Detection: for each recording, then, the presence or absence of human voice was detected manually in Audacity 3 . This is because, for some recordings, there were silence periods for the first few seconds and/or the last few seconds. Note that, we only removed the beginning and the end of unvoiced parts from each recording where no audible breathing took place. Hence, only voiced segments (e. g., speech, breathing, and coughing) from the recordings were maintained. Speaker Diarisation: among the remaining voiced segments, there is speech from other individuals than the targeted patient. For this reason, we manually checked and annotated the speaker identities for each voiced segment, indicating if the voice was generated by the patient, or by anyone else. Speech transcription: after annotating the speaker identities, voiced segments from the targeted diagnosed patients were converted into text transcriptions. Note that, while the collection was in the wild, beyond the aforementioned five sentences, some spontaneous recordings were spoken by the patients but with impromptu and unscripted content. After data preprocessing, we obtained in total of 378 segments. For this preliminary study, we focus only on the scripted segments from patients, leading to 260 pieces of recordings for further analysis. A statistic of the distribution of the five sentences is provided in Table 2 . It can be seen that the distribution is imbalanced. The reason is that, some patients recorded the same content more than once, while some patients failed to supply all five recordings. These 260 audio segments from 51 COVID-19 infected patients, were then converted to mono signals with a sampling rate of 16 kHz for further analysis. In this section, we detail out the experiments that were exectued to verify the feasibility of audio-only-based COVID-19 patient state assessment. More specifically, we first describe the experimental setups including the applied acoustic feature sets as well as related evaluation strategies. Afterwards, we elaborate on the experiment performance for COVID-19 severity estimation, as well as prediction performance of three COVID-19 patient selfreported status attributes, namely, sleep quality, fatigue, and anxiety degrees. Last, we discuss the limitation of the current study, and provide future work plans and directions. Two established acoustic feature sets were considered in this study, namely, the Computational Paralinguistics Challenge (COMPARE) set and the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS). Specifically, these feature sets were extracted with the openSMILE toolkit [15] . The COMPARE feature set is a large-scale brute-force set utilised in the series of INTERSPEECH Computational Paralinguistics Challenges since 2013 [16, 17] . It contains 6 373 static features by computing various statistical functionals over 65 low-level descriptor (LLD) contours [16] . These LLDs consist of spectral (relative spectra auditory bands 1-26, spectral energy, spectral slope, spectral sharpness, spectral centroid, etc.), cepstral (Mel frequency cepstral coefficient 1-14), prosodic (loudness, root mean square energy, zero-crossing rate, F0 via subharmonic summation, etc.), and voice quality features (probability of voicing, jitter, shimmer and harmonics-to-noise ratio). For more details, the reader is referred to [16] . Different from the large-scale COMPARE set, the other feature set applied in this work, eGeMAPS, is considerably smaller. It consists of only 88 features derived from 25 LLDs. Particularly, these features were chosen concerning their capabilities to describe affective physiological changes in voice production. For more details about these features, please refer to [18] . In this work, we carried out four audio-based classification tasks. First, we performed COVID-19 severity estimation based on the number of days of the hospitalisation. The hypothesis is that, a COVID-19 patient is generally very sick at the early stage of the hospitalisation, and then recovers step by step. As a consequence, the patients were approximately grouped into three categories, i. e., the high-severity stage for the first 25 days, the mid-severity stage between 25 and 50 days, and the low-severity stage after 50 days. Besides, further three classification tasks considered in this study are to predict the self-reported sleep quality, fatigue, and anxiety levels of COVID-19 patients, the potential of which has been spotted in [14] . For these classification tasks, we implemented Support Vector Machines (SVMs) with a linear kernel function as the classifiers for all experiments, due to its widespread usage and appealing performance achieved in intelligent speech analy- sis [19, 20] . Specifically, a series of complexity constants C were evaluated in [10 −7 , 10 −6 , · · · , 10 −1 , 10 0 ]. Further, to deal with the imbalanced data during training, a class weighting strategy was employed to automatically adjust the C values in proportion to the importance of each class. The SVMs were implemented in Python based on the scikit-learn library. Moreover, for all experiments in this study, Leave-One-Subject-Out (LOSO) cross-validation evaluations were carried out to satisfy the speaker independence evaluation constraint. In this context, all the 260 instances were divided into 51 speakerindependent folds, with each fold containing only instances from one patient. With the LOSO evaluation scheme, one of the 51 folds was used as the test set and the other folds were put together to form a training set to train an SVM model. Then, this process was repeated 51 times until all folds were utilised as the test set. Note that, for each folder, an on-line standardisation was applied to the test set by using the means and variations of the respective training partition. Then, the average performance was computed over the predictions of all instances. In this work, we utilise three most frequently-used measures, i. e., Unweighted Average Recall (UAR), the overall accuracy (also known as Weighted Average Recall or WAR), and the F1 Score (also known as F-score or F-measure) that is the harmonic mean of precision and recall. In Table 3 , we report the performance of the best SVM models for the two selected feature sets, respectively. In particular, the best model was chosen from varied SVMs with different C values based on UAR. It can be seen that, the large feature set, ComPARE, performs slightly better than eGeMAPS for the severity estimation, achieving .68 UAR, .69 accuracy, and .66 F1-score. Moreover, we further inspect the audio recordings from patients with varied severity levels. An illustration is given in Fig. 2 . In particular, three recordings were taken from three different patients, who were asked to say the same content. The first patient failed to produce the sentence due to his severe symptoms. The second sample is from a female patient. She successfully spoke the whole content following the given template, however, had to pause several times to take a heavy breath before carrying on with the remaining content. In contrast, the third patient managed to generate the same recording more clearly and fluently. Considering audio-based sleep quality, fatigue, and anxiety estimation, we further trained SVM models for each task separately. Corresponding results are shown in Table 4 , where the performance of the best models for each task and each feature set are provided. Similarly, the best performance was taken where the highest UAR was obtained, and performance in terms Figure 2 : Illustration of the spectrograms from recordings by COVID-19 diagnosed patients with high (top), mid (middle), and low (bottom) severity degrees. All were requested to speak the same content, i. e., the second sentence in the template (cf. Section 2). of accuracy and F1-score is given. When comparing the three tasks, the best performance is achieved for sleep quality classification, reaching up to .61 UAR. Then, for anxiety prediction, a UAR of .56 is attained. When it comes to fatigue prediction, the best performance of UAR is only .46, which is, however, above chance level (.33 for three-class classification). Further, when comparing two selected feature sets, the compact eGeMAPS set consistently outperforms the large-scale ComPARE feature set. On the one hand, these results reveal the effectiveness of the eGeMAPS set for audio-based sleep quality, fatigue, and anxiety detection. On the other hand, the inferior performance based on ComPARE might be due to the low number of training samples. In this preliminary study, experiments were carried out based on speech recordings from COVID-19 infected and hospitalised patients. The results have demonstrated the feasibility and effectiveness of audio-only-based COVID-19 analysis, specifically in estimating the severity level of the disease, and in predicting the health and wellbeing status of patients including sleep quality, fatigue, and anxiety. Nonetheless, there are still many ways to extend the present study for further development. First, due to time limitation, the collected data set is relatively small, and lacks control group data from both healthy subjects and patients with other respiratory diseases. These data collections are still in progress for more comprehensive analysis in the future. In addition, AI techniques can be considered to tackle the data scarcity issue, such as data augmentation via generative adversarial networks [21, 22, 23] . Given more data, the performance of our models is expected to be further improved and more robust. Second, only functional features computed over whole segments were investigated. However, abnormal respiratory symptoms might be instantaneous and occur only in a short period of time. In this context, analysing low-level features in successive frames with sequential modelling might bring further performance improvement. Moreover, in addition to conventional handcrafted features, deep representation learning algorithms might be explored to learn representative and salient data-driven features for COVID-19 related tasks. These include deep latent representation learning [24] , self-supervised learning [25] , and transfer learning [26] to name but a few. Further, in this study, the severity estimation based on days in hospitalisation is in a rough fashion, as we are in lack of other [27, 28, 29] , more objective and accurate labels could be attained. Last but not least, in this paper, SVMs were separately trained to estimate four tasks. Considering the potential correlation between the severity of the disease, and the patient's sleep quality (or mood), a multi-task learning model might help effectively exploit the mutually dependent information from these tasks [30, 31] . At the time of writing this paper, the world has reported a total of 3 020 117 confirmed COVID-19 cases and 209 799 fatalities, according to a dashboard developed and maintained by the Johns Hopkins University 4 . To leverage the potential of computer audition to fight against this global health crisis, for the first time, experiments have been performed based on the speech of 51 COVID-19 infected and hospitalised patients from China. In particular, audio-based models have been constructed and assessed to predict the severity of the disease, as well as health and wellbeing-relevant mental states of the patients including sleep quality, fatigue, and anxiety. Experimental results have shown the great potential of exploiting audio analysis in the fight against COVID-19 spread. In the future, we will continue the data collection process as well as collecting relevant clinical reports for a comprehensive understanding of the patient state. In addition, we attempt to introduce interpretable models and techniques to make the predictions more traceable, transparent, and trustworthy [32] . We express our deepest sorrow for those who left us due to COVID-19; they are lives, not numbers. We further express our highest gratitude and respect to the clinicians and scientists, and anyone else these days helping to fight against COVID-19, and at the same time help us maintain our daily lives. This work was partially supported by the Zhejiang Lab's International Talent Fund for Young Professionals (Project HANAMI), P. R. China, the JSPS Postdoctoral Fellowship for Research in Japan (ID No. P19081) from the Japan Society for the Promotion of Science (JSPS), Japan, and the Grants-in-Aid for Scientific Research (No. 19F19081 and No. 17H00878) from the Ministry of Education, Culture, Sports, Science and Technology (MEXT), Japan. Research and development on therapeutic agents and vaccines for COVID-19 and related human coronavirus diseases Traditional chinese medicine for COVID-19 treatment COVID-19: Immunopathology and its implications for therapy News feature: Avoiding pitfalls in the pursuit of a COVID-19 vaccine Developing Covid-19 vaccines at pandemic speed Fast and simple highthroughput testing of COVID 19 Innovative screening tests for COVID-19 in South Korea Artificial intelligence distinguishes COVID-19 from community acquired pneumonia on chest CT COVID-CAPS: A capsule network-based framework for identification of COVID-19 cases from X-ray images COVID-Net: A tailored deep convolutional neural network design for detection of COVID-19 cases from chest radiography images Covid-resnet: A deep learning framework for screening of COVID19 from radiographs Abnormal respiratory patterns classifier may contribute to large-scale screening of people infected with covid-19 in an accurate and unobtrusive manner Features, evaluation and treatment coronavirus (COVID-19)," in Statpearls Covid-19 and computer audition: An overview on what speech & sound analysis could contribute in the sars-cov-2 corona crisis openSMILE -the Munich versatile and fast open-source audio feature extractor The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism Baby Sounds & Orca Activity The geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing At the border of acoustics and linguistics: Bag-of-audio-words for the recognition of emotions in speech Prediction-based learning for continuous emotion recognition in speech Generative adversarial nets Snore-GANs: Improving Automatic Snore Sound Classification with Synthesized Data Adversarial Training in Affective Computing and Sentiment Analysis: Recent Advances and Prospectives Emotion recognition in speech with latent discriminative representations learning Learning problem-agnostic speech representations from multiple self-supervised tasks Learning Image-based Representations for Heart Sound Classification Chest CT findings in coronavirus disease-19 (COVID-19): relationship to duration of infection Time course of lung changes on chest CT during recovery from 2019 novel coronavirus (COVID-19) pneumonia Relation between chest CT findings and clinical conditions of coronavirus disease (COVID-19) pneumonia: a multicenter study Cross-corpus acoustic emotion recognition from singing and speaking: A multi-task learning approach Jointly predicting arousal, valence and dominance with multi-task learning Peeking inside the black-box: A survey on explainable artificial intelligence (XAI)