key: cord-0617996-5q7u6t7f authors: Rashid, Meemnur; Alman, Kaisar Ahmed; Hasan, Khaled; Hansen, John H.L.; Hasan, Taufiq title: Respiratory Distress Detection from Telephone Speech using Acoustic and Prosodic Features date: 2020-11-15 journal: nan DOI: nan sha: 0010f32ac43ca5864983343672936b536431ae90 doc_id: 617996 cord_uid: 5q7u6t7f With the widespread use of telemedicine services, automatic assessment of health conditions via telephone speech can significantly impact public health. This work summarizes our preliminary findings on automatic detection of respiratory distress using well-known acoustic and prosodic features. Speech samples are collected from de-identified telemedicine phonecalls from a healthcare provider in Bangladesh. The recordings include conversational speech samples of patients talking to doctors showing mild or severe respiratory distress or asthma symptoms. We hypothesize that respiratory distress may alter speech features such as voice quality, speaking pattern, loudness, and speech-pause duration. To capture these variations, we utilize a set of well-known acoustic and prosodic features with a Support Vector Machine (SVM) classifier for detecting the presence of respiratory distress. Experimental evaluations are performed using a 3-fold cross-validation scheme, ensuring patient-independent data splits. We obtained an overall accuracy of 86.4% in detecting respiratory distress from the speech recordings using the acoustic feature set. Correlation analysis reveals that the top-performing features include loudness, voice rate, voice duration, and pause duration. Respiratory diseases, including asthma, chronic obstructive pulmonary disease (COPD), acute lower respiratory tract infection, and lung cancer [1] , are among the leading causes of death and disability worldwide. According to WHO, about 339 million people suffer from asthma [2] . COPD was responsible for about 3.17 million deaths in 2015, and in 2016, about 251 million people were affected by this disease [3] . Cystic fibrosis, another respiratory disease, also called Bronchiectasis, affects about 30,000 people in the US [4] . Shortness of breath, stubborn cough, wheezing, chest pain are typical symptoms of respiratory diseases [5] . The COVID-19 pandemic, first identified in late 2019, is also a respiratory disease [6] that has already claimed over 1 million lives worldwide. Early detection of symptoms such as respiratory distress can be vital in tracking various respiratory diseases, including the COVID-19 pandemic. Early diagnosis is critical in effectively treating most respiratory diseases, including COPD and asthma [7] . Spirometry is the most commonly used test for an initial respiratory assessment of a patient. Confirmatory diagnosis of various respiratory disorders may require additional examination such as a chest x-ray or other laboratory tests. It is well known that shortness of breathing affects the human speech production mechanism [8] and, in theory, respiratory distress should be detectable from the speech signal analysis alone. Thus, automatic initial assessment of respiratory function from speech can be valuable in low-resource settings where there is a lack of experienced general practitioners. Previous work on speech signal analysis for respiratory assessment has been minimal. Relevant research in this area includes the detection of breathing sound [9] , wheezing [10] and cough [11] . In [12] , methods are presented for estimating and visualizing the breathing pattern from speech recordings. Although the authors claim that the system can be used for detecting pathological conditions, only healthy volunteer data were used for evaluation. In [13] , the authors introduced "Spirocall", a method of performing spirometry using standard telephone calls. This method uses a 3D printed whistle and transmits the breathing effort's sound via the telephone channel. Although the method is promising, it still requires a 3D printed device and the presence of trained personnel to be effective. In this work, we present our preliminary findings on detecting respiratory distress (or shortness of breath) from conversational telephone speech. Voice recordings were collected from a telemedicine provider while the patient's personal identifying information was removed. We hypothesized that respiratory distress would affect the speech sound and rhythm and therefore utilized a set of acoustic and prosodic features for classification between patients having respiratory distress and healthy subjects. Speech production involves airflow from the lungs through the larynx, vocal cord vibration, and resonance in the oral and nasal cavities. Since the lung is a vital organ for speech production, and respiratory disease is expected to cause physical changes in the speech production pathway and affect speech signal itself [14] . Previous research [15] shows that the lung volumes and breathing patterns during speech differ from quiet respiration, and alterations in speech breathing are disease and task-specific. Asthmatic patients tend to show an increased duration of silence between speech segments, lower syllable rates, and increased time in non-speech ventilatory activity [16] . Thus, we can assume that respiratory distress will affect the speaking rate and speech breathing pattern [8] . In [17] , a high degree of correlation was found between FEV1/FEV ratio obtained from spirometry and harmonicsto-noise ratio (HNR) obtained from human speech for asthma patients. This motivates us to consider traditional acoustic features to analyze speech signals to detect respiratory distress symptoms. The speech recordings used for this study are collected from Digital Healthcare Solutions (DHS), a leading telemedicine service provider in Bangladesh 1 . The telemedicine service operates through direct phone calls between the patient and physicians. For this study, a total of 88 phone-call record- ings are collected. A patient's recordings were included if the patient called in to report suffering from respiratory distress. The speech recordings can be categorized into three classes: (i) patients reporting severe respiratory distress who are advised to visit a hospital urgently, (ii) patients reporting mild respiratory distress and generally have a history of breathing difficulty, (iii) healthy control subjects. The data is summarized in Table 1 . Patients included in (i) and (ii) may have a chronic condition such as asthma. However, our study's goal is to detect the condition of respiratory distress (or shortness of breath), not the actual disease. For this reason, the disease information is not used for our analysis even if it is available for some patients. The recordings are collected at a sampling rate of 44.1 kHz. Every telemedicine phonecall consists of the speech of the patient (or their representative, e.g., relative or guardian) and the physician. The recordings also contain occasional cough and wheezing sounds from the patient or other background noise. Accordingly, the audio recordings are annotated in these 6 categories: (i) patient, (ii) doctor, (iii) patient representative, (iv) cough sound, (v) wheezing sound, and (vi) background noise. A typical annotated phonecall recording is shown in Fig. 2 including some of these audio events. The total number of annotated recordings from all categories is 4522 as summarized in Table 2 . In this work, only male patients are considered. The age of the patients range between 25 -65. A total of 1957 utterances (Table 1 ) of different lengths were prepared for cross-validation experiments. The entire 1957 segmented speech signals are distributed into 3 folds for cross-validation (Fig 3) . The train-test split for each fold is reserved to be approximately 70-30. We ensure that the same patient data is not used for both training and test so that the algorithms do not learn to recognize the speakers. The telephone calls typically include background noise and a static hum due to the recording instrument. Therefore, before feature extraction, we performed a speech enhancement operation to reduce the background noise. The step is a pre-requisite for the extraction of the prosodic feature set. We use the VAD approach presented in [18] as provided in the MATLAB voicebox toolkit [19] . We use the Interspeech 2010 Paralinguistic Challange feature set proposed in [20] . The feature set contains 38 lowlevel descriptors (LLDs) and 21 functionals yielding a total of 1582 acoustic features [21] as summarized in Table 3 . The LLDs are extracted at 100 frames per second with a diverse set of short-time windows and smoothed by simple moving average low-pass filtering with a window length of 3 frames. Afterward, the first-order regression coefficients are calculated followed by the 21 functionals for every instance in the dataset [21] . These features are extracted using the openS-MILE toolkit [22] . We only utilize speaking rhythm related features in this work. The following four (4) LLDs are computed: (i) voice rate, (ii) voice duration, (iii) pause duration, and (iv) voice & pause ratio. A window size of 1 sec is used and 21 functionals (as in Sec. 4.3.1) are calculated to obtain an 84 dimensional feature set. The extracted features have a high dynamic range and thus a normalization step is necessary for effective classification [23] . In this study, we use standard mean and variance normalization for each feature. Normalization of both training and test features is done for each fold using the mean and variance calculated from the training data of the corresponding fold. The SVM classifer has been used for training the dataset with binary decision using the LIBSVM [24] tool. The two classes include (i) distressed, and (ii) normal. The "Distress" class contains data of patients who reported having mild or severe respiratory distress. The "Normal" class contains the speech segments of healthy control subjects. A linear kernel is used for the SVM model with a fixed seed for every fold to ensure the results are reproducible. The proposed method's performance is evaluated with respect to the accuracy, sensitivity, specificity, F1-score, and AUC (Area Under the Curve of ROC) for each fold. The mean and standard deviation of each performance metric is calculated from the folds. The acoustic and prosodic feature sets are first evaluated separately. In the final stage, the feature sets are fused by concatenation, and correlation-based feature reduction is performed to reduce the feature dimension to 251. The results are summarized in Table 4 . From the result, we observe that for acoustic features, the classifier shows a mean accuracy of 86.4 (±2.1)% with the best sensitivity, specificity, F1 score, and AUC. The acoustic features have consistent performance across the 3 folds, including identical AUC values indicating the subject/patient invariance of the classifier. Compared to the acoustic features, the prosodic features performed were sub-par. This feature set resulted in an accuracy of only 56.5 (±2.5)%, which slightly better than a random classifier. The fusion of the features and feature selection did not provide any significant improvement in the overall performance. To correlate individual features' performance to the physiological aspects of speech production, we combined the acoustic and prosodic feature sets and performed a correlationbased feature ranking. The analysis returned 251 top-ranked features, including 242 acoustic features and 9 prosodic features. To simplify the study, we look at the LLD features included among the top 251 features and provide the rankings of the top 10 LLDs in Table 5 . We observe that features that can be physiologically connected to respiratory distress are included, such as loudness, voice rate, voice duration, pause duration. Loudness is the most prominent feature in detecting respiratory distress. Spectral and cepstral features are also found to be important in the ranking analysis, confirming our hypothesis that voice quality can be altered due to respiratory distress (e.g., increased hoarseness). This study presented findings from our preliminary analysis on respiratory distress detection from telephone speech collected from real telemedicine phonecalls from actual patients reporting adverse health conditions. We have utilized a set of acoustic and prosodic features for binary classification between a patient reporting respiratory distress and a healthy control subject. Experimental results using the proposed feature set and an SVM classifier showed promising results achieving above 85% performance in all of the performance metrics, namely, accuracy, sensitivity, specificity, and F-1 score, on a 3-folds cross-validation experiment. The topranked features obtained by correlation analysis were found to be physiologically meaningful. The proposed method can significantly impact the automatic early detection of respiratory diseases through telemedicine phonecalls in low-resource settings. A further large-scale clinical study is required to confirm the diagnostic usability of the proposed approach. The global impact of respiratory disease Chronic respiratory diseases Chronic obstructive pulmonary disease (COPD) The top 8 respiratory illnesses and diseases Symptoms and signs of respiratory disease Human coronavirus nl63 infection and other coronavirus infections in children hospitalized with acute respiratory disease in hong kong, china Detection of asthma and chronic obstructive pulmonary disease in primary care Speech breathing patterns in health and chronic respiratory disease An effective algorithm for automatic detection and exact demarcation of breath sounds in speech and song signals Design of wearable breathing sound monitoring system for real-time wheeze detection Cough detection algorithm for monitoring patient recovery from pulmonary tuberculosis Deep sensing of breathing signal during conversational speech Spirocall: Measuring lung function over a phone call Voice changes in patients with chronic obstructive pulmonary disease Speech breathing in patients with lung disease Speech segment durations produced by healthy and asthmatic subjects Speech signal analysis as an alternative to spirometry in asthma diagnosis: investigating the linear and polynomial correlation coefficient A statistical model-based voice activity detection Voicebox is a matlab toolbox for speech processing The interspeech 2010 paralinguistic challenge Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing Recent developments in opensmile, the munich open-source multimedia feature extractor Prediction of cognitive load from speech with the voqal voice quality toolbox for the interspeech 2014 computational paralinguistics challenge LIBSVM: A library for support vector machines