key: cord-0645148-422vtzwj
authors: Hosseini, Morteza; Ren, Haoran; Rashid, Hasib-Al; Mazumder, Arnab Neelim; Prakash, Bharat; Mohsenin, Tinoosh
title: Neural Networks for Pulmonary Disease Diagnosis using Auditory and Demographic Information
date: 2020-11-26
journal: nan
DOI: nan
sha: 31f0856aa52f63f5be88a064d387cafc0ede628b
doc_id: 645148
cord_uid: 422vtzwj

Pulmonary diseases impact millions of lives globally and annually. The recent outbreak of the pandemic of the COVID-19, a novel pulmonary infection, has more than ever brought the attention of the research community to the machine-aided diagnosis of respiratory problems. This paper is thus an effort to exploit machine learning for classification of respiratory problems and proposes a framework that employs as much correlated information (auditory and demographic information in this work) as a dataset provides to increase the sensitivity and specificity of a diagnosing system. First, we use deep convolutional neural networks (DCNNs) to process and classify a publicly released pulmonary auditory dataset, and then we take advantage of the existing demographic information within the dataset and show that the accuracy of the pulmonary classification increases by 5% when trained on the auditory information in conjunction with the demographic information. Since the demographic data can be extracted using computer vision, we suggest using another parallel DCNN to estimate the demographic information of the subject under test visioned by the processing computer. Lastly, as a proposition to bring the healthcare system to users' fingertips, we measure deployment characteristics of the auditory DCNN model onto processing components of an NVIDIA TX2 development board.

In 2016, pulmonary diseases were among the top 10 causes of death: ranked 1 for low-income and ranked 5 for high-income countries [1] . Recently, with the outbreak of the COVID-19 as a novel pulmonary infection, a tremendous amount of attention has been directed to control the pandemic crisis about which extreme measures are taken by countries to diagnose the infected patients. Measures such as extensive testing and early-stage diagnosis help to locate and contain the infection, and are reportedly the most effective preventive actions to control the contagion in a pandemic.

Pulmonary problems encompass a wide range of chronic and infectious diseases, and because of the common organ, lung, that they affect, they develop respiratory symptoms whose auditory signals recorded from various medical devices are among the first to be scrutinized by a medical expert. As an example, COVID-19 develops symptoms such as dry cough, fever, fatigue, dyspnea, and shortness of breath that vary in severity at different stages of the disease progression, and correlate with certain ethnicity, gender, and age groups differently [10] . More than 70% of the confirmed COVID-19 patients have reported fever in tandem with a dry cough [23] . Meanwhile, clinical case records indicate that the young population is less likely to develop COVID-19 relevant symptoms, contrary to the elderly that is the most vulnerable group [9] .

Traditionally, when someone feels symptoms, they either call a doctor or have themselves seen/scrutinized by medical experts at walk-in clinics, where extensive use of vital signs, visual and auditory information are applied to make diagnostic decisions. Such practice during a pandemic or in remote locations is unsuitable/impractical as a result of the limited capacity of existing facilities and human resources at health centers, and, ironically, can expedite spreading the infection. On the other hand, calls are made by governments/organizations during the pandemic for people to stay at home that, by itself, has caused a state of confusion and has made another barrier. Thus, early-stage and clinic independent machine assistance is critical for the initial diagnosis of the disease and/or for evaluating/assessing its severity.

Our goal in this research is to allow machine learning algorithms running on general computing processors (e.g., those in cell-phones and tablets) to assess patients similar to what doctors do at triage and telemedicine, using passively recorded audio and/or video and self-declared information, to bring proactive healthcare to users' fingertips and to estimate the urgency/necessity of whether they need to attend clinics and have themselves further examined with the use of more specialized test-kits or facilities . Our vision is to  3x3 conv, 64  3x3 conv, 64  3x3 conv, 64  3x3 conv, 64  3x3 conv, 64  3x3 conv, 64  3x3 conv, 128, /2 3x3 conv, 128  3x3 conv, 128  3x3 conv,   128   3x3  conv, 128   3x3 conv, 128  3x3 conv, 128  3x3 conv, 128  3x3 conv, 256, /2  3x3 conv, 256  3x3 conv, 256  3x3 conv, 256  3x3 conv, 256  3x3 conv, 256  3x3 conv, 256  3x3 conv, 256 3x3 conv, 256  3x3 conv, 256  3x3 conv, 256  3x3 conv, 256  3x3 conv, 512, /2  3x3 conv, 512  3x3 conv, 512  3x3 conv, 512 3x3 conv, 512  3x3 conv, 512 Figure 1 : The proposed framework to classify respiratory problem has two DCNN components that process data from a user under test. Part of the information is auditory, such as the audio sound recorded from a medical electronic device like a microphone or a stethoscope, and part of that information is the demographic information, such as age, gender, and ethnicity, that can either be estimated using a computer vision algorithm or inserted manually. The framework is flexible and scalable in the sense that it can incorporate new sensors easily, allowing the system to be tailored to a variety of kinds of situations, such as in-home consultations, clinical visits, or even symptom detection in public milieus using non-contact sensors.

provide a detection framework that can provide early detection for anyone and anywhere. We develop our work on a publicly released respiratory sound database that includes both auditory and demographic information recorded from 126 subjects covering 7 pulmonary diseases including healthy condition, and with two sets of annotations. More specifically, since the respiratory sound dataset includes the two types of information per patient, we examine how the lack/existence of the demographic information impacts the total accuracy of the model. For further compilation, we develop another deep neural network that estimates demographic information including the age and the gender of the captured images and to correlate them with the auditory signals recorded from the subject under test in order to assess a higher sensitivity and specificity rate of diagnosis. The main contributions of this work include:

• Statistically analyze the information in a public respiratory sound database, to justify extracting a reasonably balanced dataset out of it. • Train a DCNN on the extracted auditory dataset without considering the demographic information.

• Train another DCNN model on a face images dataset annotated with age, gender, and ethnicity so to estimate/extract demographic information of a subject visioned by a computer. • Train the auditory dataset in conjunction with the demographic information. • Deploy the first DCNN model to TX2 embedded system and measure its implementation characteristics for CPU and GPU.

With the advancement of machine learning and deep learning algorithms, audio-based biomedical diagnosis and anomaly detection have recently become an active area of research. Some important aspects of audio-based diagnosis using deep learning include detection of sleep apnea, recognition of cough tone, and classification of heart sound, to name a few. Early research [7] shows that machine learning (ML) tools on a limited unpublished dataset can distinguish solely between coughs from COVID-19 patients and those who are healthy or with upper-respiratory coughs with high accuracy of 96.8%.

[? ] introduces End to End convolutional neural networks for cough and dyspnea detection. Authors in [3] used both DCNNs and recurrent neural networks (RNNs) to classify cough sound that they collected using chest-mounted sensors. Authors in [13] used deep learning to detect sleep apnea. Classification of heart sound into normal and abnormal classes was conducted in [19] using DCNNs. Authors in [4] and [15] used DCNNs and RNNs to classify lung sounds respectively. Most of these works report high levels of accuracy on unpublished datasets that are accessible by the research community. The 2017 International Conference on Biomedical Health Informatics (ICBHI) [17] issued a benchmark dataset of respiratory sound to facilitate researching on respiratory sound classification. Since then, researchers proposed various algorithms [5, 11, 14, 16] using different deep learning techniques to classify respiratory cycle anomalies such as the precise locations of wheezes and crackles within the cycle of each respiratory sound recording. In [12] authors showed innovativeness by proposing a digital stethoscope to provide an immediate diagnosis of respiratory diseases. They developed a modified bi-ResNet architecture using STFT and wavelet feature extraction. Log quantized deep CNN-RNN based model for respiratory sound classification was proposed in [2] for memory limited wearable devices.

The framework, depicted in Fig. 1 , leverages audio/video to extract necessary and medically relevant information and combines the extracted features with other inserted/self-declared patient data. The audio processing incorporates an ML approach such as a DCNN that extracts symptomatic features like crackles and wheezes of lung sounds from a given window of recorded sound of a subject under test. At the video processing path, the captured RGB images are Figure 2 : Statistics of the respiratory sound database that contains auditory samples from 126 patients, A) a 2D histogram of 7 pulmonary classes with respect to 10 age groups, B) Break-down of the pulmonary classes in the dataset, C) Break-down of each pulmonary class with respect to age groups, D) Break-down of each pulmonary class with respect to the four recording medical devices, E) Our selection of 52 and 11 individual subjects for train and test datasets respectively that cover 5 pulmonary classes recorded with Meditron Electronic Stethoscope.

given to a ResNet-34 DCNN to process and estimate the user's other demographic and symptomatic features such as age and gender. The extracted audio/video features along with the other relevant inserted data are concatenated towards the final layers and with the addition of a few more neural network layers or an ensemble of classifiers in the last layer, a probability vector of diagnosis for the user under test is reported. Both audio/video data can extend the scope of the clinical-reported symptoms to more diverse features that may be invisible to a human's perception. For example, when listened by a trained machine, the features extracted from the sound of a patient's cough can include more useful features beyond terms like "dry" or "productive" that are commonly reported in clinical case records.

For the auditory dataset, we used a public respiratory sound database [18] , which includes 920 recordings acquired from 126 participants annotated with 8 types of respiratory conditions including URTI, Healthy, Asthma, COPD, LRTI, Bronchiectasis, Pneumonia, and Bronchiolitis. The recordings were collected using four types of medical equipment including AKG C417L Microphone, 3M Littmann Classic II SE Stethoscope, 3M Littmann 3200 Electronic Stethoscope, and Welch Allyn Meditron Master Elite Electronic Stethoscope. The duration of each recording range from 10 to 90 seconds mostly dominated with 20s samples. Fig. 2-A, B , and C plot the distribution of the subjects with respect to their diagnosed disease and the age groups they impact, and Fig. 2 -D shows the contribution of each of the 4 medical devices for recordings from participants. Among the four recording devices, the Meditron Electronic Stethoscope is the only device that encompasses the 8 pulmonary conditions except for the Asthma, and is used for 63 out of the 126 participants. The recordings from the other 3 devices are majorly taken from COPD-diagnosed participants. By eliminating Asthma, Pneumonia, and LRTI that have little or no samples within the Meditron recordings, we extracted a random subset encompassing all 63 participants and split it into a semi-balanced train and a test set of 52 and 11 participants that include 5 types of pulmonary classes. Fig. 2 -E shows a plot of the selected train/test dataset based on the total duration of each class. The database is meanwhile provided with demographic information of the 126 participants and another annotation that marks begin/end of respiratory cycles and the precise locations of events of crackles and wheezes per recording. Based on the second annotation, we counted the total number of respiratory cycles to estimate the slowest and average respiratory cycles within the dataset and to decide on a window size to cut the recordings into smaller frames. Table 1 summarized the number of subjects, duration of recordings, and the number of respiratory cycles per pulmonary class within both train and test datasets.

Database. UTKFace dataset [22] is a large-scale dataset consisting of over 20,000 face images with annotations of age (ranging from 0 to 116 years old) gender, and ethnicity. In [8] , the UTKFace dataset is trained on a ResNet-34 [6] , and we reproduced the results of training over ResNet-34, and report the accuracy it gives for precise age as well as for the age group of a random split of 20% test data.

For data augmentation of the respiratory sound database, every recorded audio sample is cut into frames with a duration of 5s and with a stride of 1s, which means every two adjacent frames overlap a duration of 4s, and every 20s recorded sample results in 16 5s frames. Therefore the total 2000 seconds of the training dataset generates 1600 frames, and the total 460s testing data generates 368 frames of 5s samples. The choice of the 5s frames is inferred empirically by experiencing frames ranging from 1s to 10s. For the data augmentation of the UTKface images, we use common image augmentation techniques such as flipping, shifting and resizing the images within the dataset.

We used a ResNet-34 DCNN [6] for the UTKface RGB images of size 200×200, and an EnvNet-like [21] DCNN for the respiratory sound frames of size 1×220500. For the EnvNet-like DCNN, the input from the audio recordings is a one-dimensional vector where the size depends on the window selected for the framework. To best utilize the one-dimensional input, we use two one dimensional convolution layers to extract relevant features with a follow up of non-overlapping max-pooling operation to downsample the feature map. The subsequent layers include two-dimensional convolutional layers with max-pooling layers in between for efficient classification of the diseases. Finally, the fully connected layers summarize the required feature information and feed it to the extended model to generate a generalized output that classifies 5 types of pulmonary conditions as in our extracted dataset.

The classification accuracy of age and gender estimation of ResNet-34 is reported in Table 2 . Although the DCNN model does not precisely classify the age within the test dataset, it is able to classify the gender and estimate the age groups when the range of the groups expands. This is in correspondence to combining the auditory data with the age group, as conducted and reported in the next subsection where we combine the auditory information with the age group of the subjects, rather than the precise age of each participant.

We first conduct a set of experiments to explore the best DCNN configuration based on the EnvNet DCNN that achieves the highest accuracy. Then, we combined the audio dataset with the age groups they are recorded from as depicted in Fig. 1 . Table 3 compares the   Table 4 : Deploying the EnvNet model to commercial off-theshelf devices including a dual-core Denver CPU, a quad-core ARM A57 CPU, and a combination of ARM CPU + Pascal GPU from the NVIDIA TX2 board. two sets of experiments, indicating that the COPD and healthy conditions are diagnosed with higher accuracy and resulting in a total test accuracy increase by 5% when the demographic information is taken into account.

The framework is intended to be flexibly deployable for generalpurpose devices where the developed ML models trained on the framework can be deployed onto processing machines that may range from front-end edge devices to back-end computer servers. Trading off between the computation complexity and the classification accuracy, trained ML models can be deployed to edge devices (e.g. a cell-phone, tablet) to process data locally if the information privacy is a concern, or otherwise to the cloud servers that can process data with more elaborate up-to-date models that yield higher quality metrics. All of the DCNN models are attributed to at least two hardwarelevel characteristics: the model size and the number of computer operations per inference, both of which are upper-bounded by the platform resources that they deploy to, or by the inference deadline. When putting all the components of the framework together, both the hardware resource constraints and the diagnosis latency should meet the application goals. Having set the batch-size equal to 1, the trained models obtained from the previous Section are deployed on two mobile CPUs including Denver (dual-core) and ARM-Cortex A57 (quad-core) as well as an embedded CPU+GPU implementation with different frequency settings. All of the settings were performed on the TX2 development board that provides precise on-board power measurement. Table 4 summarizes the implementation, indicating that, provided a 5s frame of recording to the memory, the least power dissipating implementation (Denver with a low frequency) takes 10 seconds to classify one frame whereas the most energy-efficient implementation (CPU+GPU) dissipates approximately 10× more power to classify the same frame within 0.1 seconds.

The most related work to ours that has developed a DCNN on the same respiratory sound database is the work in [20] that reports an overall accuracy of 97%. The main difference between the two works is that our model uses additional information in tandem with the audio data and proposes a framework that suggests combining as much existing correlated information within the dataset as possible to rectify and increase the diagnosis accuracy. The other difference is that our selected dataset is semi-balanced among 5 classes of respiratory sounds recorded from one unique medical device that has been indistinguishably utilized for 61 subjects diagnosed with 7 out of 8 classes within the database, whereas the dataset selection in [20] is excessively dominated with COPD recordings, a major portion of which, as depicted in Fig. 2 -D, are recorded by two medical devices that have been used to merely sample from COPD-diagnosed participants. Table 5 provides a comparison and a summary of the total number of augmented samples per class within the two works.

In an attempt to exploit machine learning algorithms to classify respiratory problems, we proposed a framework that employs as much correlated information as a dataset provides and showed that with combining both auditory and demographic information for a selection of reasonably balanced dataset out of a publicly released respiratory sound database the diagnosis accuracy of the trained deep convolutional neural networks (DCNNs) increases by 5%. Since the demographic data can be extracted and estimated using computer vision, we suggest using another DCNN that works in parallel to the auditory signal processing DCNN to estimate the demographic information of the subject under test. Lastly, we deploy our DCNN models on a dual-core Denver CPU, a quadcore ARM Cortex A57, and a heterogeneous implementation of CPU+GPU from the NVIDIA TX2 development board to measure hardware characteristics when deploying the model to an embedded device.

The Top 10 Causes of Death

Deep Neural Network for Respiratory Sound Classification in Wearable Devices Enabled by Patient Specific Model Tuning

Deep neural networks for identifying cough sounds

Classification of lung sounds using convolutional neural networks

Convolutional neural networks based efficient approach for classification of lung diseases

Deep residual learning for image recognition

AI4COVID-19: AI enabled preliminary diagnosis for COVID-19 from cough samples via an app

FairFace: Face Attribute Dataset for Balanced Race

Are children less susceptible to COVID-19

Clinical features of COVID-19 in elderly patients: A comparison with young and middle-aged patients

Detection of Adventitious Respiratory Sounds based on Convolutional Neural Network

LungBRN: A Smart Digital Stethoscope for Detecting Respiratory Disease Using bi-ResNet Deep Learning Algorithm

Tracheal sound analysis using a deep neural network to detect sleep apnea

Convolutional neural networks learning from respiratory data

Deep auscultation: Predicting respiratory anomalies and diseases via recurrent neural networks

Truc Nguyen, and Ramaswamy Palaniappan. 2020. Robust Deep Learning Framework For Predicting Respiratory Anomalies and Diseases

A respiratory sound database for the development of automated classification

An open access database for the evaluation of respiratory sound classification algorithms

Classification of heart sound recordings using convolution neural network

Lung Disease Classification using Deep Convolutional Neural Network

Learning environmental sounds with end-to-end convolutional neural network

Age progression/regression by conditional adversarial autoencoder

2020. Incidence, clinical characteristics and prognostic factor of patients with COVID-19: a systematic review and metaanalysis