key: cord-0195013-oeavr2yq authors: Singh, Vishwanath Pratap; Kumar, Shashi; Jha, Ravi Shekhar; Pandey, Abhishek title: SRIB Submission to Interspeech 2021 DiCOVA Challenge date: 2021-06-15 journal: nan DOI: nan sha: d2e2046482f68f7584f5c33fdf898d3c8163054d doc_id: 195013 cord_uid: oeavr2yq The COVID-19 pandemic has resulted in more than 125 million infections and more than 2.7 million casualties. In this paper, we attempt to classify covid vs non-covid cough sounds using signal processing and deep learning methods. Air turbulence, the vibration of tissues, movement of fluid through airways, opening, and closure of glottis are some of the causes for the production of the acoustic sound signals during cough. Does the COVID-19 alter the acoustic characteristics of breath, cough, and speech sounds produced through the respiratory system? This is an open question waiting for answers. In this paper, we incorporated novel data augmentation methods for cough sound augmentation and multiple deep neural network architectures and methods along with handcrafted features. Our proposed system gives 14% absolute improvement in area under the curve (AUC). The proposed system is developed as part of Interspeech 2021 special sessions and challenges viz. diagnosing of COVID-19 using acoustics (DiCOVA). Our proposed method secured the 5th position on the leaderboard among 29 participants. The COVID-19, novel coronavirus or SARS-Cov-2, has claimed thousands of lives and affected millions of people all around the world with the number of deaths and infections growing day-by-day. According to [1] , the cough has highest prevalence among the symptoms of COVID-19 with 60.4% of all occurrence. For diagnosing COVID-19, the real-time polymerase chain reaction (RT-PCR) is a standard diagnostic test, but, it can be considered as a time-consuming test. Thus using the deep learning methods to classify non-covid and covid cough sounds can play a critical role in COVID-19 diagnosis. Cough is a powerful natural respiratory defense mechanism that clears the central airways in human breathing system. The characteristics of the cough sound is dependent on the flow and configuration of the tissue elements involved, and as such is likely to change during cough as the physiological configuration evolves [2, 3] . The temporal pattern of the cough sound can be analyzed in three phases the first starts with an explosive burst, opening of glottis, followed by the second phase which is a period of noisy sound and slow decay of the noise as flow reduces due to glottal closure, transient forms the third phase [4] . These sounds can provide us enough information to distinguish between wet and dry cough, and ailment cough vs. voluntary cough [5] . Thus the ability to 1 * Represents the equal contribution characterize the cough sounds of unhealthy speaker should therefore be helpful in diagnosis of the COVID-19. Recently, several cough and breath data sets have been published. Examples include the 'Coughvid' [7] , 'Breath for Science' [8] , 'Coswara' [9] , and 'CoughAgainstCovid' [10] . Multiple deep learning methods are proposed on these data sets to detect COVID-19 from the cough and breath sounds [11, 12, 13, 14, 15] . However, availability of limited training data is still a challenge in developing such systems. In this paper, we experiment with different data augmentation methods, handcrafted features, small-footprint novel deep neural network architectures, and efficient model selection methods for evaluation. The system is developed as part of Interspeech 2021 special session and challenge on diagnosing COVID-19 using acoustics (DiCOVA) challenge. Our proposed method secured the 5 th position on leaderboard among 29 participants. This paper is organised as follows. In Section 2, we explain the shared data and the overview of the challenge. In Section 3 and Section 4, we present the data augmentation methods and handcrafted features, respectively. In section 5, we discuss the model training methods, and deep neural network architectures. Results and analysis are presented in section 6. Conclusion with future works are provided in Section 7. The Shared data set contains 1040 audio files, stored in .FLAC format and a list containing train and validation lists for 5 folds. In each fold, the train set contains 822 audio files and remaining 218 audio files are part of validation set. Models need to be trained on all 5 folds without mixing the audio across the folds and need to be evaluated on respective validation set. Validation folds are used to select the best model across the folds. A separate blind test set containing 233 audio files is also shared for final evaluation on leaderboard. Shared data is part of Coswara data set [8] . It is to be noted that only 75 shared audio files are COVID-19 positive of which 50 audio files are part of train set and 25 are part of validation set. We also experimented 'Coughvid' [7] data set and obtained the results. First the raw cough speech signal is passed to energy threshold based speech activity detector and then normalized between -1 to 1. The spectrograms of cough sounds of speakers suffering from COVID-19 have been observed. There are marked differences observed in the spectrograms. It is noted that in non-covid cough sound higher proportion of signal energy lies in higher frequency range compare COVID-19 cough sound signal as can be seen in Fig. 3 and Fig. 4 . We have experimented with 39 dimensional mel frequency cepstral coefficient (MFCC) and 24 dimensional handcrafted features to capture the time domain and frequency domain variability between COVID-19 negative and positive cough sounds. Handcrafted features include per frame energy of the signal(1D), fundamental frequency(1D), first four formants(4D), alpha ratio(1D) with cut-off frequency at 1400 Hz, relative average perturbation(1D), spectral flatness( D), kurtosis(1D), spectral contrast(7D), second order spectral polynomial(2D), spectral centroid(1D), spectral roll off(1D), and spectral bandwidth(1D), root mean square value of the signal(1D), and zero crossing rate of the signal(1D). These features are designed to capture the source and vocal tract variabilities. Then delta and double delta of the feature vector is computed to form 189 (3 * 63) dimensional feature. It is noted that only 75 cough audio files are COVID-19 positive and 50 out of them are part of train set in each fold. This makes the COVID negative vs COVID positive data ratio to 16:1 during training due to which the performance of classifier is suboptimal. We incorporate a novel spectrum interpolation method to increase the proportion COVID-19 cough sound sample in train set. First, we obtain the 1024 point discrete fourier transform (DFT) for each COVID-19 positive audio. Then, for each audio we obtain corresponding 5 nearest neighbour based on Euclidean distance of their DFT. Then, the linear combination of DFT of each COVID-19 positive audio and corresponding 5 nearest neighbour's is used to augment the COVID-19 cough sound. The linear combination coefficients are obtained from a uniform distribution between 0 and 1. In this way we obtain 5 augmented spectrum for each COVID-19 positive audio. We also experiment with the only nearest neighbour spectrum interpolation method to get 1 augmented spectrum for each COVID19 positive audio. Results for both of these methods are presented. Additive noise based augmentation has been proven to be beneficial for automatic speech recognition task [16] . Noise based augmentation improves the performance when the acoustic environment(background noise) are different during train and test time, which will be true in case system is to be used for diagnosis of COVID-19. We explore this method to augment the cough speech data set. The basic process of noisy training for deep neural network is as follows: first of all, sample some noise signals from some real-world recordings and then mix these noise signals with the original training data. Our noise set contains 1130 noise samples from kitchen, digital appliances, babble, music, traffic and other backgrounds. The signal to noise ratio (SNR) was varying between 5-20 dB and was uniformly distributed across the range. Noise based augmentation was performed for the entire data. Vocal Tract Length Perturbation (VTLP) is used in speech recognition to add the speaker to speaker variability that result primarily from differences in vocal tract length [17] . In this paper, we experiment with VTLP for cough sound augmentation. The warp factor was randomly selected between 0.85 to 1.15. We have experimented with standard machine learning algorithms such as, support vector machine (SVM) and random forest classifiers, and deep neural network architectures such as, LSTM, CNN and ResNet. We experiment with number of layers, sequence length and class weights in cross entropy (CE) loss. We train LSTM based models in four different configuration. First type is regular uni-directional LSTM, termed as "uni". Second type is bi-directional LSTM, termed as "bidir". In model type termed as "seq-to-concat", we concatenate output vector of all elements in a sequence to form a super-vector. For example, if output dimension of LSTM layer is n with a sequence length of seqL then dimension of super-vector would be n * seqL. This super-vector is then passed through the final fully connected layer to predict probabilities. The "seq-to-concat" type model effectively processes seqL number of frames before making predictions. In "seq-to-last1" type model, output corresponding to the last timestamp of a sequence is passed further. This type of model makes prediction after processing all frames in a sequence. Here, the final fully connected layer takes n dimensional vector as input. It may be noted that both these model types, namely "seq-to-concat" and "seq-to-last1", make segment level predictions. All LSTM models use 20 frames as initial context before processing seqL number of frames for every sequence. The initial context frames set the hidden state of LSTM layer before processing actual inputs for which loss will be back propagated. In CNN based models, we use 1-D convolution layers with filter length 5. Feature vector for each frame is reshaped into 63x3 before passing to CNN models implying that the input is now a 63 dimensional vector with 3 channels. After convolution layer, batch normalization is applied and then relu activation is applied. We also use Residual Networks [18] , namely ResNet18 and ResNet-34, in our experimentation. To reduce learnable weights, we decrease filter size to 3 with 32 output channels for every convolution layer in default residual networks configuration. In this challenge, our main task is to maximize area under curve (AUC). It is known that CE loss based classifier training does not have proportional relationship with AUC. We now formulate the AUROC loss function. We pick one non-covid cough sound at random and let Vn be its predicted value. Similarly, we pick one covid cough sound at random and Vp be its predicted value [19, 20, 21] . Then AUC can be seen proportional to the probability that the predicted values are in right order, that is Vn < Vp. It can be seen that this score is not differentiable as it does not make smooth transitions. So to force differentiability, we apply continuous approximation to this score using sigmoid function. Specifically, we calculate binary cross entropy (BCE) loss with assumpion that Sigmoid (Vp − Vn) belongs to class 1. For stable model training, the final loss we use is given by The model architecture we use with AUROC loss has only one output node. From dataset, we pick one non-covid cough sound sequentially and one covid cough sound randomly. We pass these two cough sounds through model and average the output of final fully connected layer for individual cough sounds. Thus, we have 1-D output value for each covid and non-covid cough. We then calculate the loss described in Eq 1 and back propagate the gradients to train the model. Following [22] , the variational lower bound (ELBO) of joint VAE (JVAE) can be written as To train JVAE model, we minimize negative of the ELBO described in Eq 2. In original JVAE formulation, all the above conditional distributions are modelled by diagonal Gaussian distribution. We now propose to model pθ(y|x,z) by a binomial distribution. Now, minimizing negative of Eqφ(z|x) log pθ(y|x,z) in ELBO (Eq 2) reduces it to binary cross entropy (BCE) loss. In practice, the actual loss used to train JVAE network is given by We choose value of hyper parameters in Eq 3 following [22] . Thus, λ1 is taken as 1, λ2 is taken as 10 and λ3 is taken as 0.1. Patient level receiver operating characteristic (ROC) on validation set is presented in Fig. 1 and Fig. 2 . It can be observed that the proposed data augmentation methods and architectures give significant improvement in the area under the curve (AUC). Detailed results are shown in Table 1 . The first column depicts the type of model used. The "fold" column mentions the dataset fold on which the model is trained and validated. For all the experiments, we train our classifier model on each of the 5 folds of dataset and then choose the best model. Hidden dimension of layer before the final fully connected layer in our model architecture is presented in "hidden dim" column. Variation in model architecture type and its specialities is disclosed in "type" column. Details of these variations are discussed in Section 5. In CE loss, values in "wtPos" column is multiplied with loss corresponding to covid positive cough data. This is done to counter the imbalance of covid vs non-covid cough sounds in dataset. We normalize the input feature vector with mean before passing through classifier model during both training and testing. The "utt-wise" token in "norm" column means mean is calculated per-utterance. We also calculate mean of all dataset beforehand, termed as global-mean. The "global" token in "norm" column specifies that the global-mean vector is used for feature normalization. This is done to ensure that handcrafted features do not vanish for any input cough sound. It also ensures that inter-frame variations in handcrafted features and their absolute values remain expressive. The sequence length parameter of LSTM models is reported in "seq length" column. We experiment with sequence lengths 20, 30, 40 and 50. For CNN and residual networks, batches are not prepared sequentially. So, we mention number of channels ("inCh") of input features. The Adam optimizer in pytorch toolkit offers weight regularization in form of "weight decay". We use weight decay with a factor of 10 −3 for training of some models. Rest of the models are trained without weight decay -mentioned as 0 in "weight decay" column. We report AUC yielded by classifier models in "AUC" column. We also specificity at > 80% sensitivity in "specificity" column of Table 1 . In this paper we explore various deep learning architectures, data augmentation techniques, and feature extraction method to improve state of the art cough sound based COVID-19 detection system. Our proposed system gives 14% absolute improvement in area under the curve (AUC) compare to baseline system released as part of the challenge. Future work includes the experimentations with attention based architectures on raw audio. It will also be interesting to explore the developed method for other cough symptom based ailments. Identification of Risk Factors and Symptoms of COVID-19: Analysis of Biomedical Literature and Social Media Data Preliminary analysis of cough sounds IIIT-S CSSD: A Cough Speech Sounds Database Acoustic analysis of cough Discriminant feature vectors for characterizing ailment cough vs. simulated cough A Framework for Biomarkers of COVID-19 Based on Coordination of Speech Production Subsystems The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms Coswara -A Database of Breathing, Cough, and Voice Sounds for COVID-19 Diagnosis Exploring Automatic Diagnosis of COVID-19 from Crowdsourced Respiratory Sound Data The voice of COVID-19: Acoustic correlates¨ of infection COVID-19 Patient Detection from Telephone Quality Speech Data Artificial Intelligence Diagnosis using only Cough Recordings SARSCoV-2 Detection From Voice AI4COVID-19: AI Enabled Preliminary Diagnosis for COVID-19 from Cough Samples via an App Noisy training for deep neural networks in speech recognition Vocal Tract Length Perturbation (VTLP) improves speech recognition Deep residual learning for image recognition Optimizing classifier performance via an approximation to the wilcoxonmann-whitney statistic Joint distribution learning in the framework of variational autoencoders for farfield speech enhancement