key: cord-0611149-mjv6gi3h
authors: Kamble, Madhu R.; Patino, Jose; Zuluaga, Maria A.; Todisco, Massimiliano
title: Exploring auditory acoustic features for the diagnosis of the Covid-19
date: 2022-01-22
journal: nan
DOI: nan
sha: b23db0f5a10490006329c42759f400484886f0da
doc_id: 611149
cord_uid: mjv6gi3h

The current outbreak of a coronavirus, has quickly escalated to become a serious global problem that has now been declared a Public Health Emergency of International Concern by the World Health Organization. Infectious diseases know no borders, so when it comes to controlling outbreaks, timing is absolutely essential. It is so important to detect threats as early as possible, before they spread. After a first successful DiCOVA challenge, the organisers released second DiCOVA challenge with the aim of diagnosing COVID-19 through the use of breath, cough and speech audio samples. This work presents the details of the automatic system for COVID-19 detection using breath, cough and speech recordings. We developed different front-end auditory acoustic features along with a bidirectional Long Short-Term Memory (bi-LSTM) as classifier. The results are promising and have demonstrated the high complementary behaviour among the auditory acoustic features in the Breathing, Cough and Speech tracks giving an AUC of 86.60% on the test set.

Coronavirus disease, so-called COVID-19, is an infectious disease caused by the recently discovered coronavirus, the SARS-CoV-2. This disease has spread rapidly worldwide over the past year, causing a global crisis with serious health, social and economic consequences. To put an end to this pandemic, various initiatives are being carried out worldwide, including the development of new systems for rapid diagnosis of the disease.

Recently, the DiCOVA 2021 Challenge [1] was carried out to promote research in development of systems for the detection of COVID-19 through recordings of respiratory sounds. Several systems have been proposed to detect the COVID-19 signature within acoustic indicators [2, 3, 4, 5, 6, 7] . Only a few of them focused on the study of acoustic clues, giving more emphasis to classifiers. The study reported in [8] explores the Autoregressive Predictive Coding (APC)

The first author is supported by the RESPECT project funded by the French Agence Nationale de la Recherche (ANR).

to pre-train a unidirectional LSTM and spectral augmentation. In [9] , authors used ComParE 2016 feature set, and two classical machine learning models, namely Random Forests, and Support Vector Machines (SVMs). The use of breathing patterns for the diagnosis of COVID-19 is studied in [10] . COVID-19 detection by means of a Contextual Attention Convolutional Neural Networks and gender information is studied in [11] .

The first COVID-19 challenge consisted on 2 tracks: Track-1 focused on diagnosing COVID-19 using cough sounds, while Track-2 focused on a collection of breath, sustained vowel phonation, and a number of counting speech recordings. As a follow up of the first successful DiCOVA 2021 challenge, the second DiCOVA challenge has been organised [12] . The second challenge aimed at 4 different tracks, namely, breathing, cough, speech and fusion. The organisers provided a baseline system for the second challenge based on Log Mel Spectrogram front-end and Bidirectrional Long Short-Term Memory (bi-LSTM) back-end.

In this paper, we describe and propose a system for automatic COVID-19 detection presented on all the four different tracks of the second DiCOVA challenge. Our system focuses more on features than classifiers by using 4 perceptuallymotivated acoustic features at front-end. The features we explored are Teager energy operator cepstral coefficients (TECCs), Instantaneous Amplitude Cepstral Coefficients (IACCs), Constant Q-Cepstral Coefficients (CQCCs) and Filterbank Constant Q Transform (FBCQT) [13, 14, 15] , along with the bi-LSTM classifier.

The remainder of this paper is organized as follows. Section 2 presents the technical details of auditory acoustics features used for the detection of COVID-19. Section 3 describes the second DiCOVA challenge database. Experimental setup and results are presented in Section 4 and Section 5, respectively. Finally, the main conclusions of this work and future research lines are drawn in Section 6.

In this section we discuss the acoustic features used to diagnose COVID-19 from breath, cough and speech.

Studies have shown that humans do not perceive frequencies on a linear scale. We are better at detecting differences in lower frequencies than higher frequencies [16] . The Mel spectrum contains a short-time Fourier transform (STFT) for each frame of the spectrum (energy/amplitude spectrum), from the linear frequency scale to the logarithmic Mel-scale, and then goes through the filter bank to get the eigen vector, these eigenvalues can be roughly expressed as the distribution of signal energy on the Mel-scale frequency.

The Teager Energy Operator (TEO) (ψ{·}) track the running estimate of instantaneous energy fluctuations of the narrowband signal. The Teager energy profile obtained from the bandpass filter is further given to the Energy Separation Algorithm (ESA) to isolate the Instantaneous Amplitude (IA) (a i [n]) and Instantaneous Frequency (IF) (Ω i [n]) and is given as [17, 18, 19] :

where x i [n] is i th bandpass filtered signal. The block diagram of Teager Energy Cepstral Coefficients (TECC) and Energy Separation Algorithm Instantaneous Amplitude Cepstral Coefficients (ESA-IACC) feature set is shown in Figure 1 . The TECC feature set is computed as per our earlier studies in [13, 20] and ESA-IACC feature set according to [14, 21] . 

The constant Q transform (CQT) is a perceptually motivated approach to time-frequency analysis introduced by Youngberg and Boll [22] in 1978 and refined over the last few decades by Brown [23] . In contrast to Fourier-based approaches, the CQT gives a greater frequency resolution for lower frequencies and a greater temporal resolution for higher frequencies, which emulate the human auditory perception. The CQT of a discrete signal x(n) is defined by:

where k = 1, 2, ..., K is the frequency bin index, a k (n) are the basis functions, * is the complex conjugate and N k is a variable window length. The center frequencies f k are defined according to f k = 2 (k−1)/(B) f 1 , where f k is the center frequency of bin k, f 1 is the center frequency of the lowest frequency bin and B is the number of bins per octave. The filter selectivity Q which reflects the ratio between center frequency and bandwidth is constant and defined as:

In practice, B determines time-frequency resolution trade-off.

Similar to MelSPEC, FBCQT is calculated filtering X CQ (k, n) with a filterbank composed of n f b triangular filters equally spaced along the linear-scale, and then calculating the logarithm of the energy in each band.

Constant Q cepstral coefficients (CQCCs) were introduced recently and successfully in the context of fake audio detection [15] . CQCC features are based on a combination of the constant Q transformation (CQT) and cepstral analysis. The cepstral coefficients are calculated from the transformation at constant Q, which imply a re-sampling in frequency domain from the geometric scale to a linear scale, according to:

whereX CQ is the linearised CQT-derived spectrum, l is the linear-scale index and p = 0...L − 1. The full CQCC extraction algorithm is described in [15] .

After the first successful DiCOVA Challenge, the organisers released the second DiCOVA challenge focusing on four different tracks, namely, breathing, cough, speech, and fusion [12] . The training/validation set for the all the tracks contains 965 audio files stored in .FLAC format at 44.1 kHz sampling frequency. Each audio file corresponds to a single subject. This set comprises an audio recording from 172 COVID-19 positive subjects and 793 COVID-19 negative subjects. Gender and age of the subjects is also provided as extra metadata. Validation set are performed using 5-fold cross validation from train lists. The test set consists of 471 audio files with the same format as the training/validation set, but with the COVID-19 status hidden from the participants. The detailed information of the respective tracks are reported below:

Track 1: Breathing -The goal of this track is to use the key differences and analyze the breathing signal from COVID-19 positive and negative subjects that can contribute towards the detection of the disease. In total, the samples provided by the organizers include the data of 965 subjects that is further split into train and validation set. The dataset contains lists corresponding to a 5-fold cross validation split.

Track 2: Cough -The goal of this track is to use cough sound recordings from COVID-19 positive and negative subjects. The validation set is composed of cough audio data from 965 subjects. The dataset also contains lists corresponding to a 5-fold cross validation split.

Track 3: Speech -Similar to the Track 1, the goal of this track also aims to detect the COVID-19 disease using the speech signals from positive and negative COVID-19 subjects. The data distribution and 5-fold cross validation is also similar to the previous tracks. 

The organisers of the challenge provide a baseline system for 4 different tracks based on log Mel spectrogram. The backend classifier used is a bidirectional Long Short-Term Memory (bi-LSTM) Classification performance evaluation is measured using traditional detection metrics, namely, true positive rate (TPR) and false positive rate (FPR) over a range of decision thresholds. From these metrics, the probability scores for each audio file are used to compute the receiver operating characteristic (ROC) curve, and the area under the curve (AUC) metric to quantify the model performance [24] .

We have performed the experiments on the second DiCOVA Challenge database.

Five acoustic features discussed in Section 2 have been used along with a cascade of two bi-directional long-short term memory (bi-LSTM) and a fully connected neural network with an encoder-decoder style network. The encoder consists of two bi-LSTM layers with 128 units in both the forward and back-ward direction. This is fully connected neural network comprising of 256 nodes in the first layer and 64 nodes and a tanh(·) non-linearity in the second layer. Finally, a single node output, passed through a sigmoid non-linearity is obtained as the COVID-19 probability score for the input feature matrix.

Parameters used to extract acoustic features are detailed hereafter.

MelSPEC: The MelSPEC feature set was extracted, similarly for the baseline system, using 64-dimensional log Mel spectrogram with ∆ and ∆∆ resulting in total 192-D feature vector.

TECC: The TECC feature set was extracted using 40 Melspaced Gabor filterbank with f min =10 Hz, and f max =f s/2 Hz [13] . For each subband filtered signals, we obtain 40-D static features appended along with their ∆ and ∆∆ coefficients resulting in 120-D feature vector.

The ESA-IFCC feature set was extracted using same parameters as used for TECC feature set expect the frequency scale in Gabor filterbank, here we used linearlyspaced Gabor filterbank. However, ESA-IACC feature set is computed with the pre-processing technique and cepstral mean normalization (CMN) technique for COVID-19 classification task.

The CQCC features are extracted with a maximum frequency of F max = F N Y Q , where F N Y Q is the Nyquist frequency of 44.1kHz. The minimum frequency is set to F min = F max /2 9 43Hz (9 being the number of octaves). The number of bins per octave B is set to 96. Only 20 static coefficients (with log-energy) were considered, resulting in total 60-dimensional (D) feature vector (including 20-∆ and 20-∆∆).

FBCQT: The Filterbank CQT features set were extracted using 63-dimensional log linearised CQT with with a maximum frequency of F max = F N Y Q /2 and a minimum frequency of F min = F max /2 1 0 43Hz, with ∆ and ∆∆ resulting in total 189-D feature vector.

The results in terms of AUCs obtained on the validation folds for Breathing and Cough tracks are reported in Table 1 and for Speech and Fusion tracks reported in Table 2 . For each fold the classifier is trained using the training data and evaluated on the validation data. The average validation AUC denotes the average over the AUCs for the 5 folds. The acoustic features considered have their strengths and weaknesses and therefore the AUC for some folds and tracks are better compared to other folds and tracks. For all the tracks of validation set, FBCQT gave the higher AUC compared to other features. In particular, for breathing track FBCQT gave an average AUC of 80.52%. For Cough, Speech and Track Fusion it yielded an average AUC of 79.60%, 81.04, and 84.18%, respectively. IACC and FBCQT feature obtained the same AUC of 81.04% for the speech track.

We now focus on the results obtained on the blind test set that we submitted to the challenge. For evaluation on the test dataset, the COVID-19 positive likelihood score for each file Last but not least, we also report fusion experiments to understand the complementary information that is present in each acoustic feature under investigation. Systems were selected each time by adding the next worst in terms of AUC according to the Track Fusion results reported in Tables 2.  Fusion results are shown in Table 4 . Unexpectedly, the best combination results in MelSPEC + IACC + FBCQT feature set giving an AUC of 86.60%. All the combinations outperform the single system based on MelSPEC except the IACC + FBCQT. 

This paper reports on the exploration of acoustic cues using different auditory-based features for the diagnosis of COVID-19. Particularly, the systems presented are based on 5 different acoustic features based on Mel frequency scale, Teager energy operator, speech demodulation, and constant Q transform. For a proper comparison of these features we use the same back-end consisting of a bi-LSTM network. The FBCQT system outperforms all the proposed systems for the validation set on all the tracks, however, for the test set, it demonstrated a low capacity for generalisation with limited accuracy. Fusion experiments showed that the features considered are highly complementary. The best fusion gave an AUC of 86.60% for the MelSPEC + IACC + FBCQT feature combination on the test set, which led us to a result between the challenge baseline system (84.70%) and the challenge winner system (88.44%).

DiCOVA Challenge: Dataset, Task, and Baseline System for COVID-19 Diagnosis Using Acoustics

PANACEA Cough Sound-Based Diagnosis of COVID-19 for the DiCOVA 2021 Challenge

Diagnosis of COVID-19 Using Auditory Acoustic Cues

Contrastive Learning of Cough Descriptors for Automatic COVID-19 Preliminary Diagnosis

Investigating Feature Selection and Explainability for COVID-19 Diagnostics from Cough Sounds

COVID-19 Detection from Spectral Features on the DiCOVA Dataset

Recognising Covid-19 from Coughing Using Ensembles of SVMs and LSTMs with Handcrafted and Deep Audio Features

Classification of COVID-19 from Cough Using Autoregressive Predictive Coding Pretraining and Spectral Data Augmentation

Detecting COVID-19 from Audio Recording of Coughs Using Random Forests and Support Vector Machines

The DiCOVA 2021 Challenge -An Encoder-Decoder Approach for COVID-19 Recognition from Coughing Audio

Cough-Based COVID-19 Detection with Contextual Attention Convolutional Neural Networks and Gender Information

The Second DiCOVA Challenge: Dataset, task, and baseline system for COVID-19 diagnosis using acoustics," in submitted to IEEE Intl. Conference on Acoustics Speech Signal Processing

Analysis of reverberation via teager energy features for replay spoof speech detection

Effectiveness of speech demodulation-based features for replay detection

Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification

Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

On separating amplitude from frequency modulations using energy operators

On amplitude and frequency demodulation using energy operators

Discrete-Time Speech Signal Processing: Principles and Practice. 1

Detection of replay spoof speech using Teager energy feature cues

Amplitude and frequency modulation-based features for detection of replay spoof speech

Constant-q signal analysis and synthesis

Calculation of a constant q spectral transform

An introduction to roc analysis