key: cord-0215438-jmjs9v7z
authors: Mallol-Ragolta, Adria; Cuesta, Helena; G'omez, Emilia; Schuller, Bjorn W.
title: EIHW-MTG: Second DiCOVA Challenge System Report
date: 2021-10-18
journal: nan
DOI: nan
sha: 8317f0b768ca6900e2a01be56a0f09cf0baad581
doc_id: 215438
cord_uid: jmjs9v7z

This work presents an outer product-based approach to fuse the embedded representations generated from the spectrograms of cough, breath, and speech samples for the automatic detection of COVID-19. To extract deep learnt representations from the spectrograms, we compare the performance of a CNN trained from scratch and a ResNet18 architecture fine-tuned for the task at hand. Furthermore, we investigate whether the patients' sex and the use of contextual attention mechanisms is beneficial. Our experiments use the dataset released as part of the Second Diagnosing COVID-19 using Acoustics (DiCOVA) Challenge. The results suggest the suitability of fusing breath and speech information to detect COVID-19. An Area Under the Curve (AUC) of 84.06% is obtained on the test partition when using a CNN trained from scratch with contextual attention mechanisms. When using the ResNet18 architecture for feature extraction, the baseline model scores the highest performance with an AUC of 84.26%.

Digital health systems powered with Artificial Intelligence (AI) have the potential to revolutionise the health care systems worldwide, improving the early diagnosis of diseases, and the monitoring of the patients towards personalised treatment plans. Previous works in the literature explored the use of AI-based techniques in a wide range of medical problems, including the detection of coughs or sneezes [1] , the analysis of breath signals [2] , or the recognition of mental illnesses, such as depression [3, 4] or Post-Traumatic Stress Disorder (PTSD) [5] . Such technologies do not aim at replacing medical diagnostic tools, rather providing highly scalable, cost-effective prescreening solutions to optimise the medical resources.

In the current pandemic context caused by the outbreak of the Coronavirus Disease 2019 (COVID-19), we envision the use of new technologies to help monitor the spread of this virus. As the COVID-19 symptomatology presents affections in the human respiratory system, it seems reasonable to argue about the potential of the respiratory-related sounds to contain salient information for the detection of this disease. Hence, there is an opportunity to develop new, digital solutions exploiting respiratory sounds to detect patients with COVID-19.

This work focuses on the automatic detection of patients with COVID-19 in the context of the Second Diagnosing COVID-19 using Acoustics (DiCOVA) Challenge [6, 7] . We use the spectrogram representation of cough, breath, and speech samples to train neural networks composed of two main blocks: the first block aims at extracting embedded representations from the spectrograms, the second block is responsible for the actual classification. The embedded representations from the different sound types are extracted with dedicated Convolutional Neural Networks (CNNs). We explore the use of an outer product-based approach to fuse the extracted representations with the goal to enrich the information for the final classification. Additionally, we also aim to investigate whether using the patients' sex as a priori information, and introducing contextual attention mechanisms to the network can be beneficial for the task at hand.

In this work, we use the dataset released as part of the Second Di-COVA Challenge [6, 7] . This dataset contains acoustic samples of COVID-19 positive and negative (healthy) patients from three different sound types produced by the human respiratory system; specifically, from coughs, breaths, and speech. Although the sampling rate of the acoustic samples provided is 44.1 kHz, an initial exploration of the dataset revealed the existence of samples without frequency content in the upper frequencies of the spectrogram. This observation suggests that some audio samples were originally recorded at a different, lower sampling rate, and upsampled before distributing the data. This is a plausible hypothesis given the nature of the dataset, which was recorded in-the-wild, via crowdsourcing, and using the patients' own devices. The available samples are distributed in two partitions, and the Challenge organisers require assessing the performance of the models on the training partition using a pre-defined 5-fold cross-validation approach.

Each patient recorded a cough, a breathing, and a speech sample. The total duration of the dataset is 14 h 45 min 23 sec (cf. Table 1 ). The dataset contains information from a total of 1 436 patients (cf. Table 2 ): 965 belonging to the training partition, and 471, to the test partition. The training data is imbalanced both in terms of sex (242 females and 723 males) and COVID-19 status (172 positives and 793 negatives). Similarly, the test data is also imbalanced in terms of sex (119 females and 352 males), whilst the COVID-19 status distribution is blind to the Challenge participants.

The respiratory sounds are first downsampled to 16 kHz to overcome the disparity between recording devices, avoiding our networks to perform the COVID-19 detection based on the presence or the absence of frequency content in the upper frequencies of the spectrogram (cf. Section 2.1). This work focuses on fusing the information embedded in the different sounds recorded by each patient. As each sound has a different duration, we compute the longest one from each patient and use this information to extend the shorter sounds via repetition, so all samples from each patient have the same duration. Next, we window each respiratory sound separately into frames of 5 sec length with a 50 % overlap. We compute the magnitude of the Short-Time Fourier Transform (STFT) of each individual frame using a window length of 4096 samples (256 ms) and a hop size of 128 samples (8 ms) to obtain its spectrogram representation. The spectrograms are generated using a logarithmic frequency scale, and the magma colour map. Once normalised, each spectrogram is stored in disk as a colour image of 224 × 224 pixels. The generated spectrograms from each sound type are standardised before being fed into the models for training. The standardisation parameters (µ and σ) are computed from all the spectrograms corresponding to the current sound type that belong to the training partition. To downsize the effect of training the models with COVID-19 imbalanced data (cf. Table 2 ), we augment the generated spectrograms corresponding to the COVID-19 positive patients via replication to balance the training data. Despite considering other data augmentation strategies, such as filtering or additive noise, we decided not to alter the original samples in any way, as the relevant acoustic information for the task at hand is not clear yet. The replication approach may introduce redundancy in the training material; however, we believe it can still be useful in this case, as the number of positive and negative samples is significantly different.

This passage describes the network architectures implemented and investigated in this work.

The networks implemented are composed of two main blocks: the first block extracts deep learnt representations from the spectrograms of the cough, f C , breath, f B , and speech, f S , samples, while the second block performs the actual classification. For the feature extraction block, we compare two different architectures. The first architecture implements two convolutional layers with 16 and 4 filters, respectively, with a kernel size of 3 × 3 and a stride of 1. Following each convolutional layer, we use batch normalisation and the output is transformed using a Rectified Linear Unit (ReLU) function. A 2-dimensional max pooling layer and a 2-dimensional adaptive average pooling layer are implemented at the end of the first and second convolutional block, respectively. This way, we force the output of the feature extraction block to produce 4 features per filter. The second architecture uses the ResNet18 architecture [8] without the last layer. Specifically, we use the pre-trained weights to initialise the network and fine-tune them during training for the task at hand. An additional linear layer is included in this architecture to reduce the dimensionality of the features from 512 to 16. The learnt features from both architectures have the same dimensionality and are finally flattened into a 1-dimensional representation.

The deep learnt representations from each sound type are extracted using a dedicated feature extraction block. In this work, we investigate the inner fusion of these embedded representations using an outer product-based approach, which can be mathematically defined as:

(1)

When the three sound types are fused together, the outer product generates a cube with the following properties: i) the original representations are preserved in the edges of the cube, ii) each face of the cube contains information from the fusion of 2 sound types, and iii) the inner part of the cube fuses information from the three sound types all together. The fused representation is flattened before being fed into the final, classification block of the network. This fusion layer is implemented when training multi-type models, which combine at least two sound types, and omitted when training mono-type models, which consider a single sound type to infer the COVID-19 status.

The classification block of the network contains two fully connected layers, preceded by a dropout layer with probability 0.3. The number of input neurons in this block depends on the number of sound types selected for training. Nevertheless, the number of output neurons is fixed to 8. The output of this first layer is transformed using a ReLU activation function. The transformed representation is finally fed into the second layer of this block, which contains two output neurons with a Softmax activation function. This way the outputs of the network can be interpreted as probability scores.

This model expands the baseline model described in Section 2.3.1 to consider the sex of the patients when inferring their COVID-19 status. Specifically, a binary encoded representation of the patient's sex is fed into the second layer of the classification block of the network. The number of input features to the classification block depends on the number of sound types to be fused. Introducing the sex information in the first layer of this block would difficult understanding if the performance of the network is conditioned by the patient's sex or by the number of input features. Thus, we opted for feeding this information into the second layer of the classification block, where the number of neurons corresponding to the sound representations is fixed.

This model also expands the baseline model described in Section 2.3.1, but, in this case, using a dedicated contextual attention mechanism at the output of each feature extraction block. The aim Table 3 : AUC measurements (%) obtained from the mono-and multi-type models trained using a dedicated CNN-based network (Baseline). These models consider the patient's sex for the analysis (Sex), use contextual attention mechanisms (C. Att.), and their combination (Sex & C. Att.).

Set of this mechanism is to help highlight the salient information from the embedded representations learnt. Representing the embedded representations learnt as f N , where N = C, B, S depending on the input sound type, the contextual attention mechanism is mathematically defined as:

where W, b, and uc are parameters to be learnt by the network. The parameter uc can be interpreted as the context vector. The attentionbased representation obtained,f N , is then fed into the classification block of the network when training mono-type models, or fused when training multi-type models.

For a fair comparison among the models, these are all trained under the exact same conditions. We use the Categorical Cross-Entropy as the loss to minimise, using Adam as the optimiser with a fixed learning rate of 10 −3 . As model performances are assessed in terms of the Area Under the Curve (AUC), we define LAUC = 1 − AU C as the validation loss to monitor during the training process. Network parameters are updated in batches of 64 samples, and trained during a maximum of 100 epochs. We implement an early-stopping mechanism to stop training when the validation loss does not improve for 15 consecutive epochs. We follow a 5-fold cross-validation approach to evaluate the models, as defined by the Challenge organisers. Each fold is trained during a specific number of epochs. Hence, when modelling all training material and to prevent overfitting, the training epochs are determined by computing the mean of the training epochs processed in each fold, rounded up to the next integer. 

The results obtained using specific CNNs and using ResNet18-based CNNs are summarised in Tables 3 and 4 , respectively. One of the main insights from our experiments is that the fusion of breath and speech samples outperforms the multi-type models resulting from the combination of all other sound types and the mono-type models in 3 out of the 4, and in 2 out of the 4 scenarios investigated with the specific CNNs, and the ResNet18-based CNNs, respectively. Likewise, when we look at the mono-type models (C, B, S), we observe that the models using the breath and the speech samples score higher results in comparison to the models using coughs only.

We observe that the mono-type models considering the patients' sex only improve the performance of the cough-based models, while they barely have an effect on the breath-based models. Patients' sex negatively impacts the performance of the speech-based models. Although there is no clear pattern to determine the suitability of considering patients' sex and/or using contextual attention, we note that the models surpassing the baseline with the specific CNNs use one of the three variants in most of the cases. The contextual attentionbased model fusing breath and speech samples obtains the best performance with an AUC of 84.06 %. With the ResNet18-based CNNs, the baseline models obtain the best AUC scores in most of the cases. The baseline model fusing breath and speech samples scores the best AUC of 84.26 %.

Although the transfer learning approach obtains the best performance, the specific CNNs obtain similar results with a simpler structure. Further experiments are needed to better understand the impact of patients' sex in the fused scenarios, as we hypothesise it is downsized as a result of a magnitude difference between the sex representation and the deep learnt features at the intermediate layer of the classification block.

CAST a database: Rapid targeted large-scale big data acquisition via small-world modelling of social media platforms

The INTERSPEECH 2020 Computational Paralinguistics Challenge: Elderly Emotion, Breathing & Masks

A Hierarchical Attention Network-Based Approach for Depression Detection from Transcribed Clinical Interviews

AVEC 2019 Workshop and Challenge: State-of-Mind, Detecting Depression with AI, and Cross-Cultural Affect Recognition

A Multimodal Approach for Predicting Changes in PTSD Symptom Severity

Coswara -A Database of Breathing, Cough, and Voice Sounds for COVID-19 Diagnosis

The Second DiCOVA Challenge: Dataset, task, and baseline system for COVID-19 diagnosis using acoustics

Deep Residual Learning for Image Recognition

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No. 826506 (sustAGE), and from the Spanish Ministry of Science and Innovation under the Musical AI project (PID2019-111403GB-I00).