key: cord-0153540-8done3sm authors: Ren, Zhao; Chang, Yi; Schuller, Bjorn W. title: The EIHW-GLAM Deep Attentive Multi-model Fusion System for Cough-based COVID-19 Recognition in the DiCOVA 2021 Challenge date: 2021-08-06 journal: nan DOI: nan sha: dac2a9d80493bc8b2d654672082e8dea49674c06 doc_id: 153540 cord_uid: 8done3sm Aiming to automatically detect COVID-19 from cough sounds, we propose a deep attentive multi-model fusion system evaluated on the Track-1 dataset of the DiCOVA 2021 challenge. Three kinds of representations are extracted, including hand-crafted features, image-from-audio-based deep representations, and audio-based deep representations. Afterwards, the best models on the three types of features are fused at both the feature level and the decision level. The experimental results demonstrate that the proposed attention-based fusion at the feature level achieves the best performance (AUC: 77.96%) on the test set, resulting in an 8.05% improvement over the official baseline. To describe our system, the methodology will be briefly introduced in Section 1.1. The pre-processing procedure of the audio signals will be then explained in 1.2. Afterwards, the methods of single-model feature extraction and classification will be described in Section 1.3, and the fusion methods of multiple models will be given in Section 1.4. Finally, the experimental results will be shown and analysed in Section 1.5. Our proposed system consists of three major types of representations as the input: i) hand crafted features, ii) high-level image-from-audio-based representations extracted by pre-trained models from natural image datasets, and iii) high-level audiobased representations extracted by pre-trained models from audio datasets. The extracted features are then classified via training a feed-forward deep neural network (DNN) on the hand-crafted features or fine-tuning the transferred models on the high-level representations. The outputs of all classifiers are the probabilities of audio samples being COVID-19 positive. Finally, the multiple classifiers are assembled using a feature-/decision-level fusion. In our experiments, the officially provided audio waves with a sampling rate of 44.1 kHz in the Track-1 of the DiCOVA challenge 2021 [1] are resampled into 16 kHz. To train a deep learning model, all of the audio signals are cut into smaller segments with a time length of 57, 600 frames. Notably, the log Mel spectrograms (cf. Figure 1) with a time length of 224 frames are generated using a window size of 512 and an overlap of 256. For each audio sample, the probabilities of all its segments are averaged as the final prediction. Further, to make the input size consistent with the hyperparameters in the pre-trained image-based and audio-based models for transfer learning (cf. Section 1.3.2), the number of Mel bins are respectively set to 64 and 128 for the audio-based models and the image-based models. Before feeding the extracted features and log Mel spectrograms into deep learning models, mixup [2, 3] is employed to augment the training data in our study. Specifically, the augmented data can be represented by x = αx1 + (1 − α)x2 and y = αy1 + (1 − α)y2, where (x1, y1) and (x2, y2) are two examples drawn from the original training data, and α is sampled from a Beta distribution. Next, both the original data and mixup-augmented data are processed by deep learning models in a training procedure. The extraction and classification of the hand-crafted features and transfer-learning-based representations will be introduced in Section 1.3.1 and Section 1.3.2, respectively. Hand-crafted features have been widely used and achieved good performances in audio/speech classification, such as heart sound classification [4] , cold speech detection [5] , etc. In this regard, three feature sets are used in our study: a 2, 600-dimensional log Mel feature set, a 1, 400-dimensional Mel Frequency Cepstral Coefficients (MFCC) feature set, and a 6, 373-dimensional COM-PUTATIONAL PARALINGUISTICS CHALLENGE (COMPARE) 2016 feature set [4] . In the log Mel and the MFCC feature sets, 26 Mel bins and 14 coefficients are respectively calculated as the low-level descriptors (LLDs), and 100 functionals are then applied to these LLDs. All of these features are extracted by the open source openSMILE toolbox [6] . To predict the COVID-19 class (negative/positive), the handcrafted features are fed into a DNN model, which contains three linear layers with the number of output neurons 1, 024, 256, and 1, respectively. To tackle small-scale datasets using deep learning topologies, transfer learning [7] has shown potential of extracting highly abstract representations with pre-trained models learnt from largescale datasets. Both image-based and audio-based pre-trained models are utilised to extract features from the Track-1 dataset of the DiCOVA challenge 2021. Image-based models. A large number of deep learning models have proven to be effective in processing large-scale natural image datasets, such as ImageNet [8] . In this work, we choose VGG11 [9] and ResNet34 [10] to extract high-level representations from log Mel spectrograms. Particularly, the final two linear layers of VGG11 are replaced with three new linear layers, which output the number of neurons 1, 024, 256, and 1. As only one linear layer is used in ResNet34 as the final layer, it is replaced with three new linear layers as those in VGG11, and updated along with the final convolutional block. Audio-based models. While transferring pre-trained imagebased models to audio classification tasks, the data difference between images and audio waves might be a potential bottleneck of improving the performance [11] . Therefore, deep learning models trained on large-scale audio databases (e. g., AudioSet [12] ) are applied to extract representations from the log Mel spectrograms. We initialise our models with the parameters of the pre-trained CNN14 16k and ResNet38 [2] , which have shown state-of-the-art performances on AudioSet. The first linear layer in both models is replaced with a trainable linear layer which outputs the number of neurons 1, 024, and followed by two trainable linear layers with the number of output neurons 256, and 1. With the trained models on the hand-crafted features and the transfer-learning-based representations, a set of fusion approaches are employed to improve the performance. Especially, an attention mechanism [13, 14] is proposed in our study to fuse the multiple models. Next, the feature-level and decisionlevel fusion methods will be introduced in Section 1.4.1 and Section 1.4.2. We apply feature-level fusion methods at the output of the second linear layer in each best model of the three methods in Section 1.3 (i. e., a DNN model on hand-crafted features, an image-based model, and an audio-based model), including max fusion, average fusion, and attention-based fusion. Max fusion selects the maximum values from the learnt three representations, and outputs a vector with a length of 256. The feature vector is finally processed by a linear layer, which outputs a probability value for each audio segment. Average fusion calculates the average vector of the three representations, and outputs a 256 dimensional vector, which is then fed into a linear layer. Attention-based fusion aims to calculate the contribution of each unit in the three representations. An additional onedimensional convolutional layer with a kernel size of 1 and an output channel number of 256, followed by a sigmoid function, is applied to the representations. Afterwards, the normalised output of the convolutional layer is multiplied with the representations. The summed results of the multiplication are finally fed into a linear layer to calculate the probability. Apart from the feature-level fusion, decision-level fusion works on the output activation of the final linear layer. Similarly, the max, average, and attention-based decision-level fusion methods will be introduced as follows. Max fusion at the decision level selects the maximum probability. Average fusion computes the average probability of the obtained three probabilities for each audio segment. Attention-based fusion processes the output of the second linear layer with two one-dimensional convolutional layers, with a kernel size of 1 and an output channel number of 1. Next, one of the convolutional layers is followed by a sigmoid function and normalisation. Furthermore, the normalised output is multiplied with the output of the other convolutional layer, and the multiplication result is summed up to give the final probability. Our proposed approaches are evaluated on the dataset of Track-1 (cough sounds) of the DiCOVA 2021 challenge [1] . The dataset is composed of 1, 040 audio files (965 non-COVID) recorded from COVID-19 positive and negative individuals. The majority of the individuals are COVID-negative male individuals with their age in 15 − 30 years [1] . All cough sound recordings are sampled to 44.1 kHz and compressed as FLAC format. The dataset is split into five folds for cross validation, where each fold has a training and a validation sets. Additionally, the blind test set is composed of 233 audio samples. The evaluation metrics of the results consist of the sensitivity, the specificity, and the Area Under the Curve (AUC) [1, 15] . During training, the parameters are updated with 30 epochs and iterated with a batch size of 16. The 'Adam' optimiser with a learning rate of 0.001 is experientially applied during the training process. As for the loss function, BCEWithLogitsLoss is used with the weight set as ratio of the number of COVIDpositive samples to the number of COVID-negative ones in the training data. Moreover, after every 10 epochs, the learning rate decays by 0.1 to stabilise the training procedure. To recognise COVID-19 on the test set, the whole dataset with 1, 040 samples contributes to training models. The results evaluated on the validation and test sets are given in Table 1 . Compared to the official baseline (average validation AUC: 68.81 %, test AUC: 69.91 %), the results of the three single-model methods are comparable or slightly lower on the validation and test sets. However, the multi-model fusion approaches achieve improvements over the single-model methods, and mostly outperform the baseline system. Especially, the proposed feature-level attention and decision-level attention fusion approaches perform better than both the max and average fusion methods. Finally, the feature-level attention achieves an AUC of 77.96 % on the test set, resulting in an improvement of the baseline. In future efforts, more data pre-processing and augmentation techniques, such as signal enhancement and Generative Adver- sarial Networks (GANs) [16] , could be explored. Meanwhile, more databases of cough sounds will be considered to augment the training data. Furthermore, COVID-19-related databases with more modalities, e. g., speech signals, will be employed to achieve multi-modal COVID-19 detection for improving the performance. We express our deepest sorrow for those who left us due to COVID19; they are lives, not numbers. We further express our highest gratitude and respect to the clinicians and scientists, and anyone else helping to fight against COVID-19 and maintain our daily lives. This work was supported by the Horizon H2020 Marie Skłodowska-Curie Actions Initial Training Network European Training Network (MSCA-ITN-ETN) project under grant agreement No. 766287 (TAPAS). DiCOVA Challenge: Dataset, task, and baseline system for COVID-19 PANNs: Large-scale pretrained audio neural networks for audio pattern recognition Mixup: Beyond empirical risk minimization The INTERSPEECH 2018 computational paralinguistics challenge: Atypical & selfassessed affect, crying & heart beats Analysis and classification of cold speech using variational mode decomposition openSMILE: The Munich versatile and fast open-source audio feature extractor A survey on transfer learning ImageNet: A large-scale hierarchical image database Very deep convolutional networks for large-scale image recognition Deep residual learning for image recognition Audio for audio is better? An investigation on transfer learning models for heart sound classification Audio set: An ontology and human-labeled dataset for audio events Attention-based convolutional neural networks for acoustic scene classification CAA-Net: Conditional atrous CNNs with attention for explainable devicerobust acoustic scene classification Understanding receiver operating characteristic (ROC) curves Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty