key: cord-0166770-yu4eqmny authors: Jing, Xin; Liu, Shuo; Parada-Cabaleiro, Emilia; Triantafyllopoulos, Andreas; Song, Meishu; Yang, Zijiang; Schuller, Bjorn W. title: A Temporal-oriented Broadcast ResNet for COVID-19 Detection date: 2022-03-31 journal: nan DOI: nan sha: 46d39d390b54f3facaee55b2532805af8d147012 doc_id: 166770 cord_uid: yu4eqmny Detecting COVID-19 from audio signals, such as breathing and coughing, can be used as a fast and efficient pre-testing method to reduce the virus transmission. Due to the promising results of deep learning networks in modelling time sequences, and since applications to rapidly identify COVID in-the-wild should require low computational effort, we present a temporal-oriented broadcasting residual learning method that achieves efficient computation and high accuracy with a small model size. Based on the EfficientNet architecture, our novel network, named Temporal-oriented ResNet~(TorNet), constitutes of a broadcasting learning block, i.e. the Alternating Broadcast (AB) Block, which contains several Broadcast Residual Blocks (BC ResBlocks) and a convolution layer. With the AB Block, the network obtains useful audio-temporal features and higher level embeddings effectively with much less computation than Recurrent Neural Networks~(RNNs), typically used to model temporal information. TorNet achieves 72.2% Unweighted Average Recall (UAR) on the INTERPSEECH 2021 Computational Paralinguistics Challenge COVID-19 cough Sub-Challenge, by this showing competitive results with a higher computational efficiency than other state-of-the-art alternatives. COVID-19 cases are still rising in several countries, indicating that the pandemic is still a main health challenge for our world [1] . Although there are rapid testing methods, their efficiency is often limited by the capacity of the testing equipment. In addition, as their production depends on the materials' availability, limited resources might yield crowds that in turn, paradoxically, increase the infection rates. Indeed, ubiquitous low-cost methods for detecting COVID-19 are still being explored. In the realm of Artificial Intelligence, Deep Neural Networks (DNNs) have been growing in popularity in recent years, setting the state-of-art in a variety of tasks, including COVID-19 detection from audio signals, e. g., patients' breathing and coughing [2, 3] . The temporal component is an essential characteristic of audio signals. Thus, learning discriminated representations containing temporal information is crucial to achieve a better classification network when working with audio [2, 4] . To make full use of temporal information, Recurrent Neural Networks (RNNs) and variants with Long-Short Term Memory (LSTM) [5, 6, 7] have been successfully developed. However, RNNs are computationally more intensive and require more storage compared to a typical Convolutional Neural Network (CNN). Previous works have also shown that transformers can exploit the temporal properties of audio to obtain higher detection results than RNNs [8, 9, 10] . Still, over-parametrised transformer-based deep networks might be prone to overfitting, and similar to CNNs, computationally inefficient. The successful application of RNNs and transformers to audio data illustrates the importance of temporal features for audio tasks. Nevertheless, the complexity of these network structures, unlike CNNs, increases the computational complexity and reduces the training stability. In the present work, we propose a temporal broadcast residual convolution block, i. e., the Alternating Broadcast Block (AB Block), in which we average the 2D features in the frequency dimension to guide the network's focus on the temporal features. Inspired by the EfficientNet [11] architecture (made up of repeated blocks and based on the residual learning), we introduce a new deep learning network named Temporal-oriented ResNet (TorNet) that contains several AB Blocks to make full use of the temporal information in the audio segments. Furthermore, we also adopt Instance Normalisation [12] to assist the network to find the relevant feature areas of the Mel-spectrogram, by this improving the classification results. We evaluate the efficiency of TorNet on the detection of COVID-19 from coughing signals, using the audio dataset from the INTERSPEECH 2021 Computational Paralinguistics Challenge's COVID-19 cough sub-challenge [13] . For reproducibility purposes, the source code of our work is freely available 1 . The remainder of our paper is organised as follows: We summarise the related research in Section 2. Then, we present our network architecture and describe the experimental settings in Sections 3 and 4, respectively. In Section 5, we discuss the results. Finally, in Section 6, we conclude with a brief summary and outline future directions. Data representations such as Mel-Spectrograms can be seen from two different perspectives: either as an image, or as an audio sequence. This duality leads to the use of a variety of DNN architectures typical of both Computer Vision (CV) and the audio domain [14, 15] . On the one side, previous work has shown that with Mel Frequency Cepstral Coefficients (MFCC) and log Mel-Spectrograms, 1D audio data can be transformed into 2D matrices [16] . This makes it possible to directly apply CNNs, typically from CV, and which have become the mainstream in Computer Audition. In the task of COVID-19 detection, Chang et al. [10] studied the performance of classical CNNs pretrained on the FluSense database, collected to track influenza-related indicators, such as cough and sneezes [17] . Similarly, Casanova et al. [18] employed transfer learning from pretrained audio neural networks with different data augmentation techniques. On the other side, as audio data is inherently a type of temporal sequence [19] , RNNs [7] and LSTM [20] have been fully adopted to handle the temporal information in several tasks. For instance, Hassan et al. [21] and Pahar et al. [22] evaluated the role of different audio features as input for LSTM-based classification of COVID-19. Similarly, Yan et al. [6] introduced the Spatial Attentive ConvLSTM-RNN (SACRNN), able to identify the most valuable features through an embedded temporal attention. Various efforts have also explored more efficient CNNs using residual network approaches and ensembles on audio data [23, 24, 25] . In particular, Byeonggeun et al. [26] used a residual broadcast block to retrieve temporal features by averaging the frequency features. Finally, Zhang et al. [27] proposed a hierarchical structure called pyramidal temporal pooling (PTP), which can retrieve temporal information by stacking a global PTP layer on multiple local ones. In this section, we propose Temporal-Oriented ResNet (TorNet), a modified version of the Broadcasting-residual network [26] tailored to audio data, which we present for COVID-19 recognition. In addition, we also propose an Alternating Broadcast Block (AB Block), which contains several Broadcast Residual Blocks (BC ResBlock) [26] and combines the temporal information to the whole feature map and a convolution layer for a better overview of the features. Finally, we use Frequency-wise Instance Normalisation for better domain generalisation [28] . The original ResNet [29] block is described by y = x + f (x), with f (x) being the residual function, and x and y denoting the input and output features, respectively. Normally, f utilises 2Dspatiotemporal features (i. e., 2D convolutions). To emphasise temporal features, we exploit 1D-temporal features in addition to 2D ones. To highlight the frequency convolution over all blocks, an auxiliary 2D residual connection is added from 2D features. To summarise, the BC-ResBlock can be presented as: In Equation 1, the 2D feature part f2 consists of a 3x1 frequency depth-wise convolution followed by SubSpectral normalisation [30] , which splits the input frequency into multiple groups and normalises them separately. To obtain frequencybased temporal features, we apply SubSpectral normalisation instead of Batch Norm. Finally, 2D features are averaged over the frequency dimension. f1 is a combination between a 1x3 temporal convolution with Batch Norm and Swish activation [31] followed by a 1x1 point-wise convolution using a channel dropout rate of p = 0.5. Thus, the broadcasting operation expands the feature map in R 1×w to R h×w . A normal BC ResBlock (cf. left in Figure 1 ) remaps the temporal information to the original feature map, so it has the same input and output dimensions. Meanwhile, a transition block (indicating that the number of input channel and output channels is different) is used, with the following modifications: 1. When channels do not have the same size, we add a transition block with Batch Norm and ReLU activation; 2. There is no identity shortcut. With the BC ResBlock, it is possible to turn the features into a higher dimensionality while broadcasting the temporal information to the whole feature map. As shown in Figure 2 , we propose a flexible structure of the Alternating Broadcast Block (AB Block), which mainly contains a set of BC ResBlocks and a convolution layer. The AB Block can be easily widened or deepened, even when facing a large amount of data, by simply adding a larger number of Normal BC ResBlocks. Note that the first BC ResBlock must be a Transition BC ResBlock when the number of input and output channels is different. As shown in Figure 1 (left), average pooling is used before temporal depth-wise (DW) convolution, which yields information loss in the frequency dimension (an inevitable side effect of using the BC ResBlock). In order to reduce the impact of information loss, the last layer of the AB Block is set to a convolution layer, followed by a Batch Norm layer and a ReLU activation layer. The main task of the convolution layer is to capture the global information of the temporal-based feature map, while retaining the local information learnt in the previous layer and projecting them to the higher dimensions of the original inputs. Thus, by using the proposed block, we can achieve enhanced frequency-aware temporal 2D features. To achieve a better domain generalisation, we apply Instance Normalisation [12] (IN), an approach that normalises across each channel in each training example. Since IN does not rely on batch information, its implementation is kept the same for both the training and testing phases. We apply IN on the frequency dimension as formulated below: where x ncf t , where µ nf , σ nf ∈ R N ×F are mean and standard deviation of the input feature x ∈ R N ×C×F ×T , in which N, C, F, T denote the batch size, number of input channel, frequency dimension, and time dimension. is a value added to the denominator for numerical stability. After exploring several choices and combinations, we design the Temporal-oriented ResNet (TorNet) for the COVID-19 detection task as shown in Figure 3 . Details on the TorNet structure are given in Table 1 . As shown in Figure 3 , TorNet contains four main stages. The first stage has a 3 × 3 convolution layer with a 2 × 2 max-pooling layer on the front to downsample both the time and frequency dimensions. The second stage is a typical residual block with two AB Blocks, where every AB Block will double the channel while halving the frequency dimension to get a higher-level embedding. In the residual shortcut, we added a batch norm layer and used maxpooling to control the size of the receptive field. This is followed by an Instance Normalisation layer between stage 2 and stage 3. Stage 3 shares the same structure as stage 2 with minor differences, i. e., the number of channels is doubled, and the dimension of the feature map does not change. After the second IN layer, the feature map is turned into a 3D tensor [batch size, time, out channel × N mel]. Finally, two fully connected layers are added as classification layers. In the INTERPSEECH 2021 Computational Paralinguistics Challenge [13] , the COVID-19 cough sub-challenge (CCS) was based on a subset of the crowd-sourced Cambridge COVID-19 Sound database [32] , whose goal is promoting the developing of systems able to diagnose COVID-19 from audio data. The CCS database consists of 929 cough recordings (1.63 hours) from 397 participants presenting either a positive or negative COVID-19 test. Participants were asked to provide one to three forced coughs in each recording 2 . All recordings in the CCS database were resampled and converted to 16 kHz and mono/16 bits. The official training, validation, and test sets used in the COMPARE challenge are used in all our experiments. For data pre-processing, we standardise the length of the audio data to 10 seconds. The shorter samples are repeated until they match the target length. As input features, we use 40dimensional log Mel-Spectrograms with a 64 ms window length and a 16 ms frame shift. We also extract deltas and delta-delta of log Mel-Spectrograms and concatenate them as input features. At last, the size of features for TorNet is [batch size, 3, 40, 512]. For all models, we use the Adam optimiser with an epsilon value of 10 −8 , a mini-batch size of 16, and a learning rate of 10 −5 . As indicated by Chang et al. [30] , the sub-bands of SSN in AB Blocks were all set to 5 and the dropout rate was always p = 0.1 except for the last layer, where it was set to p = 0.5. We also tried data augmentation (mixup, Spec augmentation) methods, but there was no noticeable performance gain in our task, thus omitting them for brevity. All the models were developed on Pytorch 1.8.1 and trained on a single Nvidia RTX 3090 GPU. To verify the efficiency of the proposed TorNet, as well as the effectiveness of the temporal features, we developed four additional ResNet-based methods for comparison. These four methods are as follows: • ResNet-10: has an identical structure as TorNet but uses a convolution layer and maxpooling to control the size of the feature map. It is used as a baseline model. • ResNet-10 + LSTM: introduces a layer of a standard LSTM network at the output of the ResNet-10. • ResNet-10 + LSTM + Attention: adds a 4-head multihead Our experimental results obtained on the binary task of COVID-19 detection are presented in Table 2 . The upper part of [18] achieved 75.9% UAR based on a large-scale transfer learning model. Their CNN14 model is pre-trained on Audioset, which means a longer training time and higher computational effort. Indeed, the parameters of CNN14 (79.67 millions) are almost 18 times more than the number of parameters compared to TorNet (4.46 millions) -thus showing that TorNet has a higher computational efficiency. Similarly, the baseline fusion framework for the CCS Sub-Challenge fuses multiple best models to obtain the final results (73.9% UAR), which also results in a far higher computational complexity than our TorNet. Overall, this shows that TorNet can achieve competitive performance without pre-training or fusion while also using far lower computational resources. In comparison to the official 'standard' baseline (End2You), our baseline results show that the ResNet structure still has good robustness for audio data (ResNet-10 66.9%). Meanwhile, we added LSTM and transformer structures for extracting temporal information after ResNet-10. The results show that the extraction of temporal information can improve the final UAR results, as shown by the combination of ResNet + LSTM (cf. 70.5%, in the bold, in the middle part of Table 2 Since our goal is to investigate to which extent it is possible to model temporal information while improving computational efficiency with DNNs, we also set up four comparison experiments based on TorNet. In these, we keep all training parameters consistent in order to assess the impact of different modules, i. e., the use of a convolution layer in the AB Block, the Normal BC ResBlock, and Instance Normalisation on the overall performance of TorNet. The lower part of table 2 contains an ablation study of the components introduced in this work. TorNet without the last convolution layer in the AB block achieves only 65.5% UAR, while when convolution layers are introduced, there is a performance improvement of nearly 5.0% (cf. 70.2%, underlined in the lower part of Table 2 ). This is because in each AB Block, the BC ResBlock has the ability to broadcast the temporal features to the original feature map, but loses a portion of the frequency features. By introducing an extra convolution layer, we eliminate the influence that this loss of granular details entails, thus obtaining a better overview for the feature map, which results in a sizable performance increase. At the same time, the IN layer is introduced in the frequency dimension for better domain generalisation, leading to a performance improvement of approximately 2% in the same training environment: from 70.2% (without IN) to 72.2% (with IN). In this work, we proposed an AB Block that can efficiently exploit the temporal information in audio sequences. It contains multiple BC ResBlock as well as a convolution layer to capture the temporal-enhanced feature. Based on the AB Block with residual learning, we proposed a flexible, lightweight, and time-oriented network -TorNet. TorNet has a typical ResNet structure, but we replace the convolution module with the AB Block. Competitive results highlight the high computational efficiency and robustness of TorNet, a promising architecture that offers new insights for the detection of COVID-19. Future work could be targeted towards the application of TorNet in other domains, such as speech emotion recognition or acoustic scene classification. COVID-19) dashboard, 2022 COVID-19 and Computer Audition: An Overview on What Speech & Sound Analysis Could Contribute in the SARS-CoV A Generic Deep Learning Based Cough Analysis System from Clinically Validated Samples for Point-of-Need Covid-19 Test and Severity Levels AI-based Human Audio Processing for COVID-19: A Comprehensive Overview Emotion Recognition in Public Speaking Scenarios Utlising an LSTM-RNN Approach with Attention Coughing-Based Recognition of COVID-19 with Spatial Attentive ConvLSTM Recurrent Neural Networks Motivic Pattern Classification of Music Audio Signals Combining Residual and LSTM Networks Ast: Audio Spectrogram Transformer Parallelising CNNs and Transformers: A Cognitive-based approach for Automatic Recognition of Learners' English Proficiency CovNet: A Transfer Learning Framework for Automatic COVID-19 Detection From Crowd-Sourced Cough Sounds EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks Instance Normalization: The Missing Ingredient for Fast Stylization The INTERSPEECH 2021 computational paralinguistics challenge: COVID-19 cough, COVID-19 speech, escalation & primates Deep Learning for Audio Signal Processing Learning Multi-resolution Representations for Acoustic Scene Classification via Neural Networks Frustration Recognition from Speech during Game Interaction Using Wide Residual Networks FluSense: A Contactless Syndromic Surveillance Platform for Influenza-Like Illness in Hospital Waiting Areas Identification Tasks in ComParE 2021 Extending Temporal Feature Integration for Semantic Audio Analysis Exploiting Time-frequency Patterns with LSTM-RNNs for Low-bitrate Audio Restoration COVID-19 Detection System using Recurrent Neural Networks COVID-19 Cough Classification using Machine Learning and Global Smartphone Recordings End-to-end Audio-scene Classification from Raw Audio: Multi Time-frequency Resolution CNN Architecture for Efficient Representation Learning 1D/2D Deep CNNs vs. Temporal Feature Integration for General Audio Classification Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning Broadcasted Residual Learning for Efficient Keyword Spotting Pyramidal Temporal Pooling With Discriminative Mapping for Audio Classification Batch-Instance Normalization for Adaptively Style-Invariant Neural Networks Deep Residual Learning for Image Recognition Subspectral Normalization for Neural Audio Data Processing Searching for Activation Functions Exploring Automatic Diagnosis of COVID-19 from Crowdsourced Respiratory Sound Data Visual Transformers for Primates Classification and Covid Detection