key: cord-0480645-mqts60fd
authors: Bai, Jisheng; Chen, Jianfeng; Wang, Mou; Ayub, Muhammad Saad
title: A Squeeze-and-Excitation and Transformer based Cross-task System for Environmental Sound Recognition
date: 2022-03-16
journal: nan
DOI: nan
sha: 750497624143da77f8d26c69ef04fd42df8e769f
doc_id: 480645
cord_uid: mqts60fd

Environmental sound recognition (ESR) is an emerging research topic in audio pattern recognition. Many tasks are presented to resort to computational systems for ESR in real-life applications. However, current systems are usually designed for individual tasks, and are not robust and applicable to other tasks. Cross-task systems, which promote unified knowledge modeling across various tasks, have not been thoroughly investigated. In this paper, we propose a cross-task system for three different tasks of ESR: acoustic scene classification, urban sound tagging, and anomalous sound detection. An architecture named SE-Trans is presented that uses attention mechanism-based Squeeze-and-Excitation and Transformer encoder modules to learn channel-wise relationship and temporal dependencies of the acoustic features. FMix is employed as the data augmentation method that improves the performance of ESR. Evaluations for the three tasks are conducted on the recent databases of DCASE challenges. The experimental results show that the proposed cross-task system achieves state-of-the-art performance on all tasks. Further analysis demonstrates that the proposed cross-task system can effectively utilize acoustic knowledge across different ESR tasks.

mechanism, Data augmentation

Humans can automatically recognize sounds, but it is challenging for machines. Audio pattern recognition (APR) is a growing research area where signal processing and machine learning methods are used to understand the surrounding sound. APR is of great importance in automatic speech recognition, automatic music transcription, and environmental sound recognition (ESR). More recently, ESR has attracted much attention due to various applications are emerging in our daily life. For example, heart sound has been used as a biometric to identify a person in a real-time authentication system [1] . In surveillance systems, the detection of gunshots or glass breaking can be used to report danger in time [2] and the recognition of baby crying can be used as a safety measure for babies [3] . A more meaningful application is using speech and non-speech audio to detect COVID-19 [4] .

The Detection and Classification of Acoustic Scenes and Events (DCASE) challenges, which focus on machine listening research for acoustic environments, have dramatically promoted the development of ESR in the past few years [5] .

The series of challenges on DCASE have provided a set of tasks, such as acoustic scene classification (ASC) [6] , urban sound tagging (UST), [7] and anomalous sound detection (ASD) [8] , encouraging the participants to develop strong computational systems [9] . The computational system for ESR usually consists of two stages: feature extraction, in which audio is transformed into a feature representation, and sound recognition, in which a mapping between the feature representations and labels of sound classes is learned by a classifier [10] . Log Mel spectrograms, which represent the audio signal using perceptually motivated frequency scales [11] , have been the predominant feature representations in this field [12, 13] . Recently, deep learning methods have achieved great success in many pattern recognition fields, including image classification [14] , speech processing [15] , natural language processing (NLP) [16] , and ESR. Deep neural networks ii (DNNs) based classifiers, such as convolutional neural networks (CNNs), convolutional recurrent neural networks (CRNNs), and Transformer, have become dominant approaches in ESR and outperform the conventional machine learning methods [17, 18] . Moreover, data augmentation has been a necessary part of an ESR system to mitigate overfitting during the training stage, and further improves the performance [19] .

In DCASE challenges, ESR systems are usually developed for concrete and individual tasks, but the unification of the systems designed for these tasks has not been adequately studied. Unified systems across different domains will facilitate knowledge modeling from these fields, and such unification has been explored in NLP (e.g. Bert [20] ), computer vision (CV) (e.g. ViT backbone [21] ), and APR (e.g. PANN [22] ). Considering the above reason, it is of great importance to explore a generic system that has robust performance and wide applicability for different ESR tasks, where the common acoustic knowledge can be modeled in the system. The systems across several tasks in ESR are called cross-task systems [23] . They have investigated the performance of CNNs-based systems for several tasks of DCASE challenges in 2018 and 2019, and found that a 9-layer CNN with average pooling is a good model for most of the tasks.

Recently, attention mechanism has attracted tremendous interest in numerous pattern recognition fields as a component of DNNs. Implementing attention can get more powerful representations by allowing the networks to dynamically pay attention to the effective information of the signal. In ESR, different attention mechanisms have been applied for environmental sound classification [24] , sound event detection [25] , and ASC [26] . Yet, to the best of our knowledge, attention mechanism has not been studied or exploited in cross-task systems.

In this paper, we propose a cross-task system based on an attention mechanism for modeling on three subtasks of ESR, i.e., ASC, UST, and ASD. First, we propose to use Squeeze-and-Excitation (SE) modules after convolution layers in the main architecture. The model is allowed to learn the importance of channel-wise features, and the acoustic information between channels is effectively enhanced. Next, we adopt the Transformer encoder in our architecture, iii where the multi-head self-attention (MHSA) mechanism is applied to efficiently model and process the temporal dependencies. In our proposed system, the main architecture is based on SE and Transformer encoder modules, which is called SE-Trans. Besides, we proposed to use FMix as the data augmentation method, which can effectively augment the training data by randomly mixing irregular areas of two samples. The main contributions of this paper can be summarized as follows:

• We propose a cross-task system that can generally model various audio patterns across several tasks of ESR.

• We incorporate two modules with attention mechanisms, i.e., SE and Transformer encoder, into the main architecture to enhance channel-wise acoustic information and catch temporal dependencies.

• We firstly introduce FMix into ESR as the data augmentation method and find FMix can effectively augment the training data and significantly improve the performance of ESR tasks.

• We conduct experiments on the latest dataset of DACSE challenges. And results show that our cross-task system achieves state-of-the-art performance for ASC, UST, and ASD.

The rest of the paper is organized as follows: Section 2 reviews the related works of this paper. Section 3 describes the detailed parts of the proposed crosstask system. Section 4 presents the experiments, the results and discussions are given in Section 5. Finally, we conclude this paper in Section 6.

DCASE challenges have been successfully held seven times since 2013. Several tasks closed to different aspects of real-life applications are brought by the organizers, providing public datasets, metrics, and evaluation frameworks [9] .

Among the tasks, ASC, UST, and ASD, which are consecutively organized for at least 2 years, have attracted much interest. iv ASC means to identify an acoustic scene among the predefined classes in the environment, using signal processing and machine learning methods. The acoustic scenes are usually recorded in real-life environments such as squares, streets, and restaurants. Many applications can potentially use ASC systems, e.g., wearable devices [27] , robotics [28] , and smart home devices [29] . Early works primarily focused on using conventional machine learning methods, such as Gaussian mixture models and support vector machines. Gradually, CNNs-based approaches have become the mainstream for designing systems and achieved the top performance in recent years [30, 31] .

The goal of UST is to predict whether an urban sound exists in a recording. Various urban sounds occur around us all the time when living in the big cities. Some urban sounds can be harmful if we are exposed to them for a long time, such as traffic noise. Sounds of New York City (SONYC) is a research project investigating data-driven approaches to mitigate urban noise pollution [32] . SONYC has collected over 100 million recordings between 2016 and 2019, and the researchers held the UST tasks in DCASE 2019 and DCASE 2020 to resort to computational methods for automatically monitoring noise pollution.

Both of the champions in DCASE 2019 and DCASE 2020 UST tasks used CNNs as the primary architecture of the classifiers [33, 34] .

ASD aims to detect anomalous acoustic signals in particular environments.

ASD for machine condition monitoring (MCM) is an emerging task to identify whether the sound produced from a target machine is normal or anomalous.

Some of the successful methods in ASD can reduce the loss caused by machine damage and speed up the use of essential technologies in industry. In factories, anomalous sounds rarely occur and are often unavailable. The main challenge is to detect unknown anomalous sounds under the condition that only normal sound samples are provided. To address this problem, some researchers organized the tasks named "unsupervised detection of anomalous sounds for MCM" in DCASE 2020 and DCASE 2021 [35, 8] . In [36] , the authors proposed a selfsupervised density estimation method using normalizing flows and machine IDs to detect anomalies. A MobileFaceNet was trained in a self-supervised learning v manner to detect anomalous sound and achieved great performance in DCASE 2021 [37] .

Recently, using deep learning in the above tasks of ESR has become the trend and achieved state-of-the-art performance. Yet, these methods mostly focus on specific tasks, we have observed little relevant literature on studying unification in this field. The study of cross-task tries to find a general system that can perform well for various tasks in ESR, and deep learning based crosstask systems have been investigated. Kong et al. proposed DNNs based baseline with the same structure for the DCASE 2016 challenge [38] . The DNNs take

Mel-filter bank features as input and outperform the official baselines in many but not all tasks. In DCASE 2018 challenge, they created a cross-task baseline system for all five tasks based on CNNs [39] . CNNs with 4 layers and 8 layers are investigated through performance for various tasks with the same configuration of neural networks. They found that deeper CNN with 8 layers performs better than CNN with 4 layers on almost all tasks. Further, Kong et al. proposed generic cross-task baseline systems, where CNNs with 5, 9, and 13 layers are studied [23] . The results of the CNN models on five tasks of DCASE 2019 showed that the 9-layer CNN with average pooling could achieve good performance in general.

CNNs are designed specifically for images and have become the dominant models in CV [40] . Because of its great ability to extract features from spectrograms, CNNs have also been the most popular architecture in ESR. Most recently, CNNs have incorporated many attention mechanisms, such as SE and self-attention, to focus on key parts, catch long-range dependencies of the input, and reduce computational complexity. The combination of CNNs and attention has outperformed many of the previous methods and become a new research direction. SENet [41] was proposed to explicitly learn the relationships between the channels and pay more attention to the more important feature maps. Implementing SE can effectively improve the performance of classification with little increase of the parameters. In addition, Transformer has extensively achieved state-of-the-art performance in many research domains. The MHSA modules in vi Transformer can model the input of long sequences and process in parallel. A combination of CNNs and Transformers has been proposed to model both local and global dependencies of an audio sequence for speech recognition [42] .

Most state-of-the-art approaches in ESR use data augmentation in the training stage to overcome the overfitting problem caused by the lack of environmental sound. These data augmentation methods can be categorized into two classes, depending on the representation format of sound. In the first class, the methods operate on the sound waveform, including changing the speed, volume, and pitch etc [19] . The drawback of these methods is that they need complex operations and a lot of time to generate enough samples. Another class is usually carried out on the spectrogram without taking too much time to get enough data. These methods are derived from the methods for augmenting images, using two main strategies. One strategy does data augmentation by removing or masking some information on the images or spectrograms (e.g. cutout [43] or SpecAugment [44] ). The above strategy is undesirable for some classification tasks because it may lose part of the information during training. Another strategy called mixed sample data augmentation (MSDA) augments the training data by combining samples according to a specific policy, such as mixup [45] and FMix [46] . The MSDA methods can generate more unseen data, force the model to learn more robust features, and finally improve the performance of classification.

The processing stages of the proposed cross-task system are shown in Fig.   1 . First, the system takes audio recordings as input and transforms them into acoustic features. Then, FMix is applied to the acoustic features to generate mixed acoustic features. Next, the SE-Trans, which consists of SE-blocks and a Transformer encoder, is trained to recognize the acoustic features under different situations. For ASC, each audio recording will be recognized as a specific acoustic scene. For UST, each audio recording will be tagged with different urban sound classes. And for ASD, each audio recording will be annotated as normal or anomaly.

The first part of the proposed system is SE-blocks. We denote ∈ R × as an acoustic feature transformed from an audio recording , where is the number of time frames and is the number of frequency bins. Then is further reshaped into X ∈ R × ×1 , which is consecutively processed by convolutional layers (Conv), batch normalization (BN), SE layers, and rectified linear unit (ReLU) for two times in each SE-block.

We assume that the output of BN is

where is the number of channels of the convolutional layer, is the number of time frames, and is the number of frequency bins. In an SE layer, X is first squeezed through time-frequency dimensions × in each channel:

where is the global average pooling function, and is the channel-wise value of squeezed vector z ∈ R . A channel-wise relationship is then excited from z viii by a gating mechanism:

where W 1 ∈ R × and W 2 ∈ R × are weights of two fully connected (FC) layers, is a hyperparameter, w is the channel-wise weighted vector, is the sigmoid activation. The channel-wise feature maps of X are activated by w:

where x c and w c are the th channel of X ∈ R × × and w, respectively.

is a channel-wise multiplication function. Finally, an average pooling layer (Avg Pool) is applied to reduce the size of feature maps. A flowchart of the SE layer is shown in Fig. 2 . A global average pooling layer is used after the SE-blocks to get a proper input shape of the Transformer encoder. We assumed the input of the Transformer encoder as ∈ R × , where is the number of time frames.

The second part of the SE-Trans is the Transformer encoder. The Transformer is a sequence to sequence model which usually contains an encoder and a decoder. Considering that our proposed cross-task system is used for classification tasks, we only use the encoder.

In each encoder, there are several encoder layers. The and dimension of is the sequence length and feature number of the encoder layer, respectively. and values respectively, and this is formulated as:

where 

The attentions of all heads are concatenated and linearly projected again to obtain the multi-head output:

where ∈ R × is a linear transformation matrix. After that, residual connections and layer normalization (LN) [47] are employed:

The output is fed into the feed-forward network (FFN) followed by a residual connection and LN to get the final output of the Transformer encoder and that is formulated as:

where ∈ R × is the output of the Transformer encoder.

In this section, we describe the loss functions used for ASC, UST, and ASD in the system.

Since the ground truth and prediction of a recording contains only one of the acoustic scene classes, ASC is a multi-class classification task. We denote the output of SE-Trans as

where is the number of classes. The loss function of ASC is categorical cross-entropy loss, defined as:

is the estimated label and is the true label of the th class . 

While for UST, the ground truth and prediction of a sound recording may contain multiple classes. Therefore, it is a multi-label classification task. The loss function of UST is categorical binary cross-entropy loss, where ( ) has to pass the sigmoid activation function to obtainˆ, and the loss is defined as:

For ASD in MCM, we adopt a self-supervised learning strategy to train the model to differentiate different machine IDs of one machine type. Therefore, ASD can be seen as a supervised multi-class classification task. The loss function is categorical cross-entropy loss and the same as defined in Eq. 9, where theâ nd in ASD refers to the estimated and true machine ID, respectively.

In the proposed cross-task system, we employ FMix as the data augmentation method to improve generalization and prevent overfitting of the neural networks. FMix is applied to the time-frequency representations in the training stage for the ESR tasks. First, a random complex tensor ∈ C × is sampled for which both the real and imaginary parts are independent and Gaussian. We then scale each component according to its frequency via the parameter such that higher values of correspond to increased decay of high-frequency information. Next, we perform an inverse Fourier transform on the complex tensor and take the real part to obtain a grey-scale image. We set the top proportion of the image to have value 1 and the rest to have value 0 to obtain the binary mask, which is defined as:

where the top( , ) denotes the operation which returns a set containing the top elements of the input , g refers to a grey-scale image, and ( , g) refers xii to a binary mask with mean . Finally, the mixed sample can be obtained from two input features and by the following formulation:

where denotes the Hadamard product.

We evaluated the proposed cross-task system for all the tasks on the latest dataset of DCASE challenges. The details of the dataset, experimental setups, baseline systems, and evaluation metrics for ASC, UST, and ASD are described in this section.

The dataset used for ASC is the development set of DCASE 2021 Task1

Subtask A [48] . The organizers used different devices to simultaneously capture audio in 10 acoustic scenes, which are airport, shopping mall, metro station, street pedestrian, public square, street traffic, tram, bus, metro, and park. The development set contains data from 9 devices: A, B, C (3 real devices), and S1-S6 (6 simulated devices). The total number of recordings in the development set is 16,930. The dataset is divided into a training set, which contains 13,962 recordings, and a testing set, which contains 2,968 recordings. Complete details of the development set are shown in Table 1 . Besides FMix, two more data augmentation methods were also analyzed.

SpecAugment is the first method, the frequency bins and time frames of the spectrograms are randomly masked with random width and height. The second method is mixup, the operations on training samples are expressed as:

where and are the input features, and are the corresponding target labels, and ∈ [0, 1] is a random number drawn from the beta distribution. xiv

We used three systems as baseline in the experiments. The first are official baseline systems, which are provided by the task organizers. The official baseline system for ASC is a 3-layer CNN model, which consists of 16, 16, and 32 feature maps for each convolutional layer, respectively [49] .

The second type is CNNs-based systems, which are proposed by Kong et al. in the study of cross-task systems [23] . The third is a CRNN-based system, which aims for audio tagging task and achieves great performance [25] . This system uses the same architecture as CNN9, where the frequency axis of the output from the last convolutional layer is averaged. Then a bidirectional gated recurrent unit (biGRU) and time distributed fully connected layer is applied to predict the presence of sound classes.

Mixup is exploited during the training stage and the system is named CNN-biGRU-Avg. Finally, our proposed cross-task is compared with 5 baseline systems: official baseline systems, CNN5, CNN9, CNN13, and CNN-biGRU-Avg.

In order to evaluate the performance of the system, we first compute the accuracy (ACC), precision (P) and recall (R) of class as follows: 

We resampled all recordings to 20,480 Hz and applied STFT on them with 

Three types of systems were used as baseline systems for UST as well. The official baseline system of DCASE 2020 Task5 uses a multi-layer perceptron model, which consists of one hidden layer of size 128 and an autopool layer [50] .

OpenL3 embeddings are taken as the input using a window size and hop size of 

We use the macro area under the precision-recall curve (macro-AUPRC)

as the classification metric. The macro-precision (macro-P) and macro-recall (macro-R) are defined as:

we changed the threshold from 0 to 1 to compute different macro-P and macro-R, and calculate the area under the P-R curve to get macro-AUPRC. Moreover, micro-AUPRC and micro-F1 score are used as additional metrics.

The dataset used for ASD is the development set of DCASE 2021 Task2 [51, 52] , consisting of normal and anomalous sounds of 7 types of machines.

Each recording is single-channel 10-second audio, which is recorded in a real test. Fig. 4 shows an overview of the development set of ASD. 

There are also three types of baseline systems for ASD. The official baseline system of DCASE 2021 Task2 is a MobileNetV2-based system [8] . This system is trained in a self-supervised manner using the IDs of the machines. Besides the official baseline system, the remaining two types of systems (CNN5, CNN9, CNN13, and CNN-biGRU-Avg) are the same as described in Sec. 4.1.3

This task is evaluated with the area under the curve (AUC) of the receiver operating characteristic (ROC). We first define the anomaly score ( ) as:

where is the number of the input features extracted from log Mel spectrograms by shifting a context window by 8 frames, andˆis the softmax output of the network. The AUC can be defined as:

where + and − are normal and anomalous test input features, + and − are the number of normal and anomalous test samples, respectively. And H ( ) returns 1 when > 0 and 0 otherwise. Moreover, the partial-AUC (pAUC) is used as an additional metric, which is calculated from a portion of ROC over the pre-specified range of interest.

In this section, we demonstrate the results of the experiments and give further discussions about the systems from two aspects: cross-task, where the generality is analyzed, and subtask, where the individuality is analyzed. xix

For the cross-task aspect, we compare the general performance of the proposed cross-task system with other baseline systems, and further investigate the importance of SE, Transformer modules, and data augmentation methods in our system. In this section, we compare the proposed cross-task system with state-of-theart systems on different tasks, i.e., ASC, UST, and ASD. Table 2 shows that the proposed cross-task system surpasses the performance of official baselines, In real applications, the number of parameters and performance has to be considered and there must be a trade-off between them. Therefore, we further analyzed the model complexity of the aforementioned systems. Figure 6 : Examples of spectrograms augmented by SpecAugment, mixup and FMix for ASC, UST and ASD.

For the subtask aspect, we further illustrate the performance of the proposed system on individual tasks. 

We investigate the class-wise performance of UST using different MSDA methods in this section. First of all, Fig. 8 shows the number of recordings of 

In Sec. 5.1.4, we have shown some examples of spectrograms and analyzed the cross-task performance of different data augmentation methods. In this section, we compare the performance of ASD using FMix and SpecAugment, and the ROC curves of different machine types are illustrated in Fig. 9 . As shown in Fig. 9 , FMix performs the best for most of the machine types, where the anomalous sounds of toyCar, toyTrain, and fan can be correctly detected.

These sounds have continuous acoustic characteristics, indicating that FMix can improve the performance of these machines. The performance achieved by SpecAugment is relatively poor on many machine types. This can verify the assumption in Sec. 5.1.4 that some key acoustic features on the spectrograms are randomly covered while applying SpecAugment.

xxvi Figure 9 : ROC curves of different machine types with different data augmentation methods.

Without data augmentation (w/o data aug.), source domain (source), target domain (target).

This paper proposes a cross-task system to generally model acoustic knowledge across three different tasks of ESR. In the system, an architecture based on two types of attention mechanisms is presented, namely SE-Trans. This architecture exploits SE and Transformer encoder modules to learn the channelwise importance and long sequence dependencies of the acoustic features. We also adopt FMix to augment training data and extract robust sound representations efficiently. Experiments show that our proposed cross-task system achieves state-of-the-art performance for ASC, UST, and ASD with low computational resource demand. Further analysis explores the generality and individuality of acoustic modeling for ESR and illustrates the effectiveness and robustness of the proposed system.

Heart sound as a biometric

An abnormal sound detection and classification system for surveillance applications

Baby cry sound detection: A comparison of hand crafted features and deep learning approach

Ai-based human audio processing for covid-19: A comprehensive overview

Detection and classification of acoustic scenes and events

Acoustic scene classification: Classifying environments from the sounds they produce

Sonycust-v2: An urban sound tagging dataset with spatiotemporal context

Description and discussion on DCASE 2021 challenge task 2: Unsupervised anomalous sound detection for machine condition monitoring under domain shifted conditions

Detection and classification of acoustic scenes and events: Outcome of the dcase 2016 challenge

Computational analysis of sound scenes and events

Sound event detection: A tutorial

Real-time monophonic and polyphonic audio classification from power spectra

A novel acoustic scene classification model using the late fusion of convolutional neural networks and different ensemble classifiers

Rethinking the inception architecture for computer vision

Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation

Sequence to sequence learning with neural networks

An mfcc-gmm approach for event detection and classification

Non-speech environmental sound classification using svms with a new set of features

Spectral images based environmental sound classification using cnn with meaningful data augmentation

Pre-training of deep bidirectional transformers for language understanding

An image is worth 16x16 words: Transformers for image recognition at scale

Panns: Large-scale pretrained audio neural networks for audio pattern recognition

Cross-task learning for audio tagging, sound event detection and spatial localization

Attention based convolutional recurrent neural network for environmental sound classification

Sound event detection of weakly labelled data with cnn-transformer and automatic threshold optimization

Attentionbased convolutional neural networks for acoustic scene classification

Techniques and applications of wearable augmented reality audio

Robotic discovery of the auditory scene

Acoustic scene classification: an overview of dcase 2017 challenge entries

Integrating the data augmentation scheme with various classifiers for acoustic scene modeling

Designing acoustic scene classification models with CNN variants

Sonyc: A system for monitoring, analyzing, and mitigating urban noise pollution

Urban sound tagging using convolutional neural networks

Incorporating auxiliary data for urban sound tagging

Description and discussion on DCASE2020 challenge task2: Unsupervised anomalous sound detection for machine condition monitoring

Flow-based selfsupervised density estimation for anomalous sound detection

Anomalous sound detection using cnn-based features by self supervised learning

Deep neural network baseline for DCASE challenge

DCASE 2018 challenge surrey cross-task convolutional neural network baseline, Tech. rep., DCASE2018 Challenge

An attentive survey of attention models

Squeeze-and-excitation networks

Conformer: Convolution-augmented transformer for speech recognition

Improved regularization of convolutional neural networks with cutout

Specaugment: A simple data augmentation method for automatic speech recognition

mixup: Beyond empirical risk minimization

Fmix: Enhancing mixed sample data augmentation

Layer normalization

Acoustic scene classification in dcase 2020 challenge: generalization across devices and low complexity solutions

Low-complexity acoustic scene classification for multi-device audio: analysis of dcase 2021 challenge systems

SONYC urban sound tagging (SONYC-UST): A multilabel dataset from an urban acoustic sensor network

Sound dataset for malfunctioning industrial machine investigation and inspection with domain shifts due to changes in xxxiii operational and environmental conditions

ToyADMOS2: Another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift conditions

Dcase 2021 task