key: cord-0915098-typaevz9 authors: Mohamed, Mostafa M.; Nessiem, Mina A.; Batliner, Anton; Bergler, Christian; Hantke, Simone; Schmitt, Maximilian; Baird, Alice; Mallol-Ragolta, Adria; Karas, Vincent; Amiriparian, Shahin; Schuller, Björn W. title: Face Mask Recognition from Audio: The MASC Database and an Overview on the Mask Challenge date: 2021-10-04 journal: Pattern Recognit DOI: 10.1016/j.patcog.2021.108361 sha: a6f85db62c89676f156ab1497d3405028c1c7c97 doc_id: 915098 cord_uid: typaevz9 The sudden outbreak of COVID-19 has resulted in tough challenges for the field of biometrics due to its spread via physical contact, and the regulations of wearing face masks. Given these constraints, voice biometrics can offer a suitable contact-less biometric solution; they can benefit from models that classify whether a speaker is wearing a mask or not. This article reviews the Mask Sub-Challenge (MSC) of the INTERSPEECH 2020 COMputational PARalinguistics challengE (ComParE), which focused on the following classification task: Given an audio chunk of a speaker, classify whether the speaker is wearing a mask or not. First, we report the collection of the Mask Augsburg Speech Corpus (MASC) and the baseline approaches used to solve the problem, achieving a performance of [Formula: see text] Unweighted Average Recall (UAR). We then summarise the methodologies explored in the submitted and accepted papers that mainly used two common patterns: (i) phonetic-based audio features, or (ii) spectrogram representations of audio combined with Convolutional Neural Networks (CNNs) typically used in image processing. Most approaches enhance their models by adapting ensembles of different models and attempting to increase the size of the training data using various techniques. We review and discuss the results of the participants of this sub-challenge, where the winner scored a UAR of [Formula: see text]. Moreover, we present the results of fusing the approaches, leading to a UAR of [Formula: see text]. Finally, we present a smartphone app that can be used as a proof of concept demonstration to detect in real-time whether users are wearing a face mask; we also benchmark the run-time of the best models. • Introduction of the Mask Augsburg Speech Corpus (MASC) database. • Summarising the Mask Sub-Challenge (MSC) and its baseline approaches. • Explanation and comparison of the approaches of top participants in the challenge. • Summarising the results of the Mask Sub-Challenge from ComParE 2020. • Introduction of novel fusion results, by way of fusing approaches from the best participants. • Conducting a discussion of the approaches and the results, regarding several aspects. • Introducing a proof of concept demonstration Android app. • Benchmarking the serving run-time of the top models. Wayman [1] defines biometric authentication or biometrics as "the automatic identification or identity verification of an individual based on physiological and behavioural characteristics". According to this study, several biometric characteristics are available for biometric system designers to choose from, including 5 but not limited to: "fingerprints, voice, iris, retina, hand, face, handwriting, keystroke, and finger shape" [2] . These can be subdivided into those that require physical contact of some sort for the characteristic to be verified such as fingerprints, handprints, or handwriting, and contact-less ones that do notsuch as voice, iris, retina, or face. 10 The sudden outbreak of the COVID-19 pandemic presents a significant challenge to the field of biometrics in two ways. First, the virus stays active on surfaces for a long period of time [3] , discouraging users from interacting physically with biometric devices shared between multiple people. Consequently, contactless biometrics became more crucial in the presence of COVID-19. Second, it 15 also spreads in an airborne fashion [3] , prompting health authorities worldwide to urge the general public to wear face masks regularly to reduce the spread of the virus [4] . This everyday use prevents existing facial identification systems from functioning properly, whether they are personal (such as those found in personal computers or mobile phones) or public (found in personnel-restricted 20 areas, such as hospitals or airports). For these reasons, voice biometrics are among the few suitable contact-less biometrics, as the effects of masks impact them less compared to facial biometrics [5, 6] . Furthermore, voice biometrics can be convenient in various contexts for several reasons, e. g., in health care [7] . They are easy to use since any smartphone equipped with a microphone 25 can be utilised [7] , and they do not need special training for the users, because there are many scenarios where users already use their smartphones by way of speech communication [7] . Speaker identification and verification systems have been researched for a long time [8] ; several benchmark datasets are available in this domain [9, 10] . 30 Deep Learning (DL) voice biometrics systems have recently been proposed as well [11] . The performance of speaker identification systems has been found by Saeidi et al. [6] to deteriorate when the conditions under which the systems are trained are mismatched with the conditions under which the systems are evaluated. In the context of masks, this means that it is best to identify mask- 35 wearing speakers with models trained on audio from mask-wearing speakers and non-mask-wearing speakers with models trained without. Consequently, automatically classifying whether a speaker is wearing a mask or not employing voice characteristics can improve voice biometrics systems. The effect of wearing a mask has been thoroughly studied within other con-40 texts; it has been shown to impact the human-to-human perception of speech, although research results have been contradictory as for the question whether this impact is significant for non-hearing impaired people [12] or not [13] . Llamas et al. [14] conducted a thorough investigation of the acoustic effects of different types of face coverings and whether they affect intelligibility; their findings seem 45 to agree with those of Kawase et al. [15] , who conclude that the impact of maskwearing seems to stem from the loss of the visual information that the brain uses to compensate for degradation in auditory information, if not from a direct effect of the facial coverings on the acoustics themselves. Some studies [16, 17] analysed the effects of wearing a mask from an acoustic perspective. They found 50 that the affected frequencies are within the range of 1 -8 kHz, with the great-est impact being within the range of 2 -4 kHz. These ranges are related to the ranges required for voice biometrics, namely < 1 kHz and 3 -4.5 kHz [18] . Audio models that predict whether a speaker is wearing a mask or not can offer insights into the relevant effects of mask-wearing by examining the audio features 55 employed by the models. Machine Learning (ML), and DL especially, have gained much momentum during the last decade. In the field of image processing, Convolutional Neural Networks (CNNs) [19, 20, 21] have been used for image classification. Similarly, CNNs [22] , Recurrent Neural Networks (RNNs) [23] , and generic audio features 60 [24, 25] have also been used for audio classification. Given their capabilities, ML and DL have been used to tackle several issues related to the COVID-19 pandemic and other medicine-related problems like cancer detection [26] . Shuja et al. [27] survey a wide range of datasets concerned with several aspects related to COVID-19, including datasets of medical images, e. g., chest X-ray scans, and audio sounds, e. g., coughing and breathing. There are surveys of MLbased COVID-19 diagnosis by way of speech [28] or by medical images [29] . Examples of speech-based applications can be found in [30, 31] . DL has also been successfully applied in biometrics; however, advances in voice biometrics are not as pronounced as those in face biometrics [32, 33] , which necessitates 70 bridging the gap between the two domains. In audio processing, spectrograms -visual representations of audio signalsare often employed. They allow the modelling of audio problems using computer vision techniques, where CNNs have recently shown substantial advancements [34] . For example, Deep Spectrum utilises pretrained CNNs to extract salient 75 visual features from spectrograms and has proven effective in several tasks like snoring-sound classification [35] . Another technique is transfer learning, which uses pretrained CNNs and enhances them further by fine-tuning them to be better suited for specific tasks. This is also employed in [22, 36] , where a large-scale training of CNNs on audio data was performed from scratch. There is a conver-80 gence in several DL-based methodologies between the image and audio domains, and consequently, audio processing has made use of the recent advancements in computer vision using DL. 3. giving insight on the effects of wearing masks and their impact on audio signal processing. The article is structured as follows: We will review MSC in Section 2,-its database, evaluation, and baseline features. Then, we present the approaches and results of the participants in Section 3. Their strengths and weaknesses are discussed in Section 4 as well as their usefulness and limitations for voice biometrics. Furthermore, in Section 5, we showcase an Android-based smart-100 phone app that can be used as a proof of concept to deploy audio-based face mask recognition models; benchmarking of the run-time of the top models is included. Concluding remarks are given in Section 6. The participants performed different tasks while wearing a mask as well as without: They read the story "Der Nordwind und die Sonne" ("The Northwind and the Sun") out loud, answered some questions, repeated prerecorded words after listening to them, read words commonly used in medical operation rooms, drew a picture and talked about it, and described pictures, e. g., food, sports 130 activities, families, kids, or locations. Both free speech and reading of a defined word list from the medical field are included in the data. The corpus is monolingual (German). Additional meta-data about the identity of the speakers and the speaking tasks were saved for each track. In order to prepare MASC for MSC, the audio was first downsampled to 135 16 kHz and converted to mono/16 bit. The data was then partitioned into the position -the masked signal seems to be a bit more blurred. Perceptually, the difference is rather indistinguishable in this setting (see Figure 1 : high-quality microphone, sound-proof room). This will surely change in unfavourable listen-155 ing conditions, at a distance or with environmental noise. In MSC, the performance of the binary classification problem is evaluated using the Unweighted Average Recall (UAR) 1 which is given by: where TP mask is the count of true positives for the mask class, TP clear is the Hence, FMR and FNMR are given by: where It must be highlighted that measures like FMR and FNMR in biometrics are typically referred to in different contexts, where the classification problem 175 is an acceptance or rejection of transactions or authentication attempts [40] . Nevertheless, we will report the values of these measures for the final results. In this subsection, we describe different features that are used for the baseline approach; a comprehensive explanation and hyperparameters' exploration are provided by Schuller et al. [38] . The best fusion achieved a UAR of 71.8 %. ComParE, the BoAW approach has proven its effectiveness in a large variety of audio classification tasks, e. g., acoustic event detection [45] . Deep Spectrum: The feature extraction Deep Spectrum toolkit 3 is applied to obtain high-level deep visual features from the input audio data utilising pretrained CNNs [35] . Mel-spectrograms of the audio are passed through 210 a pretrained ResNet50 [21] (trained on ImageNet), and the activations of the 'avg pool' layer are extracted, resulting in a 2 048 dimensional feature vector. Deep Spectrum features have been shown to be effective, e. g., for speech processing [46] and audio-based medical applications [47] . auDeep: On the basis of using recurrent Sequence-to-Sequence Autoen- In this section, we elaborate on the individual approaches of the participants and on the results of fusing their approaches. Many approaches incorporate the baseline features [38] as extra models for their ensembles. Table 2 gives an overview of the performance of all approaches, with some of their highlights. In Figure 3 , we show an abstract form of a pipeline adapted in one way or an- voting; some approaches use ML models for ensembling such as Support Vector Machines (SVM) [53] for the final predictions. We categorise the approaches according to two main criteria: whether they Szep and Hariri [54] Montacié and Caraty [55] Koike et al. [56] Markitantov et al. [57] Klumpp et al. [58] Yang et al. [59] Ristea and Ionescu [60] Schuller et al. [38] Illium et al. [61] Algorithm-based Feature-based Generic Specific Generic approaches describe methods that can be adapted to other tasks with only minor changes, while specific approaches deal with methods that have components particularly tailored for the task at hand. A comparison of the ap-255 proaches regarding these two aspects is shown in Figure 4 . It may not come as a surprise that most of the approaches are localised in the upper right quadrant: Most of the approaches employed are not created for this specific problem but are generic and have been developed for other (types of) problems. We now describe the approaches found in the contributions to the challenge that were accepted for the conference 5 ; we first deal with the algorithm-based 5 There are two other approaches found in [62, 63] whose authors did not participate in the challenge. We compare few aspects from them to the participants' approaches. ones, followed by the feature-based ones; we first describe the approach chosen by the authors in detail, then we roughly assign to them a position in the twodimensional space of Figure 4 . Finally, we discuss the key strengths, weaknesses, 265 and findings in Section 4. Szep and Hariri [54] mainly use several spectrogram features: First, they adopt 3-channel spectrograms with different bandwidths (wide and narrow), with cutoffs at different noise levels: 0 dB and -70 dB. Additionally, they use 270 transfer learning and fine-tuning on three standard image classification CNNs, namely VGG19 [19] , DenseNet121 [20] , and ResNet101 [21] . These result in 12 combinations of models and features which they ensemble. Furthermore, they merge the Train and Dev sets and train five times using 5-fold cross-validation and ensemble the models resulting from each fold. The use of cross-validation 275 allows the approach to make use of more available data. For the data generation part, they utilise simple image augmentation techniques, e. g., rotation (up to 3 degrees) and warping. The procedure followed in [54] is generic, since it is not tailored to the given task and can be used as is for other speech classification tasks. Furthermore, it 280 adapts standard components: mainly state-of-the-art image classification models and features based on several variants of spectrograms, which are generic components that can be applied to any audio data. The approach in [56] is generic, because it consists of several components widely utilised in DL to enhance models and reduce the effect of overfitting. Ensembling with the baseline approaches is not algorithm-based ; however, the method is still effective without this component. Markitantov et al. [57] submitted five different models to the MSC. These 295 models are all based on two models, ResNet18v1 and ResNet18v2, which are variations of the standard ResNet18 [21] . They make use of four parallel ResNet18s, which are connected to fully-connected layers at the end. The models took as input 64 log-Mel spectrograms and were cross-validated using a variation of k-fold cross-validation with Train and Dev sets shuffled together 300 and split into k/2 stratified segments. Their best model is an ensemble of two versions of ResNet18v2, each trained with a different optimisation algorithm. The approaches introduced in [57] are all generic audio-based approaches that depend on variations of the standard ResNet18 model. As such, they can easily be used for other audio tasks without much change. 305 Ristea and Ionescu [60] use spectrograms as audio representation; however, they employ the real and imaginary components as two separate channels, as opposed to their magnitude as a single channel, which is commonly used. The Many augmentation techniques and CNN architectures are explored, and the best combination is used. The augmentations are: speed, loudness, time-shift, random noise, or SpecAugment [51] . From these, time-shifting has proved to be the best suited for the task. The method provides a generic framework for audio classification tasks and 325 does not depend on the task at hand in particular but can be applied for other audio classification tasks. Montacié and Caraty [55] build three different types of models. The This approach has some generic components like MBS, which uses the baselines features, and partly specific features like the phonetic features -partly, because the single phones are language-specific, but phone classes are rather language-generic. Therefore, it can be conceived as a hybrid approach. Klumpp et al. [58] try to reduce the problem to a phoneme discrimination This is a specific method that is tailored to the task at hand. An essential part of the training pipeline is based on an assumption of distinguishing between mask and clear. It might not be straightforward to use this same methodology as is for other audio processing tasks that have nothing to do with speech. The method followed by [59] is generic as it depends on generic audio features, namely several representations of the ComParE feature set combined with FV, which can be applied for audio processing in general. 6 Strictly speaking, it is rather a phone and not a phoneme recogniser. Gradually, the next best participant's results are iteratively fused by majority vote. In case of a stalemate, the baseline system (position 1) is used, leading to the same UAR when fusing the first two approaches. Since the submitted labels 385 are binary, other fusion techniques cannot be applied. Figure 5 shows that a fusion of the best five classification systems leads to an absolute improvement of 2.5 % UAR compared to the best single approach, resulting in a final and best MASC Test set UAR of 82.6 %. In Figure 5 , performance rises till it peaks at five fusions, then it declines 390 more or less slowly to the level of the winning system. In our experience from earlier challenges, fusion does often not pay off when the winning system itself Table 2 . Fusion calculation only considers classification systems of the 21 participating teams (including not submitted/rejected papers) and not the original baseline system provided by Schuller et al. [38] . Position 2 is empty because in the case of a tie, the winning system is chosen. Best fusion (five systems) is given in dark green. employs several fusion steps. For MSC, the following four systems obviously contribute to modelling variety in the data and thus to the performance. 395 Figure 6 visualises a two-sided significance test ( [67] , chapter 5B) based on the MASC Test set and corresponding baseline system [38] . Based on the characteristics of the approaches detailed in Section 3, we discuss here individual aspects of different approaches. An essential key ingredient in all approaches is using ensembles, i. e., combining several distinct approaches by employing majority voting, averaging the 415 results, or merging the features and training an extra classifier (SVM in most cases). Ensembling turned out to be successful in all of the approaches. It is typical that ensembling reduces the overall variance of the approach, and consequently gets better results [70] . The top approach in [54] A common aspect of DL is attempting to increase the size of the training data by employing different techniques. Small data sizes are prone to overfit-7 Note that combining the best results of participants by using late fusion with simple majority voting has often been (slightly) superior to the results of the winning system in former ComParE challenges, see, e. g., [71] . ting; hence, using more data is always recommended for better generalisation of 455 CNN-based models. However, due to the uniqueness and specificity of MASC, the participants could not simply acquire additional data, except for Montacié and Caraty [55] , who utilised MSSC. This led them to attempt data gener- with or without a mask. Phonemes/phones are obviously highly relevant parameters, impacted differently by being filtered through a mask; this is shown by the high performance obtained by Montacié and Caraty [55] , who, similarly to [63] , break down the results to show the impact on groups of phonemes/phones. authors suspect that this is why the performance of log-scale spectrograms is degraded compared to linear-scale spectrograms, because the log-scale spectrograms focus more on lower frequency range < 1 kHz than the higher frequencies. This agrees with the attenuation within the range 1 -8 kHz due to wearing masks, as concluded by [16, 17] . The extracted ranges strongly intersect with 495 the ranges relevant for speaker identification, namely < 1 kHz and 3 -4.5 kHz [18] . Together with earlier studies, these findings suggest that wearing a mask indeed has a general acoustic effect, and a particular effect on tasks in voice biometrics such as speaker identification. Klumpp et al. [58] conducted an analysis on which phoneme groups were 500 most affected by using masks, the top four groups being unvoiced plosive, fricatives, approximants and vibrants. We have seen in Figure 2 that fricatives are affected by filtering through the mask; approximants and vibrants have 'weak' and variable characteristics and might be prone to the same influences. Montacié and Caraty [55] investigated which phoneme groups are more predictive of using 505 masks or not; they concluded that the top four groups are diphthongs, laterals, central vowels, and back vowels. The frequency ranges modelled by [54] and the phoneme classes employed by [58] and [55] cannot be fully mapped onto each other. This might be due to different types of modelling. Nevertheless, overall this proves the relevancy of clustering according to phonetic knowledge. A practical aspect that is not considered in the presented approaches is the run-time of the approaches. The approaches are solely focused on the final performance, which often leads to utilising many models to increase the performance. A complete analysis in this regard is not available; however, we assume 515 that the two approaches with the highest performance are probably the ones that most suffer from the worst run-time. It is not very surprising that these meth- for inference. Furthermore, such an approach is tailored because, in a sense, it performs a form of brute-forcing over many possible features, and it obtains the best ones for the task at hand. Furthermore, when image processing is applicable, e. g., in multimodal biometrics, it is plausible that it yields better results for mask recognition than au-535 dio processing; e. g., Mohan et al. [74] achieve over 98 % for classifying whether a person is wearing a mask or not. As a result, this would surpass the models presented in this work. On the other hand, wearing a mask still strongly challenges face biometrics compared to voice biometrics [5, 6] , even if they are better at classifying masks. This opens the space for applications to switch from using 540 face biometrics to using voice biometrics, or multimodal biometrics combining voice biometrics and other non-face contact-less biometrics; in these two scenarios, the presented models would be of direct aid. In particular, they would be used to automatically select speaker identification or verification models that are fine-tuned for dealing either with mask-wearing speakers or non-mask-wearing 545 speakers, which is expected to deliver the best results [6] . Current smartphones are pocket-sized computers. They can provide biometric functionalities, replacing dedicated devices. Enabling users to access biometric information on their own smartphone has the additional benefit of 550 limiting the need for physical interaction with shared devices, which could help reduce the spread of, for instance, COVID-19 through contaminated surfaces. As a proof of concept demonstrator, we have implemented an Android-based smartphone app to deploy the audio-based face mask detection models summarised in this work in real-life scenarios. The app implements a microphone 555 functionality for users to record their own voice (cf. Figure 7a ). The code of the application is available open-source 9 . Once the recording is completed, the media file is transferred through network to a dedicated server. Upon receipt, we extract the audio component of the media file. While any of the models summarised in this work could be used on the screen; otherwise, the one of Figure 7c . Furthermore, we benchmark CNN architectures ResNet101 [21] , DenseNet121 570 [20] , and VGG19 [19] (used by the top participants [54] ) by deploying each model using the Docker setup of TensorFlow serving 10 on a device with an Intel(R) Core(TM) i7-8700PU @ 3.20GHz CPU, an Nvidia RTX 2080 GPU, and 64 GB RAM memory. In order to make use of the parallelisation due to batching, we configure the server to wait for 1 ms, and group all received requests into 575 batches (of at most 64 frames) for inference. In Table 3 on the same local host of the server, in order to diminish the arbitrary effects of network delays. Latency is measured by sending 100 requests sequentially, and measuring the average time until a response is returned. Throughput is measured by sending 100 requests concurrently and the total time used is measured 585 after all of them are processed; we repeat this 10 times and report the average. Deploying the best ensemble of models used by Szep and Hariri [54] would slow down these values by a factor of 60 (20 instances of each of the 3 models), and would need much more GPU memory. The dedicated server can be avoided altogether by deploying an offline model directly on the app. However, we did 590 not implement this, because it is computationally costly for the users without a dedicated mobile GPU, as seen by comparing CPU against GPU. In with GANs, or using Mixup. Furthermore, we presented an analysis of these approaches along two axes, specific against generic characteristics as well as whether they were based more on algorithms such as deep neural networks or 615 more on hand-crafted (mostly phonetic) features. We then presented a fusion of the top approaches: Fusing the top five participants led to the best results. The presented advances could be of benefit to future voice biometrics approaches. We discussed several aspects of the presented approaches. In particular, three key ingredients are necessary for the success of the models, namely en-620 semble learning, transfer learning, and data generation (mostly by using data augmentation) -all top models incorporated at least two of those in some form. The results obtained show that there is indeed a difference between using a mask and not using a mask for speech processing, which suggests an impact on voice biometrics. This impact will likely be higher in realistic scenarios -625 note that the MASC scenario was optimal for recordings (clean controlled environment, high-quality microphones and no background noise). We discussed a few practical aspects of the models and some limitations, as well as the suitability of the presented models to voice biometrics. Furthermore, we presented a proof of concept Android app for smartphones. Together with the aforemen-630 tioned results, this app motivates voice biometrics applications that can benefit from the classification task at hand; it can improve the application accordingly by automatically choosing a suitable speaker identification model based on the mask-wearing prediction. Also, we benchmarked the run-time of the top participants' models, if they are to be used for serving. Finally, the findings suggest that applications can start to rely more on voice biometrics in the future, especially with the regulations of wearing masks. This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No. 826506 (sustAGE). openSMILE, the Munich Open-Source Multimedia Feature Extractor, in: Proceedings of the ACM International Conference on Multimedia, Associ- A Definition of An Introduction to Biometric Authentication Systems Aerosol and Surface Stability of SARS-CoV-2 as Compared with SARS-CoV-1 Physical distancing, face masks, and eye protection to prevent person-to-person transmission of SARS-CoV-2 and COVID-19: a systematic review and meta-analysis Biometrics in 660 the Era of COVID-19: Challenges and Opportunities Speaker Recognition For Speech Under Face Cover Voice Biometrics Technologies and Applications for Healthcare: an overview Speaker Identification and Verification Using Gaussian Mixture Speaker Models The Speakers in the Wild (SITW) Speaker Recognition Database The MIT Mobile Device Speaker Verification Corpus: Data Collection and Preliminary Experiments, in: Odyssey 675 -The Speaker and Language Recognition Workshop Voice Biometrics: Deep Learning-based Voiceprint Authentication System The effects of surgical masks on speech perception in noise Speech Understanding Using Surgical Masks: A Problem in Health Care? Effects of different types of face coverings on speech acoustics and intelligibility Recruitment of fusiform face area associated with listening to degraded speech sounds in auditoryvisual speech perception: a PET study Acoustic voice characteristics with and without wear-695 ing a facemask Comparison of the Acoustic Effects of Face Masks on Speech Frequency Analysis of Speaker Identification, in: A Speaker Odyssey-The Speaker Recognition Workshop Very Deep Convolutional Networks for Large-Scale Image Recognition Proceedings Computer Vision and Pat-705 tern Recognition (CVPR) Deep Residual Learning for Image Recognition CNN Architectures for Large-Scale Audio Classification Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin On the Acoustics of Emotion in Audio: What Speech, Music and Sound Real-time Speech and Music Classification by Large Audio Feature Space Extraction Deep learning for image-based cancer detection and diagnosis -A survey COVID-19 open source data sets: A comprehensive survey An Overview on Audio, Signal, Speech, & Language Processing for COVID-19 Machine and Deep Learning Towards COVID-19 Diagnosis and Treatment: Survey, Challenges, and Future Directions Detecting COVID-19 from Breathing and Coughing Sounds using Deep Neural Networks Exploring Automatic Diagnosis 740 of COVID-19 from Crowdsourced Respiratory Sound Data Biometrics 745 recognition using deep learning: A survey Deep Learning for Biometrics: A Survey Recent advances in convolutional neural net-750 works Snore Sound Classification Using Image-based Deep Spectrum Features PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition Affect, and Personality in Speech and Language Processing The INTERSPEECH 2020 Computational Paralinguistics Challenge: Elderly Emotion, Breathing & Masks, in: Pro-765 ceedings INTERSPEECH, ISCA An Introduction to Information Retrieval Introduction to Biometrics The INTERSPEECH 2009 Emotion Challenge The INTERSPEECH 2013 Computational Paralinguistics Challenge: Social Signals, Conflict, Emotion, Autism Recent Developments in ation for Computing Machinery Schuller, openXBOW -Introducing the Passau Open-Source Crossmodal Bag-of-Words Toolkit Robust Sound Event Classification Using LBP-HOG Based Bag-of-Audio-Words Feature Representation Deep Representation Learning Techniques for Audio Signal 790 Deep Unsupervised Representation Learning for Audio-Based Medical Applications Autoencoders for Unsupervised Representation Learning from Audio auDeep: Unsupervised Learning of Representations from Audio with Deep Recurrent Neural Networks Sparse Autoencoder-Based Feature Transfer Learning for Speech Emotion Recognition SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition mixup: Beyond Empirical Risk Minimization Pattern Recognition and Machine Learning Paralinguistic Classification of Mask Wearing by Image Classifiers and Fusion, in: Proceedings INTERSPEECH, ISCA Phonetic, Frame Clustering and Intelligibility Analyses for the INTERSPEECH 2020 ComParE Challenge Learning Higher Repre-820 sentations from Pre-Trained Deep Models with Data Augmentation for the ComParE 2020 Challenge Mask Task Ensembling End-to-End Deep Models for Computational Par-825 alinguistics Tasks: ComParE 2020 Mask and Breathing Sub-Challenges Deep Recurrent Phonetic Models Exploration of Acoustic and Lexical Cues for the INTERSPEECH 2020 Computational Paralinguistic Challenge Are you Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs Surgical Mask Detection with Convolutional Neural Networks and Data Augmentations on Spectrograms Mask 845 Detection and Breath Monitoring from Speech: on Data Augmentation, Feature Representation and Modeling Identifying Surgical-Mask Speech Using Deep Neural Networks on Low-Level Aggregation The Hieroglyphs: building speech applications using CMU Sphinx and related resources Random Forests Image Classification with the Fisher Vector: Theory and Practice Test of Hypothesis -Concise Formula Summary, ms The ASA's Statement on p-values: Context, Process, and Purpose Ethics and Good Practice in Computational Paralinguistics, Transactions on Affective Computing (2020) The Superiority of the Ensemble Classification Meth-865 ods: A Comprehensive Review Recognising Realistic Emotions and Affect in Speech: State of the Art and Lessons Learnt from the First Challenge Deep feature augmentation for occluded image classification Efficient densely connected convolutional neural networks Face Mask Detection for Resource-Constrained Endpoints Mohamed received his M.Sc. degree in Computer Science at the University of Freiburg, Germany. Currently, he is an external Ph.D. student at the University of Augsburg, and a Senior Research Data Scientist at SyncPilot GmbH. His main research interests are in applying deep learning to various applications Nessiem received his M.Sc. degree in Computer Science at the University of Augsburg. He is an external Ph.D. student at the University of Augsburg and a Senior Research Data Scientist at SyncPilot GmbH. His interests generally relate to the usage of deep learning within industrial applications and specifically within the field of affective computing He is co-editor/author of two books and author/co-author of more than 300 technical articles, with an h-index = 48 and > 11 000 citations. His main research interests are all (crosslinguistic) aspects of prosody and His research is focused on machine learning applied to the field of bioacoustics, in order to analyse animal recordings, in particular killer whale underwater signals, to identify significant communication patterns, correlating them to the respective animal behaviour in order to decode animal communication USA, and is currently a Ph.D. student at EIHW, University of Augsburg. His research interests include the computational understanding of human affect and health states using multimodal and ubiquitous computing solutions He is currently a Ph.D. student at EIHW, University of Augsburg, and a member of the Ph.D. candidate program at BMW. His research focuses on multimodal affect recognition Her interests lie in the field of affective computing and speech recognition, focusing on data collection and new machine learning approaches A. degree from Columbia University's Computer Music Centre, and is currently a Ph.D. student at EIHW, University of Augsburg. Her research is focused on intelligent audio analysis in the domain of both speech and general audio Currently, he is a habilitation candidate at EIHW, University of Augsburg. His main research focus is deep learning, unsupervised representation learning, and transfer learning for machine perception He received his diploma degree in Electrical Engineering from RWTH Aachen University. His research is focused on signal processing, intelligent audio analysis, machine learning, and crossmodal affect recognition Schuller received his diploma, Ph.D. degree, habilitation, and Adjunct Teaching Professor in Machine Intelligence and Signal Processing, all in EE/IT from TUM in Munich/Germany. He is Full Professor of Artificial Intelligence and the Head of GLAM at Imperial College London/UK, Full Professor and Chair of EIHW The authors declare that they have no known competing financial interests or personal relationships which have or could be perceived to have influenced the work reported in this article. models by adapting ensembles of different models and attempting to increase The authors declare that they have no known competing nancial interests or personal relationships which have or could be perceived to have in uenced the work reported in this article.