key: cord-0182064-1x4156u2
authors: Das, Rohan Kumar; Li, Haizhou
title: Classification of Speech with and without Face Mask using Acoustic Features
date: 2020-10-08
journal: nan
DOI: nan
sha: 035a0357cad82e0c66c4c5094d9ef8f32474b2bb
doc_id: 182064
cord_uid: 1x4156u2

The understanding and interpretation of speech can be affected by various external factors. The use of face masks is one such factors that can create obstruction to speech while communicating. This may lead to degradation of speech processing and affect humans perceptually. Knowing whether a speaker wears a mask may be useful for modeling speech for different applications. With this motivation, finding whether a speaker wears face mask from a given speech is included as a task in Computational Paralinguistics Evaluation (ComParE) 2020. We study novel acoustic features based on linear filterbanks, instantaneous phase and long-term information that can capture the artifacts for classification of speech with and without face mask. These acoustic features are used along with the state-of-the-art baselines of ComParE functionals, bag-of-audio-words, DeepSpectrum and auDeep features for ComParE 2020. The studies reveal the effectiveness of acoustic features, and their score level fusion with the ComParE 2020 baselines leads to an unweighted average recall of 73.50% on the test set.

Speech is a natural way of human-human and human-robot communication [1] . However, understanding and interpreting the conveyed message via speech gets affected by external factors, such as background noise and obstruction either at speech source or at receiver, that lead to performance degradation for various automatic systems as well as adverse effect in human perception.

There are various practical scenarios, where there is a requirement of wearing face masks. For instance, the surgeons working in operation theaters and forensic investigations are the most common among them. In addition, the current world situation due to COVID-19 1 pandemic makes most of the people to wear face masks in their daily life. The wearing of face mask presents an adverse external factor to speech communication.

The face mask detection has been performed previously to detect breach protocols in operating room using image processing techniques [2] . However, such classification has never been investigated using speech [3] . Wearing of face mask may lead to increase in the vocal effort as face mask attenuates the speech energy, which is reported in a study conducted on oral examination data post-SARS [4] . Studies also show that spectral properties of fricatives are affected while using face masks [5] . We believe exploring different aspects of acoustic cues from a given speech can be helpful for detecting presence of face mask.

Literature shows that the surgical masks have less effect on speech understanding by human listeners [6] . A speech recognition study [7] on data collected in surgery rooms showed a high word error rate. Along similar direction, speaker recognition studies are also performed with different face masks [8] - [10] . However, it is also found that the use of face masks does not have a large impact on speaker recognition performance [9] , [10] . Further, the identification of different face mask types showed that most of them could not be identified correctly [9] .

The Computational Paralinguistics Evaluation (ComParE) 2 challenge series devotes on spearheading novel explorations for various paralinguistics studies [11] . It has been running successfully for more than a decade since its inception [12] . The furtherance in the field of paralinguistics and computer science have advanced the state-of-the-art systems for various studies [13] , [14] . The latest ComParE 2020 runs three tasks, one of which is to find out whether a speaker is wearing a face mask for a given speech [3] . We report the participation of NUS team on this task of ComParE 2020 in this paper.

We consider three novel acoustic features capturing different acoustic properties of a signal. They are linear frequency cepstral coefficients (LFCC) [15] , instantaneous frequency cosine coefficients (IFCC) [16] and constant-Q cepstral coefficients (CQCC) [17] . The LFCC captures spectral information using linearly spaced filterbanks, whereas IFCC captures the instantaneous phase of a signal. On the other hand, the CQCC features are derived using long-term constant-Q transform (CQT). All these three acoustic features have shown their effectiveness for different detection tasks previously [17] - [21] . We also consider widely popular mel frequency cesptral coefficient (MFCC) [22] feature as a contrast system and the ComParE 2020 baselines using ComParE functional feature set, bag-of-audio-words (BoAW), DeepSpectrum and auDeep feature [23] - [27] . Further, we perform a score level fusion of systems using acoustic features and the four ComParE 2020 baselines for the challenge submission.

The rest of the paper is organized as follows. Section II describes the three acoustic features studied for finding presence of face masks. In Section III, the details of experiments are described. The results and analysis are reported in Section IV. Finally, Section V concludes the work.

In this section, we discuss three acoustic features considered to capture the artifacts for classification of speech with and without masks. We next discuss each of them in detail.

The short-term processing of speech signals followed by computation of log power spectrum is one of the most common ways of capturing acoustic artifacts from speech signal. The discrete cosine transform (DCT) over log power spectrum is taken to derive the cepstral coefficients in various speech processing application. Further, filterbanks are used to have a compact representation of the high dimensional cepstral features.

MFCC is one of the most widely used acoustic features that consider triangular filterbanks with a non-linear logarithmic mel scale, where the filters are placed densely in low frequency regions [22] . This is motivated by the human auditory perception [28] . However, the same may not be applicable for machines to capture discriminative artifacts for classification or detection tasks as reported in [29] . Therefore, we use linear filterbank based features. The LFCC feature replaces the mel filterbanks in MFCC by linearly spaced triangular filters [15] . As they focus on the artifacts uniformly along the frequency axis, they have been found to be useful for detection tasks previously [15] . Therefore, we consider LFCC as one of the feature to capture the acoustic properties from speech signal in this work.

Most of the acoustic features are derived using magnitude of the power spectrum. We consider that phase patterns of speech, in particular, aspirated plosives, are affected by the mask filter material. We would like to study the use of IFCC feature that is derived from analytic phase of a signal [16] . The issue of phase warping is avoided by using Fourier transform properties to obtain the instantaneous frequency. The instantaneous frequency θ ′ for a signal in discrete-time n can be derived as follows:

where k = 1, 2, . . . , K represents the frequency bin index, N is the length of the narrowband signal, F −1 d indicates inverse discrete Fourier transform and Z[k] is the discrete Fourier transform of the analytic signal z[n], obtained from the narrowband component of given signal [16] , [30] .

The DCT is then applied on the instantaneous frequency components to obtain IFCC features 3 . These features carry long range acoustic information as short-term processing is performed only at the end to have frame-level features. As they are derived using the phase of a signal, they show complementary acoustic properties to many common features obtained from the magnitude spectrum. These features have been also successfully used for detection of spoofing attacks and orca activity [18] , [20] , [31] .

We consider another aspect of acoustic property captured by long-term processing. The CQT is a long-term window transform [32] and is different from traditional features derived by short-term processing over a window of few milliseconds. It not only has a higher frequency resolution for lower frequencies, but also a higher temporal resolution for higher frequencies unlike the discrete Fourier transform. In addition, geometrically distributed center frequencies of each filter and the octaves make it unique, especially for detection of classification tasks [17] , [33] . Previous studies used CQT to derive CQCC features that have been found effective for detection tasks [17] , [20] . We believe they can also help to capture useful artifacts for classification of speech with and without face mask.

We note that uniform resampling applied to CQT based log power spectrum followed by DCT to derive the CQCC features 4 . For a given signal x(n), its long-term transform CQT Y (k, n) is computed as follows

where k = 1, 2, . . . , K represents frequency bin index, N k are the variable window lengths, a * k (n) denotes the complex conjugate of a k (n), and ⌊•⌋ denotes rounding towards negative infinity. The basic functions a k (n) are complex-valued timefrequency atoms and are defined in [17] . Fig. 1 shows a comparison of speech with and without mask along with their corresponding spectrogram and pyknogram. A pyknogram is a scatter plot denoting the time-frequency representation of instantaneous frequencies from different filters in the filter-bank [34] . We find the effect of having a mask is not that visible at the waveform level. However, it is observed that spectrograms of speech without mask has more prominent energy trajectories than that in the case of speech with mask. In addition, the pyknograms of speech with and without masks are different, showing phase information as another useful artifact. We believe the three features discussed above showing different acoustic properties can capture the attenuation in the frequency components of speech signal due to wearing of face masks. The information thus captured has definite potential to classify speech with and without mask.

This section reports the experiments carried out for mask task of ComParE 2020. The database details and experimental setup are discussed in the following subsections. 

We used the Mask Augsburg Speech Corpus (MASC) released by organizers of ComParE 2020 for the studies [3] . The recordings are from 32 German native speakers that wear surgical mask from Lohmann and Rauscher. The corpus is gender balanced to have 16 male and 16 female speakers, whose age ranges from 20 to 41 years. The recordings are conducted in studio environment using large diaphragm condenser microphone. Although the original recordings are made in 48 kHz with 24 bit, the challenge participants are provided with 16 kHz mono/16 bit version. The duration of the corpus is around 10 hours. The data collection is made with and without wearing masks by the participants, where they answered some questions, read words known for their usage in medical operation rooms, drew a picture and talked about it, and described pictures. Further, the recordings are segmented into small duration non-overlapping 1-second segments for the challenge mask task to find whether a speaker is wearing mask or not.

The collection of 1-second segments are then partitioned into train, development and test sets, which are released for the mask task of ComParE 2020. The labels are given for the segments of train and development set to indicate if the speech is recorded with or without face mask, whereas the test set is blinded for the challenge. Different explorations can be carried out using the train and development set to choose the best performing systems to apply on the test set. Table I presents a summary of the corpus released for mask task of ComParE 2020. The unweighted average recall (UAR) is used as metric for reporting the challenge results following the ComParE 2020 protocol [3] .

ComParE 2020 organizers provide four state-of-the-art baseline systems for the mask task. Among these the ComParE functional feature based system is the official baseline similar to the previous editions [23] . These features are obtained using the openSMILE 5 toolkit [35] , [36] . Another baseline with BoAW features is learned using low-level-descriptors (LLD) of ComParE feature set as well as the deltas of LDDs. The openXBoW 6 toolkit is used to extract the BoAW features for different codebook size [24] . The third baseline with DeepSpectrum features generated by DeepSpectrum 7 toolkit considers a pre-trained ResNet50 model [25] , [37] . Lastly, unsupervised representation learning with recurrent sequence to sequence autoencoders (S2SAE) are used to obtain the auDeep features using auDeep 8 toolkit that serves as the fourth baseline [26] , [27] . All the four baselines use support vector machine (SVM) as a classifier to classify speech with and without mask. The organizers also performed a majority voting based fusion among these baselines that serves as the benchmark system on the test set.

We now discuss the systems with acoustic features considered in this work. The LFCC and IFCC features are obtained for every frame of 20 ms with a shift of 10 ms. As the speech segments are of 1 second duration, we do not perform any voice activity detection. The parameters for CQT followed by CQCC feature extraction follows those given in [17] . We consider ∆ and ∆∆ coefficients for each acoustic feature along with the 30 static coefficients to have 90-dimensional feature representation. The widely used MFCC features in various speech processing applications are also considered as a contrast acoustic feature for the study. They are extracted with similar settings to that of LFCC, however, mel filterbanks are used in place of linear filterbanks.

Once the acoustic features are extracted, we use Gaussian mixture model (GMM) to build two different models of 512 mixtures each using features from speech with and without mask for each feature [38] . The choice of GMM for this task follows our previous work of orca activity detection [20] . During testing, for a given test speech, its respective acoustic features are extracted and then likelihood is computed against the two models. Finally, the test speech is classified as the category showing a higher likelihood score.

In general, the combination of multiple systems with complementary information helps to improve performance [39] - [41] . Hence, we performed fusion of various acoustic feature based systems as well as the four baselines in this work. The fusion is carried out with logistic regression at score level using Bosaris 9 toolkit [42] , which applies the weights of various systems learned on the development set to the unseen test set.

We re-implement the four baseline systems of ComParE 2020 mask task, and report their results in Table II . It is noted that the results on the test set are referred from [3] . The best configuration of each of the baseline are chosen by tuning various parameters that are shown in bold fonts. The majority voting based fusion of these best baselines showing UAR of 71.8% serves as the benchmark challenge performance.

We now focus on the acoustic feature based systems and their fusion with other systems. It is noted that results on We then perform the fusion studies with the acoustic features and the four ComParE 2020 baselines. Table III shows the score level fusion of the four acoustic features as well as the four baselines and their comparison to given challenge. We observe that the score level fusion of the baselines leads to an improved result than the challenge benchmark baseline using majority voting based fusion. This may be due to the fact that the majority voting considers fusion based on predicted class output of each system, whereas the score level fusion employs a weighted scheme to fuse various systems. We observe that the score level fusion of the acoustic features performs better than the individual feature that reveals complementary acoustic properties captured by each of them. Finally, the score level fusion of the acoustic features and the four baselines achieves an UAR of 73.50% on the test set that beats the benchmark baseline by 1.70%.

This work devotes on studying novel acoustic features capturing different acoustic properties of a signal to classify speech with and without mask for ComParE 2020 challenge participation. We focused on features derived using linear filterbanks, instantaneous phase and long-term information of signal. Among these the linear filterbank and long-term information based features are found to be more effective for the challenge mask task. The systems of acoustic features are fused with the ComParE 2020 baselines at score level for challenge submission. The submitted system outperformed the challenge benchmark system based on majority voting of the best baselines showing usefulness of both score level fusion and acoustic feature based complementary information for classification of speech with and without mask.

Human-Agent and Human-Robot Interaction Theory: Similarities to and Differences from Human-Human Interaction

Mask and maskless face classification system to detect breach protocols in the operating room

The INTERSPEECH 2020 computational paralinguistics challenge: Elderly emotion, breathing & masks

The impact of wearing a face mask in a high-stakes oral examination: An exploratory post-SARS study in Hong Kong

Speaking under cover: The effect of faceconcealing garments on spectral properties of fricatives

Speech understanding using surgical masks: A problem in health care?

Distant talking speech recognition in surgery room : The DOMHOS project

The 'audio-visual face cover corpus': Investigations into audio-visual speech and speaker recognition when the speaker's face is occluded by facewear

Speaker recognition for speech under face cover

Analysis of face mask effect on speaker recognition

The computational paralinguistics challenge

Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge

Paralinguistics in speech and language state-of-theart and the challenge

Affective and behavioural computing: Lessons learnt from the first computational paralinguistics challenge

A comparison of features for synthetic speech detection

Significance of analytic phase of speech signals in speaker verification

A new feature for automatic speaker verification anti-spoofing: Constant Q cepstral coefficients

Instantaneous phase and excitation source features for detection of replay attacks

Long range acoustic features for spoofed speech detection

Instantaneous phase and long-term acoustic cues for orca activity detection

Long range acoustic and deep features perspective on ASVspoof 2019

Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism

OpenXBOW: Introducing the passau opensource crossmodal bag-of-words toolkit

Snore sound classification using image-based deep spectrum features

auDeep: Unsupervised learning of representations from audio with deep recurrent neural networks

A fusion of deep convolutional generative adversarial networks and sequence to sequence autoencoders for acoustic scene classification

Auditory patterns

Significance of subband features for synthetic speech decetion

Computing the discrete-time 'analytic' signal via FFT

Spoof detection using source, instantaneous frequency and cepstral features

Calculation of a constant Q spectral transform

Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification

Speech formant frequency and bandwidth tracking using multiband energy demodulation

OpenSmile: The munich versatile and fast open-source audio feature extractor

Recent developments in openSMILE, the munich open-source multimedia feature extractor

Deep residual learning for image recognition

Robust text-independent speaker identification using Gaussian mixture speaker models

Combining source and system information for limited data speaker verification

Different aspects of source information for limited data speaker verification

Exploring different attributes of source information for speaker verification with limited test data

The BOSARIS toolkit: Theory, algorithms and code for surviving the new DCF