key: cord-0490062-nr8kztmg authors: Wu, Haiwei; Zhang, Lin; Yang, Lin; Wang, Xuyang; Wang, Junjie; Zhang, Dong; Li, Ming title: Mask Detection and Breath Monitoring from Speech: on Data Augmentation, Feature Representation and Modeling date: 2020-08-12 journal: nan DOI: nan sha: 4eb27158772d34bc98895908ce4a434832b76f88 doc_id: 490062 cord_uid: nr8kztmg This paper introduces our approaches for the Mask and Breathing Sub-Challenge in the Interspeech COMPARE Challenge 2020. For the mask detection task, we train deep convolutional neural networks with filter-bank energies, gender-aware features, and speaker-aware features. Support Vector Machines follows as the back-end classifiers for binary prediction on the extracted deep embeddings. Several data augmentation schemes are used to increase the quantity of training data and improve our models' robustness, including speed perturbation, SpecAugment, and random erasing. For the speech breath monitoring task, we investigate different bottleneck features based on the Bi-LSTM structure. Experimental results show that our proposed methods outperform the baselines and achieve 0.746 PCC and 78.8% UAR on the Breathing and Mask evaluation set, respectively. Besides linguistic information, speech delivers various kinds of paralinguistic information, incorporating language, accent, gender, channel, emotion, psychological states, etc. [1] . To explore the automatic identification of paralinguistic attributes in audio data, the Interspeech COMPUTATIONAL PARALINGUISTIC CHALLENGE (COMPARE) is held at the Interspeech conference each year since 2009 [2] . In 2020, the twelfth COMPARE [3] focuses on three sub-challenges: Elderly Emotion Sub-Challenges (ESC), Breathing Sub-Challenge (BSC), and Mask Sub-Challenge (MSC). COMPARE encourages participants to develop task-dependent/task-independent features and techniques in different tasks. In this year, for both the MSC and BSC task, the organizer provides three traditional feature sets: OPEN SMILE acoustic feature set [4] (6373-dimensional for MSC, 130dimensional for BSC), Bag-of-Audio-Words (BoAW) extracted by OPENXBOW [5] , and AUDEEP [6] . Finally, a support vector machine (SVM) based classifier/regressor is employed. Besides the above task-independent features, for the MSC, the baseline system extracts a 2048-dimensional DEEP SPECTRUM feature [7, 8, 9] from a pre-trained convolutional neural network, which achieves the best unweighted average recall (UAR) of 70.8% among all single systems on the development set. For the BSC, a sequential regression problem, the baseline system provides an end-to-end deep sequence modeling approach (CNN-LSTM) [10, 11] , achieving the highest Pearsons Correlation Coefficient (PCC) of 0.731 on the development set. In recent years, deep learning based methods have achieved state-of-the-art performances in many paralinguistic tasks [12, 8, 13, 14, 15] . Structures of the convolutional neural network (CNN) [8, 13, 14, 15] and long short term memory (LSTM) [12] is playing an increasingly important role in feature extraction and modeling. Thus, our works concentrate on these deep neural networks (DNN) based approaches. Given that the BSC task is a sequence-to-sequence regression task, we pay attention to the high-performance Bi-LSTM network. Besides, we experiment with bottleneck features to investigate whether phonetic information is useful for predicting breathing states. For the MSC task, we implement two kinds of convolutional neural network systems to extract highlevel embeddings from the filter-bank energies (Fbank). SVM is employed on the top of these embeddings to make decisions [15] . We also investigate three approaches of data augmentation, which achieve significant improvements in our end-to-end framework. The rest of this paper is organized as follows: tasks and databases are presented in section 2. Features and modeling are given in section 3, and experimental results are provided in Section 4. Section 5 concludes the work. The Mask Sub-Challenge is a binary classification task to identify whether a speaker is wearing a facial protective mask. As COVID-19 spreading around the world, many people start to wear masks to protect themselves. Researchers found that wearing a mask affects speech production, resulted from muscle constriction, increased vocal effort, and transmission loss [18] . Experiments on speaker recognition [19] and speech recognition [20] shows that wearing masks would bring degradation to the performance of original speech systems, which indicates the necessity to detect whether the speaker is wearing a mask or not. The Mask Augsburg Speech Corpus (MASC) [3] is used for the MSC task. In this database, the audio of 32 native German speakers is segmented into chunks of one-second duration for training and evaluation. The Breathing Sub-Challenge is a sequential regression task that predicts a temporal breathing signal from recorded speech. Breathing condition is related to the speakers hesitation, duration, and emphasis of utterance [21, 22] . Developing algorithms that could reliably monitor breathing states can provide vital in-Original Fbank Speed pertubation SpecAugemnt [16] Random erasing [17] Figure 1 : Effects of different data augmentation methods on the BSC formation for doctors to make respiratory and speech planning and help singers to better manipulate their breath sound. For the BSC task, a subset of the UCL Speech Breath Monitoring (UCL-SBM) database [3] is used. All 49 speakers reported English as a primary language covering a wide range of regional accents, sociolect, and ages. For each speaker, four minutes of spontaneous speech is recoded under a quiet office space. This section describes our feature extraction, data augmentation, and modeling of the MSC and BSC. For both tasks, we employ log-Fbank as our input acoustic features. For the MSC, gender-aware and speaker-aware features act as a complement to the log-Fbank. Speed perturbation, SpecAugment [16] and random erasing [17] are adopted for data augmentation. Deep convolutional networks are trained to extract embeddings, followed by a back-end SVM for classification. In the BSC task, we also explore bottleneck (BN) features under the framework of Bi-LSTM. Facial movements vary across genders [23, 24] and people, thus wearing masks may have different effects for different genders and speakers. As mentioned in section 2.1, previous studies point out that wearing masks could degrade the speech signal and decrease the performance of speaker recognition [19] . But whether gender and speaker characteristics could influence mask detection is uncertain. To further explore this question, we introduce the gender-aware and speaker-aware features to the mask detection in this paper. Motivated by the work of [25] in emotion recognition, we automatically extract gender-aware features from a pre-trained ResNet-based gender classifier. In our work, these features are derived from the penultimate linear layer of the gender classifier network trained with the Voxceleb1 dataset. The ResNet structure is almost the same as described in [26] , except that the number of output nodes of the penultimate linear layer is 100. To introduce the speaker's information to our training, we follow the configuration of [26] to train a deep speaker model and extract embeddings as our speaker-aware features. During the optimization of the mask speech classifier, gender, and speakeraware features are fused on different levels. In the BSC, besides raw waveform and log-Fbank features, we investigate bottleneck features as the input of our model. We utilize the BUT/Phonexia bottleneck (BN) feature extractor [27, 28] to generate deep phonetic features. BN features are extracted from a narrow hidden layer of the acoustic model, whose targets are phonemes. These fea-tures contain phonetic information and are suitable for many areas like speech recognition, speaker verification, and language identification. In this toolkit [27, 28] , the acoustic model is a stacked bottleneck network with two stages. The first stage is an ordinary bottleneck network, and the second one is built on the bottleneck output of the previous model with a broader context. The BN extractor package provides three trained neural network models, among which we choose the FisherMono and FisherTri models. The FisherMono and FisherTri models are trained on the English Fisher corpus for monophone-states and triphone-states, respectively. Data augmentation is a common approach to create corrupted versions of the training data, and hence increase the quantity and diversity of the data. In the MSC task, we adopt three schemes of data augmentation for our training: speed perturbation, SpecAugment [16] , and random erasing, respectively. The effect of different data augmentation approaches is illustrated in Figure 1 . Speed perturbation is a simple data augmentation approach, which has proven effective in speech recognition, speaker verification, and paralinguistic attribute recognition [15, 29] . It can be easily implemented without any additional noise data. Practically, we apply speed perturbation with factor 0.9, 1.0, and 1.1 to augment the data and pool them together for model training. Random cropping and repeated padding on the time axis are computed to maintain the size of input features. SpecAugment [16] is a simple but effective technique of augmentation. It has been successfully applied in speech and speaker recognition [16, 30] . It masks the acoustic features partially on the time and frequency domain to train a model that is robust to distorted features. The scheme directly acts on Fbank and is suitable for on-the-fly augmentation. We choose the following deformations to augment the training data: 1. Frequency masking is applied on Fbank with a consecutive frequency channel range of [f0, f0 + f ). f refers to the pre-defined bandwidth chosen from a uniform distribution from 0 to F , and f0 is chosen from [0, v − f ). v represents the feature dimension. 2. Time masking is applied on the time axis with a consecutive frame range of [t0, t0 +t). t refers to the number of frames to be masked chosen from a uniform distribution from 0 to T , and t0 is the beginning frame selected from [0, u − t). u represents the number of frames. Time warping is also proposed as one of the strategies of SpecAugment. In our cases, we have applied speed perturba- Figure 2 : Framework of the ResNet embedding system with a back-end classifier for the MSC task tion, which is very close to it. Thus we do not consider it in this paper. Random erasing is a data augmentation method proposed by [17] in image processing and has successfully applied in object detection, image classification, and person re-identification. Its motivation is close to SpecAugment that both of them mask the features to increase the robustness of models towards deformations. Instead of covering a band on the frequency or time axis, random erasing selects a rectangle region randomly on the features and then replace its value with zero. In the training phase, the input acoustic features in a batch are randomly kept unchanged or erased a rectangle region of arbitrary size. The position, width, and the height of the masking rectangle are randomly selected within its pre-defined ranges. And for each batch of training, different versions of corrupted features can be generated, thus enhance the robustness and the generalization ability of models. Similar to the baseline deep spectrum system, we also extract embedding features from the deep convolutional neural network. Different from the baseline system, which is pre-trained with image corpus, ours is trained directly for the masked or clear targets in an end-to-end manner. In our work, we implement two networks to extract deep representations, including a modified version of ResNet and DenseNet. The deep ResNet structure is implemented following [29, 31] , which has three main components: a ResNet front-end module, two parallel global pooling, and a two linear layer structure. The ResNet module is composed of a series of residual blocks. The module projects the input Fbank to feature map F ∈ R C,H,W . Then, the global average pooling (GAP) layer and global standard deviation pooling (GSP) layer are applied on each channel to generate a concatenated 2C-dimensional vector: The output vectors of the pooling layer are then fed into the fully-connected layers to make predictions. The embedding features are extracted from the output of the penultimate fullyconnected layer. Our DenseNet [32] structure follows the implementation of torchvision. DenseNet connects every layer with other layers in a feedforward manner, thus has the potential to reduce the problem of gradient vanishing. The deep embedding features are extracted from the output of the average pooling layer. After training the model, we feed the extracted embeddings to a back-end SVM for predictions. The framework is illustrated in Figure 2 . The BSC task can be regarded as a sequence-to-sequence regression task. We implement a Bi-LSTM network, a typical sequential modeling method considering speech context from both directions. In our work, two Bi-LSTM layers are stacked, each with 256 units per direction and a dropout rate of 0.6, followed by a fully-connected layer. The final activation function we use is the tanh function. Through this structure, the input features are transformed into a sequence of predictions. Then, we compute the cosine distance between the ground true upper belt signal as the loss function to update the model. The MSC task is a binary classification task that identifies whether the speaker is wearing a mask or not. The Metric of this task is unweighted average recall (UAR). We investigate different kinds of data augmentation schemes and fusion methods of the gender/speaker-aware feature in our work. Our augmentation schemes include speed perturbation, SpecAugment, and random erasing. The parameter (F , T ) of SpecAugment is chosen to be (12, 20) , which achieves the best performances among the candidates {(8, 12), (10, 16) , (12, 20) , (14, 24) , (16, 28) } in our preliminary experiments. In the random erasing method, to explore a suitable proportion range of erased area against input features, we try the candidates {(0.02, k)}, k = 0.1, 0.15, 0.2, 0.25, 0.3, and find out that (0.02, 0.2) is an appropriate choice in our task. Both the gender-aware and speaker-aware features are extracted from a pre-trained classifier. In our experiments, we introduce these features on two different levels. The first one is the feature level, in which the gender/speaker embeddings are stacked on top of Fbank (Feat-level). Another one is the pooling level. Gender/speaker-aware features are concatenated with the output of the pooling layer (Emb-level). Categorical cross-entropy is taken as the loss function. Networks are optimized using stochastic gradient descent (SGD) with Nesterov momentum 0.9 for 100 epochs. During the training process, the learning rate is first initialized as 0.01 and reduced by a factor of 10 when the training loss plateaus. Embeddings are extracted from the trained networks and then fed into an SVM classifier with the BRF kernel with default parameters. We fuse the systems by averaging the output probabilities of different models. Table 1 shows the contribution of gender-aware and speakeraware features under different fusion methods. The performance for gender-aware and speaker-aware features fused in the feature level gives a considerably better result than others, suggesting that both gender information and speaker information are effective for mask detection. It indicates that wearing a mask may bring different effects for different genders and speakers. The UAR score of the embedding level fusion is close to the baseline, which may be caused by the redundancy of gender/speaker information in the embedding layer. Experimental results are shown in Table 2 . The ResNet based system achieves a better performance than the DenseNet based system with or without augmentation. All the data augmentation methods we apply in the MSC help improve the performance significantly. The SpecAugment approach manages to achieve a more significant growth than the random erasing on the development set. Combining SpecAugment and random erasing does not bring any further improvement, which means that the effect of SpecAugment and random erasing are not complementary. Our final submitted system is fused with systems having (*) in the Tables 1 and 2. Its performance significantly outperforms the baseline system on the test set, which indicates that our approach is robust and effective. The BSC task can be viewed as a sequence-to-sequence task. PCC is used as a metric. We train a two-layer stacked Bi-LSTM network with Fbank and BN features to predict the upper belt signal from speech. Models are optimized using Adam for 100 epochs with a batch size of 16. We exploit the BUT/Phonexia bottleneck (BN) feature extractor to extract the 80-dimensional FisherMono and FisherTri features. In the Fbank system, the speech of four minutes is transformed into a 6000-frames acoustic feature sequence with a 60ms window and 40 ms shift. For the BNF system, with a shift of 10ms, we stack four frames together to generate a 6000-frames feature. Comparing the results of the BNF features in Table 3 , we can find that the BNF system based on triphone-states works slightly better than monophone-states. They achieve higher scores than the baseline system but do not show evident advantages over Fbank. In the test set, our fused system outperforms the baseline system. This paper describes our submitted systems for the MSC and BSC in the Interspeech COMPARE Challenge 2020. For the MSC task, embeddings are extracted from deep convolutional neural networks as representations and fed into a back-end SVM classifier for binary classification. We investigate speed perturbation, SpecAugment, and random erasing as our data augmentation schemes and use the gender/speaker embeddings to further enhance the performance. For the BSC task, we explore Fbank and phonetic features based on the Bi-LSTM structure. Experimental results prove the effectiveness of our data augmentation approaches of the deep embedding systems in the MSC. The performance of the phonetic features is better than the baseline while shows no advantages towards Fbank. Our proposed methods outperform the baselines and achieve 0.746 PCC and 78.8% UAR on the MSC and BSC evaluation data, respectively. Paralinguistics in speech and languagestate-of-the-art and the challenge Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge The interspeech 2020 computational paralinguistics challenge: Elderly emotion, breathing & masks Opensmile: the munich versatile and fast open-source audio feature extractor openxbow -introducing the passau open-source crossmodal bag-of-words toolkit audeep: Unsupervised learning of representations from audio with deep recurrent neural networks Bag-of-deep-features: Noise-robust deep feature representations for audio analysis Snore sound classification using image-based deep spectrum features Sentiment analysis using image-based deep spectrum features Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network End2you-the imperial toolkit for multimodal profiling by end-to-end learning Using Attention Networks and Adversarial Augmentation for Styrian Dialect Continuous Sleepiness and Baby Sound Recognition End-to-end deep learning framework for speech paralinguistics detection based on perception aware spectrum An end-to-end deep learning framework for speech emotion recognition of atypical individuals The DKU-LENOVO Systems for the INTERSPEECH 2019 Computational Paralinguistic Challenge SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition Random erasing data augmentation Effects of different types of face coverings on speech acoustics and intelligibility Analysis of face mask effect on speaker recognition Distant talking speech recognition in surgery room : the domhos project The interplay of linguistic structure and breathing in german spontaneous speech Variability and consistency in speech breathing during reading: Lung volumes, speech intensity, and linguistic factors Gender and age differences in facial expressions Three dimensional analysis of facial movement in normal adults: influence of sex and facial shape Gender-aware cnn-blstm for speech emotion recognition On-the-fly data loader and utterance-level aggregation for speaker and language recognition But/phonexia bottleneck feature extractor Multilingually trained bottleneck features in spoken language recognition Exploring the encoding layer and loss function in end-to-end speaker and language recognition system Investigation of specaugment for deep speaker embedding learning Dihard ii is still hard: Experimental results and discussions from the dkulenovo team Densely connected convolutional networks We want to thank Hamilton, Antonia and Macintyre, Alexis from University College London to share the speech breathing dataset with us for this paper. This research is funded in part by the National Natural Science