key: cord-0635489-vh1r700w authors: Pahar, Madhurananda; Klopper, Marisa; Reeve, Byron; Warren, Rob; Theron, Grant; Diacon, Andreas; Niesler, Thomas title: Wake-Cough: cough spotting and cougher identification for personalised long-term cough monitoring date: 2021-10-07 journal: nan DOI: nan sha: b443b37d486d7c9e26617f37006c5a67170eb0f3 doc_id: 635489 cord_uid: vh1r700w We present 'wake-cough', an application of wake-word spotting to coughs using Resnet50 and identifying coughers using i-vectors, for the purpose of a long-term, personalised cough monitoring system. Coughs, recorded in a quiet (73$pm$5 dB) and noisy (34$pm$17 dB) environment, were used to extract i-vectors, x-vectors and d-vectors, used as features to the classifiers. The system achieves 90.02% accuracy from an MLP to discriminate 51 coughers using 2-sec long cough segments in the noisy environment. When discriminating between 5 and 14 coughers using longer (100 sec) segments in the quiet environment, this accuracy rises to 99.78% and 98.39% respectively. Unlike speech, i-vectors outperform x-vectors and d-vectors in identifying coughers. These coughs were added as an extra class in the Google Speech Commands dataset and features were extracted by preserving the end-to-end time-domain information in an event. The highest accuracy of 88.58% is achieved in spotting coughs among 35 other trigger phrases using a Resnet50. Wake-cough represents a personalised, non-intrusive, cough monitoring system, which is power efficient as using wake-word detection method can keep a smartphone-based monitoring device mostly dormant. This makes wake-cough extremely attractive in multi-bed ward environments to monitor patient's long-term recovery from lung ailments such as tuberculosis and COVID-19. Wake-words (WW) are used as trigger phrases which enable keyword spotting (KWS) systems to initiate certain tasks such as speech recognition, providing a bridge between the enduser and either the device or the cloud [1] . For example, some widely-used trigger phrases for voice assistants on smart devices are: Google's 'OK Google', Apple's 'Hey Siri', Amazon's 'Alexa', Microsoft's 'Hey Cortana' [2] and they are highly sensitive in both quiet and noisy environment [3] , making them extremely useful in hands-free situations like driving [4] . Coughing is the forceful expulsion of air to clear up the airway and a common symptom of respiratory diseases, such as tuberculosis (TB) [5] , asthma [6] , pertussis [7] , COVID-19 [8] , which can be identified using machine learning classifiers. To successfully implement cough as a WW in commercial smartphones, it is necessary to accurately identify the cougher [9] in a noisy and quiet environment and the cough among various other commonly used trigger phrases [10] . Vocal audio such as speech can be identified using ivectors, which present a low-dimensional speaker and channel dependant space using factor analysis [11] . The performance can be improved by using x-vectors [12] and d-vectors [13] , which use the data augmentation and DNN based embeddings to map speaker embeddings. Coughers have been identified using x-vectors on natural coughs in an open world environment for 8 male and 8 female subjects after implementing data augmentation to address the issue of background noise [14] and using d-vectors on forced coughs [15] . Here, we identify both natural and forced coughs among other trigger phrases in the Google Speech commands dataset [16] while also identifying the coughers in noisy and quiet environment using i-vectors, x-vectors and d-vectors. For the cougher identification task, two datasets which will be referred to as TASK and Wallacedene (Table 1) , were both manually annotated using ELAN [17] . The TASK dataset was collected inside a multi-bed ward at a 24h tuberculosis (TB) clinic near Cape Town, South Africa and contains natural coughs [18] . A plastic enclosure, attached to the bedframes, holds a Samsung Galaxy J4 smartphone connected to a BOYA BY-MM1 cardioid microphone ( Figure 1 ) and the distance from the cougher and the microphone was between 30 and 150 cm. The dataset includes 6000 cough events, sampled at 22.05 kHz and collected from 14 adult male patients over a 6 month period, totalling 3.16 hours of cough audio with an average SNR of 73 ±5 dB. No other information of the patients was collected due to ethical constraints. Wallacedene dataset was collected inside an outdoor booth next to a busy primary health clinic in Wallacedene, near Cape Town, South Africa representing a real-world environment where a TB test would likely to be deployed [19] (Figure 1 ). Patients were asked to count from 1 to 10, then cough, take a few deep breaths, and cough again, thus producing forced coughs. These counts were used as speech to provide a baseline to compare the performance of cougher identification. The audio, sampled at 44.1 kHz, was recorded using a RØDE M3 condenser microphone from 38 males and 13 females, keeping a 10 to 15 cm gap between the microphone and the patients. The environmental noise was present in both cough and speech, having the average SNR of 34 dB and 33 dB respectively with a standard deviation of 17 dB (Table 1) . TASK, containing only coughs, was collected in a quiet environment. Wallacedene, containing both cough and speech (counting from 1 to 10), was collected in a noisy environment. Table 1 shows that the TASK dataset is less-noisy, contains much longer cough audio for each subject, whereas Wallacedene dataset is noisier but has cough and speech audio from a larger number of subjects. All audio recordings were downsampled to 16 kHz, required for kaldi ASR system [20] . For cough spotting, we randomly selected 3795 coughs from the TASK and Wallacedene datasets. Each cough was normalised to a 1-sec duration by either trimming or padding with silence and downsampled to 16 kHz. These 'cough' events were added as an extra class to the 2nd version of Google Speech Commands dataset containing 1-sec long 109,624 events, sampled at 16 kHz, belonging to 35 classes [16] . These events were mixed with the background noises (Section 5.8 of [16] ) with a randomly selected SNR between 73 and 34 dB (Table 1) . A subset of this dataset was also cre- ated with only 42,341 events belonging to 10 classes, which can be used as commands in IoT or robotics [16] . For spotting cough as a trigger phrase, we note these two datasets as SC-36 and SC-11, containing 36 and 11 classes respectively. For cougher identification, We have extracted x-vectors and i-vectors using extractors pre-trained on the under-resourced languages [21] , which are spoken by the subjects in the TASK and Wallacedene datasets ( Figure 2 ). t-sec long audio from each of N coughers are concatenated by following the data preparation requirements of Kaldi ASR toolkit [20] . i-vectors are generated for each non-overlapping 0.1 sec audio from each utterance ID, with a dimension of (t × 10, 100) for each cougher [11] . Unique x-vectors are generated for each 1.5 sec of utterance with 0.75 sec overlap, having a dimension of (1, 512) [12] . Thus for each t sec long audio from each cougher, there are x-vectors of dimension ( t 0.75 , 512). We have also extracted d-vectors using extractor pre-trained on VCC 2018, VCTK, Librispeech, and CommonVoice English datasets and generalized using end-to-end loss function [13] . Every t sec cough is split into non-overlapping 0.5 sec audio clips, thus producing d-vectors of dimension ( t 0.5 , 256) for every cougher, suggesting that the i-vectors have a higher dimensionality than x-vectors and d-vectors. The number of subjects (N ) and the cough-time (t) were the hyperparameters in cougher identification task (Table 3) . For speakers, we used all counts, having only N as the hyperparameter. For TASK and Wallacedene datasets, N has been varied between LR, LDA, SVM and MLP were used in identifying coughers and CNN, LSTM and Resnet50 were used in spotting coughs as a trigger phrase. Table 3 lists the hyperparameters considered for these classifiers and they were optimised using 5-fold cross-validation and the standard deviation among the outer folds is noted as σ ACC in Table 4 . For Resnet50, the 50-layer architecture in [22] has been used. Table 4 shows the results using the best two features for both TASK (less-noisy) and Wallacedene (noisier) datasets. The highest accuracy (99.78%) has been achieved by an MLP when using i-vectors to identify coughers from 100 sec (t = 100) long cough collected from each of 5 coughers. By increasing the number of coughers to 10 and 14, the performance of the MLP classifier decreased to 98.87% and 98.39% respectively for i-vectors (Table 4 and Figure 4 ). The MLP produces 95.11% accuracy using these i-vectors in discriminating 14 coughers (Table 4 ). All classifiers performed well in identifying both coughers and speakers on the noisier Wallacedene dataset. The speaker identification is used as the baseline and Table 4 shows that using x-vectors produced better classification scores than using i-vectors for speaker identification, also found by others [12] . The highest accuracy (99.91%) has been achieved from the MLP while discriminating only 5 speakers using x-vectors. This accuracy drops to 98.14% using MLP while differentiating 30 speakers and it drops further to 95.24% while discriminating all 51 speakers in Wallacedene dataset. For a lower number of coughers such as 5, MLP classifier has achieved the highest accuracy of 98.49% using i-vectors. The accuracy of MLP has dropped to 97.82%, 96.69%, 94.87% and 93.32% and the σ ACC has increased sharply while the number of coughers is increased to 15, 25, 40 and 51 respectively. These scores show that although cougher identification is not as accurate as speaker identification, the performance is close, especially for the smaller number of subjects. The results also show that cougher identification on lessnoisy TASK dataset is more accurate than noisier Wallace- Fig. 4 . Classifier performance. The accuracies from the MLP classifier decrease while discriminating more subjects. dene dataset. Although longer coughs from each subject improve the classifier accuracy in general, similar performance is achieved (accuracy of 95.11% & 90.02% on less-noisy & noisy data) for coughs as short as only 2 sec (Figure 3 ). Although the performance is close, i-vectors performed better than x-vectors in cougher identification. MLP is the classifier of choice and it shows lower σ ACC across the cross-validation folds for the less-noisy data than noisier data. d-vectors are outperformed by i-vectors and x-vectors for both the speech and cough, as also found by [23] , thus excluded from Table 4 . Coughs were successfully spotted among other trigger phrases in both SC-11 and SC-36 datasets. Table 5 shows, although LSTM and CNN has performed well, the best performance of 92.73% accuracy (ACC) & mean Cohen's Kappa (K) of 0.9218 on SC-11 and 88.58% accuracy & K of 0.8757 on SC-36 have been achieved from Resnet50. The confusion matrix using the best SC-11 system exhibits the high accuracies for spotting coughs in Figure 5 . Table 5 also shows that the best results of CNN and Resnet50 were obtained mostly We propose a system using cough as a wake-word by spotting coughs among other trigger phrases and identifying the cougher. A less-noisy and noisier dataset, containing 14 and 51 subjects respectively, was used to extract i-vector, x-vector and d-vector, to classify the cougher. The best performance was achieved from an MLP, showing coughers as many as 51 can be identified with 90.02% accuracy, using i-vectors from as short as 2 sec audio from each cougher in the noisy environment. We also found, unlike speakers, coughers were better identifiable using i-vectors. Coughs can also be spotted as wake-words using a Resnet50 on features keeping endto-end time-domain information among 35 other keywords in Google Speech Commands dataset with 88.58% accuracy. Wake-cough represents a means of personalised, long-term cough monitoring that is non-intrusive and, due to the use of wake-word detection methods, power-efficient since a smartphone-based monitoring device can remain mostly dormant. Thus, it represents an attractive and viable means for monitoring a patient's long-term recovery from lung ailments such as TB and COVID-19. Its ability to discriminate between coughers also makes it attractive in multi-bed ward environments in monitoring patient's recovery process. Monophone-Based Background Modeling for Two-Stage On-Device Wake Word Detection Convolutional Neural Networks for Small-footprint Keyword Spotting Towards Data-Efficient Modeling for Wake Word Spotting Smallfootprint keyword spotting using deep neural networks Detection of tuberculosis by automatic cough sound analysis A signal processing approach for the diagnosis of asthma from cough sounds A cough-based algorithm for automatic diagnosis of pertussis COVID-19 cough classification using machine learning and global smartphone recordings Deep neural network based wake-up-word speech recognition with two-stage detection A novel Wake-Up-Word speech recognition system, Wake-Up-Word recognition task, technology and evaluation Improving DNN speaker independence with i-vector inputs X-Vectors: Robust DNN Embeddings for Speaker Recognition Generalized End-to-End Loss for Speaker Verification Whosecough: In-the-Wild Cougher Verification Using Multitask Learning Speaker recognition with cough, laugh and "Wei Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition ELAN: a professional framework for multimodality research Deep Neural Network based Cough Detection using Bed-mounted Accelerometer Measurements Automatic Cough Classification for Tuberculosis Screening in a Real-World Environment IEEE 2011 Workshop on Automatic Speech Recognition and Understanding Multilingual bottleneck features for improving ASR performance of code-switched speech in under-resourced languages Deep residual learning for image recognition Improved deep speaker feature learning for text-dependent speaker recognition