key: cord-0190797-eynlwhfn authors: Ren, Zhao; Nguyen, Thanh Tam; Nejdl, Wolfgang title: Prototype Learning for Interpretable Respiratory Sound Analysis date: 2021-10-07 journal: nan DOI: nan sha: 43c2cf03ecf4659346564c801e0eab6dd773356d doc_id: 190797 cord_uid: eynlwhfn Remote screening of respiratory diseases has been widely studied as a non-invasive and early instrument for diagnosis purposes, especially in the pandemic. The respiratory sound classification task has been realized with numerous deep neural network (DNN) models due to their superior performance. However, in the high-stake medical domain where decisions can have significant consequences, it is desirable to develop interpretable models; thus, providing understandable reasons for physicians and patients. To address the issue, we propose a prototype learning framework, that jointly generates exemplar samples for explanation and integrates these samples into a layer of DNNs. The experimental results indicate that our method outperforms the state-of-the-art approaches on the largest public respiratory sound database. Respiratory sound classification is the task of automatically identifying adventitious sounds as a tool to assist physicians in screening lung diseases such as pneumonia and asthma [1] . Unlike traditional auscultation, computer-aided auscultation of respiratory sounds provides a remote and non-invasive instrument for early diagnosis of patients at home or outside of hospitals. Owing to its promising prospect, respiratory sound classification has received considerable attention [2, 3, 4, 5] . Recently, deep neural networks (DNNs) have achieved great success in a wide range of areas. Due to their powerful capability, DNN-based models also have shown prominent performance in respiratory sound classification [4, 6] . However, a key limitation of these DNNs-based respiratory sound classification models is that they are not explainable by nature, especially in high-stake domains where decisions can have significant consequences like disease diagnosis. Prototype learning, emerging as a novel interpretable machine learning paradigm that imitates the human reasoning process, has attracted many recent works [7, 8, 9] . The basic idea of prototype learning is to explain the classification by comparing the inputs to a few prototypes, which are similar examples in the application domain [10, 11, 12] . Unlike posthoc explanation methods that only approximate orig-inal models [13] , prototype learning holds vast potential to improve the classification quality via nearest neighbor classifiers or kernel-based classifiers [14, 7] . With these efforts, the paradigm of prototype-based explanations has demonstrated some promising results showing that the so-called accuracyinterpretability trade-off [13] can be overcome. However, despite the benefits of prototype learning, little attention is given for the audio domain. To address this issue and fully inherit the power of DNNs, we propose a prototype learning method for respiratory sound classification that integrates a prototypical layer into the training of an audio-driven convolutional neural network (CNN). Our framework takes input as the log Mel spectrogram of an audio signal as well as its delta and delta-delta because of their better performance than raw audio singals for DNN models [15] . Through a prototype layer that calculates the similarity between the internal feature map and the prototypes, the prototypes are learnt at the intermediate level to represent each class. Our work relates closely to the interpretable models such as k-nearest neighbors [16] , attention mechanisms [11, 17] , and posthoc explanation methods [18, 19] . However, it is difficult for humans to generalize from such interpretation due to the lack of quantifiable extent between the classification result and the explanation [7] . Our work relates most closely to Zinemanas et al. [14] , who proposed a network architecture that builds prototype-based explanation into an autoencoder. Unlike their model, our model employ cosine similarity between examples and prototypes, and apply attention-based similarity at the time-frequency level rather than frequency level only. Also, our model can overcome the imbalance class problem, which is common in disease diagnosis [2, 20] . To the best of our knowledge, this is the first attempt to develop an interpretable respiratory sound classification framework. We propose a prototype learning method to enhance the power of DNNs as well as the explainability of case-based reasoning. Constructing prototypes as explanations brings several benefits: (i) the learnt prototypes yield a concise representation and can be projected to the original data, (ii) it is easier to visually compare a classified audio sample and the exemplar examples (i. e., prototypes), and (iii) a prototype can be a new case so that the physicians can understand more about the diseases. With the extracted log Mel spectrograms as well as their deltas and delta-deltas as the input, three prototype learning approaches are employed in our work: i) Prototype-1D, ii) Prototype-2D with vanilla similarity, and iii) Prototype-2D with attention-based similarity (see Fig. 1 ). Before learning the prototypes in each approach, a CNN model is employed as an encoder for analysing the respiratory sounds due to CNNs' strong capability of extracting highly abstract representations. As the high-level representations include more class-related information than the low-level ones, the prototype layer is trained after a global max pooling layer for 1D prototypes (see Fig. 1 (a)). Given an instance (x, y) (x: input, y: label), the intermediate representation before the prototype layer is denoted by f (x). Through the prototype layer, a set of prototypes P l , l ∈ [1; L] are learnt, where L is the classes' number. In each P l , p i , i ∈ [1; N ] is a prototype with the same size of f (x), where N is the number of prototypes for each class. The cosine metric is then used to measure the similarities between f (x) and p i : The similarity is further fed into a layer normalisation (LN) layer, a fully connected (FC) layer, and a softmax layer for classification. Although the prototype-1D can generate prototypes, it is challenging for 1D prototypes to represent the time and frequency information. As 2D prototypes can better represent the time-frequency information than 1D prototypes, the prototype layer is placed after the CNN encoder for the similarity measurements (see Fig. 1 (b-c)). Vanilla Similarity. Similar to the prototype-1D learning approach, the calculated similarities are send to the next layers for the classification task (see Fig. 1(b) ). Element-wise Similarity aims to calculate the similarity scores between each pair of time-frequency bins in (f (x)) and p i . When f (x)'s channel number is C and its spatial size is (T, R), the element-wise similarity is calculated by where t ∈ [1; T ] and r ∈ [1; R]. Average Similarity is to compute an average score across the similarities between all time-frequency bins of p i and each bin of f (x). Since only part of an audio signal may contain the class-related characteristics, the average similarity is a score between each f (x) bin and p i : Maximum Similarity has the same idea of the average similarity, but select the most similar p i bin for each f (x) bin for structuring the similarity scores at all time-frequency bins. The maximum similarity is computed by Attention-based Similarity. Apart from the vanilla similarity, the attention-based similarity (see Fig. 1 (c)) is employed to learn weighted similarity scores. The calculated similarity scores are processed by a softmax function σ(·) for the attention feature maps. As the average similarity is computed as a global score across all p i bins for each f (x) bin, it is not applicable for the attention-based similarity. Therefore, we introduce the element-wise and maximum similarities with the attention mechanism. Element-wise Similarity is defined by Maximum Similarity is calculated at each (f (x)) bin f c,t,r (x) and its most similar p i bin p c,tmax,rmax i in Equation (4) During training the above prototype learning models, the prototypes are learnt as part of the model parameters. The loss function of the neural networks is finally defined by where L N LL is the negative log likelihood (NLL) loss function when the neural networks' output in Fig. 1 is passed into a logarithm function, L dv denotes the diverse loss function, and α is a constant value. L dv aims to reduce the distances among prototypes which represent the same class and increase the distances among prototypes for different classes. The average similarity is experimentally used to evaluate the similarities between each pair of prototypes. In each set of prototypes P l for the class l, the similarities of each two different prototypes inside the P l are calculated and averaged, leading to the denominator in Eq. (8); between each two different sets of prototypes P l1 and P l2 , the averaged similarity is also computed as the molecular of Eq. (8). Data. The Scientific Challenge database released at the International Conference on Biomedical and Health Informatics (ICBHI) 2017 [3] is the largest publicly available collection of audio samples for respiratory sound classification. Totally 920 audio recordings were collected from seven chest locations (i. e., trachea, anterior left, anterior right, posterior left, posterior right, lateral left, and lateral right) of 126 participants with four devices (i. e., one microphone and three stethoscopes). The audio recordings have different sampling rates: 4 kHz, 10 kHz, and 44.1 kHz. All recordings derive 6 898 respiratory cycles, each of which was annotated with one of the four classes: normal, crackle, wheeze, and both, i. e., crackle + wheeze. The database was split into a training set (60 %) and a test set (40 %) for the competition. To optimise the model hyperparameters, we further divide the training set into two subject-independent data sets: a train set (70 %) and a development set (30 %) (see Table 1 ). Evaluation Metrics. Although the database contains four labels, it is a common practice to differentiate abnormal cases (crackles, wheezes, and both) and normal cases. Therefore, the following standard benchmarks are used: sensitivity (SE) -equals to the number of true abnormal cases over the total number of abnormal cases, specificity (SP) -the ratio of true normal cases over normal cases, average score (AS) -is the official score of the ICBHI challenge [3] and is the average of SE and SP. Due to class imbalance, we also report the unweighted average recall (UAR) as the generic classification benchmark instead of accuracy [4, 21] . Implementation Details. At the preprocessing stage, all audio recordings are resampled into 4 kHz due to the various sampling rates of the ICBHI database. A fifth butterworth bandpass filter (100 Hz-1 800 Hz) is then applied to exclude noise components, e. g., heart sounds, etc [3] . The respiratory cycles with different durations are unified into audio signals with a fixed time length of 4 s. On the training stage, 4 s segments are randomly selected from each data sample for improving the flexibility. On the testing stage, the middle 4 s segments are selected to avoid the potential silence at the start and the end. The log Mel spectrograms are further extracted from the audio signals with a window length of 256, a hop length of 128, and 128 Mel bins, as they incorporate several properties of the human auditory system [22] . The CNN encoder is structured by four convolutional blocks with output channel numbers of 64, 128, 256, and 512, when each convolutional block consists of two covolutional layers with the same output channel number followed by a local max pooling layer with a kernel size of 2 × 2. For the classification task, the CNN model with a CNN-encoder followed by a global max pooling layer and a FC layer is called 'CNN-8'. During training, the CNNs are optimised by an 'Adam' optimiser with an initial learning rate of 0.001 when the batch size is 16. To stabilise the optmisation, the learning rate is reduced with a factor of 0.9 at each 200-th iterations. The training procedure is stopped at the 10 000-th iteration. To mitigate the class imbalance problem, each class in L N LL is given a weight value which is inversely proportional to the samples' number of the class. The value of α is experimentally set to 0.1. Reproducibility Environment. Our experiments are implemented at NVIDIA Geforce GTX 1080 Ti Graphics Cards. The PyTorch code is released at: https://github.com/ L3S/PrototypeSound. We compare our proposed approach with the following stateof-the-art (SOTA) methods on the ICBHI database. Table 2 presents the result. Our approach performs better than all of the SOTA methods when comparing the AS scores. Our approach significantly outperforms the MFCC-HG approach (p < .001 in a one-tailed z-test). Table 3 shows our ablation study on the prototype layers and batch normalisation in the prototype-2D with the vanilla similarities. In general, the performance of most prototype learning variants is comparable to that of the basic CNN-8 model. Particularly, the UAR values are increased by several variants learnt by prototype learning, leading to higher SE values on the abnormal classes (crackle, wheeze, and both). Both high accuracy and interpretability are reserved in our approach. The batch normalisation procedure is also analysed in our models with the vanilla similarities. In Table 3 , batch normalisation leads to improvements for the vanilla element-wise similarity and the vanilla average similarity, which it results in very low performance (i. e., SP = 0) for the vanilla maximum similarity. In this regard, we select the best batch normalisation setting for each vanilla similarity for further experiments. Number of Prototypes. We compare the effect of prototype numbers for each approach in Fig. 2 . The performances of the proposed models are comparable when N varies from 1 to 5, indicating generating one prototype per class is sufficient. Similarity Comparison. Prototype-2D with vanilla elementwise similarity mostly perform better than the other other approaches, perhaps due to its capability of generating prototypes with time-frequency information and less parameters than the models with the attention-based similarity. When comparing the three vanilla similarities, the vanilla elementwise similarity always outperforms the other two. As the generated prototypes contain multiple channels and have a small spatial size due to local pooling layers, it is challenging to visualize the prototypes. Hence, we project the prototypes to their closest inputs of the models by searching for the closest intermediate representation f (x): min dist J j=1 (p i , f (X j )), dist(p i , f (X j )) = e (−S) , (9) where X means all J instances, j is the index number. dist is the distance, and S denotes the similarity. Herein, the log Mel spectrograms calculated by the projection procedure for our best model on the test set are depicted in Fig. 3 . The projection of prototypes is helpful to analyse the characteristics of each class of respiratory sounds. We can see that, the normal respiratory cycles are regular, while the others are not. As crackle sounds are attributed to sudden bursts of air within bronchioles, therefore they are explosive and transient, and non-musical [27] . The example of Fig. 3(b) can represent the above character of crackle sounds. Different from crackle sounds, wheeze sounds are continuous and commonly observed in patients with obstructive airways diseases, e. g., asthma (AS) and chronic obstructive pulmonarydisease (COPD) [28] . Musical wheeze sounds are sinusoidal sounds in time domain and are superimposed on normal breath sounds [28] . In Fig. 3(c) , the regularity of the sounds can reflect the nature of wheeze sounds. The wheeze log Mel spectrogram has smaller coefficients on a range of Mel frequencies than the crackle one, probably indicating the wheeze sound is weaker. In Fig. 3(d) , the musicality of wheeze sound is difficult to be observed when wheeze and crackle sounds occur simultaneously. Prototype learning paradigm, which is widely used for example-based explanation or case-based reasoning, recently has been transplanted to classification for jointly improving the classification performance and the result interpretability. This paper developed a prototype learning framework for interpretable respiratory sound classification by generating prototypical feature maps that were integrated into the training of the predictive model. Not only increasing the predictivity, the learnt prototypes can introduce new cases to assist physicians in learning from automatic diagnosis and making informed decisions. In future work, we plan to explore other types of explanations such as concepts and criticisms [14] as well as reconstruct the original audio signals. Automatic adventitious respiratory sound analysis: A systematic review Methods for adventitious respiratory sound analyzing applications based on smartphones: A survey An open access database for the evaluation of respiratory sound classification algorithms Contrastive embeddind learning method for respiratory sound classification Transformer-based CNNs: Mining temporal context information for multisound COVID-19 diagnosis Adventitious respiratory classification using attentive residual neural networks Towards scalable and unified example-based explanation and outlier detection This looks like that: Deep learning for interpretable image recognition Deep learning for case-based reasoning through prototypes: A neural network that explains its predictions ProtoPShare: Prototypical parts sharing for similarity discovery in interpretable image classification ProtoAttend: Attention-based prototypical learning Interpretable and steerable sequence learning via prototypes Neural prototype trees for interpretable fine-grained image recognition An interpretable deep learning model for automatic sound classification Coughing-based recognition of Covid-19 with spatial attentive ConvLSTM recurrent neural networks Robust CFAR radar detection using a k-nearest neighbors rule CAA-Net: Conditional atrous CNNs with attention for explainable device-robust acoustic scene classification Ada-sise: Adaptive semantic input sampling for efficient explanation of convolutional neural networks Interpretable image recognition with hierarchical prototypes CovNet: A transfer learning framework for automatic COVID-19 detection from crowd-sourced cough sounds Generating and protecting against adversarial attacks for deep speech-based emotion recognition models Should deep neural nets have ears? The role of auditory features in deep learning approaches Hidden markov model based respiratory sound classification Automatic detection of patient with respiratory diseases using lung sound analysis An automated lung sound preprocessing and classification system based onspectral analysis methods LungBRN: A smart digital stethoscope for detecting respiratory disease using bi-ResNet deep learning algorithm Feature extraction using timefrequency/scale analysis and ensemble of feature sets for crackle detection Wheeze detection based on time-frequency analysis of breath sounds