key: cord-0186514-wy5axoum authors: Nguyen, Truc; Pernkopf, Franz title: Lung Sound Classification Using Co-tuning and Stochastic Normalization date: 2021-08-04 journal: nan DOI: nan sha: 346239884fc18c19570bbe981bcaea1db9e82d70 doc_id: 186514 cord_uid: wy5axoum In this paper, we use pre-trained ResNet models as backbone architectures for classification of adventitious lung sounds and respiratory diseases. The knowledge of the pre-trained model is transferred by using vanilla fine-tuning, co-tuning, stochastic normalization and the combination of the co-tuning and stochastic normalization techniques. Furthermore, data augmentation in both time domain and time-frequency domain is used to account for the class imbalance of the ICBHI and our multi-channel lung sound dataset. Additionally, we apply spectrum correction to consider the variations of the recording device properties on the ICBHI dataset. Empirically, our proposed systems mostly outperform all state-of-the-art lung sound classification systems for the adventitious lung sounds and respiratory diseases of both datasets. R ESPIRATORY diseases have become one of the main causes of death in society. According to the World Health Organization (WHO), the "big five" respiratory diseases, which include asthma, chronic obstructive pulmonary disease (COPD), acute lower respiratory tract infections, lung cancer and tuberculosis, cause the mortality of more than 3 million people each year worldwide. Currently, CoViD-19, a special form of viral pneumonia related to the coronavirus identified firstly in Wuhan (China) in 2019 [1] , has caused globally more than 158 million infections and 3,296,000 deaths [2]. On March 11, 2020 , the WHO officially announced that CoViD-19 has reached global pandemic status. Furthermore, according to [3] , the "big five" lung diseases, except lung cancer, have increased during CoViD-19 epidemics. These respiratory diseases are characterised by highly similar symptoms, i.e. the adventitious breathing, which could be a confounding factor during diagnosis [1] . Due to their severe consequences, particularly in the case of CoViD-19, an early and accurate diagnosis of these types of diseases has become crucial. Lung sounds convey relevant information related to pulmonary disorders with adventitious breathing sounds such as crackles, wheezes, or both of crackles and wheezes [4] , [5] . In the last decades, to facilitate a more objective assessment of the lung sound for diagnosis of pulmonary diseases/conditions, computational methods i.e. computational lung sound analysis Thanks to the Vietnamese -Austrian Government scholarship and the Austrian Science Fund (FWF) under the project number I2706-N31. T. Nguyen, is with Signal Processing and Speech Communication Lab. Graz University of Technology, Austria (e-mail: t.k.nguyen@tugraz.at). F. Pernkopf is with Signal Processing and Speech Communication Lab. Graz University of Technology, Austria (e-mail: pernkopf@tugraz.at). (CLSA) [6] , [7] have been developed. The CLSA systems automatically detect and classify adventitious lung sounds by using digital recording devices, signal processing techniques and machine learning algorithms. They are also carefully evaluated in real-life scenarios and can be used as portable easy-to-use devices without the necessity of expert interaction; especially, beneficial when facing infectious diseases as CoViD- 19. In CLSA systems, there are two popular classification tasks, namely (i) adventitious lung sound and (ii) respiratory disease classification. In adventitious lung sound classification, recognition of normal and abnormal sounds (i.e. either crackles or wheezes or both of them) is important; while for respiratory disease classification, several categories have been considered e.g. binary classification (health and pathological), ternary chronic classification (healthy, chronic and non-chronic diseases) or six class classification of distinct pathologies. The systems have been evaluated on non-public datasets such as R.A.L.E. [8] or multi channel lung sound data [9] (ours) and public datasets i.e. the ICBHI 2017 dataset [5] or the Abdullah University Hospital 2020 dataset [10] . Due to limitations in the amount and quality of available data, the performance and generalization of the lung sound classification system may suffer over-estimated results. To deal with these challenges, different feature extraction methods [11] , [12] , [13] , [14] , conventional machine learning [12] , [15] , [16] , [17] , [18] , deep learning [19] , [20] , [21] , [22] , data augmentation and transfer learning from ImageNet [23] , [24] , or audio scene datasets [25] have been explored. In this work, we improve the generalization ability and model performance for adventitious lung sound classification and respiratory disease classification systems using the ICBHI 2017 dataset and our multi-channel lung sound dataset. We exploit transfer learning approaches such as co-tuning [26] for different architectures of residual neural networks (ResNets). We use pre-trained ResNet models of the ImageNet classification task as backbone architectures, which requires a 3-channel input i.e. color RGB images. Therefore the spectrograms are converted into 3 channels for the model input. Particularly, logmel spectrograms are replicated into three channels for the adventitious lung sound task or converted into RGB color spectrogram for respiratory disease classification. The pretrained models are exploited systematically in the following compositions. • Firstly, we fine-tune the pre-trained model on a target domain and update all top (i.e. feature representation) layers and bottom (i.e. task-specific) layers. We call this vanilla fine-tuning. • Secondly, we apply co-tuning for transfer learning [26] , in which representation layers and task-specific layers of both source domain and target domain are collaboratively fine-tuned. Co-tuning further updates task-specific layers of the pre-trained model using a learned category relationship between source and target domains. • Thirdly, we replace Batch Normalization (BN) layers, which suffer from poor performance in case of a data distribution shift between training and test data. We introduce stochastic normalization (StochNorm) [27] in each residual block of the pre-trained backbone architecture. StochNorm is a parallel structure normalizing the activation of each channel by either mini-batch statistics or moving statistics to avoid influence of sample statistics during training. Thus, it is considered as a regularization method. Furthermore, fine-tuning inherits further prior knowledge of moving statistics of the pre-trained networks compared to vanilla fine-tuning. Both properties help to avoid over-fitting on small datasets such as the ICBHI and our lung sound dataset. • Finally, we combine co-tuning and stochastic normalization techniques to take advantages of both techniques. In addition, we apply data augmentation in both time domain and time-frequency domain to account for the class imbalance in the datasets. Furthermore, we use spectrum correction for lung sound classification to compensate the recording device variations in the ICBHI dataset. The main contributions of the paper are: • We propose robust classification systems for adventitious lung sounds and respiratory diseases for the ICBHI and our multi-channel lung sound dataset. • We exploit transferred knowledge of pre-trained models by vanilla fine-tuning, co-tuning, stochastic normalization techniques and a combination of both co-tuning and stochastic normalization. • We introduce spectrum correction to improve the generalization ability by accounting for the recording device differences. • In addition to commonly used data augmentation techniques, we double the size of the training dataset by flipping samples in target domain. This enhances the performance of adventitious lung sound classification. • We review state-of-the-art adventitious lung sound and respiratory disease classification systems for the ICBHI and our multi-channel lung sound dataset. The outline of the paper is as follows: In Section II, we introduce the lung sound databases. In Section III, we present our lung sound classification systems. In Section IV, we present the experimental setup including the evaluation metrics and the experimental results. We review related works in Section V. Finally, we conclude the paper in Section VI. The ICBHI 2017 database [5] consists of 920 annotated audio samples from 126 subjects corresponding to patient pathological conditions i.e. healthy and seven distinct disease categories (Pneumonia, Bronchiectasis, COPD, upper respiratory tract infection (URTI), lower respiratory tract infection (LRTI), Bronchiolitis, Asthma). The audios were recorded using different stethoscopes i.e. AKGC417L, Meditron, Litt3200 and LittC2SE. The recording duration ranges from 10s to 90s and the sampling rate ranges from 4000Hz to 44100Hz. Each recording is composed of a certain number of breathing cycles with corresponding annotations of the beginning and the end, and the presence/absence of crackles and/or wheezes. The annotations of the database supports to split audio recordings into respiratory cycles. The cycle duration ranges from 0.2s to 16s and the average cycle duration is 2.7s. The database includes 6898 different respiratory cycles with 3642 normal cycles, 1864 crackles, 886 wheezes, and 506 cycles containing of both crackles and wheezes. We propose a classification system for the following tasks. • ALSC: Adventitious lung sound classification (ALSC) is separated into two sub-tasks for respiratory cycles. The first one is a 4-class task classifying respiratory cycles into four classes (Normal, Crackles, Wheezes and both Crackles and Wheezes). The second sub-tasks is a 2class task of normal and abnormal lung sounds including Crackles, Wheezes and both Crackles and Wheezes. We evaluate our system on the official ICBHI data split. The dataset was divided by the ICBHI challenge into 60% for training and 40% for testing. Both sets are composed with different patients. • RDC: Respiratory disease classification (RDC) also consists of two sub-tasks for audio recordings. The first one is a 3-class task classifying audio recordings into three groups of Healthy, Chronic Diseases (i.e. COPD, Bronchiectasis and Asthma) and Non-Chronic Diseases (i.e. URTI, LRTI, Pneumonia and Bronchiolitis). The second sub-tasks is a 2-class task (healthy/unhealthy), where the unhealthy class comprises of the seven diseases. The multi-channel lung sound database [9] , [19] , [28] has been recorded in a clinical trial. It contains lung sounds of 16 healthy subjects and 7 patients diagnosed with idiopathic pulmonary fibrosis (IPF). We used our 16-channel lung sound recording device (see Fig. 1 ) to record lung sounds over the posterior chest at two different airflow rates, with 3 -8 respiratory cycles within 30s. The lung sounds were recorded with a sampling frequency of 16kHz. The sensor signals are filtered with a Bessel high-pass filter with a cut-off frequency of 80Hz and a slope of 24dB/oct. We extracted full respiratory cycles using the airflow signal from all recordings. We manually annotated respiratory cycles in cooperation with respiratory experts from Medical University of Graz, Austria. The number of breathing cycles with/without IPF are shown in Table I . The proposed systems include two key stages i.e. feature processing and classification as shown in Fig. 2 . Firstly, [29] of the predicted labels of the individual segments with a length of 8 seconds belonging to the same recording. We use the audio pre-processing and feature extraction techniques presented in [22] for both datasets. Audio recordings are resampled to 16kHz for the ALSC tasks of the ICBHI challenge and our dataset, while the RDC tasks use 4kHz of sampling rate. Similar to our previous works on ALSC of ICBHI and our multi-channel dataset [22] , [30] , the respiratory cycles are split without overlap into segments. Furthermore, we apply sample padding in time-reversed order to achieve fixed-length segments without abrupt signal changes. For the RDC task of the ICBHI dataset, recordings are split into segments of the same length using 50% overlap. Different segment lengths are investigated in Section IV. Then we also applied sample padding to the segments being shorter than the fixed length. This means that for both respiratory cycles and recordings, the same splitting and sample padding procedure is applied to obtain the fixed-length segments for the ALSC and RDC tasks, respectively. We use a window size of 512 samples for the fast Fourier transform (FFT) using 50% overlap between the windows. The number of mel frequency bins is chosen as 50 and 45 for the ICBHI dataset and our multi-channel dataset, respectively. The logarithmic scale is applied to the magnitude of the mel spectrograms. The log-mel spectrograms are normalized with zero mean and unit variance. Then these spectrograms are duplicated into three channels to match the input size of the pre-trained ResNet model for the ALSC task. However, for the RDC task of the ICBHI dataset, we convert the spectrogram into a RGB color image and enlarge the image to twice the size using linear interpolation. We observe a different frequency response across devices which results in a performance degradation for underrepresented devices. Hence, we calibrate the features of the audio segments by applying spectrum correction instead of training or fine-tuning the model for a specific device [31] , [32] . The spectrum correction or calibration proposed in [33] , which was first applied for acoustic scene classification, scales the frequency response of the recording devices. In particular, the calibration coefficients are calculated for each device based on data from reference devices. Table II shows the recorded data portions of each recording device of the ICBHI dataset. The magnitude spectrum s k i of each segment i recorded by the device k is an averaged spectrum along the time axis of all FFT windows. The mean device spectrums k = 1 The reference spectrum s ref is furthermore averaged over all mean device spectra of the D reference devices s ref = 1 |D| k∈Ds k , where D contains the indices of the reference devices. We investigate different cases of reference devices based on their prominence i.e only one device either AKGC417L or Meditron or both AKGC417L and Meditron, or all recording devices. The scaling coefficients of each device (c k ) is the element-wise fraction (i.e for each frequency bin) of the reference spectrum and its corresponding device spectrum c k = s ref s k . The magnitude of the STFTs of each device is scaled by using the corresponding coefficient vector c k for the frequency bins. We empirically observed that the normalization in spectrogram domain is more successful than in log-mel domain. The ICBHI 2017 dataset is extremely imbalanced with around 53% of respiratory cycles belonging to the normal class and 86% of audio recordings belonging to COPD. Furthermore, with our multi-channel lung sound dataset, around 71% of respiratory cycles are annotated as normal class. Therefore, we use data augmentation in both time domain and timefrequency domain in order to balance the training dataset and prevent over-fitting. 1) Time Domain: For ALSC of the ICBHI dataset, we use time stretching to increase/reduce the sampling rate of an audio signal without affecting its pitch [34] . It is used to double the number of segments of the wheeze, and both wheeze and crackle classes. We use a random sampling rate uniformly distributed with ±10% of the original sampling rate. For RDC of ICBHI, time stretching is used for all classes to double the number of samples. Furthermore, on the doubled training set further data augmentation methods 1 Proposed transferred knowledge systems using co-tuning for transfer learning or stochastic normalization. adjusting, noise addition, pitch adjusting and speed adjusting are randomly applied based on a predefined probability. • Vocal tract length perturbation (VTLP) selects a random wrap factor α for each recording and maps the frequency f of the signal bandwidth to a new frequency f ′ [35] . We select α from a uniform distribution α ∼ U(0.9, 1.1) and set the maximum signal bandwidth to F hi = [3200, 3800]. VTLP is applied directly to the mel filter bank rather than distorting each spectrogram. VTLP is applied to enlarge the dataset for all classes in both tasks for both the original training set and the time stretched data. • Additionally, we double the log-mel features by adding the flipped log-mel features (in frequency axis) for the ALSC and crackle detection task of our dataset. 1) Transfer Learning: Deep neural networks (DNNs) trained from scratch require large amounts of data. As data collecting is a time consuming task for lung sound data, transferring pre-trained parameters from DNNs, which are trained on other datasets e.g. ImageNet is advantageous. Less data of the target task is required, faster training is enabled, and usually better performance after fine-tuning the model on the target task is achieved [36] . Therefore, fine-tuning brings great benefit to the research community. Given a DNN M 0 pre-trained on a source dataset In this work, D s is selected from ImageNet and D t is the ICBHI 2017 dataset or our multi-channel lung sound dataset. Only D t and the pre-trained model M 0 are available during fine-tuning. Because D s and D t are different domains, which may have different input spaces X s and X t , corresponding to different output spaces Y s and Y t , respectively. Therefore, M 0 can not be directly applied to the target data. It is common practice, to split M 0 into two parts: a general representation function Fθ0 (parametrized byθ 0 ) and a task-specific function G θ 0 s (parametrized by θ 0 s ), which denotes the last layers of the pre-trained model. Usually, the representation function is retained and the task-specific function is replaced by a randomly initialized function H θt (parameterized by θ t ) whose output space matches Y t . Hence, we optimize where l(·) is a loss function such as cross-entropy for classification. We will call this vanilla fine-tuning. Pre-trained parametersθ 0 provide a good starting point for the optimization. It means that the vanilla fine-tuning for a target dataset can be beneficial by transferring the knowledge of the part Fθ0 of the source dataset. In this work, we explore different depths of ResNet architectures i.e. ResNet18, ResNet34, ResNet50 and ResNet101 as neural network backbones. 2) Co-tuning: Co-tuning for transfer learning enables full knowledge transfer of the pre-trained models using a twostep framework [26] . The first step is learning the relationship between source categories and target categories from the pre-trained model with calibrated predictions. Secondly, target labels (one-hot labels) and source labels (probabilistic labels) translated by the category relationship, collaboratively supervise the fine-tuning process. Co-tuning empirically proves its ability in enhancement of the performance compared to vanilla fine-tuning of the ImageNet pre-trained models [26] . In this work, we apply co-tuning to fully exploit the ImageNet pretrained models for significantly distinct datasets such as the ICBHI and our multi-channel lung sound dataset. The cotuning block in Fig. 2 shows the source output layer G θs , the target output layer H θt , the ResNet50 backbone Fθ and category relationship, which is the relationship between output spaces i.e. the conditional distribution p(y s |y t ). In the training, the category relationship p(y s |y t ) is needed to translate target labels y t into probabilistic source categories y s , which is use to fine-tune the task-specific function G θ 0 s . The gradient of G θ 0 s can be back-propagated into Fθ. Both outputs y t and y s collaboratively supervise the transfer learning process described as where λ trades off the target and source supervisions. Variables θ and θ s are initialized from pre-trained weightsθ 0 and θ 0 s . In this way, the pre-trained parametersθ 0 , θ 0 s are fully exploited in the collaborative training. During inference, the task specific layers G θ 0 s are removed to avoid the additional cost. The category relationship p(y s |y t ) is computed based on the output of task-specific function G θ 0 s (i.e. a probability distribution over source categories Y s ) and target labels Y t by two ways: • Direct approach: The category relationship is determined as average of the predictions of the pre-trained source model over all samples of each target category i.e. where the pre-trained model M 0 is considered as a probabilistic model approximating the conditional distribution M 0 (x) ≈ p(y s |x). • Reverse approach: When categories in the pre-trained dataset are diverse enough to compose a target category, we can use a reverse approach. We learn the mapping y s → y t from (M 0 (x t ), y t ) pairs, where y t is target labels and M 0 (x) ≈ p(y s |x) is a probability distribution over source categories Y s . Then p(y s |y t ) can be calculated from p(y t |y s ) by Bayes's rule. In addition, according to [26] , it is necessary to calibrate the neural network, i.e. calibrating the probability output of the pre-trained model, to enhance performance. In [27] , stochastic normalization is proposed to avoid over-fitting during finetuning on small dataset. It replaces the Batch Normalization (BN) layers. It implement a two-branch architecture including one branch normalized by mini-batch statistics and another branch normalized by moving statistics (specified in detail below). A stochastic selection mechanism like Dropout is used between the two branches to avoid over-depending on some sample statistics. This is interpreted as an architecture regularization. Furthermore, StochNorm uses the moving statistics of the pre-trained model for the initial statistics to better exploit prior knowledge in pre-trained networks. Given a mini-batch of feature maps of each channel z = {z i } m i=1 and a moving statistic update rate α ∈ (0, 1). The normalization process in the two branches is calculated aŝ during training, where the mean µ and variance σ 2 of the current mini-batch data of size m are used as usual, while the other branch uses moving statistics µ andσ 2 of the training datã The moving statistics are initialized by using corresponding parameters from the pre-trained model. During forward propagation, eitherẑ i,0 orẑ i,1 is randomly selected with probability p in each channel of the normalization layers and each step of training, i.e.ẑ where s is the branch-selection variable generated from a Bernoulli distribution s ∼ Bernoulli(p). The learnable scale and shift parameters β, γ can be applied after the stochastic selection as usual Stochastic normalization in Fig. 2 uses a ResNet backbone where BN layers are replaced by StochNorm. We empirically evaluate the combination of co-tuning and StochNorm for lung sound classification. To do that, the category relationship is initially calculated based on the pretrained ResNet models, followed by replacing BN layers with the StochNorm modules in the backbone of the original cotuning architecture (i.e replacing the green ResBlocks from Co-tuning by the blue ResBlocks of Stochastic Normalization in Fig. 2 ). After that, the fine-tuning of the co-tuning is processed on the new architecture. In this section, we first provide details of the experimental setup. Furthermore, we empirically evaluate the following cases: • Transfer learning of different pre-trained ImageNet ResNet models on the ICBHI dataset. • Ablation study for respiratory segment length, spectrum corrections and flipping data augmentation. • Transfer learning of different ResNet models pre-trained on ImageNet and ICBHI for our multi-channel lung sound dataset. Our systems for the ALSC and RDC tasks on the ICBHI dataset are also compared against state-of-the-art works for the official ICBHI data split and five-fold cross-validation. Additionally, we compare our best system for crackle detection to our previous work on the multi-channel lung sound dataset. We use the evaluation metrics supported by the ICBHI Challenge [5] for ALSC of 4 classes. The evaluation is based on respiratory cycles using sensitivity (SE), specificity (SP), average score (AS), known as the average of the sensitivity and the specificity, and the harmonic score (HS), known as the harmonic mean of the sensitivity and the specificity. For 2 classes, we determine SE and SP as in [20] and [14] and AS and HS as in [5] . Similarly, for RDC of 3 classes and 2 classes, a recordingwise evaluation is performed using SE and SP as in [20] and [14] and AS and HS as in [5] . Furthermore, for our multi-channel lung sound dataset, we calculate Precision (P + ), Sensitivity or Recall (Se), and the F 1 -score (F 1 ) as specified in [19] . Precision provides information about how many of the respiratory cycles recognized as crackles are actually true. Sensitivity provides information about how many respiratory cycles containing crackles are actually recognized as crackles. The F 1 -score is the harmonic mean of precision and sensitivity. We evaluate our ALSC system for 4 and 2 classes on the official ICBHI 2017 dataset split, which consists of 60% recordings for the training set and 40% for the test set. Each patients is either in the training or test set. The reported performance is the average accuracy of five independent runs. For the RDC task of 3 and 2 classes, our proposed system is evaluated on both the official dataset split and five-fold crossvalidation. Again, data of each patients is not shared between folds. For five-fold cross-validation, one fold is used as test set, the remaining folds are used for training. As co-tuning requires a validation set to compute the category relationship, we randomly select 20% of the samples from the training set. For cross-validation, the average of the best performance on the test sets is represented. Due to the limited amount of data samples in our multichannel lung sound dataset, we use 7-fold cross-validation with the recordings of each IPF subject appearing once in the test set. Each subject is assigned to either training, validation or test set. The best model is selected based on the best accuracy on the validation set. The reported performance of the system is an average accuracy of seven folds using the same data splittings. Experiments are implemented based on Pytorch [37] . For vanilla fine-tuning, the learning rate and number of epochs is set to 0.001 and 150 for all tests, respectively. The fine-tuning of co-tuning and stochastic normalization techniques 2 updates the weights after each mini-batch. The learning rate of the feature representation layers and the last layer are set to 0.001 and 0.01, respectively. The fine-tuning process optimizes the cross entropy loss using SGD with a momentum of 0.9. The batch size is 32 for all experiments. We evaluate the vanilla fine-tuning (VanillaFineTuning), cotuning (CoTuning), stochastic normalization (StochNorm) and the combination of co-tuning and stochastic normalization (CoTuning-StochNorm) for different ResNet architectures trained on the ImageNet dataset for the ALSC task of 4 classes (see Fig. 3 ) and the RDC task of 3 classes (see Fig. 5 ) on the official ICBHI dataset split. These systems use a segment length of 8s, spectrum correction using reference data s ref of all devices and all data augmentation methods introduced in Section C. Fig. 3 shows that ResNet50 is the best performing architecture to build the backbone for these transfer learning techniques of the 4-class ALSC task. ResNet101 is also performing well except for vanilla fine-tuning. Co-tuning achieved the best performance of ∼58% compared to the other techniques. Although CoTuning and StochNorm improve significantly the performance of VanillaFineTuning, the combination of cotuning and StochNorm is not able to outperform the original techniques for this task. Fig. 4 visualizes of the average pooling outputs of the ResNet50 architecture for different transfer learning techniques projected to 2D by t-distributed stochastic neighbourhood embedding (t-SNE) [38] . Distributions of the training set using vanilla fine-tuning, co-tuning, stochastic normalization and combination of co-tuning and stochastic normalization are shown at a), b), c) and d), respectively. Comparing to vanilla fine-tuning (a) and stochastic normalization technique (c), the distribution of 4 classes using co-tuning (b) and the combination of co-tuning and stochastic normalization (d) bring a large margin. It shows that the collaborative fine-tuning using the category relationship of source and target domain is useful for the adventitious lung sound classification task. In Fig. 5 , we see that the different transfer learning techniques using the ResNet101 model achieve the best performance for the 3-class RDC task. The ResNet50 model works better than others for the vanilla fine-tuning. CoTuning-StochNorm and StochNorm achieved a better performance compared to CoTuning and VanillaFineTuning. It proves the efficiency of the stochastic normalization in the fine-tuning process for the RDC task. 2) Respiratory segment length: The length of respiratory cycles in the ICBHI dataset varies in a wide range. Hence, we applied cycle splitting into segments and perform sample padding in order to obtain fixed-length segments. We observe different segment lengths for the ResNet 50 model fine-tuned by co-tuning and applied data augmentation in both time domain and time-frequency domain with spectrum correction. Results are shown in Table 3 . The best score is obtained with 8s fixed-length segments (AS) for 4 classes. We also use 8s as the fixed length for other tasks of the ICBHI and our lung sound dataset. We experiment on the ICBHI dataset without spectrum calibration and with spectrum calibration using different reference spectra s ref , which are determined by one or more devices. No-Calib denotes that no spectrum correction is applied. Calib-Dev1 and Calib-Dev2 denote calibration using data of device AKGC417L and Meditron, respectively. Calib-Dev1Dev2 denotes calibration using data of both devices of AKGC417L and Meditron and Calib-AllDev denotes spectrum adaptation using reference data of four devices. From Table IV , we can see that co-tuning of the ResNet50 model using reference data of all devices achieves the best performance, it is 1.62% (absolute) better than without using spectrum calibration. Thus, we apply spectrum calibration for both adventitious lung sound classification and respiratory diseases classification. We apply data augmentation in time domain and VTLP in order to balance the dataset. In this section we focus on the influence of feature flipping data augmentation (see Section III). Fig. 6 shows that when the system does not use spectrum correction, doubling the size of the augmented training set by the flipping technique always performs well for vanilla fine-tuning, co-tuning and stochastic normalization. It improves significantly the performance of vanilla fine-tuning and stochastic normalization of about 3% and 2%, respectively. For co-tuning, the flipping data augmentation achieves an improvement of 1% accuracy. Furthermore, we can see from Fig. 6 that using the combination of spectrum calibration and flipping data augmentation always enhances the robustness of the adventitious lung sound classification systems. 5) Effect of pre-trained model on the multi-channel lung sound dataset: According to the above evaluation of transfer learning techniques for different residual neural networks for the 4-class ALSC task, co-tuning achieves the best performance. Thus, we evaluate the effect of pre-trained models using co-tuning (CoTuning) for the 2-class ALSC task on our multi-channel lung sound dataset. It is shown in Fig. 7 . Co-tuning using the ImageNet pre-trained model always outperforms that of the ICBHI pre-trained model. The smaller ResNet architectures tend to work better for co-tuning. We also can see from Fig. 7 that the ResNet34 backbone system achieves the best performance, followed by ResNet18, ResNet50 and ResNet101. In addition, Fig. 7 shows that transferred knowledge from full pre-trained models of ICBHI and the ImageNet dataset by co-tuning to our small lung sound dataset can achieve better accuracy than vanilla fine-tuning (VanillaFineTuning) using the ImageNet pre-trained model. Overall, the best segment length for lung sound classification tasks is 8s. Spectrum correction is useful to improve the performance of our ALSC and RDC system on the ICBHI dataset. This helps to correct the different frequency responses of the recording devices. The ALSC system using the flipping data augmentation enhances performance on both ICBHI and our multi-channel lung sound dataset. The new transfer learning methods always outperform vanilla transfer learning. Co-tuning works better for the ALSC task while StochNorm and its combination with the co-tuning achieve higher performance for the RDC task. Furthermore, ResNet34 and ResNet50 are more suitable for the ALSC tasks, while a large ResNet101 model tends to be more robust for the RDC task in most of transfer learning settings. Table VI and Table VII show the comparison of our best systems of different transfer learning techniques and state-of-the-art systems (see Section V for more details on the systems) for the ALSC and RDC tasks, respectively. Our best systems are presented in bold and the highest scores are presented in bold and italic. It is notable that the performances on the official 60/40 ICBHI separation without common patients in both sets are significantly lower than that of randomly 80/20 splitting i.e. 5-fold cross validation and overlap of the same patients in both sets. Despite of the same fixed length for segments, the RDC systems always achieve considerably higher performance compared to the ALSC system for different sub-tasks. The RDC tasks have the full audio recordings which consists of many available respiratory cycles, while the ALSC tasks are processed and evaluated on respiratory cycles. We evaluate our proposed system on the official ICBHI split for the 4 and 2 class ALSC tasks. Our best systems of different fine-tuning techniques outperform the other ALSC systems. Our system using co-tuning of the ResNet50 pretrained model achieve the highest ICBHI average score at 58.29% and 64.74% for the 4-class and 2-class ALSC task, respectively. Our RDC systems are evaluated on the official ICBHI split and the 5-fold cross-validation method. On the official dataset split of the 3-class RDC task, our systems achieves the best performance with the ResNet101 pre-trained architecture combined with stochastic normalization. It obtains 92.72% of the ICBHI average score. While for the 5-fold cross-validation evaluation, the stochastic normalization of the ResNet101 model obtains the best average score 95.73%. It has an around 5% better average score compared to the stateof-the-art systems. On the 2-class RDC task, our systems using stochastic normalization achieves the average scores of 93.77% and 98.20% for the official splitting and 5-fold cross-validation, respectively. Our best 2-class RDC system outperforms all compared systems on the former evaluation, but it scores 1.02% lower compared to the system in [39] . 2) Comparison for our multi-channel lung sound dataset: Table V compares our best systems using different transfer learning techniques with our previous system using fine-tuning for a multi-input CNN model [30] on the multi-channel lung sound dataset. We can see that our best transfer learning systems outperform the previous system. The co-tuning system using the ResNet34 model pre-trained on ImageNet achieves the best performance, closely followed by the StochNorm system using the pre-train ResNet50 model. The best F 1 -score is 2.82% better than for the multi-input fine-tuned system [30] . We review recent works on ALSC and RDC using the ICBHI 2017 dataset and binary ALSC works (i.e. crackle detection) using the multi-channel lung sound database. In general, it is difficult to compare the score of some proposed methods for ICBHI as a substantial work does not use the official data splitting or use a different evaluation metric. There are two main directions: (i) conventional classifiers for low-level features of time or frequency domain, (ii) deep neural networks and robust machine learning techniques for spectral features. 1) Conventional approach: Jakovljevic et al. [12] used hidden Markov models and Gaussian mixture models for MFCCs, which were applied after spectrum subtraction for noise suppression. Their system achieved an average score of 39.56% on the official ICBHI train-set split and 49.5% on 10fold cross-validation for the ALSC task with 4 classes. Serbes et al. [40] proposed a 4-class ALSC system using support vector machines (SVMs) for STFT and wavelet features. It achieved 49.86% of average score on the official ICBHI data split. Furthermore, Chambres et al. [11] proposed a system using a boosted tree method for low-level features of spectral information i.e. bark-bands, energy-bands, mel-bands, MFCCs and other features i.e. rythm features and tonal features for both of the ALSC and binary RDC task. Their performance on the official ICBHI split were 49.63% and 85% for the 4class ALSC and 2-class RDC tasks, respectively. A binary RDC system using the RUSBoost algorithm, which combines random under sampling and boosting techniques (i.e decision tree as a base classifier) was also introduced [47] . The input of the classifier are features selected from MFCCs, discrete wavelet transform (DWT) and time domain features. The proposed system was evaluated on their own ICBHI dataset split and achieved 87.1% of average score. In addition, Mukherjee et al. [39] developed a method to detect patients with respiratory infections. They extracted features based on Linear Predictive Coefficient (LPC) for a multilayer perceptron classifier (MPC). The method was evaluated on the ICBHI dataset using 5-fold cross-validation and achieved 99.22% of accuracy for the 2-class ALSC task. 2) Deep learning approach: Deep learning systems use CNNs, Recurrent neural networks (RNNs) and hybrid architectures. They are combined with machine learning techniques such as data augmentation, ensemble methods and transfer learning to enhance robustness. For the RNN-based systems, Kochetov et al. [31] proposed a system using a noise making RNN (NMRNN) and MFCC features to classify cycles of lung sounds into four categories. The performance was evaluated based on 5-fold cross-validation. It is the first work which considers the effect of the recording devices on the performance. They achieved a score of 64.8% and 68.5% with training data from all devices and the most often occurring recording device (i.e. AKGC417L), respectively. Furthermore, a CNN network for MFCCs was used, which achieved 61% and 83% of average score for the 4-class ALSC and 3-class RDC task with random train-test split of 80% and 20% [45] . In [20] , Perna et. al. also introduced different architectures of RNNs such as long short time memory (LSTM), gated recurrent units (GRU), bidirectional-LSTM (BiLSTM) and bidirectional-GRU (BiGRU) for MFCC features to perform 4class and 2-class ALSC and the ternary and 2-class RDC task. The results on random train-test ICBHI split of 80% and 20% with over-lapping of the same patients in both sets are 74% and 81% of average score for the 4-class and 2-class ALSC task, respectively. The average performance of the RDC tasks of 3 and 2 classes is 84% and 91%, respectively. Furthermore, CNNs or hybrid architectures have been used. In [22] , we proposed a lung sound classification using a snapshot ensemble of CNNs for log-mel spectrograms. We applied temporal stretching and vocal tract length perturbation (VTLP) for data augmentation to deal with the class-imbalance of the ICBHI dataset. Our system achieved 78.4% and 83.7% of average score on the random train-test split of 80% and 20% with common patients in both sets for the ALSC task of 4 classes and 2 classes, respectively. Acharya et al. [24] introduced a deep CNN-RNN model for mel spectrograms to classify adventitious lung sounds into four classes. The performance for 5-fold cross validation evaluation was 66.31%. When this system was combined with a patient specific model tuning strategy, its performance increased up to 71.81% of average score. Similarly, Pham et al. [46] , [14] introduced lung sound classification systems for anomaly sounds and respiratory diseases. In the first work [46] , they proposed various deep learning architectures mainly based on CNNs and RNNs using gammatone filtered spectrograms. They use a 80%-20% dataset split, where data from one subject may exist in both training and test set. An average ensemble of these systems achieved 80% and 86% of the average score for the 4-class and 2-class ALSC, respectively. The proposed CNN -mixture of expert (MoE) model was suitable for the RDC task of 3 classes and 2 classes with a performance of 90% and 91%, respectively. In [14] they proposed a CNN-MoE neural network for different feature types i.e. MFCCs, log-mel, gammatone filter and constant Q transform spectrogram. The gammatone filter spectrogram was suggested for the ALSC tasks, while log-mel spectrogram worked better for the RDC task. The average score of the 4-class ALSC task was 47% for the ICBHI official dataset split. For 5-fold cross-validation with data of the same patient in both sets, their performance was 78.6% and 84% for the ALSC task of 4 classes and 2 classes, respectively. On the 3-class RDC task they achieved 85% of average score on the ICBHI official dataset split and 91% on 5-fold cross-validation. Recently, CNN-based systems from diverse architectures i.e. VGGNets, ResNets have been more and more introduced. Minami et al. [41] proposed a 4-class ALSC system using a VGG16 neural network for the combination of STFT spectrogram and scalogram. The performance was 54% of average score on the official ICBHI dataset. Ma et al. proposed two ALSC systems for four classes [42] , [43] . The first one used an improved Bi-ResNet deep learning architecture based on STFT The proposed systems achieved 50.16% and 52.26% on the official data split, respectively. The latter work [43] was also evaluated using 5-fold cross-validation and achieved an average score of 64.21%. Yang et al. [44] proposed a 4-class ALSC system combining the ResNet18 architecture with Squeeze-and-Excitation and spatial attention blocks using STFT spectrogram features. They obtained 49.55% of average score on the official ICBHI datset split. Demir et al. proposed a 4-class ALSC system using pre-trained models for STFT spectrograms converted into color images. In the first approach [23] , the pre-trained model was used as feature extractor and combined to an SVM classifier. In the second approach [23] , the pre-trained model was fine-tuned on the ICBHI dataset. They achieved 65.5% and 63.09% of accuracy for 10-fold cross-validation, respectively. In [48] , they introduced a parallel pooling CNN model for deep feature extraction. It is combined to a linear discriminant analysis (LDA) classifier and random subspace ensembles (RSE). The performance of the proposed system was 71.5% for 10-fold cross-valuation. However, the evaluation metrics are different. Additionally, Gairola et. al. [32] proposed a RespireNet model based on ResNet34 and fully connected layers with a set of techniques i.e. device specific fine-tuning, concatenationbased augmentation, blank region clipping and smart padding to improve the accuracy. The average score for the 4-class ALSC task was 56.2% and 68.5% for the official ICBHI dataset split and 5-fold cross-validation, respectively. They also evaluated the proposed system for the ALSC task of two classes and obtained 77.0% accuracy on 5-fold crossvalidation. In [19] , Messner et al. introduced an event detection approach with bidirectional gated recurrent neural networks (Bi-GRNNs) using MFCCs to identify crackles in respiratory cycles. The proposed system was evaluated on the first version of the multi-channel lung sound dataset including 10 lunghealthy subjects and 5 patients with IPF. The performance was 72% of F-score on 5-fold cross-validation evaluation. In [28] , a classification framework using lung sound signals of all recording channels was introduced to identify healthy and pathological breathing cycles. Lung sounds of one breath cycle of all recording channels were first transformed into STFT spectrograms. Then, the spectrogram were stacked into one compact feature vector. These features were fed into a CNN-RNN model for classification. Its score was 92% for 7-fold cross-validation evaluation. We proposed a multi-input CNN model based on transfer learning for the ALSC task of crackles and normal sounds, namely crackle detection [30] on the multi-channel lung sound classification dataset. The multi-input CNN model shares the same network architecture of the pre-trained CNN model trained on the ICBHI dataset for respiratory cycles and their corresponding respiratory phases. Our system achieved an Fscore of 84.71% on 7-fold cross-validation evaluation. We propose robust fine-tuning frameworks to classify adventitious lung sounds and recognize respiratory diseases from lung auscultation recordings using the ICBHI and our multi-channel lung sound datasets. Transferred knowledge of pre-trained models from different ResNet architectures are exploited by vanilla fine-tuning, co-tuning, stochastic normalization and the combination of co-tuning and stochastic normalization techniques. Furthermore, spectrum correction and flipping data augmentation are introduced to improve the robustness of our system. Empirically, our proposed systems outperform almost all state-of-the-art systems for adventitious lung sound and respiratory disease classification. In addition, we also evaluate our adventitious lung sound classification approach using co-tuning on our multi-channel lung sound dataset to detect crackles using different pre-trained models of the ImageNet and ICBHI dataset. The best co-tuning system for 2-class lung sound classification achieves a better performance (2.82%) compared to our previous work using a multiinput convolutional neural network. We also review state-ofthe-art classification systems for adventitious lung sounds and respiratory diseases using the ICBHI dataset and our multichannel lung sound dataset. Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in wuhan, china: a descriptive study The "big five" lung diseases in covid-19 pandemic-a google trends analysis Auscultation of the respiratory system A respiratory sound database for the development of automated classification Computerized lung sound analysis as diagnostic aid for the detection of abnormal lung sounds: a systematic review and meta-analysis Automatic adventitious respiratory sound analysis: A systematic review Rale: A computer-assisted instructional package A robust multichannel lung sound recording device A dataset of lung sounds recorded from the chest wall using an electronic stethoscope Automatic detection of patient with respiratory diseases using lung sound analysis Hidden markov model based respiratory sound classification Triple-classification of respiratory sounds using optimized s-transform and deep residual networks Cnn-moe based framework for classification of respiratory anomalies and lung disease detection Classification of lung sounds in patients with asthma, emphysema, fibrosing alveolitis and healthy lungs by using self-organizing maps Pattern recognition methods applied to respiratory sounds classification into normal and wheeze classes Wheezing recognition algorithm using recordings of respiratory sounds at the mouth in a pediatric population Lung sound classification based on hilbert-huang transform features and multilayer perceptron network Crackle and breathing phase detection in lung sounds with deep bidirectional gated recurrent neural networks Deep auscultation: Predicting respiratory anomalies and diseases via recurrent neural networks Classification of lung sounds using convolutional neural networks Lung sound classification using snapshot ensemble of convolutional neural networks Convolutional neural networks based efficient approach for classification of lung diseases Deep neural network for respiratory sound classification in wearable devices enabled by patient specific model tuning Lung sound recognition algorithm based on vggish-bigru Co-tuning for transfer learning Stochastic normalization Multi-channel lung sound classification with convolutional recurrent neural networks Majority voting Crackle detection in lung sounds using transfer learning and multi-input convolitional neural networks Noise masking recurrent neural network for respiratory sound classification Respirenet: A deep neural network for accurately detecting abnormal lung sounds in limited data setting Acoustic scene classification for mismatched recording devices using heated-up softmax and spectrum correction A review of time-scale modification of music signals Vocal tract length perturbation (vtlp) improves speech recognition Rethinking imagenet pre-training Pytorch: An imperative style, highperformance deep learning library Visualizing data using t-sne Automatic lung health screening using respiratory sounds An automated lung sound preprocessing and classification system based onspectral analysis methods Automatic classification of large-scale respiratory sound dataset based on convolutional neural network Lungbrn: A smart digital stethoscope for detecting respiratory disease using biresnet deep learning algorithm Lungrn+ nl: An improved adventitious lung sound classification using non-local block resnet neural network with mixup data augmentation Adventitious respiratory classification using attentive residual neural networks Convolutional neural networks learning from respiratory data Robust deep learning framework for predicting respiratory anomalies and diseases A novel method for automatic identification of respiratory disease from acoustic recordings Classification of lung sounds with cnn model using parallel pooling structure ACKNOWLEDGMENT This research was supported by the Vietnamese -Austrian Government scholarship and by the Austrian Science Fund (FWF) under the project number I2706-N31. We acknowledge NVIDIA for providing GPU computing resources. The authors would like to thank our colleague, Alexander Fuchs for feedback and fruitful discussions.