key: cord-0667434-sqhr7g54 authors: Calderon-Ramirez, Saul; Yang, Shengxiang; Elizondo, David; Moemeni, Armaghan title: Dealing with Distribution Mismatch in Semi-supervised Deep Learning for Covid-19 Detection Using Chest X-ray Images: A Novel Approach Using Feature Densities date: 2021-08-17 journal: nan DOI: nan sha: 8bc5e28809b796be665d0e5be4adcba1e84bbe89 doc_id: 667434 cord_uid: sqhr7g54 In the context of the global coronavirus pandemic, different deep learning solutions for infected subject detection using chest X-ray images have been proposed. However, deep learning models usually need large labelled datasets to be effective. Semi-supervised deep learning is an attractive alternative, where unlabelled data is leveraged to improve the overall model's accuracy. However, in real-world usage settings, an unlabelled dataset might present a different distribution than the labelled dataset (i.e. the labelled dataset was sampled from a target clinic and the unlabelled dataset from a source clinic). This results in a distribution mismatch between the unlabelled and labelled datasets. In this work, we assess the impact of the distribution mismatch between the labelled and the unlabelled datasets, for a semi-supervised model trained with chest X-ray images, for COVID-19 detection. Under strong distribution mismatch conditions, we found an accuracy hit of almost 30%, suggesting that the unlabelled dataset distribution has a strong influence in the behaviour of the model. Therefore, we propose a straightforward approach to diminish the impact of such distribution mismatch. Our proposed method uses a density approximation of the feature space. It is built upon the target dataset to filter out the observations in the source unlabelled dataset that might harm the accuracy of the semi-supervised model. It assumes that a small labelled source dataset is available together with a larger source unlabelled dataset. Our proposed method does not require any model training, it is simple and computationally cheap. We compare our proposed method against two popular state of the art out-of-distribution data detectors, which are also cheap and simple to implement. In our tests, our method yielded accuracy gains of up to 32%, when compared to the previous state of the art methods. The COVID-19 disease is caused by the novel SARS-CoV2 coronavirus, discovered in 2019 [57] . The COVID-19 pandemic has caused thousands of human losses around the world, where even the most developed health systems have not been able to cope with the infection peaks [57] . Health practitioners are struggling with the detection and tracking of infected subjects, as the number of patients in need for medical assistance increases. Therefore, accurately detecting patients infected with the SARS-CoV2 virus is a critical task to control the pandemic. Nevertheless, SARS-CoV2 detection methods like the Realtime Reverse Transcription Polymerase Chain Reaction (RT-PCR) test can be expensive and time consuming. As an alternative and/or complementary method, the usage of medical imaging based approaches can be less expensive and also accurate [17] , [20] . Moreover, X-ray based imaging diagnosis can be considered cheaper. The usage of X-ray machines is more widespread when compared to other imaging technologies like computer tomography. This is specially the case in less industrialised countries [3] . However, a limitation of Xray based diagnosing of COVID-19 is the need of highly trained clinical practitioners like radiologists, which in less industrialised countries are scarce [3] . The implementation of Computer Aided Diagnosis (CAD) systems for COVID-19 diagnosis can be a solution to mitigate the specialized staff shortage. Deep learning based CAD systems have been extensively explored for different medical imaging applications [7] , [16] , [1] , [11] , [66] , [15] . More specifically, several deep learning architectures for COVID-19 detection have been proposed recently in the literature [31] , [32] , [6] . These systems have been developed using publicly available X-ray images datasets, with COVID-19 positive [21] and negative cases [9] . Nevertheless, a short-coming of implementing a deep learning architecture for real-world usage is the need of a large labelled dataset from the specific target clinic or hospital where the system is intended to be used. Labeling images in the medical domain is time-consuming and requires expensive human effort from highly trained clinical practitioners, which makes building an extensive labelled dataset costly. Previous work on COVID-19 detection with deep learning has relied on large and heteregenous datasets, where around 100-400 COVID-19 positive cases sampled from the dataset [21] , and larger datasets of COVID-19 negative cases sampled from different sources [36] , [30] , [22] . Such testing conditions can be considered far from a real-world scenario, where usually in the target clinic/hospital a limited set of labelled observations is available. Using external datasets for training might harm the overall performance of the model. This is mainly due to the differences between patient features and imaging protocols. This affects the final data distribution between the test and training data [58] . Another short-coming of the aforementioned previous work, is the bias of the population between the positive and negative COVID-19 samples. For example, as reported in [50] , negative COVID-19 observations in [36] were sampled from pediatric chinese patients, while positive COVID-19 cases in [21] correspond to adult patients from different countries. This dataset combination has been extensively used for training Convolutional Neural Network (CNN) based models to detect COVID-19, and leads to deceptive bias in both the test and training model data [50] . To deal with the limited labelled datasets, different approaches have been implemented in literature [19] . In the context of COVID-19 detection, namely data augmentation and transfer learning [43] , [25] have been used. In transfer learning, a source labeled dataset D s l is used to pre-train a model, and then fine-tune it in the target dataset D t l . However, as discussed in [69] , fine-tuning might not be enough to improve the model's accuracy. The distribution mismatch between D s l and D t l due to different patient populations and imaging acquisition protocols, is frequently a reason for poor transfer learning performance. Another approach to deal with scarce labelled data is the usage of Semi-supervised Deep Learning (SSDL). SSDL leverages cheaper and more widely available unlabelled data. Semi-supervised learning for COVID-19 detection have been explored in [9] , [10] with positive results, where very small labelled datasets have been used. The authors combined SSDL with common data augmentation and transfer learning approaches. However, to implement deep learning based solutions for extensive real-world usage, testing different model attributes like robustness and predictive uncertainty is crucial for its safe usage. A deep review on the importance of measuring different model attributes like robustness in medical applications of Artificial Intelligence (AI) can be found in [47] . In a real-world scenario, the use of unlabelled data sampled from different sources (hospitals or clinics) can be considered. However, the usage of unlabelled datasets with different distributions from the labelled test and training target data might harm the accuracy of the model. This leads to the need of analyzing model robustness to different data distributions in the unlabelled dataset. Therefore, in this work, we study the impact of different unlabelled data sources in a SSDL model. Specifically, the MixMatch algorithm, which previously yielded interesting accuracy gains with very small labelled datasets for COVID-19 detection using X-ray images [10] , [9] is used. Moreover, we propose a simple approach to select and build an unlabelled dataset. This aims to improve the overall SSDL model accuracy. Therefore, in this work, we evaluate a setting where the following datasets are available: 1) A labelled dataset in the target clinic/hospital D l t is available. The number of labelled observations n l t is very small. The target dataset is sampled from the clinic/hospital where the model is intended to be deployed. 2) A larger unlabelled dataset in a different source clinic/hospital D u s is available, with n u s > n l t . Different deep learning applications in medical imaging face distribution mismatch situations between the different datasets used. This might be the case for SSDL, when using different unlabelled data sources. We argue that quantifying distribution mismatch with respect to the model behaviour is important for medical imaging applications, as different unlabelled data sources might be considered. Moreover, simple dataset transformation procedures to improve model robustness to data distribution mismatch between the labelled and unlabelled datasets, is also important. This helps to narrow the gap between machine learning research and its real-world usage. A. Semi-supervised Deep Learning SSDL aims to deal with small labelled datasets, by leveraging unlabelled data. Supervised deep learning networks often require large labelled datasets. This is partially addressed with the usage of data augmentation and transfer learning [62] . However, the usage of cheaper and more widely available unlabelled data, can further lower the need for labelled data. With a formal notation, in SSDL both labelled and unlabelled datasets are used. Each labelled observation X l = {x 1 , . . . , x n l } is mapped to a label in the set Y l = {y 1 , . . . , y n l }. The unlabelled dataset corresponds to a set of observations X u = {x 1 , . . . , x nu }, with S u = X u . SSDL architectures can be classified as: Pre-training [23] , pseudo-labelled [24] and regularization based. Within regularization based approaches, consistency loss term and graph based regularization and generative based [19] regularization techniques can be distinguished. A detailed survey regarding SSDL can be found in [63] , [37] . Concerning regularization based SSDL, a regularization term leveraging unlabelled data is implemented in the loss function S u : with w the model's weights array, L l and L u the labelled and unlabelled loss terms respectively. The coefficient γ weighs the influence of unsupervised regularization. As previously mentioned, a number of regularization based variations can be found in the literature. The main ones include: consistency loss based [59] , [58] , graph based [65] , [42] and generative augmentation based [55] , [52] . Consistency based methods make the assumption of clustered-data/low-density separation. Such assumption refers to how the observations corresponding to a class, are clustered together. This makes the decision manifold lie in very sparse regions [63] . A violation to this assumption might degrade the performance of the semisupervised method [63] . In pseudo-label training, pseudo-labels are estimated for unlabelled data. These are used for later model refinement. A straightforward pseudo-label based approach is based in co-training two models [4] . The model is pre-trained with the limited size labelled dataset. Later, the pseudo-labels are estimated for the unlabelled data using two models trained with different views (features) of the data. A voting scheme is implemented for estimating the pseudo-labels. Mix Match [8] combines both pseudo-label and consistency based SSDL, along with heavy data augmentation using the MixUp algorithm [67] . According to [8] , MixMatch outperforms, accuracy wise, previous SSDL approaches. Given the recently state of the art performance demonstrated by Mix Match and also the good results yielded in [9] , [10] for medical imaging applications, we chose it for the developed solution in this work. A detailed description of MixMatch can be found in Section III. The distribution mismatch between S u and S l is also referred to as the identically and independently distributed (IID) assumption violation. It might have different degrees and causes, which are enlisted as follows [34] : • Prior probability shift: The distribution of the labels in S l can be different when compared to S u . In a CAD system this can be exemplified when the labels of the medical images have different distributions between the two datasets S l and S u . A specific case would be the label imbalance of the labeled dataset S l as discussed in [10] . • Covariate shift: A different distribution of the features in the input observations might be sampled, leading to a distribution mismatch. In a medical imaging application, this can be related to the difference in the frequencies of the observed features between S l and S u . • Concept drift: It refers to the different features observed in a sample, with the same label. In the application at hand in this work, this might happen when different patients with different variations of the COVID-19 disease are sampled to build S u with the same pathologies (classes) in S l . • Concept shift: It is associated to a shift in the labels, with the same features. In the aforementioned example, it would refer to labelling a medical image with similar features with a different pathology (a bias caused by the image labelers). In our tested setting, different data sources were used only to gather unlabelled data S u . We recreate two of the aforementioned distribution mismatch causes: covariate and prior probability shift. The unlabelled datasets created and tested belong to normal (no pathology) chest X-ray images (COVID-19 − ), from patients of different nationalities. As the labelled dataset S l includes both classes (COVID-19 + and COVID-19 − ), a label distribution mismatch also occurs. The tested setting in this work simulates the case where different unlabelled data sources might be available (for instance from different hospitals), at the beginning of a pandemic. Furthermore, a small labelled dataset might be available in the target hospital/clinic. The usage of different unlabelled datasets might potentially cause a violation of the aforementioned clustered-data/lowdensity separation assumption. Using unlabelled datasets with different distributions when compared to the labelled dataset, might create wrong sparse regions and/or less clustered groups of observations belonging to the same class. Therefore, in this work we explore data-oriented approaches to deal with potential violations of the clustered-data/low-density separation assumption. Unlabelled data can be considered significantly cheaper than labelled data. Thus, discarding potentially harmful observations with the aim to decrease the odds of violating the clustered-data/low-density separation assumption is viable and worthy to explore. In [48] , an extensive evaluation of different distribution mismatch settings and its impact in SSDL is developed. Authors concluded that distribution mismatch in SSDL is an important challenge to be addressed. Recently, different approaches for improving SSDL robustness to the distribution mismatch between S u and S l have been proposed. In [46] , an Out of Distribution (OOD) masking method is proposed. It consists on weighting the observations likely to be OOD during semi-supervised training. The output of a softmax activation function after the raw model output, was used as OOD masking coefficient. This works as an observation-wise weighting during semi-supervised model training. The authors compared their proposed method with state of the art generalpurpose SSDL approaches like MixMatch [8] . The test bed consisted in different unlabelled datasets with a varying degree of distribution mismatch. The contamination source consists of images with different labels and features (completely OOD). Their method proved to improve model robustness against OOD data contamination in S u , using general purpose datasets such as CIFAR-10 and SVHN. However, other types of distribution mismatch corruption such as concept drift or covariate shift were not tested. Another approach to deal with distribution mismatch under OOD contamination (different labels and features), can be found in [18] . The proposed method also implements a weighting coefficient, calculated as the softmax output of a models ensemble. In a similar trend, the work in [26] propose a weighted approach to deal with OOD observations (with different label, different features). However, instead of using the softmax output, the observation-wise weight is estimated through an optimization step. Similar to [46] , only general purpose datasets (CIFAR-10 and MNIST) were used, with no other variations of distribution mismatch settings. Another resembling approach and testing bed to [26] , can be found in [68] , where an optimization based approach to weight each observation is implemented, with a test-bed focused in OOD contaminated unlabelled datasets. In this work, we analyze the effect of distribution mismatch in SSDL within a real-world application: COVID-19 detection using chest X-ray images. Unlike previous work on SSDL under distribution mismatch, we test a real-world setting in the medical domain, and explore its implications within such context. As previously mentioned, we analyze the impact of a distribution mismatch caused by covariate and prior probability shift. Different unlabelled dataset sources within the same domain and features are used. We aim to evaluate different approaches to weigh how harmful an unlabelled observation could be for SSDL training. We test different OOD detection approaches in this work. After calculating a harm coefficient for each unlabelled observation, different steps can be implemented to use such unlabelled dataset. For example, filtering the observations with high harm coefficients, select an unlabelled dataset upon its estimated benefit for SSDL, or weigh the unlabelled observation during SSDL training. Moreover, we focus on a data-oriented approach to identify and/or build a good unlabelled dataset for SSDL. We propose a simple and very inexpensive method to evaluate the distribution mismatch between an unlabelled and labelled datasets, S u and S l respectively. Such method can be thought as an OOD scoring approach (harm coefficient), which leads us to compare our method to recent OOD detectors used in the context of OOD data filtering to improve the accuracy of an SSDL model. OOD data detection refers to the general problem of detecting observations that are very unlikely given a specific data distribution (usually the training dataset distribution) [28] . The problem of OOD data detection can be thought as a generalization of the outlier detection problem, as it considers individual and collective outliers [54] . Specific scenarios of OOD data detection can be found in the literature. These include novel data and anomaly detection [49] , with several applications like rare event detection [27] , [2] . In classical pattern recognition literature different approaches to anomaly and OOD data detection are grounded in concepts such as density estimation [44] , kernel representations [60] , prototyping [44] and robust moment estimation [51] . Recent success of deep learning based approaches for image analysis [64] have motivated the development of OOD detection techniques for deep neural networks. OOD detection methods with deep learning architectures can be categorized in methods based upon the Deep Neural Networks (DNN)'s output, its input, or its learned feature space. DNN's output based methods include the softmax based OOD detector proposed in [29] . In such work, OOD detection is framed as a confidence estimation using the model's raw output layer values and passing it through a softmax function. Its maximum softmax value is used as confidence. Authors claim that the highest softmax value of OOD observations meaningfully differ from in distribution observations. However, as reported in [40] , non calibrated models can be overconfident with OOD data. Therefore, in [40] a calibration methodology is introduced, implementing a temperature coefficient. OOD data detection in neural networks is implemented in [40] using input perturbations meant to maximize the softmax based separability. For this end, a gradient descent optimization is used, resulting in a preprocessed image. A temperature coefficient in the calculation of the softmax output is added and is estimated to make the true positive rate of 95% for in-distribution data detection, using the previously pre-processed images. Another approach for OOD detection based on the model's output is the usage of Monte Carlo Dropout (MCD) based uncertainty estimations.MCD is a popular method for implementing predictive uncertainty estimation [41] , [35] . It consists in analyzing the distribution of N predictions using the same input and adding noise to the model (drop-out in the context of DNNs). This idea has been ported to the OOD detection problem, where observations with high uncertainty are scored with high OOD likelihood [33] , [53] . Regarding feature space (a latent space approximation in DNNs) based methods for OOD detection different approaches can be found in the literature. For example, in [39] , the authors implemented the Mahalanobis distance in latent space of the dataset to the input observation, assuming a Gaussian distribution of the data. Both the mean and covariance are estimated for the in distribution dataset. For a new observation x, the OOD score is estimated as the Mahalanobis distance for such given distribution. The authors also implemented the calibration approach used in [40] . A superior performance of their proposed method in generic OOD detection benchmarks is reported, when compared to the methods in [40] , [29] . However, no statistical significance tests of the results were performed. Another feature space based approach can be found in [61] , known as deterministic uncertainty quantification. Such approach is also intended for uncertainty estimation, but also is tested as an OOD detection technique. It makes use of a centroid calculation of each category in the feature space, to later quantify the distance of a new observation to each centroid. Uncertainty quantification is estimated based in the kernel based distance to the category centroids. The approach is compared against an ensemble of deep neural networks (an output based approach for OOD detection). This is done in a simple OOD detection benchmark, where the CIFAR-10 is used as an in-distribution dataset and the SVHN as a OOD dataset. The authors reported the area under the Receiver Operator Characteristic (ROC) curve of their approach against other OOD methods. Their approach showed the highest area under the ROC curve index. However, no statistical analysis of the results were done. In [13] the authors developed an extensive testing of the influence of distribution mismatch between unlabelled and labelled datasets. Moreover, they also developed an approach to estimate the accuracy hit of such distribution mismatch for a state of the art SSDL method. The proposed method estimates the distribution mismatch in the feature space between S l and S u , using what the authors referred as a Deep Dataset Dissimilarity Measure (DeDiM). Euclidean and Manhattan based DeDiMs were tested and compared against density based DeDiMs. All of them were applied within the feature space, built with an image net pre-trained network. The authors found a significant advantage of the density based distances. In [70] , the authors proposed an OOD detector using the feature space as well. The approach fits different parametric distributions in the feature space of the data. The decision to discriminate between OOD and In-Distribution (IOD) data is done based on the estimation of the approximated parametric model. Unfortunately, no comparison with other popular OOD methods was presented. 1) Unsupervised Domain Adaptation: When using an unlabelled dataset S u with a very different distribution to S l , a solution would be to correct or align the feature extractor trained with labelled or unlabelled data from the source of the unlabelled dataset S u , to the distribution of the labelled dataset S l (target dataset, usually smaller). This is known as Unsupervised Domain Adaptation (UDA). For instance in [69] , the authors proposed an UDA method to align the feature extractor from a source dataset to a specific target dataset. This is done within the context of COVID-19 detection using chest X-ray images. The feature extractor was originally trained with source data. Later, the feature extractor is aligned by using both labelled and unlabelled data from the target dataset. The feature extractor alignment procedure basically consists in an adversarial training step using the aforementioned datasets. As a disadvantage of such method, the feature extractor needs to be trained with labelled source data (as usual in supervised learning). Hence a large number of labels is needed. Also, the feature extractor alignment process can be considered to be expensive, as an adversarial loss function needs to be optimized. In this work, we explore the usage of MixMatch as an SSDL method, therefore, we describe it as follows. For more details please refer to [8] . As previously mentioned, MixMatch combines both pseudo label and consistency regularization SSDL. In such context, a pseudo-label y j is estimated for each unlabelled observation x j in X u . It corresponds to the the mean model output of a transformed input x j , using K number of different transformations, such as flips and rotations [8] . Each pseudo-label y is sharpened using a temperature parameter T [8] . Also, a simple data augmentation approach is implemented, by linearly combining unlabelled and labelled observations, through the usage of the MixUp algorithm [67] . The pseudo-labels are used in the MixMatch loss function, which combines a supervised and unsupervised loss terms. In this work, the well-known cross-entropy function is used as a supervised loss term. As for the unsupervised loss term, we used the previously implemented Euclidian distance loss in [8] . The Euclidian distance measures the distance between the current model output and its pseudo-label, for the unlabelled observations. This loss term is weighed by the unsupervised learning coefficient γ. In this work, we used the MixMatch hyperparameters recommended in [8] , of K = 2, and T = 0.25. As for the unsuperivsed coefficient, a value of γ = 200 is used, given our empirical test results. Interesting results were yielded in [13] , [12] , where the authors found an strong correlation between the featuredensity based distances and the MixMatch's accuracy. Based upon it, we propose to estimate how harmful an individual unlabelled observation might be towards the MixMatch's level of accuracy. We refer to this operator as the SSDL harm coefficient H x u j , where x u j ∈ S u . We aim to implement a simple and computationally inexpensive method to filter OOD data in the unlabelled dataset, This is done in order to decrease the distribution mismatch between S u and S l . As mentioned in Section II, using different unlabelled data sources might increase the chance of violating the clustereddata/low-density separation assumption. This is particularly the case given the potential distribution mismatch between the labelled and unlabelled datasets. Therefore, our proposed method aims to discard harmful observations that might create wrong low density regions to build the manifold and/or sparser sample clusters for each category. In a real-world scenario for OOD filtering, DNNs are fed with high resolution images, frequently with images from the same domain (chest X-ray images in our case). This contrasts with the usual settings of the methods discussed in Section II. As previously discussed, benchmarking in the literature have been usually performed with small resolution images and with relatively not very difficult OOD detection challenges (i.e distinguishing between CIFAR-10 and MNIST images). We aim to further test realworld distribution mismatch conditions in a medical imaging analysis application such as the COVID-19 detecion using chest X-ray images. In this work, we propose to use the feature density of a labelled dataset S l , to weigh how harmful could be to include an unlabelled observation x u j in the unlabelled dataset S u . This is done witin the context of training a model using the SSDL algorithm known as MixMatch. This harmful coefficient is represented as H x u j . We test two different variations to estimate H x u j . The first one consists in a non-parametric estimation of the feature density through an histogram calculation. The second variation assumes a Gaussian distribution of the feature space, by using a Mahalanobis distance. We use a generic feature-space built from a pre-trained image-net model, to keep the computational cost of the proposed method low. For all the tested configurations, we only use the features of the final convolutional layer. Computational resource restrictions for solving a real-world problem in medical imaging makes very expensive to use all the features extracted in the different layers as done in [39] . The procedure to calculate the harm coefficient using both methods, is depicted as follows: 1) For all of the input observations x l j ∈ S l , with x l j ∈ R n , being n the input space dimensionality, using the feature extractor f , we calculate its feature vector as h l j = f x l j . 2) The feature vector h l j ∈ R n has dimension n , with n < n. For instance, a given feature extractor f using the Imagenet pretrained Wide-ResNet architecture, yields n = 512 features. For architectures such as densenet that might yield larger feature arrays in its final convolutional layer, we sub-sampled it to keep it in n = 1024 features, using an average pooling operation. This yields a feature set H l . 3) For the Feature Histograms (FH) method, we perform the following steps: a) For each dimension r = 1, ..., n in the feature space, we compute its normalized histogram to approximate the density functions p l r , in the sample H l . This yields the set of approximated feature density functions: b) Using the approximated feature densities in P l , we estimate our SSDL harm coefficient H x u j , for an unlabelled observation in the following steps x u j . c) Calculate the features for each unlabelled observation as h u j = f x u j , for each dimension in h u j ∈ R n , d) The total likelihood calculation within the density function approximation set P l assumes that each dimension is statistically independent. Thus: n r=1 p l r h u j,r . e) To avoid under-flow, we calculate the negative logarithm of the likelihood, and use it as the harm coefficient: 4) For the Mahalanobis based filtering, we perform the following steps: a) Calculate the covariance matrix Σ from the features set H l , and the sample mean from the features set h l . b) Calculate the features for each unlabelled observation as h u j = f x u j . c) Compute the harm coefficient as: The harm coefficient H x u j can be used to discard the observations with high values, or to weigh them in case an online semi-supervised per-observation weighting is implemented. In this work, we test the impact of the distribution mismatch between the labelled target and unlabelled source datasets, D l t and D u s , respectively, in the accuracy of the SSDL MixMatch algorithm. Later, we test the impact of the proposed feature based harm coefficient to eliminate potentially harming observations from the unlabelled dataset. This was done to assess the accuracy of the model using the filtered unlabelled dataset D u s . This way, we can assess in a controlled setting the impact of the distribution rectification procedure, implemented through a data filtering process. In this work, we explore the sensitivity to distribution mismatch between S u and S l of a SSDL COVID-19 detection system using chest X-ray images. Therefore, we use different data sources for chest X-ray images for both COVID-19 + (positive COVID-19) and COVID-19 − (no pathology chest X-ray observations). For COVID-19 + cases we use the open dataset made available by Dr. Cohen in [21] . This dataset is composed of 105 COVID-19 + images at the time of writing this work. The observations were sampled from different journal websites like the Italian Society of Medical and Interventional Radiology and radiopaedia.org, and more recent publications in the field. In this work we used COVID-19 + observations, discarding images related to Middle East Respiratory Syndrome (MERS), Acute Respiratory Distress Syndrome (ARDS) and Severe Acute Respiratory Syndrome (SARS). The images present varying resolutions from 400×400 up to 2500×2500 pixels. As for COVID-19 − observations, we used four different data-sources. Table I summarizes the COVID-19 − cases data sources. Figure 1 shows observations for each one of the data sources used in this work. The datasets were randomly augmented with flips and rotations. No random crops were used to avoid discarding important regions in the images. In this first set of experiments, we evaluate the impact of OOD on data with different unlabelled data sources and different degrees of contamination. We simulate the following scenario: A small labelled target dataset D t l (with n l = 20 and n l = 40 observations) is provided with a partition of the observations of the COVID-19 + taken from Dr. Cohen's dataset and the COVID-19 − cases of the Indiana Chest Xray dataset, described in Table I . A larger number of 142 unlabelled observations is also available, to be used in the harm coefficient estimations methods. This can be thought as the target labelled dataset with limited labels which is accessible in a real-world application from the clinic/hospital where the model is intended to be deployed. For the unlabelled dataset, different partitions of COVID-19 − cases the chest X-ray data sources described in Table I . This simulates the usage of different sources of unlabelled datasets D s u , taken from different hospitals/clinics. All the unlabelled observations are COVID-19 − , to enforce a prior Fig. 1. Row 1, column 1 : a COVID-19 + observation from [21] , row 1, column 2: a COVID-19 − observation from the Chinese dataset [36] , row 2, column 1: ChestX-ray8 COVID-19 − image [30] , row 2, column 2: Indiana dataset COVID-19 − sample image [22] . The bottom image corresponds to a sample image from the Costa Rica dataset [10] . As it can be seen, images from the Costa Rica dataset include a black frame. probability shift (label imbalance). As in our preliminar tests, the worst performing unlabelled dataset D s u dataset is the Costa Rican dataset described in Table I , we used it to create different combinations with the rest of datasets. All of these are depicted in Table IV . A total of n u = 90 unlabelled observations were picked from such datasests with different combinations. Using different data sources for the unlabelled dataset, can help to assess the impact of a distribution mismatch between S u and S l . As for the test dataset, it consists in another partition of the target dataset which includes the COVID-19 + dataset, along with another partition of the Indiana Chest X-ray dataset (COVID-19 − ). Both are the same size. This yiels a completely balanced test setting. We used a total of n t = 62 observations, drawn from the same target dataset (31 observations per class). The test data comes from the distribution of the labelled data with no contamination. This simulates the case where the labelled data comes from the target dataset distribution. Both unlabelled and labelled datasets were standardised, given that the authors in [14] found that normalisation is important in semi-supervised learning. Test-bed 1 (TB-1) is designed to assess the effect of on MixMatch's accuracy of using different unlabelled datasets D s u with a target labelled dataset D t l . This test-bed recreates differ-ent distribution mismatch conditions between D s u and D t l . The Costa Rican dataset acts as a source of OOD data, as it yielded the lowest accuracy when used as D s u for MixMatch, among the empirically tested unlabelled datasources. We combine the aforementioned data sources with the Costa Rican dataset. This helps enforce different distribution mismatch settings. In the Test-bed 1.1 (TB-1.1), the first sub-experiment defined within the TB-1, we measure MixMatch's accuracy using a densenet model, with feature extractor fine-tuning and without it. We aim to measure if there is a significant accuracy gain of fine-tuning the feature extractor during training. Table II shows the results of performing MixMatch's training without feature extractor fine-tuning, while Table III shows the results with it. Additionally, we devised a Test-bed 1.2 (TB-1.2), where the baseline results obtained in this MixMatch accuracy baseline test-bed in Tables II and IV are correlated with the cosine DeDiMs between each D s u and D s u . This is measured as proposed in [14] , and represented as d C (D s u , D t l ). For this experiment, we tested an alexnet's model feature extractor, given its low computational cost. We implemented the cosine dataset DeDiM with a batch dataset size of n b = 40, with 10 batches of random samples. The same batches were used to test the different configurations. Similar to the proposed harm coefficient estimation methods, we used a generic Imagenet pre-trained feature extractor to build the feature density estimations, as proposed in [14] . The DeDiM results are linearly correlated using a Pearson coefficient in Table VI . Finally, Test-bed 2 (TB-2) aims to assess MixMatch's accuracy results when implementing the proposed methods in this work to filter the OOD observations, against two popular output based OOD filtering methods: the MCD and Softmax based OOD filters. In this test bed, we measure MixMatch's accuracy through the four different filtered datasets, testing both alexnet and densenet as a model. We also tested the model with n l = 20 and n l = 40 labels. The results using the proposed feature histograms and Mahalanobis distance for each generated unlabelled data source D s u are depicted in Tables VIII and X, for the alexnet and the densenet models, respecitvely. To filter possible OOD observations, we eliminated the same percent of contaminated observations using the Costa Rican dataset (i.e, if the Chinese dataset was contaminated with 35% of observations with the Costa Rican dataset, we eliminated 35% of the observations with the highest harm coefficient, and so on). We leave the problem of defining the right harm coefficient threshold out of this study. In all test beds, the MixMatch algorithm is tested with a densenet and alexnet models, using the recommended parameters in [8] , along with an unsupervised regularization term coefficient of 200. As for model training, we use the one-cycle policy implemented in the FastAI library, with a weight decay of 0.001, This way we can measure MixMatch's behaviour with models with different depth and architecture. For each configuration, we trained the model with 10 runs, using a different random data partition for training and test, for 50 epochs. As for the results in TB-1.1, depicted in Table II , we can see a very strong influence of the unlabelled data source D s u in the accuracy of the SSDL MixMatch algorithm. Training the model with the Indiana dataset including also COVID-19 + observations, yields the highest accuracy, with around 0.89, higher than the supervised model. From there, using the ChestX-ray8 as D s u , yields an accuracy of 0.825, followed by the usage of the Chinese dataset as D s u , accuracy wise. Using the Costa Rican dataset as D s u yields the lowest accuracy, with close to 0.493. Contaminating the ChestXray8, Chinese and Indiana dataset (with only COVID-19 − observations for with the highest distance to D t l ). We can see how using both of the aforementioned D s u datasets, yield very low Mix-Match accuracy. This behaviour is summarized in the obtained Pearson coefficients depicted in Table VI , with a very high lineal correlation, of around 78% for the tested variations. The correlation is still high for the semi-supervised densenet model behaviour with the dataset distances, using a generic imagenet pre-trained alexnet model. This suggests that the usage of the feature density can bring useful information to preserve or discard an unlabelled observation in a D s u . Regarding the results of TB-2, Tables X and VIII show the accuracy of MixMatch yielded when filtering the unlabelled datasets with the proposed FH and Mahalanobis methods, for both tested models (alexnet and densenet, respectively). For both proposed methods, we can see how filtering potentially harming observations from the unlabelled dataset increases MixMatch's accuracy significantly, when compared to the baseline accuracies in Tables IV and II, for both tested models. For instance, when using the densenet model with n l = 40, the ChestX-ray8 dataset contaminated with 35% and 65% with the Costa Rica dataset, increases its accuracy from 0.579 to 0.78 and 0.5 to 0.79, respectively, when filtering harmful observations with the Mahalanobis method. This can be seen in both Tables II and X. The usage of the FH method yields also an important accuracy gain. In this case however, it is lower than the gains obtained with the Mahalanobis method. The accuracy of the model trained with D s u using the ChestX-ray8 dataset with no contamination is almost restored, as MixMatch originally yielded 0.825. We have to consider that the filtered dataset is always smaller than the original unlabelled dataset. Despite this, the accuracy ends very close. Similarly, for the alexnet model with n l = 40, the accuracy of using an Indiana unlabelled dataset contaminated with 65% of the Costa Rica dataset is close to 50%, according to Table IV . However, after filtering out harmful unlabelled observations ends close to the 71%, using both the FH or the Mahalanobis method. When comparing the accuracy gain of using the feature histograms against the Mahalanobis distance based method, we can see a similar behaviour across almost all the tested unlabelled datasets D s u . However, for the ChestX-ray8 dataset where the Mahalanobis based method yields statistically significant accuracy gains the the FH approach, for the densenet model, as seen in Table X . This suggests that the feature distribution of the labelled dataset D t l fits well with a Gaussian distribution, given the very slightly better results of the Mahalanobis method. The Mahalanobis based method is faster, as it only needs to compute a covariance matrix, when compared to the histogram based approach, which needs to build a feature histogram. This proved to be significantly slower in our tests. As for the tested MCD and Softmax baseline methods, popular in OOD detection and uncertainty estimation, the results depicted in Tables VII and IX, for the alexnet and densenet models, show a very poor performance. The accuracy gains are negligible and sometimes the accuracy is diminished, when compared to the baseline results shown in Tables IV and II . Therefore, the usage of the feature density based methods for filtering potentially harmful unlabelled observations prove to be a significantly better approach. Accuracy gains of up to 25% with statistical significance in all the tested settings were obtained (using a Wilcoxon test with p < 0.1), when using the feature density approaches over the tested output based ones. This can be seen when comparing the results for the proposed feature density techniques in Tables VIII and X, with Tables VII and IX , for the both tested architectures alexnet and densenet, respectively. In this work, we have analyzed the impact of the distribution mismatch between the labelled and the unlabelled dataset for training a SSDL model, using the MixMatch algorithm. The setting assessed used medical imaging data, for COVID-19 detection. Assessing the impact of distribution mismatch between the unlabelled and labelled dataset for medical imaging applications is still an under-reported problem in the literature. In the first test-bed, we have assessed the impact of using different unlabelled data sources D s u , and quantitatively analyzed the distribution mismatch between them using DeDiMs as a metric. The high linear correlation between the measured DeDiMs and the MixMatch accuracy, suggests a strong influence of the feature distribution mismatch between D s u and D t l . In contexts where a decision must be made about what unlabelled data source D s u must be used, from a set of possible unlabelled datasets, the DeDiMs might be used as a quantitative prior method. Implementing the tested DeDiMs requires no model training, as a generic pre-trained ImageNet model seems to be good enough to estimate the benefit of using a specific unlabelled dataset D s u , according to our results. Data quality metrics for deep learning models as argued in [45] , [5] is an interesting path to develop further, as it might help to narrow the gap between research and real-world implementation of deep learning systems. For instance, building high quality datasets for training a semi-supervised model, or assess the safety of using a deep learning model before hand, can benefit from quantitative data quality measures. We argue for the community to include robust data quality metrics in the deployment of deep learning solutions. To increase the robustness of the SSDL model to the distribution mismatch, we tested different approaches to discard potentially harming unlabelled observations from the unlabelled dataset D s u . The tested setting can be considered to be closer to real-world settings, as images within the same domain were used as OOD data contamination sources. This contrasts to the frequent OOD detection benchmarks where images from very different dataset were used as OOD data sources [70] . Our approach is data-oriented, as it modifies the original dataset in an explicit way by removing potentially harming unlabelled observations. We tested output based OOD filtering techniques against our proposed feature density based approaches. Our proposed methods based on the feature densities built upon a pre-trained model with Imagenet, showed a large and significantly advantage over previous output based OOD filtering methods. In the context of SSDL, some approaches have relied in weighing each unlabelled observation using the output of the model, as in [46] . According to our results, we argue that using the model's output might yield over-confident results to filter or weigh unlabelled observations. This is widely known in OOD detection literature [38] . Even ensemble based approaches like the tested MCD method are not able to filter harming unlabelled observations, according to our test results. However, both feature density based approaches demonstrated a good performance on detecting harming unlabelled observations, almost recovering the original accuracy of the no contaminated datasets. The proposed methods can be deployed to correct and create more effective unlabelled datasets. Moreover both proposed methods do not require any deep learning model training, making it cheap and reducing the carbon footprint of its implementation [56] . A brief analysis of u-net and mask r-cnn for skin lesion segmentation Concrete problems in ai safety The training and practice of radiology in India: current trends. Quantitative imaging in medicine and surgery 21 an augmented pac model for semi-supervised learning Sample-size determination methodologies for machine learning in medical imaging research: A systematic review Deep learning for screening covid-19 using chest x-ray images Armaghan Moemeni, Shengxiang Yang, and Jordina Torrents-Barrena. Quality assessment of dental photostimulable phosphor plates with deep learning Mixmatch: A holistic approach to semi-supervised learning Jordina Torrents-Barrena, and Miguel A Molina-Cabello. Dealing with scarce labelled data: Semisupervised deep learning with mix match for covid-19 detection using chest x-ray images Correcting data imbalance for semi-supervised covid-19 detection using x-ray chest images A real use case of semi-supervised learning for mammogram classification in a local clinic of costa rica More than meets the eye: Semisupervised learning under non-iid data Mixmood: A systematic approach to class distribution mismatch in semi-supervised learning using deep dataset dissimilarity measures Mixmood: A systematic approach to class distribution mismatch in semi-supervised learning using deep dataset dissimilarity measures Improving uncertainty estimation with semi-supervised deep learning for covid-19 detection using chest x-ray images Assessing the impact of a preprocessing stage on deep learning architectures for breast tumor multi-class classification with histopathological images Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in wuhan, china: a descriptive study Semisupervised learning under class distribution mismatch Notso-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis Ct imaging features of 2019 novel coronavirus (2019-ncov) Covid-19 image data collection Preparing a collection of radiology examinations for distribution and retrieval Unsupervised visual representation learning by context prediction Tri-net for semisupervised deep learning The effectiveness of image augmentation in deep learning networks for detecting covid-19: A geometric transformation perspective Safe deep semi-supervised learning for unseen-class unlabeled data Rare event detection using disentangled representation learning A baseline for detecting misclassified and out-of-distribution examples in neural networks A baseline for detecting misclassified and out-of-distribution examples in neural networks Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison Deep learning approaches for covid-19 detection based on chest x-ray images Deep learning based detection and analysis of covid-19 on chest x-ray images Augmenting monte carlo dropout classification models with unsupervised learning tasks for detecting and diagnosing out-ofdistribution faults Advances and open problems in federated learning What uncertainties do we need in bayesian deep learning for computer vision? Identifying medical diagnoses and treatable diseases by image-based deep learning Recent deep semi-supervised learning approaches and related works Uncertainty estimation for deep neural object detectors in safety-critical applications A simple unified framework for detecting out-of-distribution samples and adversarial attacks Enhancing the reliability of out-of-distribution image detection in neural networks A general framework for uncertainty estimation in deep learning Smooth neighbors on teacher graphs for semi-supervised learning Diagnosing covid-19 pneumonia from x-ray and ct images using deep learning and transfer learning algorithms Novelty detection: a review-part 1: statistical approaches. Signal processing Using cluster analysis to assess the impact of dataset heterogeneity on deep convolutional network accuracy: A first glance Realmix: Towards realistic semi-supervised deep learning algorithms Ml4h auditing: From paper to practice Realistic evaluation of deep semi-supervised learning algorithms Deep transfer learning for multiple class novelty detection Common pitfalls and recommendations for using machine learning to detect and prognosticate for covid-19 using chest radiographs and ct scans Least median of squares regression Improved techniques for training gans Uncertainty-based out-of-distribution detection in deep reinforcement learning Outlier detection: applications and techniques Unsupervised and semi-supervised learning with categorical generative adversarial networks Energy and policy considerations for modern deep learning research Covid-19: epidemiology, evolution, and cross-disciplinary perspectives Semisupervised learning of fetal anatomy from ultrasound Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results Support vector data description Simple and scalable epistemic uncertainty estimation using a single deep deterministic neural network The art of data augmentation A survey on semi-supervised learning Deep learning: Evolution and expansion Deep learning via semi-supervised embedding Enforcing morphological information in fully convolutional networks to improve cell instance segmentation in fluorescence microscopy images mixup: Beyond empirical risk minimization. arXiv e-prints Robust semi-supervised learning with out of distribution data Soda: Detecting covid-19 in chest x-rays with semi-supervised open set domain adaptation Deep residual flow for out of distribution detection