key: cord-0567700-sg2kwt5u
authors: Calderon-Ramirez, Saul; Yang, Shengxiang; Elizondo, David
title: Semi-supervised Deep Learning for Image Classification with Distribution Mismatch: A Survey
date: 2022-03-01
journal: nan
DOI: nan
sha: 790bc516286ac4315068f1ee05b83bf04f7947b8
doc_id: 567700
cord_uid: sg2kwt5u

Deep learning methodologies have been employed in several different fields, with an outstanding success in image recognition applications, such as material quality control, medical imaging, autonomous driving, etc. Deep learning models rely on the abundance of labelled observations to train a prospective model. These models are composed of millions of parameters to estimate, increasing the need of more training observations. Frequently it is expensive to gather labelled observations of data, making the usage of deep learning models not ideal, as the model might over-fit data. In a semi-supervised setting, unlabelled data is used to improve the levels of accuracy and generalization of a model with small labelled datasets. Nevertheless, in many situations different unlabelled data sources might be available. This raises the risk of a significant distribution mismatch between the labelled and unlabelled datasets. Such phenomena can cause a considerable performance hit to typical semi-supervised deep learning frameworks, which often assume that both labelled and unlabelled datasets are drawn from similar distributions. Therefore, in this paper we study the latest approaches for semi-supervised deep learning for image recognition. Emphasis is made in semi-supervised deep learning models designed to deal with a distribution mismatch between the labelled and unlabelled datasets. We address open challenges with the aim to encourage the community to tackle them, and overcome the high data demand of traditional deep learning pipelines under real-world usage settings.

Abstract-Deep learning methodologies have been employed in several different fields, with an outstanding success in image recognition applications, such as material quality control, medical imaging, autonomous driving, etc. Deep learning models rely on the abundance of labelled observations to train a prospective model. These models are composed of millions of parameters to estimate, increasing the need of more training observations. Frequently it is expensive to gather labelled observations of data, making the usage of deep learning models not ideal, as the model might over-fit data. In a semi-supervised setting, unlabelled data is used to improve the levels of accuracy and generalization of a model with small labelled datasets. Nevertheless, in many situations different unlabelled data sources might be available. This raises the risk of a significant distribution mismatch between the labelled and unlabelled datasets. Such phenomena can cause a considerable performance hit to typical semi-supervised deep learning frameworks, which often assume that both labelled and unlabelled datasets are drawn from similar distributions. Therefore, in this paper we study the latest approaches for semisupervised deep learning for image recognition. Emphasis is made in semi-supervised deep learning models designed to deal with a distribution mismatch between the labelled and unlabelled datasets. We address open challenges with the aim to encourage the community to tackle them, and overcome the high data demand of traditional deep learning pipelines under real-world usage settings.

Impact statement: This paper is a deep review of the state of the art semi-supervised deep learning methods, focusing on methods dealing with the distribution mismatch setting. Under real world usage scenarios, a distribution mismatch might occur between the labelled and unlabelled datasets. Recent research has found an important performance degradation of the state of the art semi-supervised deep learning (SSDL) methods. Therefore, state of the art methodologies aim to increase the robustness of semi-supervised deep learning frameworks to this phenomena. In this work, we are the first to our knowledge to systematize and study recent approaches to robust SSDL under distribution mismatch scenarios. We think this work can add value to the literature around this subject, as it identifies the main tendencies surrounding it. Also we consider that our work encourages the community to draw the attention on this emerging subject, which we think is an important challenge to address in order to decrease the lab-to-real-world gap of deep learning methodologies. to biodiversity conservation [7] , [10] , [12] , [14] , [17] , [33] , [62] , [69] , [97] . Most of deep learning architectures rely on the usage of extensively labelled datasets to train models with millions of parameters to estimate [13] , [16] , [35] . Overfitting is a frequent issue when implementing a deep learning based solution trained with a small, or not representative dataset. Such phenomena often causes poor generalization performance during its real world usage. In spite of this risk, the acquisition of a sufficiently sized and representative sample, through rigorous procedures and standards, is a pending challenge, as argued in [5] . Moreover, procedures to determine whether a dataset is large and/or representative enough is still an open subject in the literature, as discussed in [58] .

Often labels are expensive to generate, especially in fields developed by highly trained medical professionals, such as radiologists, pathologists, or psychologists [10] , [17] , [30] , [42] . Examples of this include the labelling of hystopathological images, necessary for training a deep learning model for its usage in clinical procedures [17] . Therefore, there is an increasing interest for dealing with scarce labelled data to feed deep learning architectures, stimulated by the success of deep learning based models [63] .

Among the most popular and simple approaches to deal with limited labelled observations and diminish model over-fitting is data augmentation. Data augmentation adds artificial observations to the training dataset, using simple transformations of real data samples; namely image rotation, flipping, artificial noise addition [35] . A description of simple data augmentation procedures for deep learning architectures can be found in [96] . More complex data augmentation techniques make use of generative adversarial networks. Generative models approximate the data distribution, which can be sampled to create new observations, as seen in [32] , [78] , [102] with different applications. Data augmentation is implemented in popular deep learning frameworks, such as Pytorch and TensorFlow [64] .

Transfer learning is also a common approach for dealing with the lack of enough labels. It first trains a model f with an external or source-labelled dataset, hopefully from a similar domain. Secondly, parameters are fine-tuned with the intended, or target dataset [88] . Similar to data augmentation, TensorFlow and Pytorch include the weights of widely used deep learning models trained in general purpose datasets as ImageNet [27] , making its usage widespread. Its implementation yields better results with more similar source and target datasets. A detailed review on deep transfer learning can be found in [84] .

Another alternative to deal with small labelled datasets is arXiv:2203.00190v3 [cs.CV] 10 Mar 2022

Semi-supervised Deep Learning (SSDL) which enables the model to take advantage of unlabelled or even noisily-labelled data [47] , [94] . As an application example, take the problem of training a face based apparent emotion recognition model. Unlabelled videos and images of human faces are available on the web, and can be fetched with a web crawler. Taking advantage of such unlabelled information might yield improved accuracy and generalization for deep learning architectures.

One of the first works in the literature regarding semisupervised learning is [76] ; where different methods for using unlabelled data were proposed. More recently, with the increasing development and usage of deep learning architectures, semi-supervised learning methods are attracting more attention. An important number of SSDL frameworks are general enough to allow the usage of popular deep learning architectures in different application domains [63] . Therefore, we argue that it is necessary to review and study the relationship between recent deep learning based semi-supervised techniques, in order to spot missing gaps and boost research in the field. Some recent semi-supervised learning reviews are already available in the literature. These are detailed in Section I-A. Moreover, we argue that it is important to discuss the open challenges of implementing SSDL in realworld settings, to narrow the lab-application gap. One of the remaining challenges is the frequent distribution mismatch between the labelled and unlabelled data, which can hinder the performance of the SSDL framework.

In [101] , an extensive review on semi-supervised approaches for machine learning was developed. The authors defined the following semi-supervised approaches: self-training, cotraining, and graph based methods. However, no deep learning based concepts were popular by the time of the survey, as auto-encoder and generative adversarial networks were less used, given its high computational cost and its consequent impractical usage.

Later, a review of semi-supervised learning methods was developed in [66] . In this work, authors enlist self-training, co-training, transductive support vector machines, multi-view learning and generative discriminative approaches. Still deep learning architectures were not popular by the time. Thus, semi-supervised architectures based on more traditional machine learning methods are reviewed in such work.

A brief survey in semi-supervised learning for image analysis and natural language processing was developed in [67] . The study defines the following semi-supervised learning approaches: generative models, self-training, co-training, multiview learning and graph based models. This review, however, does not focus on deep semi-supervised learning approaches.

A more recent survey on semi-supervised learning for medical imaging can be found in [21] , with different machine learning based approaches listed. Authors distinguished selftraining, graph based, co-training, and manifold regularization approaches for semi-supervised learning. More medical imaging solutions based on transfer learning than semi-supervised learning were found by the authors, given its simplicity of implementation.

In [63] , authors experimented with some recent SSDL architectures and included a short review. The authors argued that typical testing of semi-supervised techniques is not enough to measure its performance in real-world applications. For instance, common semi-supervised learning benchmarks do not include unlabelled datasets with observations from classes not defined in the labelled data. This is referred to as distractor classes or collective outliers [79] . The authors also highlight the lack of tests around the interaction of semi-supervised learning pipelines with other types of learning, namely transfer learning.

More recently, in [92] , the authors extensively review different semi-supervised learning frameworks, mostly for deep learning architectures. A detailed concept framework around the key assumptions of most SSDL (low density/clustering and the manifold assumptions) is developed. The taxonomy proposed for semi-supervised methods include two major categories: inductive and transductive based methods. Inductive methods build a mathematical model or function that can be used for new points in the input space, while transductive methods do not. According to the authors, significantly more semi-supervised inductive methods can be found in the literature. These methods can be further categorized into: unsupervised pre-processing, wrapper based, and intrinsically semi-supervised methods [92] . The authors mentioned the distribution mismatch challenge for semi-supervised learning introduced in [63] . However, no focus on techniques around this subject was done in their review.

In [74] , a review of semi-, self-and unsupervised learning methods for image classification was developed. The authors focus on the common concepts used in these methods (most of them based on deep learning architectures). Concepts such as pretext or proxy task learning, data augmentation, contrastive optimization, etc., are described as common ideas within the three learning approaches. A useful set of tables describing the different semi-supervised learning approaches along with the common concepts is included in this work. After reviewing the yielded results of recent semi-supervised methods, the authors conclude that few of them include benchmarks closer to realworld (high resolution images, with similar features for each class). Also, real-world settings, such as class imbalance and noisy labels are often missing.

We argue that a detailed survey in SSDL is still missing, as common short reviews included in SSDL papers usually focus on its closest related works. The most recent semi-supervised learning surveys we have found are outdated and do not focus on deep learning based approaches. We argue that recent SSDL approaches add new perspectives to the semi-supervised learning framework. However, more importantly, to narrow the lab-to-application gap, it is necessary to fully study the state of the art in the efforts to address such challenges. In the context of SSDL, we consider that increasing the robustness of SSDL methods to the distribution mismatch between the labelled and unlabelled datasets is key. Therefore, this review focuses on the distribution mismatch problem between the labelled and the unlabelled datasets.

In Section I-B, a review of the main ideas of semisupervised learning is carried out. Based on the concepts of both previous sections, in Section II we review the main approaches for SSDL. Later we address the different methods developed so far in the literature regarding SSDL when facing a distribution mismatch between S (u) and S (l) . Finally, we discuss the pending challenges of SSDL under distribution mismatch conditions in Section IV.

In this section we describe the key terminology to analyze SSDL in more detail. We based our terminology on the semisupervised learning analytical framework developed in [4] . This framework extends the learning framework proposed in [90] , as a machine learning theoretical framework.

A model f w is said to be semi-supervised, if it is trained using a set of labelled observations S (l) , along a set of unlabelled observations S (u) = {x n1 , x n2 , . . . , x nu } , with the total number of observations n = n l + n u . Frequently, the number of unlabelled observations n u is considerably higher than the number of labelled observations. This makes n u n l , as labels are expensive to obtain in different domains. If the model f w corresponds to a Deep Neural Network (DNN), we refer to SSDL. The deep model f w is often referred to as a back-bone model. In semi-supervised learning, additional information is extracted from an unlabelled dataset S (u) . Therefore, training a deep model can be extended to f w = T S (l) , S (u) , f w . The estimated hypothesis should classify test data in x ∈ S (t) with a higher accuracy than just using the labelled data S (l) . Figure 1 plots a semi-supervised setting with observations in d = 2 dimensions. The label can also correspond to an array, y i ∈ R k , in case of using a 1 − K encoding (onehot vector) for classifying observations in K classes, or y i ∈ R for regression. More specifically, observations of both the labelled and unlabelled dataset belong to the observation space x i ∈ X and labels lie within the label space Y. For instance, observation for binary images of written digits with d pixels would make up for an observation space X ∈ {0, 1} d , and its label set is given as Y = {0, 1, . . . , 9}.

The concept class C k corresponds to all the valid combinations of values in the array x i ∈ R d for a specific class k. For example, for the digit 1, a subset of all possible of observations that belong to class k belong to the concept C k . The concept class models all the possible images of the digit 1. In such case x i ∈ C k=1 . The concept class C = {C 1 , . . . , C k } includes all the possible observations which can be drawn for all the existing classes in a given problem domain.

From a data distribution perspective, usually the population density function of the concept class p x∼C (x) = p (x|y = 1, . . . , K) and the density for each concept p x∼C k (x) = p (x|y = k) is unknown. Most semi-supervised methods assume that both S (u) and labelled data S (l) sample the concept class density, making p x∼S (l) (x) and p x∼S (u) (x) very similar [92] . A labelled and an unlabelled dataset, S (l) and S (u) , respectively, are said to be identically and independently sampled if the density functions p x∼S (u) and p x∼S (l) are identical and are statistically independent.

However, in real-world settings different violations to the Independent and Identically Distributed (IID) assumption can be faced. For instance, unlabelled data is likely to contain observations which might not belong to any of the K classes. Potentially, this could lead to a different sampled density function from p x∼C (x). These observations belong to a distractor class dataset x ∈ D, and are drawn from a theoretical distribution of the distractor class p x∼D (x). Figure 1 shows distractor observations drawn from a distractor distribution p x∼D (x) in yellow.

A subset of unlabelled observations from S (u) , referred to as S (u) D are said to belong to a distractor class, if they are drawn from a different distribution than the observations that belong to the concept classes. The distractor class is frequently semantically different than the concept classes.

Different causes for a violation to the IID assumption for S (u) and S (l) might be faced in real-world settings. These are enlisted as follows, and can be found with different degrees [45] :

• Prior probability shift: The label distribution in the dataset S (l) might differ when compared to S (u) . A specific case would be the label imbalance of the labelled dataset S (l) and a balanced unlabelled dataset. • Covariate shift: A difference in the feature distributions between the datasets S (l) with respect to S (u) with the same classes in both, might be sampled, leading to a distribution mismatch. In a medical imaging application, for example, this can be related to the difference in the distribution of the sampled features between S (l) and S (u) . This can be caused by the difference of the patients sample. • Concept shift: This is associated to a shift in the labels of S (l) with respect to S (u) of data with the same features. For example, in the medical imaging domain, different practitioners might categorize the same x-ray image into different classes. This is very related to the problem of noisy labelling [31] . • Unseen classes: The dataset S (u) contains observations of unseen or unrepresented classes in the dataset S (l) . One or more distractor classes are sampled in the unlabelled dataset. Therefore, a mismatch in the number of labels exist, along with a prior probability shift and a feature distribution mismatch. Figure 1 illustrates a distribution mismatch setting. The circles correspond to unlabelled data and the squares and diamonds to the labelled dataset. The labelled and unlabelled data for the two classes are clearly imbalanced and sample different feature values. In this case, all the blue unlabelled observations are drawn from the concept classes. However, the yellow unlabelled observations, can be considered to have different feature value distributions. Many SSDL methods make usage of the clustered-data/low-density separation assumption together with the manifold hypothesis [70] .

In this section we study recent semi-supervised deep learning architectures. They are divided into different categories. Such categorization is meant to ease its analysis. However each category is not mutually exclusive with the rest of them, as there are several methods that mix concepts of two or more categories. This serves as a background to understand current SSDL approaches to deal with the distribution mismatch between S (u) and S (l) .

A basic approach to leverage the information from an unlabelled dataset S (u) , is to perform as a first step an unsupervised pre-training of the classifier f w . In this document we refer to it as Pre-trained Semi-Supervised deep learning (PT-SSDL). A straightforward way to implement PT-SSDL, is to pre-train the encoding section of the model h (FE) wFE (x i ) to optimize a proxy or pretext [89] task δ. The proxy task does not need the specific labels, allowing the usage of unlabelled data. This proxy loss is minimized during training, and enables the usage of unlabelled data:

where the function δ compares the proxy label r i with the output of the proxy model f proxy . The proxy task can be optimized also using labelled data. The process of optimizing a proxy task is also known as self-supervision [44] . This can be done in a pre-training step or during training, as seen in the models with unsupervised regularization. A simple approach for this proxy or auxiliary loss is to minimize the unsupervised reconstruction error. This is similar to the usage of a consistency function δ, where the proxy task corresponds to reconstruct the input, making δ

wFE (x i ) . The usage of an auto-encoder based reconstruction means it is usually necessary to add a decoder path h (DE) wDE , which is later discarded at evaluation time. Pre-training can be performed for the whole model, or in a per-layer fashion, as initially explored in [6] . Moreover, pre-training can be easily combined with other semi-supervised techniques, as seen in [50] .

In [28] , a Convolutional Neural Network (CNN) is pretrained with image patches from unlabelled data, with the proxy task of predicting the position of a new second patch.

The approach was tested in object detection benchmarks. In [18] an unsupervised pre-training approach was proposed. It implements a proxy task optimization followed by a clustering step, both using unlabelled data. The proxy task consists of the random rotation of the unlabelled data, and the prediction of its rotation. The proposed method was tested against other unsupervised pre-training methods, using the PASCAL Visual Object Classes 2007 dataset.

The proxy or auxiliary task is implemented in different manners in SSDL, as it is not exclusive to pre-training methods. This can be seen in consistency based regularization techniques, later discussed in this work. For instance in [98] , an extensive set of proxy tasks are added as an unsupervised regularization term, and compared with some popular regularized SSDL methods. The authors used the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) for the executed benchmarks. The proposed method showed a slight accuracy gain, with no statistical significance, against other two unsupervised regularization based methods.

In Pseudo-label Semi-Supervised deep learning (PLT-SSDL) or also known as self-training, self-teaching or bootstrapping, pseudo-labels are estimated for unlabelled data, and used for model fine-tuning. A straightforward approach of pseudo-label based training consisting in co-training two models can be found in [4] .

In co-training, two or more different sets of input dimensions or views are used to train two or more different models. Such views can be just the result of splitting the original input array x i . For instance, in two-views v 1 and v 2 co-training [4] , two labelled datasets S (l,v1) and S (l,v2) are used. In an initial iteration i = 1, two models are trained using the labelled dataset, yielding the two view models w v2) . This can be considered as a pretraining step. The resulting models can be referred to as an ensemble of models

As a second step, the disagreement probability

The final pseudo-labels for each observation x j can be the result of applying a view-wise summarizing operation µ (like averaging or taking the maximum logits) making y

The set of pseudo labels for the iteration i can be represented as S i = µ f wi S (u) . In co-training [4] , the agreed observations for the two models are picked in the function S i = ϕ S i , as highly confident observations. The pseudolabelled data with high confidence are included in the labelled dataset S (r) i+1 = S (l) S i as pseudo-labelled observations. Later the model is re-trained for i = 2, .., ϑ iterations repeating the process of pseudo-labelling, filtering the most confident pseudo-labels and re-training the model. In general, we refer to a pseudo-labelling to the idea of estimating the hard labels y

In [29] the Tri-net semi-supervised deep model (Tri-Net) was proposed. Here an ensemble f w of Deep Convolutional Neural Network (DCNN)s is trained with k = 1, 2, 3 different top models, with also k = 1, 2, 3 different labelled datasets S (l,k) i . The output posterior probability is the result of the three models voting. This results in the pseudo-labelled for the whole evaluated dataset S i , with i = 1 for the first iteration. The pseudo-label filtering operation ϕ includes the observations where at least two of the models agreed are included into the labelled dataset. The process is repeated for a fixed number of iterations. Also Tri-Net can be combined with any regularized SSDL approach. This combination was tested in [82] , and is referred to in this document as Tri-net semi-supervised deep model with a Pi-Model regularization (TriNet+Pi). In [82] , a similar ensemble-based pseudo-labelling approach is found. In such work, a mammogram image classifier was implemented, with an ensemble of classifiers that vote for the unlabelled observations. The observations with the highest confidence are added to the dataset, in an iterative fashion.

Another recent deep self-training approach can be found in [23] , named as Speed as a supervisor for semi-supervised Learning Model (SaaSM). In a first step, the pseudo-labels are estimated by measuring the learning speed in epochs, optimizing the estimated labels as a probability density function S 1 = f w1 S (u) with a stochastic gradient descent approach. The estimated labels are used to optimize an unsupervised regularized loss. SaaSM was tested using the Canadian Institute for Advanced Research dataset of 10 classes (CIFAR-10) and Street View House Numbers dataset (SVHN) datasets. It yielded slightly higher accuracy to other consistency regularized methods such as mean teacher, according to the reported results. No statistical significance analysis was done.

In Regularized Semi-Supervised deep learning (R-SSDL), or co-regularized learning as defined in [101] , the loss function of a deep learning model f w includes a regularization term using unlabelled data S (u) :

(2) The unsupervised loss L u regularizes the model f w using the unlabelled observations x j . The unsupervised regularization coefficient γ controls the unsupervised regularization influence during the model training. We consider it an SSDL subcategory, as a wide number of approaches have been developed inspired by this idea. In the literature, different approaches for implementing the unsupervised loss function L u can be found. Sections II-C1, II-C2 and II-C3 group the most common approaches for implementing it.

1) Consistency based regularization: A Consistency Regularized Semi-supervised deep learning (RC-SSDL) loss function measures how robust a model is, when classifying unlabelled observations in S (u) with different transformations applied to the unlabelled data. Such transformations usually perturb the unlabelled data without changing its semantics and class label (label preserving transformations). For instance, in [4] consistency assumption χ (CL) is enforced for two views in [4] , using the Euclidean distance:

Where δ (x i , f w ) is the consistency function. Consistency can also be measured for labelled observations in x j ∈ S (l) . A number of SSDL techniques are based on consistency regularization. Therefore we refer to this category as Consistency based Regularized Semi-Supervised deep learning (CR-SSDL).

A simple interpretation of the consistency regularization term is the increase of a model's robustness to noise, by using the data in S (u) . A consistent model output for corrupted observations implies a more robust model, with better generalization. Consistency can be measured between two deep learning models f w and f w fed with two different views or random modifications of the same observation, Ψ η (x j ) and Ψ η (x j ). For this reason, some authors refer to consistency based SSDL approaches as self-ensemble learning models [55] . The consistency of two or more variations of the model is evaluated, measuring the overall model robustness.

Consistency regularized methods are based on the consistency assumption χ (CL) . Thus, such methods can be related to other previously SSDL approaches that also exploit this assumption. For instance, as previously mentioned, the consistency assumption is also implemented in PT-SSDL as the proxy task, where the model is pre-trained to minimize a proxy function. The consistency function can be thought as a particular case of the aforementioned confidence function for self-training ϕ, using two or more views of the observations, as developed in [4] . However, in this case the different corrupted views of the observation x j are highly correlated, in spite of the original co-training approach developed in [4] . Nevertheless, recent regularized SSDL models [3] , [49] , [86] simplify this assumption. They consider as a view of an observation x j its corruption with random noise η, making up a corrupted view Ψ η (x j ). The independence assumption [4] between the views of co-training, fits better when measuring the consistence between different signal sources, as seen in [56] . In such work, different data sources are used for semisupervised human activity recognition.

In [3] the Pi Model (Pi-M) was proposed. The consistency of the deep model with random noise injected to its weights (commonly referred to as dropout) is evaluated. The weights w are a corrupted version of the parent model with weights w, making up what the authors refer as a pseudo-ensemble. The Pi-M model was tested in [86] using the CIFAR-10 and SVHN datasets. Intersected yielded results of Pi-M with the rest of the discussed methods in this work can be found in Table I. A consistency evaluation of both unlabelled and labelled datasets can be performed, as proposed in [72] , in the Mutual Exclusivity-Transformation Model (METM). In such method, an unsupervised loss term for Transformation Supervision (TS) was proposed: L

This can be used for unsupervised pre-training. Such loss term can be regarded as a consistency measurement. Furthermore, a Mutual Exclusivity (ME) based loss function is used. It encourages non-overlapping predictions of the model. The ME loss term is depicted as L

u , with the weighting coefficients λ 1 and λ 2 for each unsupervised loss term. METM was tested with the SVHN and CIFAR-10 datasets. Comparable results with the rest of reviewed methods in this work are depicted in Table I .

Later, authors in [49] proposed the Temporal Ensemble Model (TEM), which calculates the consistency of the trained model with the the moving weighted average of the predictions from different models along each training epoch τ :

with ρ the decay parameter, and τ the current training epoch. The temporal ensemble evaluates the output of a temporally averaged model to a noisy observation Ψ η (x i ). This enforces temporal consistency of the model. Table I shows the yielded results by the TEM method for the CIFAR-10 dataset. Based on this approach, the authors in [85] , developed an SSDL approach based on the Kullback-Leibler cross-entropy to measure model consistency. Different transformations Ψ η are applied to the input observations x j . These correspond to image flipping, random contrast adjustment, rotation and cropping. The method was evaluated in a real world scenario with ultrasound fetal images for anatomy classification.

An extension of the temporal ensembling idea was presented by the authors of [86] , in the popular Mean Teacher Model (MTM). Instead of averaging the predictions of models calculated in past epochs, the authors implemented an exponential weight average: w τ = ρw τ −1 + (1 − ρ) w τ for a training epoch τ , with an exponential weighting coefficient ρ. Such exponentially averaged model with parameters w is referred to by the authors as the teacher model. For comparison purposes, the yielded results by MTM using the CIFAR-10 dataset are depicted in Table I. More recently, authors in [59] , proposed the Virtual Adversarial Training Model (VATM). They implemented a generative adversarial network to inject adversarial perturbations η into the labelled and unlabelled observations. Artificially generated observations are compared to the original unlabelled data. This results in adversarial noise encouraging a more challenging consistency robustness. Furthermore, the authors also added a conditional entropy term, in order to make the model more confident when minimizing it. We refer to this variation as Virtual Adversarial Training with Entropy Minimization (VATM+EM). Both VATM and VATM+EM were tested with the CIFAR-10 dataset, thus we include the comparable results with the rest of the reviewed methods in Table I .

Another variation of the consistency function L u was developed in [19] in what the authors referred to as a memory loss function. We refer to this as the Memory based Model (MeM). This memory loss is based on a memory module, consisting of an embedding m i = (x i ,ŷ i ). It is composed of the features extractedx i = h (FE) w (FE) (x i ) by the deep learning model f w , and the corresponding probability density function computedŷ i = f w (x i ) (withŷ the logits output), for a given observation x i . The memory stores one embedding k per class m k = (x k ,ŷ k ), corresponding to the average embedding of all the observations within class k. Previous approaches like the temporal ensemble [49] needed to store the output of past models for each observation. In the memory loss based approach of [19] this is avoided by only storing one average embedding per class. In the second step, the memory loss is computed as follows:

where p i is they key addressed probability, calculated as the closest embedding to x i , andŷ i is the model output for such observation. The factor max (p i ) is the highest value of the probability distribution p i and H (p i ) is the entropy of the key addressed output distribution p i . The factor δ KL (p i ,ŷ i ) is the Kullback-Leibler distance of the output for the observation x i and the recovery key address from the memory mapping. Comparable results to the rest of the reviewed methods for the MeM method are shown in Table I for the CIFAR-10 dataset.

More recently, an SSDL approach was proposed in [77] . This is referred to as the Transductive Model (TransM). The authors implement a transductive learning approach. This means that the unknown labels y are also treated as variables, thus optimized along with the model parameters w. Therefore, the loss function implements a cross-entropy supervised loss:

, which indicates the label estimation confidence level for an observation x i . Such confidence level coefficient makes the model more robust to outliers in both the labelled and unlabelled datasets. The confidence coefficient is calculated using a k-nearest neighbors approach from the labelled data, making use of the observation density assumption χ (CL) . This means that the label estimated is of high confidence, if the observations lies in a high density space for the labelled data within the feature space. As DCNNs are meant to be used by the model, the feature space is learned within the training process, making necessary to recalculate R at each training step τ . As for the unlabelled regularization term:

it is composed of a robust feature measurement L RF and a min-max separation term L MMF , where λ RF and λ MMF weigh their contribution to the unsupervised signal. The first term measures the feature consistency, thus using the output of the learned feature extractor of the modelx i = h 

Regarding the second term, referred to as the min-max separation function, it is meant to maximize the distance between observations of different classes by a minimum margin ρ, and to minimize the distance from observations within the same class. It is implemented as follows:

With δ ( y i, y j ) = 1 when y i = y j , or cero otherwise. The first term in L MMF minimizes the intra-class distance and the second term maximizes the inter-class observation distance.

We highlight the theoretical outlier robustness of the method by implementing the confidence coefficient r i . This coefficient is able to give lower relevance to unconfident estimations. However, it is yet to be fully proved, as the experiments conducted in [77] have not tested the model robustness to unlabelled single and collective outliers. The approach was also combined and tested against other consistency regularization approaches, like the MTM. This is referred to, in this work, as Transductive Model with Mean Teacher (TransM+MTM). The comparable results with the rest of the reviewed approaches are depicted in Table I .

An alternative approach for the consistency function was implemented in [89] , with a model named by the authors as Self Supervised network Model (SESEMI). The consistency function is fed by what the authors defined as a self-supervised branch. This branch aims to learn simple image transformations or pretext tasks, such as image rotation. The authors claim that their model is easier to calibrate than the MTM , by just using an unsupervised signal weight of λ = 1. The intersected results of SESEMI with the rest of the reviewed methods are detailed in Table I. In [9] , the authors proposed a novel SSDL method known as MixMatch. This method implements a consistency loss which first calculates a soft pseudo-label for each unlabelled observation. Those soft pseudo-labels are the result of averaging the model response to a number of transformations of the input x j y j = 1 T T η=1 f w (Ψ η (x j )). In such equation, T refers to the number of transformations of the input image (i.e., image rotation, cropping, etc.). The specific image transformation is represented in Ψ η . The authors in [9] recommend to use T = 2. Later, the obtained soft pseudo-label is sharpened, in order to decrease its entropy and under-confidence of the pseudo-label. For this, a parameter ρ is used within the softmax of the output y j : s ( y, ρ) i = y 1/ρ i j y 1/ρ j . The dataset S u = X u , Y contains the sharpened soft pseudo-labels, where Y = y 1 , y 2 , . . . , y nu . The authors of MixMatch found that data augmentation is very important to improve its performance. Taking this into account, the authors implemented the MixUp methodology to augment both the labelled and unlabelled datasets [99] . This is represented as follows: S l , S u = Ψ MixUp S l , S u , α . The MixUp generates new observations through a linear interpolation between different combinations of both the labelled and unlabelled data. The labels for the new observations are also lineally interpolated, using both the labels and the pseudo-labels (for the unlabelled data).

Mathematically, MixUp takes two pseudo-labelled or labelled data pairs (x a , y a ) and (x b , y b ), and generates the augmented datasets S l , S u . These augmented datasets are used by MixMatch, to train neural network with parameters w through the minimization of the following loss function:

The labelled loss term L l , can be implemented with a crossentropy function, as recommended in [9] ; L l (w, x i , y i ) = H CE (y i , f w (x i )). Regarding the unlabelled loss term, an Euclidean distance was tested by the authors in [9] L u w, x j , y j = y j − f w (x j ) . In the MixMatch loss function, the coefficient r(t) is implemented as a ramp-up function which augments the weight of the unlabelled loss term as t increases. The parameter γ controls the overall influence of the unlabelled loss term. In Table I , the results yielded in [9] are depicted for the CIFAR-10 dataset.

In [8] an extension of the MixMatch algorithm was developed, referred to as ReMixMatch. Two main modifications were proposed: a distribution alignment procedure and a more extensive use of data augmentation. Distribution alignment consists of the normalization of each prediction using both the running average prediction of each class (in a set of previous model epochs) and the marginal label distribution using the labelled dataset. This way, soft pseudo-label estimation accounts for the label distribution and previous label predictions, enforcing soft pseudo-label consistency with both distributions. The extension of the previous simple dataaugmentation step implemented in the original MixMatch algorithm (where flips and crops were used) consists of two methods. They are referred to as anchor augmentation and CTAugment by the authors. The empirical evidence gathered by the authors when implementing stronger data augmenting transformations (i.e. gamma and brightness modifications, etc.) in MixMatch showed a performance deterioration. This is caused by the larger variation in the model output for each type of strong transformation, making the pseudo-label less meaningful. To circumvent this, the authors proposed an augmentation anchoring approach. It uses the same pseudo-labels estimated when using a weak transformation, for the T strong transformations. Such strong transformations are calculated through a modification of the auto-augment algorithm. Autoaugment originally uses reinforcement learning to find the best resulting augmentation policy (set of transformations used) for the specific target problem [24] . To simplify its implementation for small labelled datasets, the authors in [8] proposed a modification referred to as CTAugment. It estimates the likelihood of generating a correctly classified image, in order to generate images that are unlikely to result in wrong predictions. The performance reported in the executed benchmarks of ReMixMatch, showed an accuracy gain ranging from 1% to 6%, when compared to the original MixMatch algorithm. No statistical significance tests were reported. Comparable results to other methods reviewed in this work for the CIFAR-10 dataset are shown in Table I. More recently, an SSDL method referred to as FixMatch was proposed in [80] . The authors argue the proposition of a simplified SSDL method compared to other techniques.

FixMatch is based upon pseudo-labelling and consistency regularized SSDL. The loss function uses a cross-entropy labelled loss term, along with weak augmentations for the labelled data. For the unlabelled loss term, the cross-entropy is also used, but for the strongly unlabelled observations with its corresponding pseudo-label. The soft pseudo-label is calculated using weak transformations, taking the maximum logit of the model output. Therefore no model output sharpening is done, unlike MixMatch. Strong augmentations are tested using both Random Augmentation (RA) [25] and CTAugmentation (CTA) [8] . For benchmarking FixMatch, the authors used the CIFAR-10 (40, 250, 4000 labels), Canadian Institute For Advanced Research dataset with 100 classes (CIFAR-100) (400, 2500 and 10000 labels), SVHN (40, 250, 1000 labels) and Self-Taught Learning 10 classes (STL-10) (1000 labels) datasets. For all the methods, variations of the Wide-ResNet CNN backbone were used. The average accuracy for each test configuration was similar to the results yielded by ReMixMatch, with no statistical significance tests performed. Comparable results yielded by FixMatch are depicted in Table  I .

2) Adversarial augmentation based regularization: Recent advances in deep generative networks for learning data distribution has encouraged its usage in SSDL architectures [36] . Regularized techniques usually employ basic data augmentation pipelines, in order to evaluate the consistency term L u . However, generative neural networks can be used to learn the distribution of labelled and unlabelled data and generate entirely new observations. These are categorized as Generative adversarial Network based Consistency Regularized Semisupervised deep learning (GaNC-SSDL). Learning a good approximation of data distribution Pr x∼C (x) allows the artificial generation of new observations. The observations can be added to the unlabelled dataset S (u) , or the very same adversarial training might lead to a refined set of model parameters.

In [81] , the generative network architecture was extended for SSDL, by implementing a discriminator function f (d) w d able to estimate not only if an observation x i belongs or not to one of the classes to discriminate from, but also to which specific class it belongs to. The model was named by the authors as Categorical Generative Adversarial Network (CAT-GAN), given the capacity of the discriminator to perform ordinary 1 − K classification. Therefore, f (d) w d is able to estimate the density function of an unlabelled observation x i ∈ S (u) ,

The discriminator model implements a semisupervised loss function:

where H CE is the cross entropy. The unsupervised discriminator term L (d) u was designed for maximizing the certainty for unlabelled observations and minimizing it for artificial observations. The authors also included a term for encouraging imbalance correction for the K classes. The proposed method in [81] was tested using the CIFAR-10 and Modified National Institute of Standards and Technology dataset (MNIST) datasets for SSDL. It was compared only against the Pi-M, with marginally better average results and no statistical significance analysis of the results. However, the CAT-GAN served as a foundation for posterior work on using generative deep models for SSDL.

A breakthrough improvement in training f (d)

w d and f (g) wg models was achieved in [73] , aiming to overcome the difficulty of training complementary loss functions with a stochastic gradient descent algorithm. This problem is known as the Nash equilibrium dilemma. The authors yielded such improvement, through a feature matching loss for the generator f (g) wg , which seeks to make the generated observations match the statistical moments of a real sample from training data S. The enhanced trainability of the Feature Matching Generative Adversarial Network (FM-GAN) was tested in a semi-supervised learning setting. The semi-supervised loss function implements an unsupervised term L

w d , x i which aims to maximize the discriminator success rate in discriminating both unlabelled real observations x i ∈ S (u) and artificially generated ones.

The discriminator model f (d) w d (x i ) outputs the probability of the observation x i belonging to one of the real classes. Also in this work, the authors showed how achieving a good semisupervised learning accuracy (thus a good discriminator), often yields a poor generative performance. Authors suggested that a bad generator describes better Out of Distribution (OOD) data, improving the overall model robustness. Intersected results with the rest of the reviewed SSDL methods are depicted in Table I .

In [26] , the authors further explored the inverse relationship between generator and semi-supervised discriminate performance with the Bad Generative Adversarial Network (Bad-GAN). The experiments showed how a bad generator f w d for distractor observations. In [52] a comparison between the triple generative network proposed in [22] and the bad generator [26] was done. No conclusive results were reached, leading the authors to suggest a combination of the approaches to leverage accuracy. For comparison purposes with related work, results with the CIFAR-10 are described in Table I .

Later, in [68] , the authors proposed a co-training adversarial regularization approach for the Co-trained Generative Adversarial Network (Co-GAN), making use of the consistency assumption of two different models f 

The term L cot measures the expected consistency of the two models using two views from the same observation, through the Jensen-Shannon divergence. In L dif (z), each model artificially generates observations to deceive the other one. This stimulates a view difference between the models, to avoid them collapsing into each other. The coefficients λ cot and λ dif weigh the contribution of each term. Therefore, for each view, a generator is trained and the models f (d1)

w d 2 play the detective role. The proposed method out performed MTM, TEM and the Bad-GAN according to [68] . Experiments were performed with more than two observation views, generalizing the model for a multi-view layout

The best performing model implemented 8 views, henceforth referred in this document as Co-trained Generative Adversarial Network with 8 views (Co-8-GAN) . We include results of the benchmarks done in [68] with the SVHN and CIFAR-10 datasets. The authors did not report any statistical significance analysis of the provided results.

The Triple Generative Adversarial Network (Triple-GAN) [22] addressed the aforementioned inverse relationship between generative and semi-supervised classification performance, by training three different models, detailed as follows.

First, a classifier f (c)

wc (x i ) which learns the data distribution Pr x∼C and outputs pseudo-labels for artificially generated observations. Secondly, a class-conditional generator f (g) wg able to generate observations for each individual class. Thirdly, a discriminator f

, which rejects observations out of the labelled classes. The architecture uses pseudo-labelling, since the discriminator uses the pseudo labels of the classifier

wc (x i ). Nevertheless, a consistency regularization was implemented in the classifier loss L (c) . The results using CIFAR-10 with the settings also tested in the rest of the reviewed works are depicted in tables I.

3) Graph based regularization: Graph based Regularized Semi-Supervised Deep Learning (GR-SSDL) is based on previous graph based regularization techniques [34] . The core idea of GR-SSDL is to preserve mutual distance from observations in the dataset S (for both labelled and unlabelled) in a new feature space. An embedding is built through a mapping functionx i = h (FE) wFE (x i ) which reduces the input dimensionality d toď. The mutual distance of the observations in the original input space x i ∈ R d , represented in the matrix W ∈ R n×n , with W i,j = δ (x i , x j ) is meant to be preserved in the new feature spacex i ∈ Rď. The multidimensional scaling algorithm developed in [48] is one of the first approaches to preserve the mutual distance of the embeddings of the observations.

More recently, in [55] , a graph-based regularization was implemented in the Smooth Neighbors on Teacher Graphs Model (SNTGM). The model aims to smooth the consistency of the classifier along the observations in a cluster, and not only the artificially created observations by the previous consistency-based regularization techniques. The proposed approach implements both a consistency based regularization L c with weight λ 1 , and a guided embedding L e with coefficient λ 2 : u) . L c measures the prediction consistency, by using previous approaches in consistency based regularized techniques. The term L e implements the observation embed-ding, with a γ margin-restricted distance. To build the neighbourhood matrix W , the authors in [55] used label information instead of computing the distance between the observations. Regarding unlabelled observations in S (u) , the authors estimated the output of the teacher to beŷ i = f w (Ψ η (x i )). Thus, the neighbourhood matrix is given as follows:

The loss term L e encourages similar representations for observations within the same class and higher difference for representations of different classes. The algorithm was combined and tested with a Pi-M and VATM consistency functions, henceforth Smooth Neighbors on Teacher Graphs Model with Pi-Model regularization (SNTGM+Pi-M) and Smooth Neighbors on Teacher Graphs Model with Virtual Adversarial Training (SNTGM+VATM), respectively.

In [63] and [15] , extensive evaluation of the distribution mismatch setting is developed. The authors agreed upon its decisive impact in the performance of SSDL methods and the consequent importance of increase their robustness to such phenomena. SSDL methods designed to deal with the distribution mismatch between S (u) and S (l) often use ideas and concepts from OOD detection techniques. Most methods for SSDL that are robust to distribution mismatch calculate a weight or a coefficient referred to as the function H x u j in this article, to score how likely the unlabelled observation x u j is OOD. The score can be used to either completely discard x u j from the unlabelled training dataset (referred to as hard thresholding in this work) or to weigh it (soft thresholding). Thresholding the unlabelled dataset can take place as a data pre-processing step or in an online fashion during training.

Therefore, we first review modern approaches for OOD detection using deep learning in Section III-A. Later we address state of the art SSDL methods that are robust to distribution mismatch in Section III-B.

OOD data detection is a classic challenge faced in machine learning applications. It corresponds to the detection of data observations which are far from the training dataset distribution [38] . Individual and collective outlier detection are particular problems of OOD detection [79] . Other particular OOD detection settings have been tackled in the literature such as novel data and anomaly detection [65] and infrequent event detection [1] , [37] . Well studied and known concepts have been developed within the pattern recognition community related to OOD detection. Some of them are kernel representations [87] , density estimation [57] , robust moment estimation [71] and prototyping [57] .

The more recent developments in the burgeoning field of deep learning for image analysis tasks have boosted the interest in developing OOD detection methods for deep learning architectures. According to our literature survey, we found that OOD detection methods for deep learning architectures can be classified into the following categories: DNN output based and DNN feature space based. In the next subsections we proceed to describe the most popular methods within each category. 1) DNN output based: In [39] the authors proposed a simple method known as Out of DIstribution detector for Neural networks (ODIN) to score input observations according to its OOD probability. The proposed method by the authors implements a confidence score based upon the DNN model's output which is transformed using a softmax layer. The maximum softmax value of all the units is associated with the model's confidence. The authors argued that this scores is able to distinguish in-distribution from OOD data.

More recently, in [53] , the authors argued that using the softmax output of a DNN model to estimate OOD probability can be often a misleading measure in non calibrated models. Therefore, the authors in [53] proposed a DNN calibration method. This method implements a temperature coefficient which aims to improve the model's output discriminatory power between OOD and In-Distribution (IOD) data. In [53] the authors tested ODIN against the softmax based OOD score proposed in [39] , with significantly better results obtained by ODIN.

An alternative approach for OOD detection using the DNN's output is the popular approach known as Monte Carlo Dropout (MCD) [46] , [54] . This approach uses the distribution of N model forward passes (input evaluation), using the same input observation with mild transformations (noise, flips, etc.) or injecting noise to the model using a parameter drop-out. The output distribution is used to calculate distribution moments (variance usually) or other scalar distribution descriptors such as the entropy. This idea has been implemented in OOD detection settings, as OOD observations might score higher entropy or variance values [43] , [75] .

2) DNN's feature space based: Recently, as an alternative approach for OOD detection, different methods use the feature or latent space for OOD detection. In [51] the authors propose the usage of the Mahalanobis distance in the feature space between the training dataset and the input observation. Therefore the covariance matrix and a mean observation is calculated from the training data (in-distribution data). By using the Mahalanobis distance, the proposed method by the authors assume a Gaussian distribution of the training data. Also, the proposed method was tested mixed with the ODIN calibration method previously discussed. The authors reported a superior performance of their method over the softmax based score proposed in [39] and ODIN [39] , [53] . However no statistical significance analysis of the results was carried out.

In [91] another feature space based was proposed, referred to as Deterministic Uncertainty Quantification (DUQ) by the authors. The proposed method was tested for both uncertainty estimation and OOD detection. It consists in calculating a centroid for each one of the classes within the training dataset (IOD dataset). Later, for each new observation where either uncertainty estimation or OOD detection is intended to be used, the method calculates the distance to each centroid. The shortest distance is used as either uncertainty or OOD score. DUQ performance for OOD detection was compared against a variation of the MCD approach, with a an ensemble of networks for OOD detection. The authors claimed a better performance of DUQ for OOD detection, however no statistical analysis of the results was done. The benchmark consisted in using CIFAR-10 as an IOD dataset and SVHN as a OOD dataset. Therefore, as usual in OOD detection benchmarks, the unseen classes setting for the IID assumption violation was tested.

In the literature, there are two most commonly studied causes for the violation of the IID assumption. The first one is the prior probability shift (different distribution of the labels) between S (u) and S (l) . Novel methods proposed to deal with this challenge are described in this section. The other cause for the IID violation assumption is the unseen class setting, which has been more widely studied. State of the art methods are also discussed in this section.

1) Unseen classes as a cause for the distribution mismatch: Most of the SSDL methods designed to deal with distribution mismatch have been tested using a labelled dataset with different classes (usually less) from the unlabelled dataset. For example, in this setting, for S (l) the SVHN is used, and for S (u) a percentage of the sample is drawn from the CIFAR-10 dataset, and the rest from the SVHN dataset. In this context, the dataset CIFAR-10 is often referred to as the OOD data contamination source. Benchmarks with varying degrees of data contamination for SSDL with distribution mismatch can be found in literature. In this section we describe the most recent approaches for SSDL under distribution mismatch with unseen classes in the literature.

In [61] an SSDL method for dealing with distribution mismatch was developed. The authors refer to this method as RealMix. It was proposed as an extension of the Mix-Match SSDL method. Therefore, it uses the consistency based regularization with augmented observations, and the MixUp data augmentation method implemented in MixMatch. For distribution mismatch robustness, RealMix uses the softmax of the output from the model as a confidence value, to score each unlabelled observation. During training, in the loss function, the unlabelled observations are masked out using such confidence score. The φ percent of unlabelled observations with the lowest confidence scores are discarded at each training epoch. To test their method, the authors deployed a benchmark based upon CIFAR-10 with a disjoint set of classes for S (l) and S (u) . The reported results showed a slight accuracy gain of the proposed method against other SSDL approaches not designed for distribution mismatch robustness. A fixed number of labelled observations and CNN backbones were used. No statistical significance tests over the results were done. RealMix can be categorized as a DNN output based OOD scoring method. The thresholding is done during training, several times, using binary or hard thresholding (keep or discard). The testing can be considered limited as the OOD contamination source causes a hard distribution mismatch.

More recently, the method known as Uncertainty Aware Self-Distillation (UASD), was proposed in [20] for SSDL distribution mismatch robustness. UASD uses an unsupervised regularized loss function. For each unlabelled observation, a pseudo-label is estimated as the average label from an ensemble of models. The ensemble is composed of past models yielded in previous training epochs. Similar to RealMix, UASD uses the output of a DNN model to score each unlabelled observation. However, to increase the robustness of such confidence score, UASD uses the ensemble of predictions from past models, to estimate the model's confidence over its prediction for each unlabelled observation. The maximum logits of the ensemble prediction is used as the confidence score. Therefore we can categorize the UASD method as a DNN output based approach. Also in a similar fashion to RealMix, the estimated scores are used for hard-thresholding the unlabelled observations. In a resembling trend to RealMix, the authors of the UASD method evaluated their approach using the CIFAR-10 dataset. S (l) includes 6 classes of animals, and S (u) samples other 4 classes from CIFAR-10, with a varying degree of class distribution mismatch. Only five runs were performed to approximate the error-rate distribution, and no statistical analysis was done for the results. No varying number of labelled observations, or different DNN backbones were tested. UASD was compared with SSDL methods not de-signed for distribution mismatch robustness. From the reported results, an accuracy gain of up to 6 percent over previous SSDL methods was yielded by UASD, when facing heavy distribution mismatch settings.

In [20] , an SSDL approach to deal with distribution mismatch was introduced. The authors refer to their proposed approach as Deep Safe Semi-Supervised Learning (D3SL). It implements an unsupervised regularization, through the mean square loss between the prediction of unlabelled observation and its noisy modification. An observation-wise weight for each unlabelled observation is implemented, similar to RealMix and UASD. However, the weights for the entire unlabelled dataset are calculated using an error gradient optimization approach. Both the model's parameters and the observation-wise weights are estimated in two nested optimization steps. Therefore, we can categorize this method as a gradient optimized scoring of the unlabelled observations. The weights are continuous or non-binary values, therefore we can refer to this method as a softly-thresholded one. According to the authors, this increases training time up to 3×. The testing benchmark uses the CIFAR-10 and MNIST datasets. For both of them, 6 classes are used to sample S (l) , and the remaining for S (u) . Only a Wide ResNet-28-10 CNN backbone was used with a fixed number of labels. A varying degree of OOD contamination was tested. The proposed D3SL method was compared with generic SSDL methods, therefore ignoring previous SSDL robust methods to distribution mismatch. From the reported results, an averaged accuracy gain of around 2% was yielded by the proposed method under the heaviest OOD data contamination settings, with no statistical significance reported. Only five runs were done to report such averaged error-rates.

A similar gradient optimization based method to D3SL can be found in [95] . The proposed method is referred to as a Multi-Task Curriculum Framework (MTCF) by the authors. Similar to previous methods, MTCF defines an OOD score for the unlabelled observations, as an extension to the MixMatch algorithm [9] . Such scores are alternately optimized together with the DNN parameters, as seen in D3SL. However, the optimization problem is perhaps more simple than the D3SL, as the OOD scores are not optimized in a gradient descent fashion directly. Instead, the DNN output is used as the OOD score. The usage of a loss function that includes the OOD scores, enforces a new condition to the optimization of the DNN parameters. This is referred to as a curriculum multi-task learning framework by the authors in [95] . The proposed method was tested in what the authors defined as an Open-set semi-supervised learning setting (Open-Set-SSLS), where different OOD data contamination sources were used. Regarding the specific benchmarking settings, the authors only tested a Wide ResNet DNN backbone, to compare a baseline MixMatch method to their proposed approach. No comparison with other SSDL methods was performed. The authors used two IOD datasets: CIFAR-10 and SVHN. Four different OOD datasets were used: Uniform, Gaussian noise, Tiny ImageNet (TIN) and Large-scale Scene Understanding dataset (LSUN). The average of the last 10 checkpoints of the model training, using the same partitions was reported (no different partitions were tested). A fixed OOD data contamination degree was tested. The reported accuracy gains went from 1% to 10%. The usage of the same data partitions inhibited an appropriate statistical analysis of the results.

In the same trend, authors in [100] proposed a gradient optimization based method to calculate the observation-wise weights for data in S (u) . Two different gradient approximation methods were tested: Implicit Differentiation (IF) and Meta Approximation (MetaA). Authors argue that finding the weights for each unlabelled observation in a large sample S (u) is an intractable problem. Therefore, both tested methods aim to reduce the computational cost of optimizing such weights. Moreover, to further reduce the number of weights to find, the method performs a clustering in the feature space. This reduces the number of weights to find, as one weight is assigned per cluster. Another interesting finding reported by the authors, is the impact of OOD data in batch normalization. Even if the OOD data lies far to the decision boundary, if batch normalization is carried out, a degradation of performance is likely. If no batch normalization is performed, OOD data far from the decision boundary might not significantly harm performance. Therefore, the weights found are also used to perform a weighted mini-batch normalization of the data. Regarding the benchmarking of the proposed method, the authors used the CIFAR-10 and FashionMNIST datasets, with different degrees of OOD contamination. The OOD data was sampled from a set of classes excluded from the IOD dataset. A WRN-28-2 (WideResNet) backbone was used. No statistical analysis of the results, with the same number of partitions across the tested methods was performed. The average accuracy gains show a positive margin for the proposed method ranging from 5% to 20%.

Another approach for robust SSDL to distribution mismatch was proposed in [93] , referred to by the authors as Augmented Distribution Alignment (ADA). Similar to MixMatch, ADA uses MixMup [99] for data augmentation. The method includes an unsupervised regularization term which measures the distribution divergence between the S (u) and S (l) datasets. This divergence is measured in the feature space. In order to diminish the empirical distribution mismatch of S (u) and S (l) , the distribution distance of both datasets is minimized to build a feature extractor aiming for a latent space where both feature densities are aligned. This is done through adversarial loss optimization. We can categorize this method as a feature space based method. As for the reported benchmarks, the authors did not test different degrees of OOD data contamination, and only compared their method to generic SSDL methods, not designed to handle distribution mismatch. No statistical significance tests were done to measure the confidence of their accuracy gains. In average, the proposed method seems to improve the error-rate from 0.5 to 2%, when compared to other SSDL generic methods. These results do not ensure a practical accuracy gain, as no statistical analysis was performed. From the baseline model with no distribution alignment, an accuracy gain of around 5% was reported, again with no statistical meaning analysis performed. The authors in [11] also used the feature space to score unlabelled observations. The proposed method was tested in the specific application setting of COVID-19 detection using chest X-ray images.

Recently, in [40] , an SSDL approach to distribution mismatch robustness was developed. The method consists of two training steps. The first or warm-up step performs a selftraining phase, where a pretext task is optimized using the DNN backbone. This is implemented as the prediction of the degrees of rotation that each image, from a rotationally augmented dataset. This includes observations from both data samples S (u) and S (l) , (along with the OOD observations). In the second step, the model is trained using a consistency based regularization approach for the unlabelled data. This consistency regularization also uses the rotation consistency loss. In this step, an OOD filtering method is implemented, referred by the authors as a cross-modal mechanism. This consists of the prediction of a pseudo-label, defined as the softmax of the DNN output. This pseudo-label, along its feature embedding, is fed to what the authors refer to as a matching head. Such matching head consists of a multi-perceptron model that is trained to estimate whether the pseudo-label is accurately matched to its embedding. The matching head model is trained with the labelled data, with different matching combinations of the labels and the observations. As for the testing benchmark, the authors used CIFAR-10, Animals-10 and CIFAR-ID-50 as IOD datasets. For OOD data sources, images of Gaussian and Uniform noise, along with the TIN and LSUN datasets were used. The average accuracy reported for all the tested methods correspond to the last 20 model copies yielded during training. Therefore, no different training partitions were tested, preventing an adequate statistical analysis of the results. The average results were not significantly better when compared to other generic SSDL methods such as FixMatch, with an accuracy gain of around 0.5% to 3%. No computational cost figures about the cost of training the additional matching head or warm-up training were provided. The authors claim that in their method, OOD data is re-used. However other methods like UASD also prevent totally discarding OOD data, as dynamic and soft observation-wise weights are calculated every epoch. Perhaps, from our point of view, a more appropriate description of the novelty of their method could be referred to as the complete usage of OOD data in a pre-training stage.

a) Prior probability shift: Data imbalance for supervised approaches has been tackled in the literature widely. Different approaches have been proposed, ranging from data transformations (over-sampling, data augmentation, etc.) to model architecture focused approaches (i. e., modification of thee loss function, etc.) [2] , [60] , [83] . Nevertheless, related to the problem of label imbalance or label balance mismatch between the labelled and unlabelled datasets, more scarce is found in the literature. This setting can be interpreted as a particularisation of the distribution mismatch problem described in [63] . A distribution mismatch between S (l) and S (u) might arise when the label or class membership distribution of the observations in both datasets meaningfully differ.

In [41] , an assessment of how distribution mismatch impacts a SSDL model is carried out. The cause of the distribution mismatch between the labelled and unlabelled datasets was the label imbalance difference between them. An accuracy decrease between 2% and 10% was measured when the SSDL faced such data setting. The authors proposed a simple method to recover such performance degradation. The method consists on assigning a specific weight for each unlabelled observation in the loss term. To choose the weight, the output unit of the model with highest score at the current epoch is used as a label prediction. In this work, the mean teacher model was tested as a SSDL approach [86] . The authors yielded a superior performance of the SSDL model by using the proposed method. An extension to the work in [41] is found in [16] , where in this case the more recent MixMatch SSDL method is modified to improve the robustness of the model to heavy imbalance conditions in the labelled dataset. The approach was extensively tested in the specific application of COVID-19 detection using chest X-ray images.

Among the most important challenges faced by SSDL under practical usage situations is the distribution mismatch between the labelled and unlabelled datasets. However, according to our state of the art review, there is significant work pending, mostly related to the implementation of standard benchmarks for novel methods. The benchmarks found so far in the literature show a significant bias towards the unseen class distribution mismatch setting. No testing of other distribution mismatch causes such as covariate shift was found in the literature. Real world usage settings might include covariate shift and prior probability distribution shift, which violate the frequently used IID assumption. Therefore, we urge the community to focus on different distribution mismatch causes.

Studying and developing methods for dealing with distribution mismatch settings, shifts the focus upon data-oriented (i.e., data transformation, filtering and augmentation) methods instead of more popular model-oriented methods. Recently, the renowned researcher Andrew Ng, has drawn the attention towards data-oriented methods 1 . In his view, not enough effort has been carried out by the community in studying and developing data-oriented methods to face real-world usage settings.

We agree with Andrew Ng's opinion, and add that besides establishing and testing a set of standard benchmarks where different distribution mismatch settings are tested, experimental reproducibility must be enforced. Recent technological advances not only allow to share the code and the datasets used, but also the testing environments through virtualization and container technology. Finally, we argue that the deep learning research community must be mindful of not only comparing average accuracies from the different state-ofthe art methods. Statistical analysis tools must be used to test whether the performance difference between one method over another is reproducible and is statistically meaningful. Therefore we suggest that the results distribution is shared and not only the means and standard deviations of the results, in order to enable further statistical analysis.

Concrete problems in ai safety

Deep over-sampling framework for classifying imbalanced data

Learning with pseudo-ensembles

21 an augmented pac model for semi-supervised learning

Sample-size determination methodologies for machine learning in medical imaging research: A systematic review

Greedy layer-wise training of deep networks

Armaghan Moemeni, Shengxiang Yang, and Jordina Torrents-Barrena. Quality assessment of dental photostimulable phosphor plates with deep learning

Remixmatch: Semisupervised learning with distribution alignment and augmentation anchoring

Mixmatch: A holistic approach to semi-supervised learning

Assessing the impact of the deceived non local means filter as a preprocessing stage in a convolutional neural network based approach for age estimation using digital hand x-ray images

Jordina Torrents-Barrena, and Miguel A Molina-Cabello. Dealing with scarce labelled data: Semi-supervised deep learning with mix match for covid-19 detection using chest x-ray images

A first glance into reversing senescence on herbarium sample images through conditional generative adversarial networks

Improving uncertainty estimations for mammogram classification using semi-supervised learning

A real use case of semi-supervised learning for mammogram classification in a local clinic of costa rica

Mixmood: A systematic approach to class distribution mismatch in semi-supervised learning using deep dataset dissimilarity measures

Correcting data imbalance for semisupervised covid-19 detection using x-ray chest images

Assessing the impact of a preprocessing stage on deep learning architectures for breast tumor multi-class classification with histopathological images

Unsupervised pre-training of image features on non-curated data

Semi-supervised deep learning with memory

Semisupervised learning under class distribution mismatch

Notso-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis

Triple generative adversarial nets

Saas: Speed as a supervisor for semi-supervised learning

Learning augmentation policies from data

Randaugment: Practical automated data augmentation with a reduced search space

Good semi-supervised learning that requires a bad gan

Imagenet: A large-scale hierarchical image database

Unsupervised visual representation learning by context prediction

Tri-net for semi-supervised deep learning

Machine learning approaches for clinical psychology and psychiatry

Classification in the presence of label noise: a survey

Synthetic data augmentation using gan for improved liver lesion classification

Convolutional neural networks for segmenting xylem vessels in stained cross-sectional images

Dissimilarity in graph-based semi-supervised classification

Deep learning

Generative adversarial nets

Rare event detection using disentangled representation learning

A baseline for detecting misclassified and out-of-distribution examples in neural networks

A baseline for detecting misclassified and out-of-distribution examples in neural networks

Trash to treasure: Harvesting ood data with cross-modal matching for open-set semisupervised learning

Class-imbalanced semisupervised learning

Paediatric bone age assessment using deep convolutional neural networks

Augmenting monte carlo dropout classification models with unsupervised learning tasks for detecting and diagnosing out-ofdistribution faults

Self-supervised visual feature learning with deep neural networks: A survey

Advances and open problems in federated learning

What uncertainties do we need in bayesian deep learning for computer vision?

Recycling: Semi-supervised learning with noisy labels in deep neural networks

Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis

Temporal ensembling for semi-supervised learning

Deep neural network self-training based on unsupervised learning and dropout

A simple unified framework for detecting out-of-distribution samples and adversarial attacks

Semi-supervised learning based on generative adversarial network: a comparison between good gan and bad gan approach

Enhancing the reliability of out-of-distribution image detection in neural networks

A general framework for uncertainty estimation in deep learning

Smooth neighbors on teacher graphs for semi-supervised learning

Bi-view semisupervised learning based semantic human activity recognition using accelerometers

Novelty detection: a review-part 1: statistical approaches. Signal processing

Using cluster analysis to assess the impact of dataset heterogeneity on deep convolutional network accuracy: A first glance

Virtual adversarial training: a regularization method for supervised and semi-supervised learning

Generative adversarial minority oversampling

Realmix: Towards realistic semi-supervised deep learning algorithms

Ml4h auditing: From paper to practice

Realistic evaluation of deep semi-supervised learning algorithms

Automatic differentiation in pytorch

Deep transfer learning for multiple class novelty detection

A survey of semi-supervised learning methods

A survey on semi-supervised learning techniques

Deep co-training for semi-supervised image recognition

Dealing with scarce labelled data: Semisupervised deep learning with mix match for covid-19 detection using chest x-ray images

The manifold tangent classifier

Least median of squares regression

Regularization with stochastic transformations and perturbations for deep semisupervised learning

Improved techniques for training gans

A survey on semi-, self-and unsupervised learning for image classification

Uncertainty-based out-of-distribution detection in deep reinforcement learning

The effect of unlabeled samples in reducing the small sample size problem and mitigating the hughes phenomenon

Transductive semi-supervised deep learning using min-max features

Medical image synthesis for data augmentation and anonymization using generative adversarial networks

Outlier detection: applications and techniques

Fixmatch: Simplifying semi-supervised learning with consistency and confidence

Unsupervised and semi-supervised learning with categorical generative adversarial networks

Enhancing deep convolutional neural network scheme for breast cancer diagnosis with unlabeled data

Adaboostcnn: an adaptive boosting algorithm for convolutional neural networks to classify multi-class imbalanced datasets using transfer learning

A survey on deep transfer learning

Semisupervised learning of fetal anatomy from ultrasound

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

Support vector data description

Transfer learning

Semi-supervised learning with self-supervised networks

A theory of the learnable

Simple and scalable epistemic uncertainty estimation using a single deep deterministic neural network

A survey on semi-supervised learning

Semi-supervised learning by augmented distribution alignment

Learning from massive noisy labeled data for image classification

Multi-task curriculum framework for open-set semi-supervised learning

Deep learning in remote sensing scene classification: a data augmentation enhanced convolutional neural network framework. GIScience & Remote Sensing

Jordina Torrents-Barrena, and Miguel A Molina-Cabello. Enforcing morphological information in fully convolutional networks to improve cell instance segmentation in fluorescence microscopy images

S4l: Self-supervised semi-supervised learning

mixup: Beyond empirical risk minimization

Robust semi-supervised learning with out of distribution data

Semi-supervised learning literature survey

Emotion classification with data augmentation using generative adversarial networks