key: cord-0291261-6tktvnou
authors: Zhou, Xinlei; Liu, Han; Pourpanah, Farhad; Zeng, Tieyong; Wang, Xizhao
title: A Survey on Epistemic (Model) Uncertainty in Supervised Learning: Recent Advances and Applications
date: 2021-11-03
journal: nan
DOI: 10.1016/j.neucom.2021.10.119
sha: a49f34f216392bf655c1604a4d9780d3bb61f6fe
doc_id: 291261
cord_uid: 6tktvnou

Quantifying the uncertainty of supervised learning models plays an important role in making more reliable predictions. Epistemic uncertainty, which usually is due to insufficient knowledge about the model, can be reduced by collecting more data or refining the learning models. Over the last few years, scholars have proposed many epistemic uncertainty handling techniques which can be roughly grouped into two categories, i.e., Bayesian and ensemble. This paper provides a comprehensive review of epistemic uncertainty learning techniques in supervised learning over the last five years. As such, we, first, decompose the epistemic uncertainty into bias and variance terms. Then, a hierarchical categorization of epistemic uncertainty learning techniques along with their representative models is introduced. In addition, several applications such as computer vision (CV) and natural language processing (NLP) are presented, followed by a discussion on research gaps and possible future research directions.

Supervised learning, as a broad branch of machine learning, refers to the task of learning a mapping function for associating high-dimensional input samples into their corresponding target vectors using labeled data [1, 2, 3, 4] . They have been successfully used for a variety of real-world applications, e.g., medical and fault diagnosis [5, 6, 7, 8] , object detection [9, 10] , text processing [11, 12, 13] , and speech recognition [14] , image segmentation [15, 16, 17] , image enhancement [18, 19] . Indeed, supervised learning is a process of predicting unknown data based on partial samples that cannot accurately represent the whole data set distribution. In such an experience-driven process, the model is not provably correct but only hypothetical; therefore uncertain and the same holds for the predictions produced by the model [20] . In addition, the challenge of big data, such as skyrocketed feature dimensions and categories, missing data, unbalanced data distribution and huge solution space, aggravate the uncertainty of the learning process, which seriously affects the performance of the supervised learning algorithms [21] . Moreover, supervised learning approaches are unable to identify in-domain from out-domain samples [22] , provide reliable uncertainty approximation [23] , and lack expressiveness during inference [24] ; therefore, their deployment in high-risk and safety-critical applications remains limited. To alleviate these issues, it is vital to present uncertainty estimate in a way that ignores the uncertain predictions or passes them to experts [25] .

In supervised learning, traditional uncertainty assessment is usually based on a single probability distribution. Nowadays, the widely accepted way is quantifying uncertainty separately by distinguishing two different sources, i.e., aleatoric uncertainty and epistemic uncertainty [20] . Aleatoric (data) uncertainty is a kind of uncertainty that reflects the inherent property of data, like noise. It is usually caused by the irreducible error in the data measurements and observations process, which cannot be reduced even by collecting more data. Kendal and Gal [26] further divided aleatoric uncertainty into homoscedastic uncertainty and heteroscedastic uncertainty. The former is a value that stays constant for various input samples in the same task and the latter is associated with the differences among input data, for example, some inputs contain more noise than others. In contrast, epistemic (model) uncertainty is referred to a state that model cognition is restricted, which is due to the upper limit of the model fitting ability, the optimizing strategy, the parameters, the lack of knowledge. It can be reduced by gathering more data or refining models.

On the other hand, the generalization error is a standard metric to quantify the effectiveness of decisions made by supervised learning models. Meanwhile, several studies [27, 28, 29] have proved that the generalization error manifests the predictive uncertainty. It simultaneously commits to the theoretical exploration of the quantification and formal expression of their relationship. The generalization error can be decomposed into three terms, i.e., noise, bias, and variance [30, 31] .

Suppose, y o = f (x) + represents the observed value for a given x ∈ d , which is corrupted by noise ∼ N (0, σ 2 1 ). Thus, y o ∼ N (f (x), σ 2 1 ) where f (x) represents the original target. Besides, y p ∼ N (ŷ, σ 2 2 ) denotes the distribution of predictionf (x) centered on mean valueŷ = Ef (x). The noise term (σ 2 1 ) arises from data and it is irreducible, which can be represented as aleatoric (data) uncertainty. In contrast, the (squared) bias term ([Ef (x) − f (x)] 2 ) reveals the gap between the estimated value and the true value. It reflects the degree of cognitive limitation caused by the setting of model properties such as parameters, strategies, or learning algorithms. While, the variance term (E[f (x)−Ef (x)] 2 ) is related to the sensitivity of model pertaining to the training samples. Thus, we argue that the bias term together with variance represents the epistemic (model) uncertainty. Fig. 1 shows these three terms and explains the predictive uncertainty from the perspective of the generalization error decomposition. The generalization error of a model can be reduced through the correlation analysis of bias and variance, i.e., epistemic uncertainty. Therefore, analyzing the epistemic uncertainty, using the established relationship between uncertainty and error items (as shown in Fig. 1 ), can help to select an appropriate uncertainty quantification method and improve the model performance.

Existing survey papers: There exist several overviews of uncertainty learning techniques in machine learning from different perspectives and emphases (see Table 1 ). In 2016, Wang and He [21] discussed the challenging issues in analyzing big data and emphasized the importance of modeling uncertainty in improving the performance of the learning models. Subsequently, Hariri et al. [33] surveyed uncertainty learning techniques in big data. They briefly introduced the classical uncertainty measuring techniques in machine learning and categorized them into probability theory, Shannon's entropy, fuzzy set theory, and rough set theory. Recently, Hullermeier and Waegeman [20] emphasized the significance of identifying aleatoric and epistemic uncertainty separately in machine learning. With the popularity of deep learning techniques, most of the recent review papers focused on techniques that are effective for neural network frameworks. For example, Kabir et al. [32] provided a review of uncertainty quantification techniques in neural networks from the concept of prediction intervals. Wang et al. [34] described Bayesian deep learning as a uniform framework that combines deep learning techniques with a paradigm of excellent uncertainty handling capabilities, i.e., probabilistic graphical methods. Jospin et al. [35] organized a handbook from basic statistic concepts to the principle, the learning strategy, and specific algorithms for researchers interested in Bayesian neural networks. They categorized Bayesian methods into Variational inference (VI), Markov Chain [37] arXiv

Introducing sources of uncertainty, categorizing uncertainty techniques and reviewing re-calibration techniques .

Monte Carlo (MCMC), and Bayes by backprop. They also discussed approximation techniques in terms of stochastic gradient descent (SGD) dynamics and Monte Carlo dropout (MCD). Recently, Abdar et al. [36] gave an extensive review of uncertainty quantification methods in deep learning along with their applications. Specifically, they categorized the uncertainty quantification techniques into Bayesian approximation and ensemble learning and discussed the representative models of each category. In addition, they provided open challenges and future research directions associated with uncertainty quantification. While Gawlikowski et al. [37] first identified five sources of uncertainty in deep learning models. These sources are variability in practical scenarios, error, and noise in measurement tools, errors caused by unknown data samples, errors in model structure and training procedure. Then, they categorized uncertainty learning techniques into single deterministic methods, Bayesian methods, ensemble Methods, test-time augmentation methods. Besides, they reviewed re-calibration techniques in DL. Each of these surveys has provided a comprehensive review of uncertainty learning from different points of view. But none of them include a detailed review of epistemic uncertainty techniques in terms of bias-variance decomposition.

Contributions of the paper: Different from existing surveys, we emphasize the importance of uncertainty analysis in supervised learning models from the perspective of generalization error decomposition. Specifically, the focus is on tracing the epistemic uncertainty according to the decomposed items, i.e., bias and variance. In this paper, we outline the two most representative techniques in supervised learning, e.g., Bayesian and ensemble methods, over the last five years and discuss the properties of each category according to the bias-variance decomposition. In this context, we have collected a total of 138 publications, in which:

• Approximately 70% of the selected articles are published in the last five years, i.e., after 2016; while most of the remaining 30% are high-cited classic articles.

• More than 80% of the selected articles are indexed in Q1 journals, top conferences 1 , and high-cited books or thesis Standing upon these high-quality articles, this work aims to guide researchers interested in tracing the limitation of models from the perspective of uncertainty decomposition, quantification, analysis, and applications. To sum up, the main contributions of this survey include:

• review of the epistemic (model) uncertainty learning techniques in supervised models over the last five years;

• discussion on epistemic uncertainty learning from the perspective of generalization error, i.e., bias and variance decomposition;

• hierarchical categorization of the epistemic uncertainty learning methods along with their representative models and real-world applications;

• elucidation on the main research gaps and suggesting future research directions.

Organization of the paper: As shown in Fig. 2 , this review consists of four sections. Section 2, first, provides a review of epistemic uncertainty learning methods. Specifically, these methods are categorized into Bayesian approximation and ensemble learning. Then, several widely used Bayesian approximation techniques, such as variational inference (VI), Monte Carlo dropout (MCD), Markov Chain Monte Carlo (MCMC), and Laplace approximation (LA), are discussed in detail. Finally, ensemble learning is introduced in terms of its concept and relationship to epistemic uncertainty as well as those related ensemble methods. Section 3 discusses the importance of quantifying uncertainty in supervised learning approaches for several real-world applications such as computer vision and natural language processing. Section 4 presents the research gaps and trends for future research as well as concluding remarks.

This section gives a review of the epistemic uncertainty learning techniques in supervised learning. Specifically, we focus on the epistemic uncertainty quantification methods in terms of decomposed items of the generalization error, i.e., bias and variance. According to Fig. 3 , the epistemic uncertainty learning methods are grouped into the Bayesian and ensemble methods, as follows:

• Bayesian methods formulate epistemic uncertainty as a probability distribution over the model parameters. These methods mainly quantify the variance (as shown in Fig. 1 ) in order to reduce the generalization error of the learning model. In this context, most studies explore the neural network-based model and Bayesian methods to estimate the epistemic uncertainty caused by the model parameters. Section 2.1 provides a comprehensive review of these techniques.

• Ensemble methods train multiple models to produce multiple predictions and then combine their predictions to reach the final output. These methods mainly quantify the variance of the outputs of base models as the epistemic uncertainty to control the complementarity among these base models for improving the ensemble performance. In this context, the reduction of the generalization error can be achieved by reducing the bias or the variance of the ensemble output, depending on the essence of specific ensemble methods. These methods are reviewed in Section 2.2.

In the following subsections, we provide a detailed review and discuss the main properties of each category.

Assume a training data set

, where x i ∈ d and y i ∈ {1, ..., C} indicate the i-th input and its corresponding class, respectively, and C denotes the number of classes. The aim is to learn a function y = f θ (x) with parameters θ to obtain a desired output. Bayesian modeling aims to capture the epistemic uncertainty by putting distributions over the network weights instead of deterministic network weights, which is known as marginalisation. For a given test sample x * , the distribution over a prediction y * can be written as [38] :

where p(θ|D), which is known as posterior distribution on the model parameters, represents the uncertainty on the model parameters given a training data set D.

Assume θ t indicates the parameters of t-th sample from distribution p(θ|D), the epistemic uncertainty of the model can be quantified via variance term (as shown in Fig. 1 ):

where T is the total number of sampling, which will be explained in the following subsections. Bayesian methods are easy to implement but difficult to perform inference, because they require to estimate the posterior distribution, i.e., p(θ|D). Therefore, the marginal probability cannot be computed analytically [26] . In order to obtain the posterior distribution, the Bayes theorem [39] is applied to a given data set D over θ, as follows:

where p(D, θ) represents the likelihood that the data samples in D are realization of the distribution predicted by a model with parameter θ, and p(D) is the prior distribution on the model parameters. Scholars have proposed many approximation techniques to estimate posterior distribution. In this survey, we discuss several approximation techniques including variational inference (VI), Monte Carlo dropout (MCD), Markov chain Monte Carlo (MCMC) and Laplace approximation. A detailed review of each category is provided in the following subsections.

VI [40] has been successfully applied as an approximation technique to many applications of neural networks. It uses a pre-specified distribution q(θ) to infer the posterior distribution p(θ|x, y). In other words, VI aims to make q(θ) to be close to the posterior obtained from the original model through optimizing a set of parameters. To achieve this, the Kullback-Leibler (KL) divergence [41] can be defined as:

But the KL divergence cannot be directly minimized because of the posterior p(θ|x, y). Instead, the evidence lower bound (ELBO) can be optimized. As such, the ELBO for a given prior distribution over the model parameters can be written as:

and for the KL divergence:

holds.

Graves et al. [42] introduced a stochastic variational method to reduce the difficulty in inferring analytical solutions of the original VI. They used numerical integration to approximate the expected values. Bayes By Backprop (BBB) [43] is an extended version of the stochastic variational method [42] to non-Gaussian priors. Specifically, BBB uses an unbiased estimate of gradients to learn a distribution over the network's weights. Kingma et al. [44, 45] reduced the variance of the stochastic gradients by introducing a reparameterization strategy. This strategy can approximate posterior inference in models with continuous latent variables. Rezende and Mohamed [46] used normalizing flow to construct distributions for approximation. This technique applies a sequence of invertible transformations to transfer a simple density to a more complex one. Zeng et al. [47] estimated the epistemic uncertainty in an active learning framework using Bayesian convolutional neural networks (CNNs) with Gaussian approximate VI. They showed that using a few Bayesian layers close to the output layer of CNN can estimate a similar level of uncertainty as with that of the original Bayesian CNNs.

Zhang et al. [48] showed that natural gradient descent [49] with adaptive weight noise can be fitted as a variational posterior to maximize the ELBO. Later, Osawa et al. [50] trained deep networks using a natural gradient VI, namely variational online Gauss-Newton (VOGN) [51] , and obtained similar results to that of Adam optimizer by using strategies such as batch normalization and data augmentation. The stochastic low-rank approximate natural-gradient (SLANG) [52] is a variant of VI methods that use a structure based on diagonal plus low-rank to compute the Gaussian approximations. In addition, SLANG uses the back-propagated gradients of the network log-likelihood to build a covariance, which enables the model a faster estimation than mean-field methods. Heo et al. [53] proposed uncertainty-aware attention, which uses VI with dropout sampling, to compute the epistemic uncertainty together with the heteroscedastic uncertainty in predicting time-series data.

Monte Carlo (MC) [54] is another approximation technique that has been widely used for estimating the posterior distribution, but it is computationally expensive and slow. Gal devised MC dropout (MCD) [55, 38] to alleviate these issues. MCD integrates dropout [56] , which is an effective technique for tackling overfitting problems in deep models, as a regularization term to estimate the prediction uncertainty. During learning, dropout randomly (with a certain probability p) drops some model units to avoid excessive co-tuning. MCD uses the mean of N models, f θ 1 , ..., f θ N , parametrized by θ 1 , ..., θ N to approximate outcome based on the posterior estimation of the weights as follows [55] :

Later, Gal et al. [57] introduced a new variant of dropout, called concrete dropout, which uses gradient methods instead of grid search to tune the dropout probability. This leads to a calibrated uncertainty estimate in large models. Mokhoti and Gal [58] integrated MCD and concrete dropout as inference techniques into the DeepLab-v3+ [59] structure sense segmentation. In addition, they introduced a new metric, namely mutual information, to estimate epistemic uncertainty by computing the mutual information between a predictive distribution and posterior over the model weights. The single-shot MCD [60] analytically approximates the expected value and variance of the MCD for each layer of the fully connected network. This model requires less computational time as compared with that of the MCD. Study [61] conducted an empirical study to show how epistemic uncertainty is affected when the observing condition is changed using MCD.

Since the proposal of MCD, many scholars have applied it to estimate epistemic uncertainty. For example, Abdar et al. [62, 63] applied MCD to tackle uncertainty during skin cancer image classification. Studies [64] and [6] integrated MCD into CNN to estimate the epistemic uncertainty for segmentation and lesion detection in medical images. Loquercio et al. [65] computed the epistemic uncertainty in robotics by combining Bayesian belief networks with MCD. Bertoni et al. [66] estimated the epistemic uncertainty for Monocular 3D Pedestrian Localization using MCD. In [67] , MCD is used to estimate the epistemic uncertainty in an encoder-decoder framework with long-short-term-memory (LSTM) for time series forecasting and anomaly detection using Uber data. Xiao and Wang [11] utilized MCD to estimate the epistemic uncertainty for natural language processing tasks.

MCMC is another popular technique to approximate inference and represents epistemic uncertainty. It first samples from arbitrary distributions and then performs a stochastic transition governed by the current state and the desired distribution, e.g., true posterior. In other words, MCMC starts with generating samples in an iterative and Markov chain fashion. Markov chain is a distribution over random variables that undergoes a transition from one state to another one in the space state. At each iteration, the model selects samples based on pre-specified rules. This process is iterated T times. Finally, the desired distribution is approximated using the generated samples. It aims to sample for a set of independent observations x ∈ D from the posterior distribution θ [68] :

where U is the potential energy function defined by:

Hamiltonian (hybrid) MC (HMC) [54, 69] is the first one that involves using the MCMC sampling technique for Bayesian neural networks. It explores the state space based on the Metropolis-Hastings framework instead of a random-walk strategy to sample from θ. As such, it introduces a set of auxiliary momentum variables, denoted by r, from a Hamiltonian system. In order to sample from p(θ|D), HMC generates samples from a joint distribution of (θ, r) as follows:

where M , which is a mass matrix and often set to the identity matrix, together with r indicates a kinetic energy term. The Hamiltonian function in given by:

The Hamiltonian dynamics is simulated by HMC to generate samples, as follows:

Despite the success of HMC, it requires processing all data samples at each iteration, which is computationally expensive specifically for large data sets. To alleviate this issue, many algorithms have attempted to use a mini-batch strategy. In this regard, Welling and Teh [70] proposed stochastic gradient descent (SGD) HMC method that combines SGD with first-order Langevin dynamics. Later, Chen et al. [68] proved that using second-order Langevin dynamics can explore the space of solutions and provide good generalization. In addition, they added friction into the SGD-HMC to update momentum and evaluated the impact of the noisy gradient. Teye et al. [71] showed that training deep models with batch normalization is equal to that of estimating the inference in Bayesian networks. Chandra et al. [72] proposed Bayesian graph deep learning techniques that use MCMC samples with Langevin-gradient. Mandt et al. [73] used SGD with a constant learning rate (constant SGD) to simulate the Markov chain with a stationary distribution and showed that constant SGD can approximate the posterior inference. Cyclical stochastic gradient MCMC (SG-MCMC) [74] used a cyclical stepsize schedule to better approximate posterior distributions. However, using a mini-batch strategy, that employs a small set of samples at each iteration, adds noise to the network and increases its uncertainty. To alleviate this, Luo et al. [75] used Nosé-Hoover thermostats [76] to deal with the generated noise. The resulting method is called thermostat-assisted continuously tempered HMC.

Maddox et al. [77] introduced stochastic weight averaging Gaussian (SWA-Gaussian) to represent uncertainty and calibrate deep models. Specifically, SWA-Gaussian uses SWA [78] to compute the mean of SGD iterates with a high constant learning rate in order to improve the generalization in deep models. In addition, the Gaussian posterior approximation over the model weights is approximated by using mean SWA and computing a low-rank plus diagonal approximation to the covariance of the iterates. Garg and Awate [5] proposed a perfect/exact MCMC for generic Markov random fields to compute the uncertainty in multi-label segmentation. Specifically, they combined two schemes, namely coupling from the past [79] and bounding-chain [80] , to propose perfect-sample label images. Hernández et al. [81] combined dropout and HMC to improve the predictive uncertainty in classification problems. Akkoyun et al. [82] applied MCMC into a Bayesian framework to predict maximum aneurysm diameter. In addition, Cai et al. [83] developed proximal MCMC techniques to estimate uncertainty in radio interferometric imaging.

As another powerful approximating method, Laplace approximation tackles the problem of representing a complex posterior over the parameters of neural networks by assuming it as a Gaussian distribution [84] . Different from variational approximation methods, Laplace approximation is a local approximation technique that pays more attention to the trend around the mode of the posterior distribution. As described in [85] , the expectation µ of Gaussian distribution q(θ) is the extreme point θ * of posterior distribution p(θ|D). Thus, µ is determined by the first derivative of p(D, θ) which meet the condition p(θ|D) ∝ p(D, θ), while the covariance matrix Σ is obtained by the second-order Taylor expansion of ln p(D, θ) centering on θ * :

where H is Hessian matrix which defined as:

Then, the posterior p(θ|D) is approximated as Gaussian q(θ) with covariance matrix Σ = H −1 :

Unfortunately, it is infeasible to compute the Hessian matrix for deep neural networks with a significant number of parameters. Relatively, constructing a diagonal matrix in curvature approximating for a neural network is more calculable and efficient. Kirkpatrick et al. [86] used diagonal Laplace approximations to enhance the capability of deep neural networks for sequentially learning tasks by preserving the weights important for previous tasks. Subsequently, Ritter et al. [87] first pointed out the limitation of diagonal approximation when some weights exhibit high covariance, then suggested the effectiveness of Kronecker Factorization for acquiring covariance in Laplace approximation and successfully applied in learning online scenarios [88] .

More recently, Lee et al. [89] developed a sparsification technique using a low-rank approximation to demonstrate the effectiveness of scaling Laplace approximation to large-sized data sets (e.g., ImageNet) and architectures. Schillings et al. [90] discussed the convergence of Laplace approximation in Hellinger distance. Margossian et al. [91] derived an adjoint method to promote the computation of Monte Carlo with an embedded Laplace approximation in order to marginalize out weights. Daxberger et al. [92] obtained posteriors by performing inference over a small subset of model weights and outlined the procedure for scaling the linearized Laplace approximation to large neural network models within the framework of subnetwork inference. A new idea, L2M [93] , estimated uncertainty by expanding Laplace approximation with gradient raw second moment estimation in optimizers.

Ensemble learning is generally aimed at training multiple models that are combined to make a final prediction. In a regression context, simple averaging is a commonly used way of combining multiple models f i : X → Y for i = 1, . . . , M in an ensemble f ens : X → Y , as illustrated below:

In a classification context, majority voting is a commonly used rule of fusing multiple classifiers to finally output a class c = 1, 2, . . . , k, as illustrated below:

In general, it often appears that the outputs of multiple models in an ensemble are different, where the variance of the outputs is considered as an indicator of epistemic uncertainty in a prediction [20, 94, 95] . On the other hand, the epistemic uncertainty is also referred to as ensemble ambiguity (diversity), which is viewed as a key factor of successful ensemble learning [96, 97] . In this section, we will introduce the general concept of ensemble diversity and analyze the importance of the diversity in terms of improving the ensemble performance. Moreover, we will provide a review of those existing methods of diversity quantification and creation.

Ensemble diversity is generally related to the generalization error of an ensemble. In a regression context, the generalization error can be decomposed through two well-known schemes, namely, ambiguity decomposition [98] and bias-variance decomposition [99] .

In terms of ambiguity decomposition, given a weighted averaging ensemble (illustrated in Eq. (18)), the ensemble error (f ens − y) 2 can be decomposed into two terms, i.e., the average error of the based models 1

where w i is the weight of each base model f i with the constraints 0 ≤ w i ≤ 1 and

e., f ens is essentially a convex combination of the M base models.

According to Eq. (19) , it is straightforward to derive that the ensemble error is guaranteed to be less than or equal to the average error of the base models, i.e., the higher the diversity among the base models is created, the larger the error reduction would be achieved. However, the increase of the diversity may also cause the increase of the average error of the base models [100] , so it is necessary to get the reasonable trade-off between the diversity and the average error.

As discussed in [100] , the ambiguity decomposition does not take into account the possible changes of the training data distribution or the initialized weights distribution. However, it is essential to measure effectively the expected error on unseen data given a specific distribution of training data or initialized weights. From this point of view, the bias-variance decomposition scheme is considered as a useful tool for analyzing the generalization error of an ensemble, given that this scheme exactly takes into account the above mentioned changes of distributions.

The general formulation of the bias-variance decomposition is shown in Fig. 1 in Section 1. Based on this formulation, three concepts have been defined in [100] , namely, the averaged bias, the averaged variance and the averaged co-variance of the M base models, as illustrated below:

According to Eq. (20)- (22) , we can obtain the bias-variance-co-variance decomposition of the mean square error of an ensemble f ens , as shown below:

Eq. (23) indicates that the mean square error of an ensemble f ens generally depends on the correlation between those base models, where the correlation is quantified through the third term (the averaged co-variance of the base models). Therefore, it is expected to decrease the co-variance, without affecting the bias and variance [100] .

Moreover, the connection between the ambiguity decomposition and the biasvariance-co-variance decomposition was disclosed in [100] . In particular, according to Eq. (19) , the ensemble error (f ens − y) 2 can be decomposed into the average error

While assuming that the base models are equally weighted for simplicity, the right hand side of Eq. (19) can be substituted into the left hand side of Eq. (23) to obtain a new formulation as below:

Based on Eq. (24), the following formulations can be obtained after some derivations [100] : The above two error decomposition schemes were generally designed for regression problems and can not be directly applied to classification tasks. In terms of the ambiguity decomposition, Eq. (19) derived in a regression setting can be transformed into Eq. (27) [102] to suit a classification task, while assuming that the M base classifiers are fused by combining the class probability values estimated by these classifiers through using the product rule [101] . In particular, the KL divergence D KL (y||f ens ) of the ensemble f ens from the target distribution y of class probability is defined as the ensemble error, which can be decomposed into two terms, namely, 

According to Eq. (27), the KL divergence D KL (y||f ens ) of the ensemble f ens from the target distribution y is guaranteed to be less than or equal to the average KL However, the combination of multiple classifiers can be achieved in various ways, e.g., majority vote and average of class probability distributions, which indicates that Eq. (27) does not provide a general formulation of ensemble diversity for classification tasks.

In terms of the bias-variance decomposition, as discussed in [101] , Eq. (23) was derived only for regression problems and similar results cannot be obtained for classification tasks. Therefore, Eq. (23) can not be used as a general formulation of ensemble diversity either. Those methods of diversity quantification in a classification context will be introduced in Section 2.2.2. Moreover, there is not yet a formally accepted definition of the diversity term [101, 103, 96, 97] , so existing methods of diversity creation were designed heuristically using different definitions and the methods will be reviewed in Section 2.2.3.

In a classification context, if the ensemble prediction is made by averaging the class probabilities estimated by M classifiers, then the ensemble diversity can be measured based on Tumer and Ghosh's framework [104, 105] . In particular, suppose that each class c has a true posterior probability P d (c|x) and another posterior probability P e i (c|x) estimated by a classifier f i , given a one-dimensional feature vector x. In this context, the classification error can be decomposed into the Bayes error and the added error, where the Bayes error is irreducible and the added error η e i (c|x) results from the incorrect estimation of the class posterior probability as illustrated below:

If it is assumed that the errors of posterior probability estimation on two classes a and b are independent and identically distributed random variables [105] with zero mean and variance σ 2 η i , then the expected added error of classifier f i in distinguishing the two classes can be defined as shown below:

where P d (a|x) and P d (b|x) are the derivatives of the true posterior probabilities of classes a and b. In the case of combing the posterior probabilities estimated by M classifiers, the expected added error can be measured in the way as illustrated below [105] :

In Eq. (30), δ is a correlation coefficient used for measuring the correlation among the estimation errors made by M base classifiers for each class and is thus a way of quantifying the ensemble diversity. If the estimation errors of the M classifiers are independent, i.e., δ = 0, then the expected added error of the ensemble would be 1 M as same as the added error of each of the M base classifiers (that are assumed to have the same estimation error). However, if the estimation errors of the M base classifiers are perfectly correlated, i.e., δ = 1, then the expected added error of the ensemble would be the same as the added error of each base classifier. Moreover, if the estimation errors of the M base classifiers are negatively correlated, i.e., δ < 0, then the expected added error of the ensemble can be reduced even more in comparison with the amount of the error reduction in the case of δ = 0 [96] .

In addition to the average rule of fusion, i.e. averaging the class probabilities estimated by multiple classifiers, majority voting is also a popular rule of combining classifiers. Since the outputs of the M classifiers in a majority vote ensemble are not numeric, the correlation coefficient δ used in Eq. (30) can not be applied directly. Instead, some researchers have tried to define the classification error diversity qualitatively. For example, a scheme has been suggested in [106] to classify error diversity into four levels as follows:

• Level 1: At most one of the base classifiers in an ensemble makes incorrect classification for each instance.

• Level 2: The majority of the base classifiers in an ensemble make correct classification for each instance.

• Level 3: At least one of the base classifiers in an ensemble make correct classification for each instance.

• Level 4: All of the base classifiers in an ensemble make incorrect classification for each instance.

Furthermore, a decomposition of the majority vote error E maj into the average error E avg of the base classifiers, good and bad diversity was introduced in [102] , where good diversity has a positive impact on the error reduction and bad diversity results in a negative impact. In this context, Level 1 and Level 2 diversity would be classified as good ones whereas Level 3 and Level 4 diversity would be classified as bad ones. In the setting of binary classification, i.e., y ∈ {+1, −1}, the majority vote error decomposition is shown as below:

where the disagreement Dis i between a base classifier f i and the ensemblef is measured using:

In Eq. (31), the sign of y(x)f (x) essentially reflects whether the ensemble classification is correct or not, i.e., y(x)f (x) = +1 represents correct classification and y(x)f (x) = −1 indicates incorrect classification. Therefore, Eq. (31) can be rewritten as below:

where the second term

Dis i (x) denotes good diversity and the third term

Dis i (x) denotes bad diversity. 

Those representative methods of the diversity quantification have been analysed in [96, 101] , which are put into two main categories, namely, pairwise measures and non-pairwise measures. In particular, pairwise measures are generally designed by involving calculation based on a contingency table shown in Table 2 for a pair of classifiers f i and f j , whereas non-pairwise measures generally involve counting the number l(x) of classifiers that correctly classify sample x and calculating the relevant probability P (l(x)). 

Those popularly used pairwise measures include the Q-statistic Q, the correlation coefficient ρ, the disagreement measure dis and the double-fault measure DF . Representative non-pairwise measures of diversity include entropy EN T , Kohavi-Wolpert variance KW , measurement of inter-rater agreement IRA, difficulty measure DM , generalized diversity GD and coincident failure diversity CF D. More details of these diversity measures are summarised in Table 3 .

In terms of Q-statistics, the value of Q i,j is ranged in [−1, 1] and is expected to be 0 for two statistically independent classifiers f i and f j . While the two classifiers provide the same outputs correctly, the value of Q i,j tends to be positive [96] . In contrast, if the two classifiers make incorrect classifications on different instances, the value of Q i,j would be rendered negative [96] . Therefore, it can be concluded that the lower the value of Q i,j resulting from a pair of classifiers f i and f j , the higher the diversity between f i and f j is created.

The value of the correlation coefficient ρ i,j is also ranged in [−1, 1]. According to formulations of Q i,j and ρ i,j , it is straightforward to identify that the values of Q i,j and ρ i,j always obtain the same sign but |ρ i,j | ≥ |Q i,j | [96, 101] .

According to the formulations of the disagreement measure dis i,j and the doublefault measure DF i,j , it is easy to see that both dis i,j and DF i,j are ranged in [0, 1]. However, dis i,j and DF i,j aim at measure of diversity in different levels (according to the scheme suggested in [106] for classifying the diversity into four different levels), i.e., dis i,j shows the percentage of samples that are classified incorrectly by only one of the two base classifiers f i and f j , whereas DF i,j is the proportion of samples that are mis-classified by both classifiers.

While it is needed to measure the diversity among multiple classifiers, the aboveintroduced pairwise measures can be used by averaging the values obtained for all those pairs of classifiers. For example, the Q statistic can be used for diversity measure through averaging the values of Q obtained for all classifier pairs, as shown below:

All the non-pairwise measures are ranged in [0, 1]. In the formulations of EN T , KW and IRA, |DS| represents the number of samples in a data set DS and l(x) denotes the number of classifiers that correctly classify sample x. In addition,p used in the formulation of IRA denotes the average accuracy of base classifiers.

In terms of the entropy EN T , the minimum value is obtained when all the M classifiers provide the same outputs and the maximum value is obtained when M/2 classifiers consistently classify sample x as one class and the other M − M/2 classifiers consistently output another class for x. As emphasized in [96] , the Kohavi-Wolpert variance KW is correlated to the average disagreement measure dis avg by a coefficient

2M dis avg . Moreover, the inter-rater agreement IRA is correlated to both KW and dis avg [96] , i.e.,

In terms of the difficult measure DM , it is generally expected that each instance is difficult for some classifiers but is easy for the other classes to encourage the ensemble diversity [96, 101] . The minimum of DM is obtained while each instance is easy for the majority of the base classifiers, and the maximum of DM is obtained while each instance is either easy or difficult for all the base classifiers. The generalized diversity GD is designed based on the argument that the incorrect output of one classifier is always accompanied by the correct output of another classifier for maximizing the diversity [107] . Furthermore, the coincident failure diversity CF D is defined as a modification of GD [107] , which expects that each instance can be classified correctly by some of the base classifiers. While each instance is classified incorrectly by at most one base classifier, the maximum of CF D would be reached, i.e., the highest level of diversity is reached according to the scheme suggested in [106] for identifying the level of diversity.

More recently, Yin et al presented a formulation of diversity learning in [108] as shown below:

where the diversity is treated as a regularization term, β is used as a control parameter for the diversity regularization and w represents the model parameter. A formulation of sparsity learning was also presented in [108] for the purpose of ensemble pruning. Based on the work presented in [108] , more studies have been conducted later on by using diversity as a regularization term [97, 109, 110, 111, 112, 113, 114] , such that the ensemble accuracy and diversity can be optimized simultaneously. In particular, Cavalcanti et al [97] proposed to combine different pairwise measures of diversity for ensemble pruning, while the genetic algorithm is used to optimize the combined diversity to obtain several candidate ensembles that are evaluated using the validation data for selecting the final ensemble. Ahmed et al [109] made an empirical investigation on whether combining the ensemble accuracy with several popular diversity measures is a better evaluation function than using only the accuracy in the setting of ensemble pruning. Dai et al pointed out in [110] that accuracy and diversity are closely related to each other and should be considered simultaneously for ensemble pruning. Accordingly, they proposed three new measures for ensemble pruning, namely, Simultaneous Diversity & Accuracy, Diversity-Focused-Two and Accuracy-Reinforcement. Dvornik et al [111] proposed to encourage the ensemble diversity by enabling the classifiers to output consistently the highest probability for the ground truth class label and to rank inconsistently the other classes by making different classes obtain the secondhighest probability (or other lower-ranked probability). Zhang et al [112] proposed to construct a diversified ensemble layer for combining multiple neural networks as individual modules, while the cross entropy loss of each individual module and the diversity among different modules are optimized simultaneously. Bian et al [113] formulated the relationship between the diversity and the ensemble performance in the context of the theorem of margin and generalization and proposed two diversity-driven pruning methods to utilize the formulated relationship, leading to the enhancement of diversity and the reduction of the ensemble size without much loss of performance. Wu et al [114] revised those representative diversity measures and introduced focal model based measures of diversity for improving further the correlation between the diversity and the prediction accuracy. Overall, all of the above-reviewed works indicate that effective selection and combination of diversity measures would be essential, such that simultaneous optimization of the ensemble accuracy and diversity can be achieved effectively.

In general, ensemble diversity can be created using various types of methods. In this section, we provide a detailed review of diversity creation methods that fall in the category of data input manipulation. We also briefly introduce other methods in the following categories: data output manipulation, manipulation of model architectures, differentiation of starting points in hypothesis space and diversification of learning strategies.

In the setting of data input manipulation, some popularly used methods include Bagging [115] , Random Subspace [116] and Boosting [117] . Bagging involves training M independent classifiers on M different sample sets drawn by random sampling from the original training set with replacement over M iterations. In this setting, the diversity is created heuristically through diversifying the training samples, which contributes to the variance reduction in the context of bias-variance trade-off [118] . Moreover, two variants of Bagging, namely, Dagging [119, 120] and Wagging [118] , were developed by setting different ways of diversifying the training samples. In Random Subspace can be viewed as a way of diversity creation through feature sampling, which involves training M independent classifiers on M different feature subsets produced by random sampling of features from the full feature set without replacement. Therefore, the Random Subspace method aims at creating diversity heuristically through diversifying the features. A variant of Random Subspace, which is referred to as 'Attribute Bagging' [121] , was designed to require a suitable subspace size to be set as a hyper-parameter for drawing M feature subsets. However, in the setting of Random Subspace, all the base classifiers are learned from the entire training set without diversity creation through manipulation of training samples. In order to better enhance the diversity among base classifiers through the combination of Bagging and Random Subspace [122] , the Random Forest method [123] has been developed and used as a powerful decision tree ensemble approach [124] .

In contrast to Bagging and Random Subspace that create diversity heuristically leading to independent classifiers, the Boosting approach aims at creating diversity explicitly by training one classifier that aims at correcting the errors resulting from the classifier trained at the previous iteration, i.e., training negatively correlated base classifiers. In particular, two popular methods of Boosting are referred to as 'Adaptive Boosting' (AdaBoost) [125] and 'Gradient Boosting' [126] . The former method involves diversity creation through re-weighting samples at each iteration i of learning a base classifier f i . In other words, at the end of each iteration i, the weight of each correctly classified sample is decreased and the weight of each misclassified sample is increased, so the learning of classifier f i+1 will focus more on those misclassified samples. The re-weighting of each sample e is operated in the way shown below:

In Eq. (36), ω i e is the weight of sample e updated at iteration i, Z i is a normalization factor (Eq.(37)), α i is the weight of classifier f i (Eq. (38)) measured based on the classification error rate i (Eq. (39) ), x e represents the feature vector of sample e and y e represents the ground truth label of sample e.

In Eq. (39), I(·) is an indicator function comparing the ground truth label y e and the output label f i (x e ) for each sample e.

According to Eq. (36)- (39) , it can be derived that classifier f i trained at iteration i must classify correctly as many as possible those samples misclassified by the ensemblē f i−1 of classifiers trained at the previous iterations, in order to reduce the error rate i . Also, due to the case that classifier f i pays less attention to those samples that were classified correctly by the ensemblef i−1 of classifiers trained at the previous iterations, some of such samples may be misclassified by classifier f i . Therefore, the AdaBoost method is considered to be able to create diversity among classifiers explicitly.

In contrast to AdaBoost, the Gradient Boosting method is aimed at training a classifier at iteration i to fit the negative gradient (residual) estimated at iteration i−1, based on the error rate of the ensemblef i−1 of the classifiers trained at the previous iterations. Moreover, any differential loss functions can be used in the setting of gradient boosting, which overcomes the limitations of the AdaBoost method in terms of the selection of loss functions. Based on the principle of the Gradient Boosting method, a popular decision tree ensemble method, referred to as Gradient Boosting Decision Trees (GBDT) [127] , has been developed and used in many application areas. In a regression context, as illustrated in Eq. (40), GBDT is essentially aimed at training a decision tree f i at each iteration i by optimizing θ i to fit the residual r i−1 resulting from the tree ensemblef i−1 obtained at iteration i − 1 for error reduction, i.e., the better the decision tree f i fits the residual r i−1 , the larger the error reduction would be achieved. As pointed out in [101] , the error reduction is achieved primarily through the bias reduction although the variance reduction can also be achieved.

Based on the Bagging, Random Subspace and Boosting approaches, there have been a variety of decision tree ensemble methods developed by introducing specific ways of diversity creation. In particular, Dynamic Random Forest [124] (a variant of Random Forest) involves training M decision trees on M training sets with different weight distributions over those samples, where the weight ω i e of each sample e at each iteration i is set heuristically to be equal to the proportion of base classifiers correctly classifying sample e to the total number of base classifiers obtained so far. Rotation Forest [128] involves using Principal Component Analysis (PCA) [127] over M iterations to draw M transformed feature sets, such that M diverse decision trees are trained. Furthermore, Rotation Random Forest [129] was developed as a variant of Rotation Forest, which involves using PCA or Linear Discriminant Analysis (LDA) [127] to transform each feature subset selected randomly for generating each specific node of a decision tree. Extremely Randomized Trees (Extra-Tree) [130] involves not only randomly selecting a feature subset for generating each specific node of a decision tree but also randomly selecting a numeric value as the cut-point at the tree node if the split attribute is continuous. Random Feature Weights for decision tree ensemble construction [131] was designed to assign each feature a random weight (ranged in [0, 1]) for training a decision tree at each iteration i. In this setting, M different decision trees are trained using M feature sets with different weight distributions over those features. Forest by Penalizing Attributes (Forest PA) [132] was designed to assign each attribute Attr a weight heuristically at each iteration i, based on the level of the tree (trained at iteration i − 1) in which the node (corresponding to attribute Attr) is located.

In addition to the above introduced methods, it is also a popular strategy of data input manipulation to create diversity through training multiple classifiers on different sets of features extracted in different ways [133, 134] . In the era of deep learning, a new type of decision tree ensemble referred to as 'Deep Forest' [103, 135] has become more popular. The pilot study was reported in [103] , which introduces the gcForest method that aims at producing a cascade of decision forests, i.e., creating an ensemble of ensembles. In particular, the major idea of gcForest is to train a model that involves multiple layers and multiple decision forests (an ensemble of decision tree ensembles) in each layer, where the feature space is dynamically changed every time a new layer is added. In other words, some new features, which are generated as outputs in each layer lr i , are used as inputs for the next layer lr i+1 , where all the original features are kept for each layer. In this context, the ensembles of decision forests in different layers are produced using different sets of features, so those ensembles of decision forests trained in different layers are considered to involve diversity created through diversification of feature sets. Based on gcForest, some variants have been developed later on through setting different strategies of feature space update, such as multi-layered GBDT [136] , Deep Extra-Tree [137] , Deep Multigrained Cascade Forest [138] , Densely Connected Deep Random Forest [139] , Rotation-based deep forest [140] , Siamese Deep Forest [141] .

In terms of data output manipulation, a popular way is to transform a multi-class classification problem into a number of binary classification problems through binary decomposition. Popular strategies of decomposition include one-vs-one (OVO) [142] , one-vs-rest (OVR) [143] , many-vs-many (MVM) strategy [142] . In the setting of binary decomposition, an ensemble of binary classifiers is created and the error correcting output codes (ECOC) strategy [144] is commonly used for fusing the outputs of the binary classifiers to finally classify a new sample. Also, ECOC has shown its effectiveness in improving the diversity between binary classifiers in the setting of end-to-end neural network training [145] . More recently, N-nary decomposition has been proposed in [146] as a generalization of binary decomposition. In addition, two other representative ways of output manipulation for diversity enhancement are referred to as 'Output Flipping' and 'Output Smearing', which have been proposed and experimented in [147] .

In addition to data manipulation, some other ways can also be taken in practice for diversity creation, which include manipulation of model architectures, differentiation of starting points in the hypothesis space and diversification of learning strategies. The manipulation of model architectures can be applied to decision tree learning, leading to a tree ensemble that contains both binary and multi-way trees, e.g., combining a binary tree trained by CART and two multi-way trees trained by ID3 and C4.5 [148] . Also, in the setting of neural network learning, different types of networks can be produced to form an ensemble through manipulating the network architectures [112] . Differentiation of starting points in the hypothesis space can be applied to neural network learning through random initialization of weights [100] over multiple iterations for training complementary models. In the setting of decision tree learning, starting points in the hypothesis space can be differentiated by selecting different attributes for the root nodes of the trees [149] . In addition, diversification of learning strategies can be achieved by training heterogeneous classifiers through using different learning algorithms [103, 135] , or using different hyper-parameter settings of the same learning algorithm, e.g., combination of various loss functions [150] .

This section discusses the importance of estimating epistemic uncertainty in several popular applications. These applications include computer vision and natural language processing (NLP). In the following subsections, we first review the applications of epistemic uncertainty learning in computer vision, and then explain how epistemic uncertainty learning has applied to NLP.

In computer vision, uncertainty is taken into account in variety of applications such as image classification [151, 152] , segmentation [83, 153] , camera relocalization [154] , object detection [155, 156, 157] , image/video retrieval (restoration) [158, 159] , in the setting of Bayesian and ensemble learning. Image classification and segmentation are among the most popular applications of DL models. The former categorize all objects in an image into a single class, while later aims to assign a label to each pixel in a single image in which pixels from a label share specific properties.

Both classification and segmentation have been widely used for medical image analysis. Although the state-of-the-art supervised learning models can produce precise predictions, they are uncertain about the quality of their predictions. Since the size and shape of diseases are different, and they locate across the patient's body, it is vital to address uncertainties and make predictions interpretable and reliable. Known et al. [153] proposed an uncertainty estimation method using the Bayesian neural networks for stroke lesion segmentation. This method finds a relationship between variance and means of a multi-modal random value. Abdar et al. [151] integrated an ensemble MCD into a multi-model learning framework, which receives chest X-ray (CXR) and computed tomography (CT) images as inputs, to estimate uncertainty in identifying COVID19 cases. Study [160] proposed an uncertainty-aware framework for grading diabetic retinopathy. This framework built a Gaussian sampling approach based on multiple instance learning strategies to infer the grade of images.

Object detection is another popular application of supervised learning models that are being extensively used in autonomous cars. Any mistake in their predictions may cause catastrophic damages or even fatality; therefore, it is vital to estimate the reliability of their predictions. In this regard, prediction surface uncertainty [156] , denoted as PURE, was proposed to estimate the predictive uncertainty. This model formulates the object detection task as a regression problem to locate objects in a 2D-camera view image and uses MCD to estimate the uncertainty of the model. Study [161] proposed an uncertainty-aware model for the detection of both salient and camouflaged objects. Specifically, the contradicting attributes of these two tasks were modeled using a similarity measure technique. In addition, an adversarial learning model was proposed to compute the network confidence score.

As the basis of downstream image classification and segmentation tasks, image restoration is an inverse image degradation process. Specifically, it processes the degraded image caused by the imaging device subject to external interference and restores a high-quality image approximating the original image before being degraded. In image restoration tasks, the degraded images are samples with high-level aleatoric uncertainty. Study [158] estimated uncertainty resulting from the undersampled source data. It enhanced the quality of reconstructed images by utilizing a specific network branch to study inherent aleatoric uncertainty arising from noise data. While epistemic uncertainty was superb at estimating the reliability of restored images, satisfying the requirements of safety-critical fields, such as magnetic resonance (MR) images reconstruction. As Begoli et al. [162] concluded that understanding predic-tion system structure and defensibly quantifying uncertainty is significantly beneficial for medical AI applications. Tanno et al. [163] combined a 3D subpixel-CNN based framework with Bayesian image quality transfer (IQT) [164] to solve diffusion MRI reconstruction problems. They described intrinsic uncertainty as an irreducible variance of mapping low-resolution(LR) to high-resolution(HR) images and defined the degree of ambiguity in the model parameters as parameter uncertainty captured by variational dropout. Subsequently, Schlemper et al. [165] introduced MCD into reconstruction networks, demonstrating the competitive performance of quantifying epistemic uncertainty by utilizing Bayesian methods, especially dealing with test samples which out of training data distribution and superior to overparametrised deterministic networks.

Epistemic uncertainty learning techniques have been applied to other image restoration tasks such as denoising [166, 167] , deraining [168] . For instance, Cheng et al. [166] presented an MCMC-based Stochastic gradient Langevin dynamics (SGLD) framework to approximate the posterior distribution to improve performance in image denoising tasks. Serra et al. [167] proposed a fast variational inference framework for solving the sparse representation-related problems in image processing and successfully applied it to the denoising problem.

In addition, several studies have addressed the epistemic uncertainty in analyzing video streams. Huang et al. [169] utilized the similarity of consecutive frames,i.e., temporal property, in videos. They proposed region-based temporal aggregation (RTA) framework, which dramatically speeds up MC-dropout in video segmentation tasks, to estimate uncertainty by calculating the moving average of prediction in consecutive frames to simulate the sampling procedure. Study [170] is the first learning-based solution for the bronchoscopic localization, which estimates uncertainty utilizing VI to conduct video-CT registration.

In natural language processing tasks, various metrics of uncertainty quantification have been studied [11, 171, 172, 12, 13] in the context of either Bayesian deep learning or ensemble learning.

In setting of Bayesian deep learning, Xiao and Wang [11] proposed novel methods of quantifying epistemic and aleatoric uncertainties in sentiment analysis, named entity recognition and language modeling tasks, and the experimental results show that learning to quantify uncertainty is not only necessary in measuring the prediction confidence but also useful in improving the model performance. Dong et al. [171] outlined three major causes of uncertainty and designed various metrics for quantifying these factors and estimating confidence scores that indicate the likelihood of correct predictions made by a model. The experimental results reported in [171] show that the proposed confidence model outperforms those methods that rely on confidence scores based on posterior probability, and the interpretation of uncertainty is also improved in comparison with simply using attention scores. Wang et al. [12] proposed to quantify the epistemic uncertainty for measuring the prediction confidence of a neural machine translation model and their experimental results indicate that the performance of machine translation can be improved significantly through uncertainty-based estimation of prediction confidence.

In the setting of ensemble learning, Shen et al [172] investigated applying Gaussian processes and random forests for measuring the uncertainty in document quality predictions. The experimental results reported in [172] indicate that both Gaussian processes and random forests can be used effectively in predicting the quality of Wikipedia articles alongside an estimate of the uncertainty concerning the inconsistent outputs of various models. He et al. [13] proposed to improve the confidence of winning score for generating accurate uncertainty score. In particular, a model, which consists of three parts, namely, "mix-up", "self-ensembling" and "distinctiveness score", is proposed in the setting of deep neural networks for reducing the impact of the overconfidence of winning score and also taking into account the impacts of other types of uncertainty. The experimental results reported in [13] indicate that accurate scores of uncertainty can be obtained using the proposed model and the performance of text classification can be improved by assigning those uncertain predictions to domain experts.

In this survey, we provided a hierarchical categorization of the epistemic (model) uncertainty learning methods, i.e., Bayesian and ensemble methods. Bayesian methods formulate epistemic uncertainty as a posterior distribution over the weight parameters. Since these methods need to compute posterior, they cannot perform inference analytically but can be approximated. In this regard, we discuss four widely used approximation techniques, including variational inference (VI), Monte Carlo dropout (MCD), Markov Chain Monte Carlo (MCMC), and Laplace approximation. Each of these techniques has several advantages and disadvantages [36] . Among them, MCD techniques are easy to implement and don't need to change the training process. However, they are not reliable for out-of-distribution samples and require multiple sampling when performing inference. VI techniques benefit from stochastic optimization methods and are suitable for big data sets. However, they are computationally complex.MCMC techniques can approximate exact posterior, but they are very slow and fail to converge. Although the Laplace approximation techniques have a simple procedure, they perform poorly due to ignoring the global properties of the real posterior.

In contrast, ensemble methods formulate epistemic uncertainty as the variance of the outputs of base models. The epistemic uncertainty is also referred to as ensemble diversity, which is considered as a key factor of successful ensemble learning. In particular, existing works have illustrated mathematically how the ensemble diversity impacts the generalization performance in the context of bias-variance decomposition. There have been quite a lot of studies conducted for diversity quantification and creation, which have provided useful guidance on how to construct effectively a high quality ensemble leading to the improvement of the generalization performance. However, there is still not yet a formally accepted definition of the term 'diversity' [96, 101, 135] , which indicates that different measures of diversity were designed from different views of diversity and the ensemble diversity was usually created in different heuristic ways [103] . Moreover, a great number of studies have been conducted towards optimizing the ensemble accuracy and the diversity simultaneously and some works also involve introducing new metrics of diversity quantification towards enhancing heuristically the relationship between the ensemble accuracy and the diversity. However, the above-mentioned relationship still needs to be explored further in depth to make it more clear how the simultaneous optimization of the ensemble accuracy and the diversity can be achieved more effectively.

Despite considerable progress in handling the epistemic uncertainty in supervised learning models, there exist several challenging issues that must be addressed in the future. We found several research gaps that need further investigations, as follows:

• Methodology: Although supervised learning approaches have been widely applied to solve computer vision and NLP problems, most of the existing studies fail to quantify uncertainty in practice. They usually use ideal (standard) data sets and inject a uniform random noise to evaluate their performance, which is unrealistic in real-world problems. In practice, the performance of learning from data sets are affected by uncertain distributions; therefore, it is crucial to develop robust techniques for learning uncertainty. Moreover, in NLP, the uncertainty on the contexts of words is naturally present in the text due to the insufficient amount of data but very few studies on epistemic uncertainty have been conducted in this aspect, which indicates the necessity of further studies on uncertainty in text processing. In addition, most of the studies estimated the uncertainty in supervised learning models, while little attention has been paid to other learning strategies such as semi-supervised learning [173] , multi-modal learning [174] , reinforcement learning [175] , active learning [176] , transfer learning [177] , graph learning [178] , etc. From the perspective of algorithm optimization, choosing suitable epistemic uncertainty quantification methods according to specific tasks and algorithm characteristics, and generating an optimized learning strategy based on quantified uncertainty, is worth further exploration. It may be an effective way to improve the performance of deep neural networks with different structural characteristics and other classic learning algorithms. For example, there are several excellent evolutionary computation algorithms, such as particle swarm optimization [179, 180, 181] , that have received extensive attention from researchers in the post-deep learning era. However, there is little research work on quantifying the uncertainty of such algorithms. We believe that the epistemic uncertainty learning techniques can be used to improve the stability of the optimization process.

• Lack of data set: For the topic of model uncertainty quantification, there are not yet benchmark databases designed particularly. The data sets used in this study are from areas of CV and NLP. Nonetheless, it is one of the basements for studying epistemic uncertainty quantification. Analyzing epistemic uncertainty based on bias-variance decomposition may be a breakthrough in constructing a benchmark data set that reflects the effectiveness of epistemic uncertainty quantification techniques fairly. We will explore this further in our future work.

• Lack of standard evaluation protocol: Existing uncertainty learning techniques are being evaluated based on measurable quantities such as performance on out-of-distribution detection. However, the details of such performance evaluation strategy may vary for different studies, which leads to an unfair comparison among various techniques. Therefore, it is vital to have a standard protocol for evaluating the effectiveness of uncertainty quantification techniques directly. Based on the bias and variance decomposition, which is mentioned in this work, making theoretical exploration of the quality of the uncertainty estimation is also a good future research direction.

• Availability of data and code: This can help researchers to reproduce results, enhance their performance and conduct a fair comparison. However, the majority of studies do not make the relevant codes and data available.

Enabling supervised learning models to quantify their uncertainty is vital for many real-world applications such as safety-related problems. This survey first explained the importance of addressing the epistemic uncertainty in supervised learning models and discussed it in terms of bias and variance. Then, we reviewed the epistemic uncertainty learning techniques in supervised learning over the last five years. We provided a hierarchical categorization of these techniques and introduced the representative models of each category along with their applications. Specifically, we discussed two widely used epistemic uncertainty learning techniques, i.e., Bayesian approximation and ensemble learning. In addition, several research gaps have been pointed out as potential future research directions. It is aimed to promote the concept of epistemic uncertainty learning.

Program of China (Grant 2021YFE0203700), in part by the Natural Science Foundation of Shenzhen (University Stability Support Program no. 20200804193857002), and in part by the Interdisciplinary Innovation Team of Shenzhen University.

Recent advances in deep learning

A review of generalized zero-shot learning methods

Intuitionistic fuzzy twin support vector machines

Dual vaegan: A generative model for generalized zero-shot learning

Perfect mcmc sampling in bayesian mrfs for uncertainty estimation in segmentation

Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation

Fuzzy measure with regularization for gene selection and cancer prediction

Anomaly detection and condition monitoring of uav motors and propellers

Bounding box regression with uncertainty for accurate object detection

Uncertainty estimation in one-stage object detection

Quantifying uncertainties in natural language processing tasks

Improving back-translation with uncertainty-based confidence estimation

Towards more accurate uncertainty estimation in text classification

Detecting adversarial examples for speech recognition via uncertainty quantification

Segnet: A deep convolutional encoder-decoder architecture for image segmentation

Deepreinforcement-learning-based images segmentation for quantitative analysis of gold immunochromatographic strip

An improved particle filter with a novel hybrid proposal distribution for quantitative analysis of gold immunochromatographic strips

Soft-edge assisted network for single image superresolution

Melt pool segmentation for additive manufacturing: A generative adversarial network approach

Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods

Learning from uncertainty for big data: future analytical challenges and strategies

Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift

Test-time data augmentation for estimation of heteroscedastic aleatoric uncertainty in deep neural networks

Model uncertainty in deep whole-brain segmentation for structure-wise quality control

A study on the uncertainty of convolutional layers in deep neural networks

What uncertainties do we need in bayesian deep learning for computer vision?

A study on relationship between generalization abilities and fuzziness of base classifiers in ensemble learning

Discovering the relationship between generalization and uncertainty by incorporating complexity of classification

An analysis on the relationship between uncertainty and misclassification rate of classifiers

Bias-variance decomposition of absolute errors for diagnosing regression models of continuous data

The elements of statistical learning

Neural network-based uncertainty quantification: A survey of methodologies and applications

Uncertainty in big data analytics: survey, opportunities, and challenges

A survey on bayesian deep learning

Hands-on bayesian neural networks-a tutorial for deep learning users

A review of uncertainty quantification in deep learning: Techniques, applications and challenges, Information Fusion

A survey of uncertainty in deep neural networks

Uncertainty in deep learning

Pattern recognition and machine learning

Keeping the neural networks simple by minimizing the description length of the weights

On information and sufficiency

Practical variational inference for neural networks

Proceedings of the International Conference on Machine Learning

Auto-encoding variational bayes

Variational dropout and the local reparameterization trick

Variational inference with normalizing flows

The relevance of bayesian layer positioning to model uncertainty in deep bayesian active learning

Proceedings of the International Conference on Machine Learning

Neural learning in structured parameter spaces -natural riemannian gradient

Proceedings of the International Conference on Neural Information Processing Systems

Fast and scalable bayesian deep learning by weight-perturbation in adam

Slang: fast structured covariance approximations for bayesian deep learning with natural gradient

Uncertaintyaware attention for reliable interpretation and prediction

Bayesian learning via stochastic dynamics

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Dropout: A simple way to prevent neural networks from overfitting

Concrete dropout, Advances in neural information processing systems

Evaluating bayesian deep learning methods for semantic segmentation

Encoder-decoder with atrous separable convolution for semantic image segmentation

Single shot mc dropout approximation

Empirical study of mc-dropout in various astronomical observing conditions

Uncertainty quantification in skin cancer classification using three-way decision-based bayesian deep learning

Barf: A new direct and cross-based binary residual feature fusion with uncertainty-aware module for medical image classification

Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation with convolutional neural networks

A general framework for uncertainty estimation in deep learning

MonoLoco: Monocular 3d pedestrian localization and uncertainty estimation

Deep and confident prediction for time series at uber

Stochastic gradient hamiltonian monte carlo

Hybrid monte carlo

Bayesian learning via stochastic gradient langevin dynamics

Bayesian uncertainty estimation for batch normalized deep networks

Bayesian graph convolutional neural networks via tempered mcmc

Stochastic gradient descent as approximate bayesian inference

Cyclical stochastic gradient mcmc for bayesian deep learning

Thermostat-assisted continuouslytempered hamiltonian monte carlo for bayesian learning

Canonical dynamics: Equilibrium phase-space distributions

A simple baseline for bayesian uncertainty in deep learning

Averaging weights leads to wider optima and better generalization

Exact sampling with coupled markov chains and applications to statistical mechanics

Perfect sampling using bounding chains

Improving predictive uncertainty estimation using dropout-hamiltonian monte carlo

Predicting abdominal aortic aneurysm growth using patient-oriented growth models with two-step bayesian inference

Uncertainty quantification for radio interferometric imaging-i. proximal mcmc methods

A practical bayesian framework for backpropagation networks

Bayesian estimation of stochastic volatility models by integrated nested laplace approximation method, Master's thesis

Overcoming catastrophic forgetting in neural networks

A scalable laplace approximation for neural networks

Online structured laplace approximations for overcoming catastrophic forgetting

Estimating model uncertainty of neural networks in sparse information form

On the convergence of the laplace approximation and noise-level-robustness of laplace-based monte carlo methods for bayesian inverse problems

Hamiltonian monte carlo using an adjoint-differentiated laplace approximation: Bayesian inference for latent gaussian models and beyond

Proceedings of the International Conference on Machine Learning

L2m: Practical posterior laplace approximation with optimization-driven second moment estimation

Simple and scalable predictive uncertainty estimation using deep ensembles

Proceedings of the International Conference on Machine Learning Workshop

Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy

Combining diversity measures for ensemble pruning

Neural network ensembles, cross validation and active learning

Neural networks and the bias/variance dilemma

Diversity creation methods: a survey and categorisation

Ensemble Methods: Foundations and Algorithms

Good and bad diversity in majority vote ensembles

Deep forest: Towards an alternative to deep neural networks

Error correlation and error reduction in ensemble classifiers

Analysis of decision boundaries in linearly combined neural classifiers

Combining diverse neural nets

Software diversity: Practical statistics for its measurement and exploitation

Convex ensemble learning with sparsity and diversity

Using diversity for classifier ensemble pruning: An empirical investigation

Considering diversity and accuracy simultaneously for ensemble pruning

Diversity with cooperation: Ensemble methods for few-shot classification

The diversified ensemble neural network

When does diversity help generalization in classification ensembles?

Boosting ensemble accuracy by revisiting ensemble diversity metrics

Bagging predictors

The random subspace method for constructing decision forests

Boosting a weak learning algorithm by majority

An empirical comparison of voting classification algorithms: Bagging, boosting, and variants

Proceedings of the International Conference on Machine Learning

Model combination in the multiple-data-batches scenario

Attribute bagging: improving accuracy of classifier ensembles by using random feature subsets

Random forests: from early developments to recent advancements

Random forests

Dynamic random forests

Proceedings of the International Conference on Machine Learning

Greedy function approximation: A gradient boosting machine

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Rotation forest: A new classifier ensemble method

Random forests with ensemble of feature spaces

Extremely randomized trees

Random feature weights for decision tree ensemble construction

Forest pa: Constructing a decision forest by penalizing attributes used in previous trees

Affect recognition from face and body: Early fusion vs. late fusion

Multimodal machine learning: A survey and taxonomy

Deep forest

Multi-layered gradient boosting decision trees

Proceedings of the International Conference on Neural Information Processing

Deep multigrained cascade forest for hyperspectral image classification

Densely connected deep random forest for hyperspectral imagery classification

Rotation-based deep forest for hyperspectral imagery classification

A siamese deep forest, Knowledge-Based Systems 139 (jan.1)

Ordinal regression methods: Survey and experimental study

A survey of multiple classifier systems as hybrid systems

Solving multiclass learning problems via errorcorrecting output codes

Error-correcting output codes with ensemble diversity for robust learning in neural networks

N-ary decomposition for multiclass classification

Randomizing outputs to increase prediction accuracy

An efficient rule-based classification of diabetes using id3, c4.5, & cart ensembles

Forest cern: A new decision forest building technique

Combination of loss functions for deep text classifcation

Uncertaintyfusenet: Robust uncertaintyaware hierarchical feature fusion with ensemble monte carlo dropout for covid-19 detection

Mcua: Multi-level context and uncertainty aware dynamic deep ensemble for breast cancer histology image classification

Uncertainty quantification using bayesian neural networks in classification: Application to biomedical image segmentation

Modelling uncertainty in deep learning for camera relocalization

Metadetect: Uncertainty quantification and prediction quality estimates for object detection

Prediction surface uncertainty quantification in object detection models for autonomous driving

Generating robust real-time object detector with uncertainty via virtual adversarial training

Reducing uncertainty in undersampled mri reconstruction with active acquisition

Structured uncertainty prediction networks

Dr-graduate: Uncertainty-aware deep learningbased diabetic retinopathy grading in eye fundus images

Uncertainty-aware joint salient object and camouflaged object detection

The need for uncertainty quantification in machine-assisted medical decision making

Bayesian image quality transfer with cnns: exploring uncertainty in dmri super-resolution

Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention

Bayesian deep learning for accelerated mr image reconstruction

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Bayesian k-svd using fast variational inference

Robust representation learning with feedback for single image deraining

Efficient uncertainty estimation for semantic segmentation in videos

Generative localization with uncertainty estimation through video-ct data for bronchoscopic biopsy

Confidence modelling for neural semantic parsing

Modelling uncertainty in collaborative document quality assessment

A semisupervised learning model based on fuzzy min-max neural networks for data classification

Target detection in clutter/interference regions based on deep feature fusion for hfswr

An improved fuzzy artmap and q-learning agent model for pattern classification

Incorporating diversity and informativeness in multiple-instance active learning

Transferring case knowledge to adaptation knowledge: An approach for case-base maintenance

A new method for knowledge and information management domain ontology graph model

A competitive mechanism integrated multi-objective whale optimization algorithm with differential evolution

A novel randomised particle swarm optimizer

A novel sigmoid-functionbased adaptive weighted particle swarm optimizer

This work was supported in part by the National Natural Science Foundation of China (Grants 61976141, 62176160 and 61732011), in part by National Key R&D