key: cord-0437049-p7ja9syz authors: Lee, Minhyeok; Seok, Junhee title: Estimation with Uncertainty via Conditional Generative Adversarial Networks date: 2020-07-01 journal: nan DOI: nan sha: 1a8b1170f631149ab40adbb81ed2adaa4d1aa6e1 doc_id: 437049 cord_uid: p7ja9syz Conventional predictive Artificial Neural Networks (ANNs) commonly employ deterministic weight matrices; therefore, their prediction is a point estimate. Such a deterministic nature in ANNs causes the limitations of using ANNs for medical diagnosis, law problems, and portfolio management, in which discovering not only the prediction but also the uncertainty of the prediction is essentially required. To address such a problem, we propose a predictive probabilistic neural network model, which corresponds to a different manner of using the generator in conditional Generative Adversarial Network (cGAN) that has been routinely used for conditional sample generation. By reversing the input and output of ordinary cGAN, the model can be successfully used as a predictive model; besides, the model is robust against noises since adversarial training is employed. In addition, to measure the uncertainty of predictions, we introduce the entropy and relative entropy for regression problems and classification problems, respectively. The proposed framework is applied to stock market data and an image classification task. As a result, the proposed framework shows superior estimation performance, especially on noisy data; moreover, it is demonstrated that the proposed framework can properly estimate the uncertainty of predictions. Conventional predictive Artificial Neural Network (ANN) models commonly operate with a feedforward framework using deterministic weight matrices as the network weight parameters [1] [2] [3] . Specifically, the estimation of ANNs is conducted with matrix operations between given samples and the trained network parameters with non-linear activation functions. While outstanding progress has been made in ANNs in recent years [4, 5] , however, conventional predictive ANN models have an obvious limitation since their estimation corresponds to a point estimate. Such a limitation causes the restrictions of using ANN for medical diagnosis, law problems, and portfolio management, where the risk of the predictions is also essential in practice. The conventional ANN models produce the same form of predictions even if the predictions are very uncertain; and such uncertain prediction results cannot be distinguished from confident and regular predictions. In short, conventional ANN models cannot say 'I don't know', and it corresponds to a type of overfitting. For instance, the models attempt to make a confident prediction for an outlier or even complete noise data of which predictions are meaningless and impossible. In such a framework, it is not clear how much the models are sure on their predictions. To handle such a problem, a probabilistic approach of ANNs, called Bayesian Neural Networks (BNNs), has been introduced [6] [7] [8] . In BNNs, values of the network weight parameters are not fixed, and instead obtained by a sampling process from certain distributions. Therefore, the prediction of BNN for a given sample differs in each operation. Integrated with the Monte Carlo method in which many different predictions are made for a given sample, the prediction of BNNs also forms a distribution of which variance can represent the risk and uncertainty of the predictions. However, the training of BNNs is not straightforward since the Monte Carlo method is employed for the training process as well [6] . In the training process, the network parameters are sampled from posterior distributions of the network parameters, and gradients are calculated and backpropagated for the sampled parameters. Therefore, such randomness intrinsic in the training process hinders fast training and convergence of BNNs. Furthermore, the training of deep BNNs is also not straightforward, due to such a problem. As another probabilistic neural network model that can produce a form of distributions as its outputs, Generative Adversarial Networks (GANs) have shown superior performance for sample generation [9] [10] [11] [12] [13] . Generally, GANs learn the sample distribution of a certain dataset in order to produce synthetic but realistic samples from input noises, by mapping features intrinsic in the dataset onto the input noise space. While typical GAN models produce random samples, conditional variants of GANs (cGANs) have been introduced to generate desired samples, and have shown fine results to produce samples by using conditional inputs [14] [15] [16] [17] . Basically, this paper addresses the following question that has arisen from the characteristic of cGANs: Since cGANs can learn the probability distribution of samples, i.e., Pr(X|Y, Z), is it possible to learn the probability distribution of labels, i.e., Pr(Y |X, Z), by reversing the inputs and outputs of the cGANs? If it is possible, we can utilize cGANs as a predictive probabilistic neural network model, similar function to BNNs. Also, such a model can solve the problems in BNNs since deep architectures can be employed for GANs, and their training is relatively simple, compared to BNNs. However, such an issue has not yet been studied extensively. In this paper, we propose an adversarial learning framework for utilizing the generator in cGANs as a predictive deep learning model with uncertainty. Since the outputs of the proposed model are a form of distribution, the uncertainty of predictions can be represented as the variance of the distribution. Furthermore, to measure and quantify the uncertainty of estimations, we introduce the entropy and relative entropy for regression problems and classification problems, respectively. Let X ∈ R p be a sample and Y (X) ∈ R q be labels for the sample. An predictive ANN model estimates the labels for a given sample as follows: where M : R p → R q is an ordinary ANN structure and ϑ denotes a set of weight parameters. Throughout the paper, we use the same notation M (·) , when an ANN structure is used, and M (·) is used to indicate a general method that does not use ANNs. However, such predictions correspond to point estimates and it is not clear how much the model is certain of the predictions. For instance, given a complete noise X N ∼ N (0, I (p)), it is even possible to predict a label for the noise, i.e.,Ŷ M,θ (X N ); obviously, such a point estimate is a wrong answer for noises, and one of the correct answer might be 'I don't know', which is impossible to be represented in conventional ANNs. In this paper, we aim to solve this problem by conducting the prediction in a probabilistic manner: where M P : R p → D is a probabilistic ANN model, Y j denotes a variable for possible values of j th element of Y (X), and j ∈ {1, ..., q}. From such probability distributions of predictions, not only the point estimation but the uncertainty of the prediction can be calculated as follows: where M C : D → R 2 is a method for point estimation using the estimated distributions; therefore, Y M C can be the mean or median of the estimated distributions, θ C is a set of parameters for M C , and η denotes a value describing the uncertainty measure of the point estimation. To construct a probabilistic neural network model shown in (2) , BNNs using stochastic weights have been introduced [6, [18] [19] [20] . In BNNs, each weight has a probability distribution; and a value of the weight is sampled from the probability distribution each time the model make an inference. Then, by using the Monte Carlo method over the distribution, the prediction of BNNs can be a form of distribution as follows:Ŷ where M B : R p → R q is a BNN architecture, i denotes an index for the weight sampling, k denotes the number of sampling, ϑ B (i) ∼ P ϑ B denotes the i th sampled weights from the weight distributions, M D : R k → D and θ D are a method for the probability density estimation and a set of parameters for the method, respectively. GANs are probabilistic neural network models in common with BNNs; however, in contrast to BNNs, GANs use deterministic weight parameters, and instead employ a stochastic noise vector as their input for representing latent features in a dataset [21] [22] [23] [24] [25] . The training of GANs is performed by an adversarial manner in which a discriminator and a generator play a game to distinguish and deceive each other. Generally, GANs are used to learn sample distributions; therefore, varying by the noise vector with the Monte Carlo method, the output of the generator is a synthetic sample: where M G is a generator in GANs, Z (i) ∼ P Z is a noise vector, andX is a generated synthetic sample. However, since it is not clear which noise variable is related to which feature, producing desired samples is challenging in ordinary GANs. To address such a problem, conditional variants of GANs have been studied. Conditional GAN (cGAN), one of the most popular of the conditional variants, is extensively used to generate conditional samples [14, [26] [27] [28] . cGAN uses labels as another input for the generator:X where M cG is a generator in cGANs, and Y is a desired condition, which is basically the same form as Y (X). In this paper, we propose a new framework to use the generator in cGAN as a predictive model while the existing cGAN is routinely employed for sample generation. By simply reversing the output and the conditional input in cGAN, the model can successfully be used as a probabilistic predictive neural network model which has the same function as BNNs: where M F : R p → R u is a feature network that extracts u-dimensional features from samples, and therefore,Ŷ M cG corresponds to one of the prediction results using cGAN; we can use the sample X, instead of the feature network, if the dimension of the sample space, i.e., p, is low. Such a modification is simply changing the input and output in (7). By sampling different noise vectors Z (i), a probability distribution of predictions can be obtained in a similar manner with BNNs as described in (5): In the training of GAN structures, the generator is trained by an adversarial manner to deceive the discriminator; therefore, the discriminator is required to be set in order to train M cG . In this paper, the projection discriminator [14] , which shows superior performance compared to simple concatenation, is employed for the training of the generator. The architecture of the projection discriminator is as follows: where M Dis : {R p , R q } → R denotes the projection discriminator, W o ∈ R 1×u is a weight matrix for the output layer of the discriminator, ϑ Dis = {W o } ∪ ϑ ϕ , and M ϕ : R q → R u is an ANN structure with an output dimension of u. To solve a classification problem with the proposed framework, where the input of the discriminator Y (X) is a one-hot vector, the M ϕ can be replaced by a matrix as follows: Therefore, for classification problems, (10) can be simplified as: Hence, throughout the paper, we use (10) and (11) for regression problems and classification problems, respectively. We introduce entropy metrics to measure and quantify the uncertainty of predictions from cGANs. While the variance of estimated distribution can be used to represent the uncertainty if the distribution follows a normal distribution, we employ the entropy in this paper since we cannot reject the possibility that the distribution does not follow a normal distribution. For regression problems, the regular entropy of the estimated distribution in (9) is employed, which can be calculated as follows: where Pr (Y j,k ) the probability of k th element of Y j . Notice that j = 1, ..., q, so that the entropy is calculated for each target variable. On the other hand, for classification problems, the relative entropy, also known as Kullback-Leibler divergence, is used instead of the ordinary entropy since the difference in distributions between the predicted class and the other classes can represent the certainty of the point prediction. If a prediction is certain, the distribution of predicted class has low variance, and its distance from the other distributions would be far; such variance and distance can be comprehensively represented by the relative entropy. Let l X be an index of the predicted class given a sample X; for example, l X = argmax j iŶ j,M cG ,ϑ cG (Z (i) , M F (X; ϑ F )), if we use the average as the point estimation of each class. Since the relative entropy is asymmetric, we employ a symmetric variant of the relative entropy as follows: Employing these entropies, we can measure the uncertainty of predictions for regression problems and classification problems as follows: ; for classification. Notice that −D KL is used to describe the uncertainty since the ordinary relative entropy represents the difference in class distributions; therefore, the minus relative entropy indicates similarity between the predicted class distribution and the other class distributions, which corresponds to the uncertainty. In contrast, the ordinary entropy measures a type of variance of distributions; thereby, a high value of the entropy indicates the uncertainty of predictions. For classification problems, however, ordinary ANN models have a sort of uncertainty measure of which function is similar to the proposed uncertainty measure. The softmax function is commonly employed for the last layer of ANN classifiers, and the outputs of the function provide a kind of probabilities of each class. Therefore, log Ŷ M,ϑ A (X) l X , the cross-entropy loss for the softmax function, can indicate the uncertainty of prediction. However, there exists overfitting in ordinary ANN models as described in the previous section; thereby, this measure becomes imprecise. We will compare the performance between the proposed uncertainty measure and this existing method in Section 4. In this section, we compare the proposed framework with related prior works. The key differences are summarized and illustrated in Figure 1 . Comparison to ordinary cGANs. The conventional generator in cGAN is used for synthetic sample generation [14] . Ordinary cGAN learns the conditional sample distribution, i.e., Pr (X|Y , Z), then, produces synthetic samples, integrated with the Monte Carlo method over the noise vector Z. In contrast, we use cGAN as a prediction model where the model learns the target distribution, i.e., Pr (Y ), and performs predictions as a form of distributions, i.e., Pr (Y |X, Z), with a stochastic input Z, given a sample X as the conditional input of cGAN. In short, the neural network architecture of cGANs in both studies are basically identical while the proposed framework corresponds to reversing the input and output in the typical use of cGAN. Comparison to ANNs and BNNs. ANN has a limitation to express the uncertainty of predictions, as described in the previous sections. The output of an ANN is a point estimate, while the output of the proposed framework is a distribution that can represent the uncertainty of predictions, which can handle such a limitation. Likewise, BNN also produces predictions as a form of distribution, which is similar to the proposed framework; however, BNN uses stochastic weights to perform such work, which generally hinders the convergence in training and construction of deep neural network architecture. In contrast, the proposed framework uses deterministic weights and stochastic inputs instead. In addition, the training process is also different, where an adversarial training manner with a discriminator is employed for the proposed framework, which can avoid overfitting resulting from the high complexity of neural network architectures. Stock market prediction is one of the most specific problems that the prediction of returns and the uncertainty of the prediction are comprehensively required in practice. In the modern portfolio theory [29] [30] [31] [32] , both expected returns and risks of a portfolio must be calculated for the selection of a portfolio; the prediction of the expected returns and the risks exactly corresponds to the estimation with uncertainty, the aim of the proposed framework. We apply the proposed framework to NASDAQ-100 Future Index data. The model is trained with returns of the past 30 days as the input and the 5-day return as the target. For the training set, the close price index from January 2001 to December 2015 are used; the data from January 2016 to April 2019 are employed for the test set. For the stability of the data, the price data are preprocessed to the returns. The preprocessing process for the price data is provided in Appendix. The hinge loss and Wasserstein distance are employed for fine training of cGAN, according to recent studies of cGAN [14, 33] . For probabilistic models that produce a form of distributions as their output, the mode is employed for the point estimation. The ANN and cGAN-UC, that is cGAN for the prediction with uncertainty, i.e., the proposed framework, are trained over 2,000 epochs; in contrast, BNN models are trained over 5,000 epochs, but the 5-layer BNN fails to converge within the training process, which demonstrates the difficulty of the training of stochastic weights in BNNs, as described previously. The architectures of the models used in this experiment are provided in Appendix. The models are comprehensively evaluated by the prediction performance of returns and whether the estimated uncertainty is actually correlated with prediction errors, which means the risk of predictions can properly be measured. Table 1 shows correlation coefficients in the test set. The deterministic models show similar prediction performances, whereas cGAN-UCs show superior performance compared to the deterministic models as well as BNNs. The 5-layer cGAN-UC demonstrates the best prediction performance, while the uncertainty estimation is more precise in the 7-layer model. We conjecture that such a performance gain results from the adversarial learning process of GAN that can assist in learning the true distribution of noisy data and avoiding overfitting inherent in the complex neural network architectures, which should be studied further for future work. Moreover, it is demonstrated that the estimated uncertainty and prediction errors are correlated, which signifies the predictions with high uncertainty have a high possibility of being wrong. While the predictions of returns by both the deterministic models and the probabilistic models are not very successful, in which the correlation coefficients are below 0.1, due to the chaotic nature of the stock market; however, the proposed framework demonstrates that the uncertainty of the predictions can be measured. Such a result indicates a possibility to utilize the proposed uncertainty measure for risk management of portfolios using stock market predictions. Figure 2 is an example of using the estimated uncertainty for portfolio management. When only the predictions of returns are given, the simplest strategy that uses the predictions is taking a long position if the prediction is positive, and taking a short position otherwise. The portfolio performance of the simplest strategy using only the prediction results of cGAN-UC is represented with dark blue in Figure 2 . We can further enhance the performance, by using the estimated uncertainty (risk), along with the predictions. The other strategies employ the uncertainty by introducing a neutral position if the uncertainty is above certain thresholds; such an approach signifies that we do not take a risk for uncertain predictions. The orange strategy in the figure considers predictions with > 2.0 uncertainty as invalid predictions; thus, we take a neutral position in such cases. By using the strategy, the final return increases by 17.1% points, and simultaneously, the standard deviation of the portfolio decreases, compared to the conventional strategy. Furthermore, interestingly, the standard deviation of the strategy with < 1.7 uncertainty (gray) considerably decreases while the final returns of the strategies are similar to that of conventional strategy. Such a possibility of performance enhancement is another strong evidence that the estimated uncertainty is effective. In this section, the proposed framework is applied to an image classification task with CIFAR-10 dataset. Unlike regression tasks, for classification tasks, deterministic models can estimate the uncertainty of predictions, by using the probability of point estimates as the certainty of predictions, since a softmax function is used for the last layer of the network in general. For instance, if a prediction of a deterministic model is 'Car' with a 98% probability, the uncertainty can be estimated by 2%. In this manner, for the uncertainty measure of deterministic models, the log of the probability of point estimates, i.e., the cross-entropy classification loss, is employed as a conventional method. For the comparison between a deterministic model and cGAN-UC, densely connected convolutional networks (DenseNet), which generally shows fine performances for image classification tasks [34, 35] , is used for the feature network of cGAN-UC as well as the deterministic model. The architectures of cGAN-UC and DenseNet used in this experiment are provided in Appendix. As a result, the estimation accuracy of the two models is marginally different, where the test set prediction accuracy of cGAN-UC and ordinary DenseNet is 94.4% and 94.1%, respectively, which corresponds to a 0.3% point of enhancement, by using cGAN-UC. Moreover, it is demonstrated that there exists a significant performance difference in the uncertainty measures of the models. In short, the conventional method to estimate the uncertainty does not perform well in general. Specifically, there is a sort of overfitting in the deterministic model where the model has a high certainty for wrong answers. For instance, in the test set of CIFAR-10, the median softmax output of DenseNet for the wrong answers is 0.968 (96.8%), which means that the deterministic classifier is overconfident to the estimations that are actually wrong. We compare the distributions of the proposed uncertainty measure and the conventional uncertainty measure with respect to correct and wrong predictions. Then, the t-test is performed to measure the difference between the two distribution, which can indicate the correlation between the estimated uncertainty and actual prediction results. As a result, the t-statistics of the proposed uncertainty measure and the classification loss are 30.0 and 16.3, respectively, which signifies superior performance to estimate the uncertainty of predictions (Figure 3 ). We preform a qualitative analysis for the most certain predictions and the most uncertain predictions of cGAN-UC ( Figure 4 ). We take top three certain/uncertain predictions for the analysis. As a result, while the image samples of the certain predictions are obvious samples, those of the uncertain predictions correspond to a sort of outlier, i.e., an automobile in the air with a door, a toy airplane on the grass, and a deer that is hard to be recognized; also, the airplane is incorrectly classified as 'Deer'. Overall results indicate the proposed framework not only shows superior prediction performance but also can properly measure the uncertainty of the predictions. In this section, we further evaluate the proposed framework with noisy image data since we conjecture that the proposed framework shows robust performance against noise because the adversarial learning process is employed for the training of cGAN-UC. In the adversarial learning process, noisy results that are produced by ordinary samples are rejected by the discriminator; thereby, cGAN-UC learns from the rejections, which makes cGAN-UC robust against noise. The proposed framework is evaluated with noisy CIFAR-10 image data, which are obtained by: where a indicates a parameter for noises and X N (a) ∈ R p denotes a noisy sample. In short, in each experiment, we take different a, and then evaluate and compare the classification accuracy of the models. The neural network architecture and the other conditions are the same as those of the previous experiment. Figure 5 (A) shows the prediction results for the noisy data. As expected, for not only the original image data but also the noisy data, it is demonstrated that cGAN-UC outperforms the ordinary DenseNet; moreover, the performance difference is more significant in the noisy data. Specifically, with X N (0.2) , the performance difference is 10.0% point where the accuracy of cGAN-UC and DenseNet is 48.1% and 38.1%, respectively. In addition, the median of the proposed uncertainty measure is compared with regard to a. As shown in Figure 5 (B), the uncertainty increases as the proportion of noises increases. Such a result also strongly supports the claim that the proposed uncertainty measure actually represents the unsureness of predictions. We propose a predictive probabilistic neural network framework, which corresponds to a different manner of using the generator in cGAN that is initially introduced for sample generation. While cGAN has commonly been employed for conditional sample generation, with extensive experiments in this paper, it is demonstrated that the model also can be used as a predictive model. In addition, we introduce the uncertainty measures for prediction results of the proposed framework. The uncertainty of prediction is calculated by the entropy and relative entropy for regression problems and classification problems, respectively. The proposed framework was evaluated with stock market data and an image classification task. As a result, the proposed framework demonstrates superior prediction performance and successfully estimates the uncertainty of predictions. Moreover, interestingly, the proposed framework showed robust performances for noisy data, compared to the deterministic model. We conjecture that these results are due to the adversarial learning process of the proposed framework, where noisy outcomes are rejected by the discriminator, and then the generator learns from the failures. For noisy data, since the performance gain by using the proposed framework is significant, such properties should be investigated further for future work. In addition, for additional experiments, the proposed framework is evaluated with complete noises, interpolations, and another dataset, and shows superior performance as well. The full results for these additional experiments are provided in Appendix. We expect that the proposed framework can be a significant breakthrough in predictive neural network model, since the proposed framework can predict the uncertainty of estimation that the conventional neural networks can hardly perform, and has a possibility to produce superior performance for prediction tasks. Also, due to the recent developments in GAN training, the proposed framework, compared to BNNs, has advantages in adopting deep architectures and convergence in training process, which have been constant issues in using probabilistic neural networks. Squeeze-and-excitation networks Learning transferable architectures for scalable image recognition Exploring randomly wired neural networks for image recognition Deep generative modeling for single-cell transcriptomics Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation Weight uncertainty in neural network Dropout as a bayesian approximation: Representing model uncertainty in deep learning What uncertainties do we need in bayesian deep learning for computer vision Large scale gan training for high fidelity natural image synthesis Progressive growing of gans for improved quality, stability, and variation Spectral normalization for generative adversarial networks Score-guided generative adversarial networks Improved recurrent generative adversarial networks with regularization techniques and a controllable framework cgans with projection discriminator Self-attention generative adversarial networks Controllable generative adversarial network Is generator conditioning causally related to gan performance? Learning structured weight uncertainty in bayesian neural networks Learning weight uncertainty with stochastic gradient mcmc for shape classification Bayesian uncertainty estimation for batch normalized deep networks Generative adversarial networks: An overview Multi-agent diverse generative adversarial networks Attngan: Fine-grained text to image generation with attentional generative adversarial networks Statistical parametric speech synthesis incorporating generative adversarial networks Generative adversarial nets Conditional generative adversarial nets Face aging with conditional generative adversarial networks Auto-painter: Cartoon image generation from sketch by using conditional wasserstein generative adversarial networks Multi-asset portfolio optimization and out-of-sample performance: an evaluation of black-litterman, mean-variance, and naïve diversification approaches Optimal mean-variance portfolio selection Mean-variance portfolio selection with the ordered weighted average Portfolio management via two-stage deep learning with a joint cost Regularization methods for generative adversarial networks: An overview of recent studies Condensenet: An efficient densenet using learned group convolutions Densely connected convolutional networks