key: cord-0045862-wad23nsb authors: Rivera, Jon Ander; Pardo, David; Alberdi, Elisabete title: Design of Loss Functions for Solving Inverse Problems Using Deep Learning date: 2020-05-22 journal: Computational Science - ICCS 2020 DOI: 10.1007/978-3-030-50420-5_12 sha: ec225b77b1d34cc80512d24287893339e43a72cb doc_id: 45862 cord_uid: wad23nsb Solving inverse problems is a crucial task in several applications that strongly affect our daily lives, including multiple engineering fields, military operations, and/or energy production. There exist different methods for solving inverse problems, including gradient based methods, statistics based methods, and Deep Learning (DL) methods. In this work, we focus on the latest. Specifically, we study the design of proper loss functions for dealing with inverse problems using DL. To do this, we introduce a simple benchmark problem with known analytical solution. Then, we propose multiple loss functions and compare their performance when applied to our benchmark example problem. In addition, we analyze how to improve the approximation of the forward function by: (a) considering a Hermite-type interpolation loss function, and (b) reducing the number of samples for the forward training in the Encoder-Decoder method. Results indicate that a correct design of the loss function is crucial to obtain accurate inversion results. Solving inverse problems [17] is of paramount importance to our society. It is essential in, among others, most areas of engineering (see, e.g., [3, 5] ), health (see, e.g. [1] ), military operations (see, e.g., [4] ) and energy production (see, e.g. [11] ). In multiple applications, it is necessary to perform this inversion in real-time. This is the case, for example, of geosteering operations for enhanced hydrocarbon extraction [2, 10] . Traditional methods for solving inverse problems include gradient based methods [13, 14] and statistics based methods (e.g., Bayesian methods [16] ). The main limitation of these kind of methods is that they lack an explicit construction of the pseudo-inverse operator. Instead, they only evaluate the inverse function for a given set of measurements. Thus, for each set of measurements, we need to perform a new inversion process. This may be time consuming. Deep Learning (DL) seems to be a proper alternative to overcome the aforementioned problem. With DL methods, we explicitly build the pseudo-inverse operator rather than only evaluating it. Recently, the interest on performing inversion using DL techniques has grown exponentially (see, e.g., [9, 15, 18, 19] ). However, the design of these methods is still somehow ad hoc and it is often difficult to encounter a comprehensive road map to construct robust Deep Neural Networks (DNNs) for solving inverse problems. One major problem when designing DNNs is the error control. Several factors may lead to deficient results. Such factors include: poor loss function design, inadequate architecture, lack of convergence of the optimizer employed for training, and unsatisfactory database selection. Moreover, it is sometimes elusive to identify the specific cause of poor results. Even more, it is often difficult to asses the quality of the results and, in particular, determine if they can be improved. In this work, we take a simple but enlightening approach to elucidate and design certain components of a DL algorithm when solving inverse problems. Our approach consists of selecting a simple inverse benchmark example with known analytical solution. By doing so, we are able to evaluate and quantify the effect of different DL design considerations on the inversion results. Specifically, we focus on analyzing a proper selection of the loss function and how it affects to the results. While more complex problems may face additional difficulties, those observed with the considered simple example are common to all inverse problems. The remainder of this article is as follows. Section 2 describes our simple model inverse benchmark problem. Section 3 introduces several possible loss functions. Section 4 shows numerical results. Finally, Sect. 5 summarizes the main findings. We consider a benchmark problem with known analytical solution. Let F be the forward function and F † the pseudo-inverse operator. We want our benchmark problem to have more than one solution since this is one of the typical features exhibited by inverse problems. For that, we need F to be non-injective. We select the non-injective function y = F(x) = x 2 , whose pseudo-inverse has two possible solutions: x = F † (y) = ± √ y. (See Fig. 1 ). The objective is to design a NN that approximates one of the solutions of the inverse problem. We consider the domain Ω = [−33, 33]. In there, we select a set of 1000 equidistant numbers. The corresponding dataset of input-output pairs is computed analytically. In some cases, we perform a change of coordinates in our output dataset. Let's name R the linear mapping that goes from the output of the original dataset into the interval [0,1]. Instead of approximating function F, our NN will approximate function F R given by In the cases we perform no rescaling, we select R = I, where I is the identity mapping. We consider different loss functions. The objective here is to discern between adequate and poor loss functions for solving the proposed inverse benchmark problem. We denote as F ϕ and F † θ the NN approximations of the forward function and the pseudo-inverse operator, respectively. Weights ϕ and θ are the parameters to be trained (optimized) in the NN. Each value within the set of weights is a real number. In a NN, we try to find the weights ϕ * and θ * that minimize a given loss function L. We express our problem mathematically as (ϕ * , θ * ) = arg min ϕ,θ L(ϕ, θ). (2) We first consider the traditional loss function: Theorem 1. Solution of minimization problem (2) with the loss function given by Eq. (3) has analytical solution for our benchmark problem in both the l 1 norm and the l 2 norm These solutions are such that: Proof. We first focus on norm · 1 . We minimize the loss function: where I = {1, ..., N } denotes the training dataset. For the exact pseudo-inverse , we can express each addend of (6) as follows: Taking the derivative of Eq. (6) with respect to x i , we see in view of Eq. (7) that the loss function for the exact solution attains its minimum at every point In the case of norm · 2 , for each value of y we want to minimize: Again, for the exact pseudo-inverse operator F † θ R , we can express each addend of Eq. (8) as: Taking the derivative of Eq. (8) with respect to x i and equaling it to zero, we obtain: Thus, the function is minimized when the approximated value is 0. Observation: Problem of Theorem 1 has infinite solutions in the l 1 norm. In the l 2 norm, the solution is unique; however, it differs from the two desired exact inverse solutions. As seen with the previous loss function, it is inadequate to look at the misfit in the inverted space. Rather, it is desirable to search for an inverse solution such that after applying the forward operator, we recover our original input. Thus, we consider the following modified loss function, where F R1 corresponds to the analytic forward function: Unfortunately, computation of F R1 required in L 2 involves either (a) implementing F R1 in a GPU, which may be challenging in more complex examples, or (b) calling F R1 as a CPU function multiple times during the training process. Both options may considerably slow down the training process up to the point of making it impractical. To overcome the computational problems associated with Eq. (11), we introduce an additional NN, named F R1 ϕ , to approximate the forward function. Then, we propose the following loss function: Two NNs of this type that are being simultaneously trained are often referred to as Encoder-Decoder [6, 12] . separately. By doing so, we diminish the training cost. At the same time, it allows us to separate the analysis of both NNs, which may simplify the detection of specific errors in one of the networks. Our loss functions are: and We first train F R1 ϕ using L 4.1 . Once F R1 ϕ is fixed (with weights ϕ * ), we train F † θ R2 using L 4.2 . We consider two different NNs. The one approximating the forward function has 5 fully connected layers [8] with ReLU activation function [7] . The one approximating the inverse operator has 11 fully connected layers with ReLU activation function. ReLU activation function is defined as These NN architectures are "overkilling" for approximating the simple benchmark problem studied in this work. Moreover, we also obtain results for different NN architectures, leading to identical conclusions that we omit here for brevity. We produce two models using norms l 1 and l 2 , respectively. Figure solutions. However, as mention in Sect. 3, this loss function entails essential limitations when considering complex problems. Figure 4 shows the results for norm l 1 and Fig. 5 for norm l 2 . We again recover excellent results, without the limitations provided by loss function L 2 . Coincidentally, different norms recover different solution branches of the inverse problem. Note that in this problem, it is possible to prove that the probability of recovering either of the solution branches is identical. We now consider the two-steps loss function and we focus only on the forward function approximation given by Eq. (13) . This is frequently the most time consuming part when solving an inverse problem with NNs. In this section, we analyze different strategies to work with a reduced dataset, which entails a dramatic reduction of the computational cost. We consider a dataset of three input-outputs pairs (x, y) = {(−33, 1089), (1, 1), (33, 1089)}. Figure 8 shows the results for norms l 1 and l 2 . Training data points are accurately approximated. Other points are poorly approximated. To improve the approximation, we introduce another term to the loss function. We force the NN to approximate the derivatives at each training point. This new loss is: From a numerical point of view, the term that approximates the first derivatives could be very useful. If we think about x as a parameter of a Partial Differential Equation (PDE), we can efficiently evaluate derivatives via the adjoint problem. Figure 9 shows the results when we use norms l 1 and l 2 for the training. For this benchmark problem, we select = 1. Thus, to approximate derivatives, we evaluate the NN at the points x + 1. We observe that points nearby the training points are better approximated via Hermite interpolation, as expected. However, the entire approximation still lacks accuracy and exhibits undesired artifacts due to an insufficient number of training points. Thus, while the use of Hermite interpolation may be highly beneficial, especially in the context of certain PDE problems or when the derivatives are easily accessible, there is still a need to have a sufficiently dense database of sampling points. Figure 10 shows the evolution of the terms composing the loss function. 10 . Evolution of the loss value when we train the NN that approximates F I ϕ using as loss Eq. (16) . "Loss F" corresponds to the loss of the first term of Eq. (16) . "Loss DER" corresponds to the loss of the second term of Eq. (16) . "Total Loss" corresponds to the total value of Eq.(16). We now consider an Encoder-Decoder loss function, as described in Eq. (12) . The objective is to minimize the number of samples employed to approximate the forward function since producing such database is often the most time-consuming part in a large class of inverse problems governed by PDEs. We employ a dataset of three input-output pairs {(−33, 1089), (1, 1), (33, 1089)} for the first term of Eq. (12) and a dataset of 1000 values of y obtained with an equidistant distribution on the interval [0, 1089] for the second term of Eq. (12) . Figure 11 shows the results of the NNs trained with norm l 1 . Results are disappointing. The forward function is far from the blue line (real forward function), specially nearby zero. The forward function leaves excessive freedom for the training of the inverse function. This allows the inverse function to be poorly approximated (with respect the to real inverse function). In order to improve the results, we train the NNs adding a regularization term to Eq. (12) . We add the following regularization term maximizing smoothness on F R ϕ : We evaluate this regularization term over a dataset of 1000 samples obtained with an equidistant distribution on the interval [−33, 33] and we select = 1. Figure 12 shows the results of the NN. Now, the forward function is better approximated around zero. Unfortunately, the approximation is still inaccurate, indicating the need for additional points on the approximation. Figure 13 shows the evolution of the terms composing the loss function. The loss values associated with the first and the second terms are minimized. The loss corresponding to the regularization term remains as the largest one. Fig. 13 . Evolution of the loss value for Encoder-Decoder method trained with loss function L3.1 and norm l1. "Loss F" corresponds to the loss of the first term of Eq. (17) . "Loss FI" corresponds to the loss of the second term of Eq. (17) . "Loss REG" corresponds to the loss of the third term of Eq. (17) . "Total Loss" corresponds to the total value of Eq. (17). We analyze different loss functions for solving inverse problems. We demonstrate via a simple numerical benchmark problem that some traditional loss functions are inadequate. Moreover, we propose the use of an Encoder-Decoder loss function, which can also be divided into two loss functions with a one-way coupling. This enables to decompose the original DL problem into two simpler problems. In addition, we propose to add a Hermite-type interpolation to the loss function when needed. This may be especially useful in problems governed by PDEs where the derivative is easily accessible via the adjoint operator. Results indicate that Hermite interpolation provides enhanced accuracy at the training points and in the surroundings. However, we still need a sufficient density of points in our database to obtain acceptable results. Finally, we evaluate the performance of the Encoder-Decoder loss function with a reduced number of samples for the forward function approximation. We observe that the forward function leaves excessive freedom for the training of the inverse function. To partially alleviate that problem, we incorporate a regularization term. The corresponding results improve, but they still show the need for additional training samples. Wave propagation inverse problems in medicine and environmental health Geosteering and/or reservoir characterization the prowess of new generation LWD tools Inverse problems in elasticity Spherical wave near-field imaging and radar cross-section measurement Evolutionary methods in inverse problems of engineering mechanics Learning phrase representations using RNN encoder-decoder for statistical machine translation Analysis of function of rectified linear unit used in deep learning Densely connected convolutional networks Using a physics-driven deep neural network to solve inverse problems for LWD azimuthal resistivity measurements New directional electromagnetic tool for proactive geosteering and accurate formation evaluation while drilling Inverting methods for thermal reservoir evaluation of enhanced geothermal system Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections Solution of implicitly formulated inverse heat transfer problems with hybrid methods Solution of inverse problems in elasticity imaging using the adjoint method Deep learning electromagnetic inversion with convolutional neural networks Inverse problems: a Bayesian perspective Inverse Problem Theory and Methods for Model Parameter Estimation Schlumberger: Borehole resistivity measurement modeling using machine-learning techniques A fast inversion of induction logging data in anisotropic formation based on deep learning