key: cord-0188095-wz2i2q0m authors: Alt, Tobias; Weickert, Joachim title: Learning Integrodifferential Models for Image Denoising date: 2020-10-21 journal: nan DOI: nan sha: 55715008e1f2b1aca2153548b2982da409ae94f1 doc_id: 188095 cord_uid: wz2i2q0m We introduce an integrodifferential extension of the edge-enhancing anisotropic diffusion model for image denoising. By accumulating weighted structural information on multiple scales, our model is the first to create anisotropy through multiscale integration. It follows the philosophy of combining the advantages of model-based and data-driven approaches within compact, insightful, and mathematically well-founded models with improved performance. We explore trained results of scale-adaptive weighting and contrast parameters to obtain an explicit modelling by smooth functions. This leads to a transparent model with only three parameters, without significantly decreasing its denoising performance. Experiments demonstrate that it outperforms its diffusion-based predecessors. We show that both multiscale information and anisotropy are crucial for its success. Since three decades, models based on nonlinear partial differential equations (PDEs) have been contributing to the progress in image analysis; see e.g. [1, 2, 3, 4] and the references therein. PDE models are compact, transparent, and involve only very few parameters. Moreover, their strict mathematical foundations allow to establish valuable theoretical guarantees. Well-posedness, for instance, includes also stability w.r.t. perturbations of the input data. More recently, deep learning approaches have been dominating the field [5] . Their huge amount of hidden model parameters allow a strong adaptation to the training data, which is the reason for their excellent performance. On the other hand, these approaches still lack transparency and mathematical foundations. They often suffer from well-posedness problems such as a high sensitivity to adversarial attacks [6] . In our paper, we are aiming at the best of two worlds: We benefit from the compactness and mathematical foundations of PDEinspired modelling, while we improve its performance by learning a small set of parameters. In contrast to most other approaches, however, we compress the model further by understanding the functional dependence of these parameters. In other words: We drive learningbased modelling to the extreme by aiming at the most compact and insightful model. While this philosophy is fairly general, it is best explained by studying an exemplary application. In this paper, we focus on image denoising with edge-enhancing anisotropic diffusion (EED) [7] . This work has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement no. 741215, ERC Advanced Grant INCOVID). We propose a multiscale extension of EED based on a transparent integrodifferential formulation. Our integrodifferential anisotropic diffusion (IAD) model adapts to image structures by combining edge information from multiple scales and steering the diffusion process accordingly. Given a set of images corrupted by Gaussian noise, we learn scale-adaptive parameters of the diffusion process by optimising the denoising performance. Afterwards, we explore the trained parameters and reduce them by explicitly modelling the parameter dynamics over the scales and the noise level. We show that this is possible without decreasing the denoising performance significantly. The resulting IAD model outperforms both its counterpart of integrodifferential isotropic diffusion (IID), as well as EED, while only requiring one additional parameter. This shows that both anisotropy and multiscale information are vital for the success of the model. While PDEs are omnipresent in image analysis, models that explicitly involve the more general integrodifferential equations are surprisingly rare. They are considered occasionally for image decompositions [8] , nonlocal generalisations of PDE evolutions [9, 10] , and in connection with fractional calculus for linear image processing [11] . These works are only very mildly related to our paper. Since our IAD model sums up contributions from multiple scales, it has some conceptual similarities to wavelet shrinkage. Didas and Weickert [12] relate wavelet shrinkage to nonlinear diffusion, but only for the isotropic case. As each scale is treated independently, anisotropy cannot be achieved. Welk et al. [13] present an anisotropic diffusion method based on wavelet shrinkage, but only on the finest scale. We combine the best of these two ideas into a novel model: Our IAD model combines structural image information over multiple scales to create an anisotropic diffusion process. Integrodifferential models for nonlinear diffusion predominantly involve models with Gaussian-smoothed derivatives. Most of these models [14, 7, 15, 16] have been proposed as regularisations of the Perona-Malik filter [1] . For enhancing coherent structures, one also considers a smoothed structure tensor [17] that captures directional information to steer the diffusion process accordingly [2] . However, all of these models only consider Gaussian smoothing on a fixed scale and do not incorporate an integration over multiple scales. Large-scale trainable models involving PDEs have recently become very successful. Chen and Pock [18] train flux functions and derivative filters of a diffusion-inspired model to obtain exceptional denoising results. Other authors only train nonlinearities of models [19, 20, 21] or learn PDEs directly with symbolic approaches [22, 23, 24] . Instead of focusing on denoising performance, we want to obtain insights into the scale behaviour of our model. In contrast to [18] , we have stability guarantees in the Euclidean norm and require very few parameters that allow insights into their functional dependency. This helps us to obtain a model which is transparent and outperforms its diffusion-based predecessors. In Section 2, we review nonlinear diffusion. This serves as a basis for our novel IAD model that is introduced in Section 3. In several experiments in Section 4, we reduce its parameters and compare it to other diffusion filters. Finally, we present our conclusions and give an outlook on future work in Section 5. where ∇ = (∂x, ∂y) denotes the spatial gradient operator and | . | the Euclidean norm. The process is initialised with u(x, y, 0) = f (x, y), and on the domain boundary ∂Ω one uses reflecting boundary conditions. Larger diffusion times t yield stronger diffused representations. To enable a diffusion process that respects edges, Perona and Malik use |∇u| 2 as an edge indicator and employ a decreasing scalar-valued diffusivity function in |∇u| 2 , e.g. with a contrast parameter λ > 0. For large values of |∇u| 2 , the diffusivity becomes very small, such that the edge is preserved. However, while a scalar-valued diffusivity function can respect the location of an edge, it is unable to represent its direction. Therefore, we regard the Perona-Malik model as isotropic in the physical sense of diffusion. At noisy edges, the Perona-Malik model reduces the diffusivity and preserves the edge without denoising it. The edge-enhancing anisotropic diffusion (EED) model [7] avoids this drawback of the Perona-Malik model by replacing its scalarvalued diffusivity by a matrix-valued diffusion tensor. This positive semidefinite 2 × 2 matrix is designed in such a way that smoothing along edges is encouraged, while smoothing across them is inhibited. Its normalised eigenvectors v1 and v2 are chosen parallel and perpendicular to a Gaussian-smoothed image gradient ∇σu, where σ denotes the standard deviation of the Gaussian. The corresponding eigenvalues λ1, λ2 are modelled as This uniquely defines the diffusion tensor as Note that anisotropy results form the fact that the eigenvector v1 ∇σu of the diffusion tensor in general does not coincide with the image gradient in the product D λ (∇σu) ∇u. Otherwise, we obtain D λ (∇σu) ∇u = g λ |∇σu| 2 ∇u, which shows that the process becomes isotropic with a scalar-valued diffusivity. This is also the case for σ → 0. To prepare ourselves for our novel integrodifferential model, we reformulate the EED model with the structure tensor [17] J = ∇σu ∇ σ u. By plugging in, we see that the matrix J has eigenvectors v1 ∇σu and v2 ⊥ ∇σu with eigenvalues ν1 = |∇σu| 2 and ν2 = 0. Expressing the diffusivity function g λ from (2) in terms of a power series, we can generalise it to matrix-valued arguments such as J . Then g λ (J ) is a matrix as well. It has the same eigenvectors v1, v2 as J , and its eigenvalues satisfy λ1 = g λ (ν1) = g λ (|∇σu| 2 ) and λ2 = g λ (ν2) = g λ (0) = 1. Thus, we have D λ (∇σu) = g λ (J ) and can write the EED evolution as It closely resembles the Perona-Malik equation (1). Indeed, by replacing the matrix-valued tensor product J = ∇σu ∇ σ u by the scalar-valued inner product ∇ σ u ∇σu = |∇σu| 2 and taking σ → 0, we end up with the Perona-Malik model. Thus, there is an elegant connection between isotropic and anisotropic models. The choice of the scale σ is crucial for the denoising performance of EED. A single scale may be too coarse for fine-scale details, while at the same time not capturing large-scale structures. We will see that by accumulating information over multiple scales within the structure tensor, we obtain a more useful representation of image information which again leads to improved denoising results. To account for multiscale information within the structure tensor, we introduce the following integrodifferential anisotropic diffusion (IAD) model: where the multiscale structure tensor accumulates structural information over all smoothing scales σ. This integration generates anisotropy on each scale, since the eigenvectors of Jγ usually are not parallel to ∇σ,γu. To adapt the contribution of each scale, we introduce an additional weight parameter γ(σ) in the smoothed gradient, i.e. ∇σ,γu = ∇(γ(σ)Kσ * u). Here, Kσ is a Gaussian of standard deviation σ. Its normalisation is subsumed by γ(σ). In contrast to other models employing an outer smoothing with a fixed weighting [25, 26] , the IAD model allows to attenuate scales if they do not offer useful information. This weighting, along with the multiscale structure tensor, is the key to the success of the IAD model. The diffusion tensor g λ (Jγ) inherits its eigenvectors from Jγ, and its eigenvalues are given by g λ (µ1) and g λ (µ2) where µ1, µ2 are the eigenvalues of Jγ. Similar diffusion tensors are considered e.g. in [25, 27] . Moreover, we make the contrast parameter λ(σ) of the diffusivity scale-adaptive. This individually steers the balance between smoothing and edge enhancement on each scale. An isotropic counterpart of the IAD model, which we call integrodifferential isotropic diffusion (IID), arises directly by switching the order of transposition within the structure tensor. In this case, is no structure tensor any more, but a multiscale gradient magnitude. In our experiments, we will use the IID model for comparisons. In a practical setting, we need a discrete version of the continuous IAD model. To this end, we employ an explicit finite difference scheme: We discretise the temporal derivative by a forward difference with time step size τ , and the right-hand side by the nonnegativity discretisation from [2] . Additionally, we select a set of N discrete scales σ1, . . . , σN according to an exponential sampling. This yields discrete weights γi = γ(σi) and contrast parameters λi = λ(σi), which we obtain by training. We iterate the method for a small number of K explicit steps. This yields an approximation of the continuous model for a diffusion time of T = Kτ . Iterating the IAD model allows to capture a nonlinear evolution. As usual, our explicit scheme must satisfy a time step size restriction that is governed by the spectral radius of the iteration matrix; see e.g. [28] for more details. Then the following two important properties are easy to establish: • The discrete IAD model is stable in the Euclidean norm, i.e. its Euclidean norm is nonincreasing in each step. • Since the algorithm consists of a concatenation of continuous function evaluations, it constitutes a well-posed discrete evolution. In particular, this implies that the output depends continuously on the input data. To train the 2N parameters γi and λi of the discrete model, we consider a training set of 200 grey value images of size 256×256 with grey value range [0, 255]. We crop them from the BSDS500 dataset [29] and corrupt them with additive Gaussian noise of standard deviation s. The resulting grey values are not cut off if they exceed the original grey value range to preserve the Gaussian statistics of the noise. A disjoint test set of 100 images is generated accordingly. We train the model with a sufficient number of explicit steps by minimizing the average mean square error over the training set. Moreover, we always choose τ in such a way that it is stable and allows for a suitable diffusion time. We found that K = 10 steps are already sufficient for noise levels up to s = 60. To gain insights into the behaviour of the IAD model, we develop smooth relations for the discrete trained parameters in terms of the scale σ and the noise standard deviation s. To this end, we train the full model for K = 10 explicit steps and N = 8 discrete scales for varying noise standard deviations s ∈ {10, 20, . . . , 60}. To ensure that the parameters γi and λi vary smoothly over the scales, we add a small regularisation to the optimisation loss. It penalises variations of γ and λ over the scales in the squared euclidean norm. We present the resulting parameter dynamics over the scales in Fig. 1 for two noise levels each. For the weight parameters γi, we find that the importance of the structural information decreases with the scale. Note that we normalised the weight parameters such that γ1 = 1, as a global rescaling can be subsumed within the time step size τ . The contribution of smoothing scales with σ > 5 is almost non-existent. This is in accordance with results for scale-adaptive wavelet shrinkage [21] , where useful shrinkage is only performed on the finer scales. Furthermore, we observe that larger noise levels s lead to a higher weighting of larger scales: As the images are corrupted more strongly, larger scales are needed to extract useful structural information. Based on Fig. 1 , we approximate the parameter dynamics of the weight function γ(σ, s) by with a learned scalar α > 0 that steers the decay. Similarly, we also see that the nonlinear response of the diffusivity function g λ decreases with larger scales, however not as rapidly as the weights γ. As noise is attenuated on coarser scales and structural information dominates, the balance between structure preservation and noise removal is shifted accordingly by decreasing λ. Outliers for large scales can be ignored, as γ → 0 dampens the influence of λ. When increasing the noise level, the contrast parameters scale linearly, a relation which has also been observed for wavelet shrinkage [30] . Therefore, we model the contrast parameters as with learned scalars β, λ0 > 0, where β determines the decay of the curve and λ0 its value at σ = 0. Through these insights, we have effectively reduced the parameter set to only three trainable parameters α, β, λ0, one more than for EED. In an ablation study, we now show that we hardly sacrifice any performance for the sake of parameter reduction. In a first step, we train the full model for K = 10 explicit steps, N = 8 scales, and noise levels s ∈ {10, 20, . . . , 60} without any smoothness regularisation. With two parameters per scale and noise level, this amounts to 96 parameters. Afterwards, we train the parameters α, β, and λ0 of the reduced model for the same configuration jointly for all noise levels. We obtain α = 1.64, β = 2.46 and λ0 = 1.47. On average, the fully trained model with 96 parameters performs only 0.07 dB better than the reduced model with three parameters in terms of peak signal-to-noise ratio (PSNR). This marginal perfor- mance difference shows that the parameter relations which we have modelled represent the true parameter dynamics well. We compare the reduced model to the isotropic Perona-Malik model (PM) [1] , as well as EED [7] . Additionally, we consider IID as the isotropic variant of the IAD model. This four-way comparison is designed in an ablative way to show that both multiscale modelling and anisotropy are crucial for the success of the IAD model: Multiscale information is not considered for EED and PM, while anisotropy is not used for PM and IID. We explicitly compare only to other diffusion-based methods, as it is our intention to transparently improve those compact and insightful models rather than to produce state-of-the-art results. To ensure a fair comparison, all models are trained with the exponential Perona-Malik diffusivity. For PM and EED, we optimise the parameters for each noise level individually. However, for IID and IAD, we respectively use the reduced parameters over the full noise range. Fig. 2 shows representative denoising results on the cameraman image for s = 50. For the Perona-Malik model, the optimal contrast parameter is too small to remove extreme noise outliers but too large to preserve edges. EED yields sharp edges in the foreground. However, due to the single smoothing scale, it is not able to denoise the background efficiently, where a much larger smoothing scale would be appropriate. The IID and IAD models do not suffer from this effect as the multiscale information can identify structures on all scales. However, IID suffers from remaining noise around edges. This is an intrinsic drawback of the isotropic model which detects only the location of an edge, but not its orientation. Finally, IAD combines edge enhancement together with efficient denoising in homogeneous regions. Lastly, we compare the performance of the four models on different noise levels. Table 1 shows the average PSNR on the test set. Surprisingly, already the IID model is able to beat EED for Gaussian noise with standard deviation s ≥ 30. With its additional anisotropy, the IAD model consistently outperforms its competitors. With higher amounts of noise, IID and IAD increase the gap to their PDE-based predecessors, indicating that multiscale information becomes increasingly important with noise. We have shown that one can elegantly introduce multiscale information within an anisotropic diffusion process by considering integrodifferential models. Learning was involved to uncover the optimal scale dynamics of the resulting model. Apart from their evident merits of improving anisotropic diffusion methods, the findings in our paper are of more fundamental nature in two aspects: Firstly, we have seen that integrodifferential equations are hitherto hardly explored but very promising extensions of differential equations. They are not only more general, but also allow new possibilities for data adaptation, e.g. by anisotropy through multiscale integration. Secondly, our paper illustrates the potential of learning with maximal model reduction. Such approaches can benefit from the best of two worlds: the transparency and mathematical foundation of model-based techniques and performance improvements by data-driven, learning-based strategies. Showing these advantages in a more general context is part of our future research. We are exploring applications beyond denoising, and we will also study integrodifferential extensions of other important PDE models. Scale space and edge detection using anisotropic diffusion Anisotropic Diffusion in Image Processing Mathematical Problems in Image Processing: Partial Differential Equations and the Calculus of Variations Scale Space and Variational Methods in Computer Vision Deep Learning Explaining and harnessing adversarial examples Theoretical foundations of anisotropic diffusion in image processing Integro-differential equations based on (BV , L 1 ) image decomposition Nonlocal operators with applications to image processing A linear scalespace theory for continuous nonlocal evolutions," in Scale Space and Variational Methods in Computer Vision Image processing by means of a linear integro-differential equation Integrodifferential equations for continuous multiscale wavelet shrinkage From tensor-driven diffusion to anisotropic wavelet shrinkage Image selective smoothing and edge detection by nonlinear diffusion A general framework for geometry-driven evolution equations Relations between regularization and diffusion filtering A fast operator for detection and precise location of distinct points, corners and centres of circular features Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration Shrinkage fields for effective image restoration Learning stable nonlinear crossdiffusion models for image restoration Learning a generic adaptive wavelet shrinkage function for denoising Learning partial differential equations via data discovery and sparse optimization PDE-net 2.0: Learning PDEs from data with a numeric-symbolic hybrid deep network Universal differential equations for scientific machine learning Diffusion and regularization of vector-and matrix-valued images Vector-valued image interpolation by an anisotropic diffusion-projection PDE," in Scale Space and Variational Methods in Computer Vision Justifying tensor-driven diffusion from structure-adaptive statistics of natural images Properties of higher order nonlinear diffusion filtering Contour detection and hierarchical image segmentation A discriminative approach for wavelet denoising