key: cord-0662693-abl691jz authors: Alt, Tobias; Weickert, Joachim; Peter, Pascal title: Translating Diffusion, Wavelets, and Regularisation into Residual Networks date: 2020-02-07 journal: nan DOI: nan sha: 7804fb72e8041dd187d5e4edf27d4471e9a0e78b doc_id: 662693 cord_uid: abl691jz Convolutional neural networks (CNNs) often perform well, but their stability is poorly understood. To address this problem, we consider the simple prototypical problem of signal denoising, where classical approaches such as nonlinear diffusion, wavelet-based methods and regularisation offer provable stability guarantees. To transfer such guarantees to CNNs, we interpret numerical approximations of these classical methods as a specific residual network (ResNet) architecture. This leads to a dictionary which allows to translate diffusivities, shrinkage functions, and regularisers into activation functions, and enables a direct communication between the four research communities. On the CNN side, it does not only inspire new families of nonmonotone activation functions, but also introduces intrinsically stable architectures for an arbitrary number of layers. In view of the undeniable success of deep learning approaches in all areas of data science (LeCun et al., 1998; 2015; Schmidhuber, 2015; Goodfellow et al., 2016) , there is a strong need to put them on a solid ground by establishing their mathematical foundations. An analytic way towards this ambitious goal is to express successful CNN architectures in terms of well-founded mathematical concepts. However, deep learning offers a plethora of design possibilities, and modern architectures may involve hundreds of layers and millions of parameters. Thus, this way of reducing a highly complex system to a simple and transparent mathematical model is not only very burdensome, but also bears the danger to lose performance critical CNN features along the way. An alternative, synthetic way uses well-established models that offer deep mathematical insights to build simple components of neural architectures which inherit these qualities. While this constructive road seems less stony, it has been explored surprisingly little. In this paper, we follow the road less taken and drive its simplicity to the extreme. Thus, at this point it is not our intention to design highly sophisticated architectures that produce state-of-the-art results in benchmarks. We do not even present any experiments at all. Our goal is to gain theoretical insights that can be useful to suggest neural architectures that are simpler, more compact, involve less parameters, and benefit from provable stability guarantees. We establish a comprehensive framework that for the first time allows to translate diffusion methods, wavelet approaches, and variational techniques simultaneously into a specific CNN architecture. The reason for choosing three denoising techniques lies in the intrinsic stability of denoising: Noise is a perturbation of the input data, which is not supposed to change the denoised output substantially. To maximise transparency and notational simplicity, we restrict ourselves to the 1D setting and choose particularly simple representatives in each class: Perona-Malik diffusion, Haar wavelet shrinkage, smooth first order variational models, and a single ResNet block. We show that discrete formulations of all three denoising approaches can be expressed as a specific ResNet block. It inherits its stability directly from the three denoising algorithms. Thus, a ResNet consisting only of these blocks is stable for any number of layers. Whereas typical CNNs learn convolution weights and fix the nonlinear activation function to a simple design, we proceed in the opposite way: We fix the convolution kernels and study various nonlinear activation functions that are inspired by the diffusivities, shrinkage functions, and variational regularisers. For researchers from the diffusion, wavelet or variational communities, this introduces a dictionary that allows them to translate their methods directly into CNN architectures. Deep learning researchers will find hitherto unexplored nonmonotone activation functions and new motivations for existing ones. We see our paper in the tradition of a number of recent contributions to the mathematical foundations of deep learning. For a detailed survey that covers results until 2017, we refer to . The following review focuses on works that are relevant for our paper and does not discuss other interesting aspects such as expressiveness (Poggio et al., 2017; Rolnick & Tegmark, 2018; Gribonval et al., 2019) and information theoretic interpretations (Shwartz-Ziv & Tishby, 2017) of CNNs. The seminal work of (Bruna & Mallat, 2013) employs wavelet operations to come up with scattering networks that perform well on classification problems. This has been the starting point for a number of wavelet-inspired CNNs; see e.g. (Wiatowski & Bölcskei, 2017; Fujieda et al., 2018; Williams & Li, 2018) and the references therein. Usually they exploit the spectral information or the multiscale nature of wavelets. Our work, however, focuses on iterated shift-invariant wavelet shrinkage on a single scale and utilises its connection to diffusion processes (Mrázek et al., 2005) . Gaining insights into CNNs is possible by studying their energy landscapes (Nguyen & Hein, 2017; Chaudhari et al., 2018; Li et al., 2018; Draxler et al., 2018) , their optimality conditions (Haeffele & Vidal, 2017) , and by interpreting them as regularising architectures (Kukačka et al., 2017; Ulyanov et al., 2018; Dittmer et al., 2019; Kobler et al., 2020) . They can also be connected to incremental proximal gradient methods (Kobler et al., 2017; Bibi et al., 2019) , that are adequate for nonsmooth optimisation problems. In the present paper we advocate an interpretation in terms of iterated smooth energy minimisation problems. We do not require proximal steps and can exploit direct connections to diffusion filters Radmoser et al., 2000) . A prominent avenue to establish mathematical foundations of CNNs is through their analysis in terms of stability. This can be achieved by studying their invertibility properties (Behrmann et al., 2018; Chang et al., 2018) , by exploiting sparse coding concepts (Romano et al., 2019) , and by interpreting deep learning as a parameter identification or optimal control problem for ordinary differential equations (Haber & Ruthotto, 2017; Thorpe & van Gennip, 2018; Zhang & Schaeffer, 2019) . CNNs can also be connected to flows of diffeomorphisms (Rousseau et al., 2019) and to parabolic or hyperbolic PDEs (Li & Shi, 2017; Arridge & Hauptmann, 2019; Smets et al., 2020) , where it is possible to transfer L 2 stability results (Ruthotto & Haber, 2019) . Our work focuses on diffusion PDEs. They allow us to establish stricter stability notions such as L ∞ stability and sign stability. We argue for shifting the focus of CNN models towards more sophisticated and also nonmonotone activation functions. The CNN literature offers only few examples of training activation functions (Kligvasser et al., 2018) or designing them in a well-founded and flexible way (Unser, 2019) . Nonmonotone activation functions have been suggested already before the advent of deep learning (De Felice et al., 1993; Meilijson & Ruppin, 1994) , but fell into oblivion afterwards. We revitalise this idea by providing a natural justification from the theory of diffusion filtering. Our paper is structured as follows. We review general formulations of nonlinear diffusion, wavelet shrinkage, variational regularisation, and residual networks in Section 2. In Section 3, we present numerical approximations for three instances of the classical models. We interpret them in terms of a specific residual network architecture, for which we derive explicit stability guarantees. This leads to a dictionary for translating the nonlinearities of the three classical methods to activation functions, which is presented in Section 4. For a selection of the most popular nonlinearities, we derive their counterparts and discuss the results in detail. Finally, we summarise our conclusions in Section 5. This section sketches nonlinear diffusion, wavelet shrinkage, variational regularisation, and residual networks in a general fashion. For each model, we highlight and discuss its central design choice. To ensure a consistent notation, all models in this section produce an output signal u from an input signal f . We define continuous one-dimensional signals u, f as mappings from a signal domain Ω = [a, b] to a codomain [c, d]. We employ reflecting boundary conditions on the signal domain boundaries a and b. The discrete signals u, f ∈ R N are obtained by sampling the continuous functions at N equidistant positions with grid size h. In nonlinear diffusion (Perona & Malik, 1990) , filtered versions u(x, t) of an initial signal f (x) are computed as solutions of the nonlinear diffusion equation with initial condition u(x, 0) = f (x) and diffusion time t. The evolution creates gradually simplified versions of f . The central design choice lies in the nonnegative and bounded diffusivity g(r) which controls the amount of smoothing depending on the local structure of the evolving signal. Choosing the constant diffusivity g(r) = 1 (Iijima, 1962) leads to a homogeneous diffusion process that smoothes the signal equally at all locations. A more sophisticated diffusivity such as the exponential Perona-Malik diffusivity g(r) = exp − r 2 2λ 2 (Perona & Malik, 1990) inhibits smoothing around discontinuities where |∂ x u| is larger than the contrast parameter λ. This allows discontinuity-preserving smoothing. Classical discrete wavelet shrinkage (Donoho & Johnstone, 1994) manipulates a discrete signal f within a wavelet basis. With the following three-step framework, one obtains a filtered signal u: 1. Analysis: One transforms the input signal f to wavelet and scaling coefficients by a transformation W . A scalar shrinkage function S(r) with a threshold parameter θ is applied component-wise to the wavelet coefficients. The scaling coefficients remain unchanged. 3. Synthesis: One applies a back-transformationW to the manipulated coefficients to obtain the result u: Besides the choice of the wavelet basis, the result is strongly influenced by the shrinkage function S(r). The hard shrinkage function (Mallat, 1999) eliminates all coefficients with a magnitude smaller than the threshold parameter, while the soft shrinkage function (Donoho, 1995) additionally modifies the remaining coefficients equally. The classical wavelet transformation is not shift-invariant: Transforming a shifted input signal changes the resulting set of coefficients. To this end, cycle spinning was proposed in (Coifman & Donoho, 1995) , which averages the results of wavelet shrinkage for all possible shifts of the input signal. This shift-invariant wavelet transformation will serve as a basis for connecting wavelet shrinkage to nonlinear diffusion. Variational regularisation (Whittaker, 1923; Tikhonov, 1963) pursues the goal of finding a function u(x) that minimises an energy functional. A general formulation of such a functional is given by where a data term D(u, f ) drives u towards some input data f , and a regularisation term R(u) enforces smoothness conditions on u. We can control the balance between both terms by the regularisation parameter α > 0. A solution minimising the energy is found via the Euler-Lagrange equations (Gelfand & Fomin, 2000) . These are PDEs that describe necessary conditions for a minimiser. A simple yet effective choice for the regularisation term is a first order regularisation of the form R(u) = Ψ (∂ x u) with a regulariser Ψ(r). The nonlinear function Ψ penalises variations in ∂ x u to enforce a certain smoothness condition on u. Choosing e.g. Ψ(r) = r 2 is called Whittaker-Tikhonov regularisation (Whittaker, 1923; Tikhonov, 1963) . Residual networks (He et al., 2016) are a popular CNN architecture as they are easy to train, even for a high number of network layers. They consist of chained residual blocks. A residual block is made up of two convolutional layers with biases and nonlinear activation functions after each layer. Each block computes the output signal u from an input signal f by with discrete convolution matrices W 1 , W 2 , activation functions σ 1 , σ 2 and bias vectors b 1 , b 2 . The main difference to general feed-forward CNNs lies in the skip-connection which adds the original input signal f to the result of the inner activation function. This helps with the vanishing gradient problem and substantially improves training performance. The crucial difference between residual networks and the three previous approaches is the design focus: The three classical methods consider complex nonlinear modelling functions, while CNNs mainly focus on learning convolution weights and use simple activation functions. We will see that by permitting more general activation functions, we can relate all four methods within a unifying framework. Now we are in a position to present numerical approximations for the three classical models that allow to interpret them in terms of a residual network architecture. In practice, the continuous diffusion process is discretised and iterated to approximate the continuous solution u(x, T ) for a stopping time T . With the help of the flux function Φ(r) = g(r) r we rewrite the diffusion equation (1) as For this equation, we perform a standard discretisation in the spatial and the temporal domain. This yields an explicit scheme which can be iterated. Starting with an initial signal , the evolving signal u k at a time step k is used to compute u k+1 at the next step by Here the temporal derivative is discretised by a forward difference with time step size τ . We apply a forward difference to implement the inner spatial derivative operator and a backward difference for the outer spatial derivative operator. Both can be realised with a simple convolution. To obtain a scheme which is stable in the L ∞ norm, one can show that the time step size must fulfil where g max is the maximum value that the diffusivity g(r) = Φ(r) r can attain (Weickert, 1998) . This guarantees a maximum-minimum principle, stating that the values of the filtered signal u k do not lie outside the range of the original signal f . To achieve a substantial filter effect, one often needs a a diffusion time T that exceeds the limit in (7). Then one concatenates m explicit steps such that mτ = T and τ satisfies (7). In order to translate diffusion into residual networks, we rewrite the explicit scheme (6) in matrix-vector form: where D + h and D − h are convolution matrices denoting forward and backward difference operators with grid size h, respectively. In this notation, the resemblance to a residual block becomes apparent: and the bias vectors b 1 , b 2 are set to 0. We see that the convolutions implement forward and backward difference operators. Crucially, the inner activation function σ 1 corresponds to the flux function Φ. The effect of the skip connection in the residual block also becomes clear now: It is the central ingredient to realise a time discretisation. We call a block of this form a diffusion block. A consequence of this equivalence is that a residual network chaining m diffusion blocks with activation function Φ(r) and time step size τ approximates a nonlinear diffusion process with stopping time T = mτ and diffusivity g(r) = Φ(r) r . These insights enable us to translate a diffusivity g directly into an activation function Φ: While this translation from a diffusion step to a residual block appears simple, it will serve as the Rosetta stone in our dictionary: It also allows to connect wavelet methods and variational regularisation to residual networks, since both paradigms can be related to diffusion (Mrázek et al., 2005; . To keep our paper selfcontained, let us now sketch these correspondences. To explore the connection between wavelet shrinkage and nonlinear diffusion, (Mrázek et al., 2005) consider shiftinvariant Haar wavelet shrinkage on the finest scale. The missing multiscale structure is compensated by iterating this shrinkage. They show that one step with shrinkage function S(r) is equivalent to an explicit diffusion step with diffusivity g(r), grid size h = 1, and time step size τ if The L ∞ stability condition (7) from the diffusion case translates into a condition on the shrinkage function −r ≤ S(r) ≤ r for r > 0, which is less restrictive than the typical design principle 0 ≤ S(r) ≤ r for r > 0. Mrázek et al. show that the latter one leads to a sign stable process in the sense of (Schoenberg, 1930) , i.e. the resulting signal shows not more sign changes than the input signal. This is a stronger stability notion than L ∞ stability. It limits the time step size to τ ≤ h 2 4gmax . This is half the bound of (7). consider an energy functional with a quadratic data term and a regulariser Ψ(r): The corresponding Euler-Lagrange equation for a minimiser u of the functional reads This can be regarded as a fully implicit time discretisation for a nonlinear diffusion process with stopping time T = α and diffusivity g(r) = Ψ ′ (r) 2r . This process can also be approximated by m explicit diffusion steps of type (6) with time step size τ = α m (Radmoser et al., 2000) , where m is chosen such that the stability condition (7) holds. The connections established so far imply direct stability guarantees for networks consisting of diffusion blocks. Theorem 2. A residual network chaining any number of diffusion blocks with time step size τ , grid size h, and antisymmetric activation function Φ(r) with finite Lipschitz constant L is stable in the L ∞ norm if It is also sign stable if the bound is chosen half as large. Since Φ(r) = g(r) r and g is a symmetric diffusivity with bound g max , it follows that L = g max . Thus, (16) is the network analogue of the stability condition (7) for an explicit diffusion step. In the same way, the diffusion block inherits its sign stability from the sign stability condition of wavelet shrinkage. Stability of the full network follows by induction. Note that our results in terms of L ∞ or sign stability are stricter stability notions than the L 2 stability in (Ruthotto & Haber, 2019) where more general convolution filters are investigated: An L 2 stable network can still produce overshoots which violate L ∞ stability. Contrary to (Ruthotto & Haber, 2019) , our stability result does not require activation functions to be monotone. We will see that widely used diffusivites and shrinkage functions naturally lead to nonmonotone activation functions. Table 2 . Function plots for selected diffusivities, regularisers, shrinkage functions, and activation functions. The names of known functions are written above the graphs. Bold font indicates the best known function for each row. Axes and parameters are individually scaled for optimal qualitative inspection. Regulariser Ψ(r) Shrinkage Function S(r) Activation Function Φ(r) Whittaker-Tikhonov Identity Table 3 . Formulas for the function plots in Table 2 . The names of known functions are written above the equations. Bold font indicates the best known function for each row. Diffusivity g(r) Regulariser Ψ(r) Shrinkage Function S(r) Activation Function Φ(r) Whittaker-Tikhonov Identity Exploiting the Equations (10), (11), and (15), we are now in the position to present a general dictionary which can be used to translate arbitrary diffusivities, wavelet shrinkage functions, and variational regularisers into activation functions. This dictionary is displayed in Table 1 . On one hand, our dictionary provides a blueprint for researchers acquainted with diffusion, wavelet shrinkage or regularisation to build a residual network for a desired model while preserving important theoretical properties. This can help them to develop rapid prototypes of the corresponding filters without the need to pay attention to implementational details. Also parallelisation for GPUs is readily available. Last but not least, these methods can be gradually refined by learning. On the other hand, also CNN researchers can benefit. The dictionary shows how to restrict CNN architectures or parts thereof to models which are well-motivated, provably stable, and can benefit from the rich research results for diffusion, wavelet shrinkage and regularisation. Lastly, it can inspire CNN practitioners to use more sophisticated activation functions, in particular nonmonotone ones. Let us now apply our general dictionary to prominent diffusivities, shrinkage functions, and regularisers in order to identify their activation functions. We visualise these functions in Table 2 and display their mathematical formulas in Table 3 . For our examples, we choose a grid size of h = 1. As we have g max = 1 for all cases considered, we set τ = α = 1 4 . This fulfils the sign stability condition (16). We generally observe that all resulting activation functions Φ(r) = g(r) r are antisymmetric, since g(r) = g(−r). This is very natural in the diffusion case, where the argument of the flux function is the signal derivative ∂ x u. It reflects a desired invariance axiom of denoising: Signal negation and filtering are commutative. Tables 2 and 3 fall into two classes: The first class comprises diffusion filters with constant (Iijima, 1962) and Charbonnier diffusivities (Charbonnier et al., 1994) , as well as soft wavelet shrinkage (Donoho, 1995) which involves a Huber regulariser (Huber, 1973 ) and a truncated total variation (TV) diffusivity (Rudin et al., 1992; Andreu et al., 2001) . These methods have strictly convex regularisers, and their shrinkage functions do not approximate the identity function for r → ±∞. Most importantly, their activation functions are monotonically increasing. This is compatible with the stan-dard scenario in deep learning where the ReLU activation function dominates (Nair & Hinton, 2010) . On the diffusion side, the corresponding increasing flux functions act contrast reducing. Strictly convex regularisers have unique minimisers, and popular minimisation algorithms such as gradient descent converge globally. The second class is much more exciting. Its representatives are given by Perona-Malik diffusion (Perona & Malik, 1990) and two wavelet shinkage methods: garrote shrinkage (Gao, 1998) and hard shrinkage (Mallat, 1999) . Garrote shrinkage corresponds to the truncated balanced forward-backward (BFB) diffusivity of (Keeling & Stollberger, 2002) , while hard shrinkage has a truncated quadratic regulariser which is used in the weak string model of (Geman & Geman, 1984) . Approaches of the second class have nonconvex regularisers, which may lead to multiple energy minimisers. Their shrinkage functions converge to the identity function for r → ±∞. The flux function of the diffusion filter is nonmonotone. While this was considered somewhat problematic for continuous diffusion PDEs, it has been shown that their discretisations are well-posed (Weickert & Benhamouda, 1997) , in spite of the fact that they may act contrast enhancing. Since the activation function is equivalent to the flux function, it is also nonmonotone. This is very unusual for CNN architectures. Although there were a few very early proposals in the neural network literature arguing that such networks offer a larger storage capacity (De Felice et al., 1993) and have some optimality properties (Meilijson & Ruppin, 1994) , they had no impact on modern CNNs. Our results motivate their ideas from a different perspective. Since it is wellknown that nonconvex variational methods can outperform convex ones, it appears promising to incorporate nonmonotone activations into CNNs in spite of some challenges that have to be mastered. We have seen that CNNs and classical methods have much more in common than most people would expect: Focusing on three classical denoising approaches in a 1D setting and on a ResNet architecture with simple convolutions, we have established a dictionary that allows to translate diffusivities, shrinkage functions, and regularisers into activation functions. This does not only yield strict stability results for specific ResNets with an arbitrary number of layers, but also suggests to invest more efforts into the design of activation functions. In particular, nonmonotone activation functions warrant more attention. Needless to say, our restrictions to 1D, to a single scale, and to denoising methods have been introduced mainly for didactic reasons. We see our work as an entry ticket to guide also other researchers with an expertise on PDEs, wavelets, and variational approaches into the CNN universe. As a result, we envision numerous generalisations, including extensions to higher dimensions and manifold-valued data. Of course, also additional key features of CNNs deserve to be analysed, for instance pooling operations. Our longterm vision is that this line of research will help to bridge the performance gap as well as the theory gap between model-driven and data-driven approaches, for their mutual benefit. Minimizing total variation flow Networks for nonlinear diffusion problems in imaging Analysis of invariance and robustness via invertibility of ReLU-networks Deep layers as stochastic solvers Invariant scattering convolution networks Reversible architectures for arbitrarily deep residual neural networks Two deterministic half-quadratic regularization algorithms for computed imaging Deep relaxation: partial differential equations for optimizing deep neural networks Translation invariant denoising Dynamics of neural networks with nonmonotone activation function Regularization by architecture: A deep prior approach for inverse problems De-noising by soft thresholding Ideal spatial adaptation by wavelet shrinkage Towards efficient intra prediction based on image inpainting methods Wavelet convolutional neural networks Wavelet shrinkage denoising using the nonnegative garrote Calculus of Variations Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images Deep Learning Approximation spaces of deep neural networks Stable architectures for deep neural networks Global optimality in neural network training Deep residual learning for image recognition Robust regression: Asymptotics, conjectures and Monte Carlo Basic theory on normalization of pattern (in case of typical one-dimensional pattern) Nonlinear anisotropic diffusion filters for wide range edge sharpening Learning a spatial activation function for efficient image restoration Variational networks: Connecting variational methods and deep learning Total deep variation for linear inverse problems Regularization for deep learning: A taxonomy Gradientbased learning applied to document recognition. Proceedings of the IEEE Deep learning. Nature Visualizing the loss landscape of neural nets Deep residual learning and PDEs on manifolds A Wavelet Tour of Signal Processing Optimal signalling in attractor neural networks Diffusion-inspired shrinkage functions and stability results for wavelet denoising Rectified linear units improve restricted Boltzmann machines The loss surface of deep and wide neural networks Scale space and edge detection using anisotropic diffusion Why and when can deep -but not shallownetworks avoid the curse of dimensionality: A review Scale-space properties of nonstationary iterative regularization methods The power of deeper networks for expressing natural functions Adversarial noise attacks of deep learning architectures: Stability analysis via sparse-modeled signals Residual networks as flows of diffeomorphisms Nonlinear total variation based noise removal algorithms Deep neural networks motivated by partial differential equations Relations between regularization and diffusion filtering Deep learning in neural networks: An overview Über variationsvermindernde lineare Transformationen Opening the black box of deep neural networks via information Deep limits of residual neural networks Solution of incorrectly formulated problems and the regularization method Deep image prior A representer theorem for deep neural networks Mathematics of deep learning Anisotropic Diffusion in Image Processing A semidiscrete nonlinear scale-space theory and its relation to the Perona-Malik paradox A new method of graduation A mathematical theory of deep convolutional neural networks for feature extraction Wavelet pooling for convolutional neural networks Forward stability of ResNet and its variants This work has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement no. 741215, ERC Advanced Grant INCOVID). We thank Michael Ertel for checking our mathematical formulas.