key: cord-0054943-v4ic78cu authors: Huot, Mathieu; Staton, Sam; Vákár, Matthijs title: Correctness of Automatic Differentiation via Diffeologies and Categorical Gluing date: 2020-04-17 journal: Foundations of Software Science and Computation Structures DOI: 10.1007/978-3-030-45231-5_17 sha: 4e7e9e55490823fd5172f828b03eed056a545166 doc_id: 54943 cord_uid: v4ic78cu We present semantic correctness proofs of Automatic Differentiation (AD). We consider a forward-mode AD method on a higher order language with algebraic data types, and we characterise it as the unique structure preserving macro given a choice of derivatives for basic operations. We describe a rich semantics for differentiable programming, based on diffeological spaces. We show that it interprets our language, and we phrase what it means for the AD method to be correct with respect to this semantics. We show that our characterisation of AD gives rise to an elegant semantic proof of its correctness based on a gluing construction on diffeological spaces. We explain how this is, in essence, a logical relations argument. Finally, we sketch how the analysis extends to other AD methods by considering a continuation-based method. Automatic differentiation (AD), loosely speaking, is the process of taking a program describing a function, and building the derivative of that function by applying the chain rule across the program code. As gradients play a central role in many aspects of machine learning, so too do automatic differentiation systems such as TensorFlow [1] or Stan [6] . Differentiation has a well developed mathematical theory in terms of differential geometry. The aim of this paper is to formalize this connection between differential geometry and the syntactic operations of AD. In this way we achieve two things: (1) a compositional, denotational understanding of differentiable programming and AD; (2) an explanation of the correctness of AD. This intuitive correspondence (summarized in Fig. 1 ) is in fact rather complicated. In this paper we focus on resolving the following problem: higher order functions play a key role in programming, and yet they have no counterpart in traditional differential geometry. Moreover, we resolve this problem while retaining the compositionality of denotational semantics. Higher order functions and differentiation. A major application of higher order functions is to support disciplined code reuse. Code reuse is particularly acute in machine learning. For example, a multi-layer neural network might be built of millions of near-identical neurons, as follows. neuron n : (real n * (real n * real)) → real neuron n def = λ x, w, b . ς(w · x + b) layer n : ((τ 1 * P ) → τ 2 ) → (τ 1 * P n ) → τ n (1) with k inputs and two hidden layers. Here P ∼ = real p with p = (m(k+1)+n(m+1)+n+1). This program (1) describes a smooth (infinitely differentiable) function. The goal of automatic differentiation is to find its derivative. If we β-reduce all the λ's, we end up with a very long function expression just built from the sigmoid function and linear algebra. We can then find a program for calculating its derivative by applying the chain rule. However, automatic differentiation can also be expressed without first β-reducing, in a compositional way, by explaining how higher order functions like (layer) and (comp) propagate derivatives. This paper is a semantic analysis of this compositional approach. The general idea of denotational semantics is to interpret types as spaces and programs as functions between the spaces. In this paper, we propose to use diffeological spaces and smooth functions [32, 16] to this end. These satisfy the following three desiderata: -R is a space, and the smooth functions R → R are exactly the functions that are infinitely differentiable; -The set of smooth functions X → Y between spaces again forms a space, so we can interpret function types. -The disjoint union of a sequence of spaces again forms a space, so we can interpret variant types and inductive types. We emphasise that the most standard formulation of differential geometry, using manifolds, does not support spaces of functions. Diffeological spaces seem to us the simplest notion of space that satisfies these conditions, but there are other candidates [3, 33] . A diffeological space is in particular a set X equipped with a chosen set of curves C X ⊆ X R and a smooth map f : X → Y must be such that if γ ∈ C X then γ; f ∈ C Y . This is remiscent of the method of logical relations. From smoothness to automatic derivatives at higher types. Our denotational semantics in diffeological spaces guarantees that all definable functions are smooth. But we need more than just to know that a definable function happens to have a mathematical derivative: we need to be able to find that derivative. In this paper we focus on a simple, forward mode automatic differentiation method, which is a macro translation on syntax (called − → D in §2). We are able to show that it is correct, using our denotational semantics. Here there is one subtle point that is central to our development. Although differential geometry provides established derivatives for first order functions (such as neuron above), there is no canonical notion of derivative for higher order functions (such as layer and comp) in the theory of diffeological spaces (e.g. [7] ). We propose a new way to resolve this, by interpreting types as triples (X, X , S) where, intuitively, X is a space of inhabitants of the type, X is a space serving as a chosen bundle of tangents over X, and S ⊆ X R × X R is a binary relation between curves, informally relating curves in X with their tangent curves in X . This new model gives a denotational semantics for automatic differentiation. In §3 we boil this new approach down to a straightforward and elementary logical relations argument for the correctness of automatic differentiation. The approach is explained in detail in §5. Related work and context. AD has a long history and has many implementations. AD was perhaps first phrased in a functional setting in [26] , and there are now a number of teams working on AD in the functional setting (e.g. [34, 31, 12] ), some providing efficient implementations. Although that work does not involve formal semantics, it is inspired by intuitions from differential geometry and category theory. This paper adds to a very recent body of work on verified automatic differentiation. Much of this is concurrent with and independent from the work in this article. In the first order setting, there are recent accounts based on denotational semantics in manifolds [13] and based on synthetic differential geometry [9] , as well as work making a categorical abstraction [8] and work connecting operational semantics with denotational semantics [2, 28] . Recently there has also been significant progress at higher types. The work of Brunel et al. gives formal correctness proofs for reverse-mode derivatives on computation graphs [5] . The work of Barthe et al. [4] provides a general discussion of some new syntactic logical relations arguments including one very similar to our syntactic proof of Theorem 1. We understand that the authors of [9] are working on higher types. The differential λ-calculus [11] is related to AD, and explicit connections are made in [22, 23] . One difference is that the differential λ-calculus allows addition of terms at all types, and hence vector space models are suitable, but this appears peculiar with the variant and inductive types that we consider here. Finally we emphasise that we have chosen the neural network (1) as our running example mainly for its simplicity. There are many other examples of AD outside the neural networks literature: AD is useful whenever derivatives need to be calculated on high dimensional spaces. This includes optimization problems more generally, where the derivative is passed to a gradient descent method (e.g. [30, 18, 29, 19, 10, 21] ). Other applications of AD are in advanced integration methods, since derivatives play a role in Hamiltonian Monte Carlo [25, 14] and variational inference [20] . We have provided a semantic analysis of automatic differentiation. Our syntactic starting point is a well-known forward-mode AD macro on a typed higher order language (e.g. [31, 34] ). We recall this in §2 for function types, and in §4 we extend it to inductive types and variants. The main contributions of this paper are as follows. -We give a denotational semantics for the language in diffeological spaces, showing that every definable expression is smooth ( §3). -We show correctness of the AD macro by a logical relations argument (Th. 1). -We give a categorical analysis of this correctness argument with two parts: canonicity of the macro in terms of syntactic categories, and a new notion of glued space that abstracts the logical relation ( §5). -We then use this analysis to state and prove a correctness argument at all first order types (Th. 2). -We show that our method is not specific to one particular AD macro, by also considering a continuation-based AD method ( §6). Rudiments of differentiation and dual numbers. Recall that the derivative of a function f : R → R, if it exists, is a function ∇f : is the gradient of f at x 0 . To find ∇f in a compositional way, two generalizations are reasonable: -We need both f and ∇f when calculating ∇(f ; g) of a composition f ; g, using the chain rule, so we are really interested in the pair (f, ∇f ) : R → R × R; -In building f we will need to consider functions of multiple arguments, such as + : R 2 → R, and these functions should propagate derivatives. Thus we are more generally interested in transforming a function g : R n → R into a function h : (R×R) n → R×R in such a way that for any f 1 . . . f n : R → R, . . , f n ); g)). ( An intuition for h is often given in terms of dual numbers. The transformed function operates on pairs of numbers, (x, x ), and it is common to think of such a pair as x + x for an 'infinitesimal' . But while this is a helpful intuition, the formalization of infinitesimals can be intricate, and the development in this paper is focussed on the elementary formulation in (2) . The reader may also notice that h encodes all the partial derivatives of g. For example, if g : R 2 → R, then with f 1 (x) def = x and f 2 (x) def = x 2 , by applying (2) to x 1 we obtain h(x 1 , 1, x 2 , 0) = (g(x 1 , x 2 ), ∂g(x,x2) ∂x (x 1 )) and similarly h(x 1 , 0, x 2 , 1) = (g(x 1 , x 2 ), ∂g(x1,x) ∂x (x 2 )). And conversely, if g is differentiable in each argument, then a unique h satisfying (2) can be found by taking linear combinations of partial derivatives: In summary, the idea of differentiation with dual numbers is to transform a differentiable function g : R n → R to a function h : R 2n → R 2 which captures g and all its partial derivatives. We packaged this up in (2) as a sort-of invariant which is useful for building derivatives of compound functions R → R in a compositional way. The idea of forward mode automatic differentiation is to perform this transformation at the source code level. A simple language of smooth functions. We consider a standard higher order typed language with a first order type real of real numbers. The types (τ, σ) and terms (t, s) are as follows. τ, σ, ρ ::= types | real real numbers function abstraction/app. The typing rules are in Figure 3 . We have included a minimal set of operations for the sake of illustration, but it is not difficult to add further operations. We add some simple syntactic sugar t − u def = t + (−1) * u. We intend ς to stand for the sigmoid function, ς(x) def = 1 1+e −x . We further include syntactic sugar let x = t in s for (λx.s) t and λ x 1 , . . . , x n .t for λx.case x of x 1 , . . . , x n → t. Syntactic automatic differentiation: a functorial macro. The aim of forward mode AD is to find the dual numbers representation of a function by syntactic manipulations. For our simple language, we implement this as the following inductively defined macro − → D on both types and terms (see also [34, 31] ): Typing rules for the simple language. This turns − → D into a well-typed, functorial macro in the following sense. Example 1 (Inner products). Let us write τ n for the n-fold product (τ * . . . * τ ). Then, given Γ t, s : real n we can define their inner product Example 2 (Neural networks). In our introduction (1), we provided a program in our language to build a neural network out of expressions neuron, layer, comp; this program makes use of the inner product of Ex. 1. We can similarly calculate − → D of such deep neural nets by mechanically applying the macro. Consider for a moment the first order fragment of the language in § 2, with only one type, real, and no λ's or pairs. This has a simple semantics in the category of cartesian spaces and smooth maps. Indeed, a term x 1 . . . x n : real t : real has a natural reading as a function t : R n → R by interpreting our operation symbols by the well-known operations on R n → R with the corresponding name. In fact, the functions that are definable in this first order fragment are smooth, which means that they are continuous, differentiable, and their derivatives are continuous, differentiable, and so on. Let us write CartSp for this category of cartesian spaces (R n for some n) and smooth functions. The category CartSp has cartesian products, and so we can also interpret product types, tupling and pattern matching, giving us a useful syntax for constructing functions into and out of products of R. For example, the interpretation of (neuron n ) in (1) becomes where · n , + and ς are the usual inner product, addition and the sigmoid function on R, respectively. Inside this category, we can straightforwardly study the first order language without λ's, and automatic differentiation. In fact, we can prove the following by plain induction on the syntax: However, as is well known, the category CartSp does not support function spaces. To see this, notice that we have polynomial terms x n y n : real → real for each d, and so if we could interpret (real → real) as a Euclidean space R p then, by interpreting these polynomial expressions, we would be able to find continuous injections R d → R p for every d, which is topologically impossible for any p, for example as a consequence of the Borsuk-Ulam theorem (see [15] , Appx. A). This means that we cannot interpret the functions (layer) and (comp) from (1) in CartSp, as they are higher order functions, even though they are very useful and innocent building blocks for differential programming! Clearly, we could define neural nets such as (1) directly as smooth functions without any higher order subcomponents, though that would quickly become cumbersome for deep networks. A problematic consequence of the lack of a semantics for higher order differential programs is that we have no obvious way of establishing compositional semantic correctness of − → D for the given implementation of (1). Diffeological spaces. This motivates us to turn to a more general notion of differential geometry for our semantics, based on diffeological spaces [16] . The key idea will be that a higher order function is called smooth if it sends smooth functions to smooth functions, meaning that we can never use it to build first order functions that are not smooth. For example, (comp) in (1) has this property. We call a function f : X → Y between diffeological spaces smooth if, for all plots p ∈ P U X , we have that p; f ∈ P U Y . We write Diff (X, Y ) for the set of smooth maps from X to Y . Smooth functions compose, and so we have a category Diff of diffeological spaces and smooth functions. A diffeological space is thus a set equipped with structure. Many constructions of sets carry over straightforwardly to diffeological spaces. Each open subset U of R n can be given the structure of a diffeological space by taking all the smooth functions V → U as P V U . It is easily seen that smooth functions from V → U in the traditional sense coincide with smooth functions in the sense of diffeological spaces. Thus diffeological spaces have a profound relationship with ordinary calculus. In categorical terms, this gives a full embedding of CartSp in Diff . Example 4 (Product diffeologies). Given a family (X i ) i∈I of diffeological spaces, we can equip the product i∈I X i of sets with the product diffeology in which U -plots are precisely the functions of the form (p i ) i∈I for p i ∈ P U Xi . This gives us the categorical product in Diff . We can equip the set Diff (X, Y ) of smooth functions between diffeological spaces with the functional diffeology in which Uplots consist of functions f : This specifies the categorical function object in Diff . We can now give a denotational semantics to our language from § 2. We interpret each type τ as a set τ equipped with the relevant diffeology, by induction on the structure of types: Notice that a term x 1 : real, . . . , x n : real t : real is interpreted as a smooth function t : R n → R, even if t involves higher order functions (like (1)). Moreover the macro differentiation (For instance, if n = 2, then − → Proof. We prove this by logical relations. Although the following proof is elementary, we found it by using the categorical methods in § 5. For each type τ , we define a binary relation S τ between curves in τ and curves in , by induction on τ : g 1 (x) ), x →f 2 (x)(g 2 (x))) ∈ S σ }. Then, we establish the following 'fundamental lemma': If x 1 :τ 1 , ..., x n :τ n t : σ and, for all 1≤i≤n, y 1 ...y m : real In this section, we show that the definition of forward AD and the semantics generalize if we extend the language of §2 with variants and inductive types. As an example of inductive types, we consider lists. This specific choice is only for expository purposes and the whole development works at the level of generality of arbitrary algebraic data types generated as initial algebras of (polynomial) type constructors formed by finite products and variants. Similarly, our choice of operations is for expository purposes. More generally, assume given a family of operations (Op n ) n∈N indexed by their arity n. Further assume that each op ∈ Op n has type real n → real. We then ask for a certain closure of these operations under differentiation, that is we define . . . , x n ) is some chosen term in the language, involving free variables from x 1 , . . . , x n , which we think of as implementing the partial derivative of op with respect to its i-th argument. For constructing the semantics, every op must be interpreted by some smooth function, and, to establish correctness, the semantics of ∂ i op(x 1 , . . . , x n ) must be the semantic i-th partial derivative of the semantics of op(x 1 , . . . , x n ) . Language. We additionally consider the following types and terms: τ, σ, ρ ::= We extend the type system according to: We can then extend − → D to our new types and terms by To demonstrate the practical use of expressive type systems for differential programming, we consider the following two examples. Usually, we run a neural network on a large data set, the size of which might be determined at runtime. To evaluate a neural network on multiple inputs, in practice, one often sums the outcomes. This can be coded in our extended language as follows. Suppose that we have a network f : (real n * P ) → real that operates on single input vectors. We can construct one that operates on lists of inputs as follows: g def = λ l, w .fold (x 1 , x 2 ).f x 1 , w + x 2 over l from 0 : (list(real n ) * P ) → real Example 7 (Missing data). In practically every application of statistics and machine learning, we face the problem of missing data: for some observations, only partial information is available. In an expressive typed programming language like we consider, we can model missing data conveniently using the data type maybe(τ ) = {Nothing ( ) Just τ }. In the context of a neural network, one might use it as follows. First, define some helper functions Given a neural network f : (list(real k ) * P ) → real, we can build a new one that operates on on a data set for which some covariates (features) are missing, by passing in default values to replace the missing covariates: λ l, m, w .f map (fromMaybe k m) l, w : (list((maybe(real)) k ) * (real k * P )) → real Then, given a data set l with missing covariates, we can perform automatic differentiation on this network to optimize, simultaneously, the ordinary network parameters w and the default values for missing covariates m. In § 3 we gave a denotational semantics for the simple language in diffeological spaces. This extends to the language in this section, as follows. As before, each type τ is interpreted as a diffeological space, which is a set equipped with a family of plots: Uj τ j . The constructors and destructors for variants and lists are interpreted as in the usual set theoretic semantics. It is routine to show inductively that these interpretations are smooth. Thus every term Γ t : τ in the extended language is interpreted as a smooth function t : Γ → τ between diffeological spaces. (In this section we focused on a language with lists, but other inductive types are easily interpreted in the category of diffeological spaces in much the same way; the categorically minded reader may regard this as a consequence of Diff being a concrete Grothendieck quasitopos, e.g. [3] .) This section has three parts. First, we give a categorical account of the functoriality of AD (Ex. 8). Then we introduce our gluing construction, and relate it to the correctness of AD (dgm. (3) ). Finally, we state and prove a correctness theorem for all first order types by considering a category of manifolds (Th. 2). Syntactic categories. Our language induces a syntactic category as follows. Let Syn be the category whose objects are types, and where a morphism τ → σ is a term in context x : τ t : σ modulo the βη-laws (Fig. 4) . Composition is by substitution. For simplicity, we do not impose arithmetic identities such as x + y = y + x in Syn. As is standard, this category has the following universal property. Lemma 2 (e.g. [27] ). For every bicartesian closed category C with list objects, and every object F (real) ∈ C and morphisms F (c) ∈ C(1, F (real)), F (+), F ( * ) ∈ C(F (real) × F (real), F (real)), F (ς) ∈ Syn(F (real), F (real)) in C, there is a unique functor F : Syn → C respecting the interpretation and preserving the bicartesian closed structure as well as list objects. Proof (notes). The functor F : Syn → C is a canonical denotational semantics for the language, interpreting types as objects of C and terms as morphisms. Gluing is a method for building new categorical models which has been used for many purposes, including logical relations and realizability [24] . Our logical relations argument in the proof of Th. 1 can be understood in this setting. In this subsection, for the categorically minded, we explain this, and in doing so we quickly recover a correctness result for the more general language in § 4 and for arbitrary first order types. We define a category Gl U whose objects are triples (X, X , S) where X and X are diffeological spaces and S ⊆ P U X × P U X is a relation between their Fig. 4 . Standard βη-laws (e.g. [27] ) for products, functions, variants and lists. f : X → Y , f : X → Y , such that if (g, g ) ∈ S then (g; f, g ; f ) ∈ T . The idea is that this is a semantic domain where we can simultaneously interpret the language and its automatic derivatives. Proposition 1. The category Gl U is bicartesian closed, has list objects, and the projection functor proj : Gl U → Diff × Diff preserves this structure. Proof (notes). The category Gl U is a full subcategory of the comma category id Set ↓ Diff (U, −) × Diff (U, −). The result thus follows by the general theory of categorical gluing (e.g. [17, Lemma 15] ). We give a semantics − = ( − 0 , − 1 , S − ) for the language in Gl R , interpreting types τ as objects ( τ 0 , τ 1 , S τ ), and terms as morphisms. We let real 0 def = R and real 1 def = R 2 , with the relation S real We interpret the constants c as pairs c 0 def = c and c 1 def = (c, 0), and we interpret +, ×, ς in the standard way (meaning, like − ) in − 0 , but according to the derivatives in − 1 , for instance, * 1 : At this point one checks that these interpretations are indeed morphisms in Gl R . This amounts to checking that these interpretations are dual numbers representations in the sense of (2). The remaining constructions of the language are interpreted using the categorical structure of Gl R , following Lem. 2. Notice that the diagram below commutes. One can check this by hand or note that it follows from the initiality of Syn (Lem. 2): all the functors preserve all the structure. Syn We thus arrive at a restatement of the correctness theorem (Th. 1), which holds even for the extended language with variants and lists, because for any x 1 ... Correctness at all first order types, via manifolds. We now generalize Theorem 1 to hold at all first order types, not just the reals. To do this, we need to define the derivative of a smooth map between the interpretations of first order types. We do this by recalling the well known theory of manifolds and tangent bundles. For our purposes, a smooth manifold M is a second-countable Hausdorff topological space together with a smooth atlas: an open cover U together with homeomorphisms φ U : U → R n(U ) U ∈U (called charts) such that φ −1 U ; φ V is smooth on its domain of definition for all U, V ∈ U. A function f : M → N between manifolds is smooth if φ −1 U ; f ; ψ V is smooth for all charts φ U and ψ V of M and N , respectively. Let us write Man for this category. Our manifolds are slightly unusual because different charts in an atlas may have different finite dimension n(U ). Thus we consider manifolds with dimensions that are potentially unbounded, albeit locally finite. This does not affect the theory of differential geometry as far as we need it here. Each open subset of R n can be regarded as a manifold. This lets us regard the category of manifolds Man as a full subcategory of the category of diffeological spaces. We consider a manifold (X, {φ U } U ) as a diffeological space with the same carrier set X and where the plots P U X are the smooth functions in Man(U, X). A function X → Y is smooth in the sense of manifolds if and only if it is smooth in the sense of diffeological spaces [16] . For the categorically minded reader, this means that we have a full embedding of Man into Diff . Moreover, the natural interpretation of the first order fragment of our language in Man coincides with that in Diff . That is, the embedding of Man into Diff preserves finite products and countable coproducts (hence initial algebras of polynomial endofunctors). Proof (notes). This is proved by induction on the structure of types. In fact, one may show that every such τ is isomorphic to a manifold of the form n i=1 R di where the bound n is either finite or ∞, but this isomorphism is typically not an identity function. The constraint to first order types is necessary because, e.g. the space real → real is not a manifold, because of a Borsuk-Ulam argument (see [15] , Appx. A). We recall that the derivative of any morphism f : M → N of manifolds is given as follows. As is standard, we can understand the tangent bundle of a composite space in terms of that of its parts. We define a canonical isomorphism φ . For the other types, we use Lemma 3. We can now phrase correctness at all first order types. − → D (full)). For any ground τ , any first order context Γ and any term Γ t : τ , the syntactic translation − → D coincides with the tangent bundle functor, modulo these canonical isomorphisms: This is shown by induction on the structure of types. We conclude the theorem from diagram (3), by putting these two observations together. We now illustrate the flexibility of our framework by briefly describing an alternative syntactic translation ← − D ρ . This alternative translation uses aspects of continuation passing style, inspired by recent developments in reverse mode AD [34, 5] . In brief, ← − D ρ works by ← − D ρ (real) = (real * (real → ρ)). Thus instead of using a pair of a number and its tangent, we use a pair of a number and a continuation. The answer type ρ = real k needs to have the structure of a vector space, and the continuations that we consider will turn out to be linear maps. Because we work in continuation passing style, the chain rule is applied contravariantly. If the reader is familiar with reverse-mode AD algorithms, they may think of the dimension k as the number of memory cells used to store the result. Computing the whole gradient of a term x 1 : real, ..., x k : real t : real at once is then achieved by running ← − D k (t) on a k-tuple of basis vectors for real k . We define the continuation-based AD macro ← − D k on types and terms as the unique structure preserving functor Syn → Syn with ← − D k (real) = (real * (real → real k )) and The idea is that ← − D k (t) is a higher order function that simultaneously computes t (the forward pass) and defines as a continuation the reverse pass which computes the gradient. In order to actually run the algorithm, we need two auxiliary definitions lamR k D k preserve all type formers, we can extend these definitions to all first order types τ : z : . We can think of lamR k τ (z) as encoding k tangent vectors z : − → D k (τ ) as a closure, so it is suitable for running ← − D k (t) on, and evR k τ (z) as actually evaluating the reverse pass defined by z : ← − D k (τ ) and returning the result as k tangent vectors. The idea is that given some x : τ t : σ between first order types τ, σ, we run our continuation-based AD by running evR k σ ( . The correctness proof closely follows that for forward AD. In particular, one defines a binary logical relation real r,k = (R, ) respect this logical relation. It follows that this relation extends to a functor − r,k : Syn → Gl R k such that id × ← − D k factors over − r,k , implying the correctness of the continuation-based AD by the following lemma. For all first order types τ (i.e. types not involving function types), we have that evR k τ (lamR k τ (t)) = t . Proof (notes). This follows by an induction on the structure of τ . The idea is that lamR k τ embeds reals into function spaces as linear maps, which is undone by evR k τ by evaluating the linear maps at 1. To phrase correctness, in this setting, however, we need a few definitions. v 1 ) , . . . , T x (f )(v k ))). As T k preserves countable coproducts and finite products (like T ), it follows that the isomorphisms φ for first order types τ . This leads to the following correctness statement for continuation-based AD. For any ground τ , any first order context Γ and any term Γ t : τ , syntactic translation t → evR k τ ( / ... ]) coincides with the tangent bundle functor, modulo these canonical isomorphisms: For example, when τ = real and Γ = x, y : real, we can run our continuationbased AD to compute the gradient of a program x, y : real t : real at values x = V, y = W by evaluating Summary. We have shown that diffeological spaces provide a denotational semantics for a higher order language with variants and inductive types ( §3,4). We have used this to show correctness of a simple AD translation (Thm. 1, Thm. 2). But the method is not tied to this specific translation, as we illustrated in Section 6. The structure of our elementary correctness argument for Theorem 1 is a typical logical relations proof. As explained in Section 5, this can equivalently be understood as a denotational semantics in a new kind of space obtained by categorical gluing. Overall, then, there are two logical relations at play. One is in diffeological spaces, which ensures that all definable functions are smooth. The other is in the correctness proof (equivalently in the categorical gluing), which explicitly tracks the derivative of each function, and tracks the syntactic AD even at higher types. Connection to the state of the art in AD implementation. As is common in denotational semantics research, we have here focused on an idealized language and simple translations to illustrate the main aspects of the method. There are a number of points where our approach is simplistic compared to the advanced current practice, as we now explain. In our examples we have treated n-vectors as tuples of length n. This style of programming does not scale to large n. A better solution would be to use array types, following [31] . Our categorical semantics and correctness proofs straightforwardly extend to cover them, in a similar way to our treatment of lists. Efficient forward-mode AD. For AD to be useful, it must be fast. The syntactic translation − → D that we use is the basis of an efficient AD library [31] . However, numerous optimizations are needed, ranging from algebraic manipulations, to partial evaluations, to the use of an optimizing C compiler. A topic for future work would be to validate some of these manipulations using our semantics. The resulting implementation is performant in experiments [31] . Efficient reverse-mode AD. Our sketch of continuation-based AD is primarily intended to emphasise that our denotational approach is not tied to any specific translation − → D . Nonetheless, it is worth noting that this algorithm shares similarities with advanced reverse-mode implementations: (1) it calculates derivatives in a (contravariant) "reverse pass" in which derivatives of operations are evaluated in the reverse order compared to their order in calculating the function value; (2) it can be used to calculate the full gradient of a function R n → R in a single reverse pass (while n passes of fwd AD would be necessary). However, it lacks important optimizations and the continuation scales with the size of the input n where it should scale with the size of the output. This adds an important overhead, as pointed out in [26] . Speed being the main attraction of reverse-mode AD, its implementations tend to rely on mutable state, control operators and/or staging [26, 6, 34, 5] , which we have not considered here. Other language features. The idealized languages that we considered so far do not touch on several useful language constructs. For example: the use of functions that are partial (such as division) or partly-smooth (such as RelU); phenomena such as iteration, recursion; and probabilities. There are suggestions that the denotational approach using diffeological spaces can be adapted to these features using standard categorical methods. We leave this for future work. Tensorflow: A system for large-scale machine learning A simple differentiable programming language Convenient categories of smooth spaces On the versatility of open logical relations: Continuity, automatic differentiation, and a containment theorem Backpropagation in the simply typed lambdacalculus with linear negation The Stan math library: Reverse-mode automatic differentiation in C++ Tangent spaces and tangent bundles for diffeological spaces Reverse derivative categories Towards formalizing and extending differential programming using tangent categories Adaptive subgradient methods for online learning and stochastic optimization The differential lambda-calculus The simple essence of automatic differentiation Backprop as functor: A compositional perspective on supervised learning The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo Correctness of automatic differentiation via diffeologies and categorical gluing. Full version Quasitoposes, quasiadhesive categories and Artin glueing Stochastic estimation of the maximum of a regression function Adam: A method for stochastic optimization Automatic differentiation variational inference On the limited memory BFGS method for large scale optimization A differential-form pullback programming language for higherorder reverse-mode automatic differentiation A simply typed λ-calculus of forward automatic differentiation Notes on sconing and relators MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator Categorical logic Some principles of differential programming languages On the momentum term in gradient descent learning algorithms A stochastic approximation method. The annals of mathematical statistics pp Efficient differentiable programming in a functional array-processing language Differential geometrical methods in mathematical physics Comparative smootheology Demystifying differentiable programming: Shift/reset the penultimate backpropagator We have benefited from discussing this work with many people, including B. Pearlmutter, O. Kammar, C. Mak, L. Ong, G. Plotkin, A. Shaikhha, J. Sigal, and others. Our work is supported by the Royal Society and by a Facebook Research Award. In the course of this work, MV has also been employed at Oxford (EPSRC Project EP/M023974/1) and at Columbia in the Stan development team. This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk lodowska-Curie grant agreement No. 895827.