key: cord-0060400-ot2oixoe
authors: Vákár, Matthijs
title: Reverse AD at Higher Types: Pure, Principled and Denotationally Correct
date: 2021-03-23
journal: Programming Languages and Systems
DOI: 10.1007/978-3-030-72019-3_22
sha: c8f1943520a1db5a6765bac7c5a6945a7965217f
doc_id: 60400
cord_uid: ot2oixoe

We show how to define forward- and reverse-mode automatic differentiation source-code transformations or on a standard higher-order functional language. The transformations generate purely functional code, and they are principled in the sense that their definition arises from a categorical universal property. We give a semantic proof of correctness of the transformations. In their most elegant formulation, the transformations generate code with linear types. However, we demonstrate how the transformations can be implemented in a standard functional language without sacrificing correctness. To do so, we make use of abstract data types to represent the required linear types, e.g. through the use of a basic module system.

Automatic differentiation (AD) is a technique for transforming code that implements a function f into code that computes f 's derivative, essentially by using the chain rule for derivatives. Due to its efficiency and numerical stability, AD is the technique of choice whenever derivatives need to be computed of functions that are implemented as programs, particularly in high dimensional settings. Optimization and Monte-Carlo integration algorithms, such as gradient descent and Hamiltonian Monte-Carlo methods, rely crucially on the calculation of derivatives. These algorithms are used in virtually every machine learning and computational statistics application, and the calculation of derivatives is usually the computational bottleneck. These applications explain the recent surge of interest in AD, which has resulted in the proliferation of popular AD systems such as TensorFlow [1] , PyTorch [30] , and Stan Math [9] .

AD, roughly speaking, comes in two modes: forward-mode and reverse-mode. When differentiating a function R n → R m , forward-mode tends to be more efficient if m n, while reverse-mode generally is more performant if n m. As most applications reduce to optimization or Monte-Carlo integration of an objective function R n → R with n very large (today, in the order of 10 4 − 10 7 ), reverse-mode AD is in many ways the more interesting algorithm.

However, reverse AD is also more complicated to understand and implement than forward AD. Forward AD can be implemented as a structure-preserving program transformation, even on languages with complex features [32] . As such, it admits an elegant proof of correctness [20] . By contrast, reverse-AD is only well-understood as a source-code transformation (also called define-then-run style AD) on limited programming languages. Typically, its implementations on more expressive languages that have features such as higher-order functions make use of define-by-run approaches. These approaches first build a computation graph during runtime, effectively evaluating the program until a straight-line first-order program is left, and then they evaluate this new program [30, 9] . Such approaches have the severe downside that the differentiated code cannot benefit from existing optimizing compiler architectures. As such, these AD libraries need to be implemented using carefully, manually optimized code, that for example does not contain any common subexpressions. This implementation process is precarious and labour intensive. Further, some whole-program optimizations that a compiler would detect go entirely unused in such systems.

Similarly, correctness proofs of reverse AD have taken a define-by-run approach and have relied on non-standard operational semantics, using forms of symbolic execution [2, 28, 8] . Most work that treats reverse-AD as a source-code transformation does so by making use of complex transformations which introduce mutable state and/or non-local control flow [31, 38] . As a result, we are not sure whether and why such techniques are correct. Another approach has been to compile high-level languages to a low-level imperative representation first, and then to perform AD at that level [22] , using mutation and jumps. This approach has the downside that we might lose important opportunities for compiler optimizations, such as map-fusion and embarrassingly parallel maps, which we can exploit if we perform define-then-run AD on a high-level representation.

A notable exception to these define-by-run and non-functional approaches to AD is [16] , which presents an elegant, purely functional, define-then-run version of reverse AD. Unfortunately, their techniques are limited to first-order programs over tuples of real numbers. This paper extends the work of [16] to apply to higher-order programs over (primitive) arrays of reals:

-It defines purely functional define-then-run reverse-mode AD on a higherorder language. -It shows how the resulting, mysterious looking program transformation arises from a universal property if we phrase the problem in a suitable categorical language. Consequently, the transformations automatically respect equational reasoning principles. -It explains, from this categorical setting, precisely in what sense reverse AD is the "mirror image" of forward AD. -It presents an elegant proof of semantic correctness of the AD transformations, based on a semantic logical relations argument, demonstrating that the transformations calculate the derivatives of the program in the usual mathematical sense. -It shows that the AD definitions and correctness proof are extensible to higher-order primitives such as a map-operation over our primitive arrays. -It discusses how our techniques are readily implementable in standard functional languages to give purely functional, principled, semantically correct, define-then-run reverse-mode AD.

Consider a simple programming language. Types are statically sized arrays real n for some n, and programs are obtained from a collection of (unary) primitive operations x : real n op(x) : real m (intended to implement differentiable functions like linear algebra operations and sigmoid functions) by sequencing.

We can implement both forward mode − → D and reverse mode AD ← − D on this language as source-code translations to the larger language of a simply typed λcalculus over the ground types real n that includes at least the same operations. Forward (resp. reverse) AD translates a type τ to a pair of types 2 compute the derivatives, resp., for forward and reverse AD. Indeed, we define, by induction on the syntax:

where we assume that we have chosen suitable terms x : real n (Dop)(x) : real n → real m and x : real n (Dop) t (x) : real m → real n to represent the (multivariate) derivative and transposed (multivariate) derivative, respectively, of the primitive operation op : real n → real m .

For example, in case of multiplication x : real n op(x) = ( * )(x) : real, we can choose D( * )(x) = λy : real 2 .swap(x) • y and (D( * )) t (x) = λy : real.y · swap(x), where swap is a unary operation on real 2 that swaps both components, ( • ) is a binary inner product operation on real 2 and (·) is a binary scalar product operation for rescaling a vector in real 2 by a real number .

To illustrate the difference between − → D and ← − D , consider the program t = op 2 (op 1 (x)) performing two operations in sequence. Then,

In general, While this AD technique works on the limited first-order language we described, it is far from satisfying. Notably, it has the following two shortcomings:

1. it does not tell us how to perform AD on programs that involve tuples or operations of multiple arguments; 2. it does not tell us how to perform AD on higher-order programs, that is, programs involving λ-abstractions and applications. The key contributions of this paper are its extension of this transformation (see §7) to apply to a full simply typed λ-calculus (of §3), and its proof that this transformation is correct (see §8).

Shortcoming 1 seems easy to address, at first sight. Indeed, as the (co)tangent vectors to a product of spaces are simply tuples of (co)tangent vectors, one would expect to define, for a product type τ * σ,

Indeed, this technique straightforwardly applies to forward mode AD:

For reverse mode AD, however, tuples already present challenges. Indeed, we would like to use the definitions below, but they require terms 0 : τ and t + s : τ for any two t, s : τ for each type τ :

These formulae capture the well-known issue of fanout translating to addition in reverse AD, caused by the contravariance of its second component [31] . Such 0 and + could indeed be defined by induction on the structure of types, using 0 and + at real n . However, more problematically, −, − , fst − and snd − represent explicit uses of structural rules of contraction and weakening at types τ , which, in a λ-calculus, can also be used implicitly in the typing context Γ . Thus, we should also make these implicit uses explicit to account for their presence in the code. Then, we can appropriately translate them into their "mirror image": we map the contraction-weakening comonoids to the monoid structures (+, 0).

In functional define-then-run reverse AD, we need to make use of explicit structural rules and "mirror them", which we can do by first translating our language into combinators. This translation allows us to avoid the usual practice (e.g. [38] ) of accumulating adjoints at run-time with mutable state: instead, we detect all adjoints to accumulate at compile-time.

Put differently: we define AD on the syntactic category Syn with types τ as objects and (α)βη-equivalence classes of programs x : τ t : σ as morphisms τ → σ.

Yet the question remains: why should this translation for tuples be correct? What is even less clear is how to address shortcoming 2. What should the spaces of tangents − → D (τ → σ) 2 and adjoints ← − D (τ → σ) 2 look like? This is not something we are taught in Calculus 1.01. Instead, we again employ category theory: Insight 2. Follow where the categorical structure of the syntax leads you, as doing so produces principled definitions that are easy to prove correct.

With the aim of categorical compositionality in mind, we note that our translations compose according to a sort of "syntactic chain-rule", which says that − →

By the following trick, these equations are functoriality laws. Given a Cartesian closed category (C, 1, ×, ⇒), define categories 

Both have identities id (A1,A2) def = (id A1 , Λ(π 2 )), where we write Λ for categorical currying and π 2 for the second projection. Composition in

where we work in the internal language of C. Then, we have defined two functors:

where we write Syn 1 for the syntactic category of our restrictive first-order language, and we write Syn for that of the full λ-calculus. We would like to extend these to functors

turns out to be a category with finite products, given by(

. Thus, we can easily extend − → D to apply to an extension of Syn 1 with tuples by extending the functor in the unique structure-preserving way. However, supports function types. (The reason turns out to be that not all functions are linear in the sense of respecting 0 and +.) Therefore, the categorical structure does not give us guidance on how to extend our translation to all of Syn. Insight 3. Linear types can help. By using a more fine-grained type system, we can capture the linearity of the derivative. As a result, we can phrase AD on our full language simply as the unique structure-preserving functor that extends the uncontroversial definitions given so far.

To implement this insight, we extend our λ-calculus to a language LSyn with limited linear types (in §4): linear function types and a kind of multiplicative conjunction !(−) ⊗ (−), in the sense of the enriched effect calculus [14] . The algebraic effect giving rise to these linear types, in this instance, is that of the theory of commutative monoids. As we have seen, such monoids are intimately related to reverse AD. Consequently, we demand that every f with a linear function type τ σ is indeed linear, in the sense that f 0 = 0 and f (t + s) = (f t) + (f s). For the categorically inclined reader: that is, we enrich LSyn over the category of commutative monoids. Now, we can give more precise types to our derivatives, as we know they are linear functions: for x : τ t : σ, we have x :

Therefore, given any model L of our linear type theory, we generalise our previous construction of the categories In particular, the following definitions are forced on us by the theory: 

With these definitions in place, we turn to the correctness of the source-code transformations. To phrase correctness, we first need to construct a suitable denotational semantics with an uncontroversial notion of semantic differentiation. A technical challenge arises, as the usual calculus setting of Euclidean spaces (or manifolds) and smooth functions cannot interpret higher-order functions. To solve this problem, we work with a conservative extension of this standard calculus setting (see §5): the category Diff of diffeological spaces. We model our types as diffeological spaces, and programs as smooth functions. By keeping track of a commutative monoid structure on these spaces, we are also able to interpret the required linear types. We write Diff CM for this "linear" category of commutative diffeological monoids and smooth monoid homomorphisms.

By the universal properties of the syntax, we obtain canonical, structurepreserving functors − : LSyn → Diff CM and − : Syn → Diff once we fix interpretations R n of real n and well-typed interpretations op for each operation op. These functors define a semantics for our language.

Having constructed the semantics, we can turn to the correctness proof (of §8). Because calculus does not provide an unambiguous notion of derivative at function spaces, we cannot prove that the AD transformations correctly implement mathematical derivatives by plain induction on the syntax. Instead, we use a logical relations argument over the semantics, which we phrase categorically:

Insight 5. Once we show that the derivatives of primitive operations op are correctly implemented, correctness of derivatives of other programs follows from a standard logical relations construction over the semantics that relates a curve to its (co)tangent curve. By the chain-rule, all programs respect the logical relations.

To show correctness of forward AD, we construct a category

SScone is a standard category of logical relations, or subscone, and it is widely known to inherit the

show that all operations op respect these predicates, it follows that our denotational semantics lifts to give a unique structure-preserving functor Syn

SScone, such that the left diagram below commutes (by the universal property of Syn).

Syn

Consequently, we can work with P f

for the multivariate calculus derivative of f at a point x evaluated at a tangent vector v. By an application of the chain rule for differentiation, we see that every op respects this predicate, as long as Dop = D op . The commuting of our diagram then virtually establishes the correctness of forward AD. The only remaining step in the argument is to note that any tangent vector at τ ∼ = R N , for first-order τ , can be represented by a curve R → τ . For reverse AD, the same construction works,

We can then choose P r

as the predicates for constructing real n r , where we write A t for the matrix transpose of A. We obtain our main theorem, which crucially holds even for t that involve higher-order subprograms.

Theorem (Correctness of AD, Thm. 1). For any typed term x : τ t : σ in Syn between first-order types τ, σ, we have that

Next, we address the practicality of our method (in §9). The code transformations we employ are not too daunting to implement. It is well-known how to mechanically translate λ-calculus and functional languages into a (categorical) combinatory form [12] . However, the implementation of the required linear types presents a challenge. Indeed, types like !(−) ⊗ (−) and (−) (−) are absent from languages such as Haskell and O'Caml. Luckily, in this instance, we can implement them using abstract data types by using a (basic) module system: We phrase the correctness proof of the AD transformations in elementary terms, such that it holds in the applied setting where we use abstract types to implement linear types. We show that our correctness results are meaningful, as they make use of a denotational semantics that is adequate with respect to the standard operational semantics. Finally, to stress the applicability of our method, we show that it extends to higher-order (primitive) operations, such as map.

As a source language for our AD translations, we can begin with a standard, simply typed λ-calculus which has ground types real n of statically sized arrays of n real numbers, for all n ∈ N, and sets Op m n1,...,n k of primitive operations op for all k, m, n 1 , . . . , n k ∈ N. These operations will be interpreted as smooth functions (R n1 × . . . × R n k ) → R m . Examples to keep in mind for op include constants c ∈ Op n for each c ∈ R n , for which we slightly abuse notation and write c( ) as c; -elementwise addition and product (+), ( * ) ∈ Op n n,n and matrix-vector product ( ) ∈ Op n n·m,m ; -operations for summing all the elements in an array: sum ∈ Op 1 n ; -some non-linear functions like the sigmoid function ς ∈ Op 1 1 .

We intentionally present operations in a schematic way, as primitive operations tend to form a collection that is added to in a by-need fashion, as an AD library develops. The precise operations needed will depend on the applications, but, in statistics and machine learning applications, Op tends to include a mix of multi-dimensional linear algebra operations and mostly one-dimensional nonlinear functions. A typical library for use in machine learning would work with multi-dimensional arrays (sometimes called "tensors"). We focus here on onedimensional arrays as the issues of how precisely to represent the arrays are orthogonal to the concerns of our development. The types τ, σ, ρ and terms t, s, r of our AD source language are as follows: τ, σ, ρ ::= types | real n real arrays | 1 nullary product t, s, r ::

The typing rules are in Fig. 1 , where we write Dom(op) def = real n1 * . . . * real n k for an operation op ∈ Op m n1,...,n k . We employ the usual syntactic sugar let x = t in s def = (λx.s) t and write real for real 1 . As Fig. 2 displays, we consider the terms of our language up to the standard βη-theory. We could consider further equations for our operations, but we do not as we will not need them.

This standard λ-calculus is widely known to be equivalent to the free Cartesian closed category Syn generated by the objects real n and the morphisms op. Syn effectively represents programs as (categorical) combinators, also known as "point-free style" in the functional programming community. Indeed, there are well-studied mechanical translations from the λ-calculus to the free Cartesian closed category (and back) [26, 13] . The translation from Syn to λ-calculus is self-evident, while the translation in the opposite direction is straightforward after we first convert our λ-terms to de Bruijn indexed form. Concretely, -Syn has types τ, σ, ρ objects; -Syn has morphisms t ∈ Syn(τ, σ) which are in 1-1 correspendence with terms x : τ t : σ up to βη-equivalence (which includes α-equivalence); explicitly, they can be represented by

• identities: id τ ∈ Syn(τ, τ ) (cf., variables up to α-equivalence); • composition: t; s ∈ Syn(τ, ρ) for any t ∈ Syn(τ, σ) and s ∈ Syn(σ, ρ) (corresponding to the capture avoiding substitution s[ t / y ] if we represent x : τ t : σ and y : σ s : ρ); • terminal morphisms: τ ∈ Syn(τ, 1); • product pairing: t, s ∈ Syn(τ, σ * ρ) for any t ∈ Syn(τ, σ) and s ∈ Syn(τ, ρ); • product projections: fst τ,σ ∈ Syn(τ * σ, τ ) and snd τ,σ ∈ Syn(τ * σ, σ); = to indicate that the variables x1, . . . , xn need to be fresh in the left hand side. Equations hold on pairs of terms of the same type. As usual, we only distinguish terms up to α-renaming of bound variables.

• function evaluation: ev τ,σ ∈ Syn((τ → σ) * τ, σ); • currying: Λ τ,σ,ρ (t) ∈ Syn(τ, σ → ρ) for any t ∈ Syn(τ * σ, ρ); • operations: op ∈ Syn(real n1 * .. * real n k , real m ) for any op ∈ Op m n1,..,n k . -all subject to the usual equations of a Cartesian closed category [26] .

1 and * give finite products in Syn, while → gives categorical exponentials.

Syn has the following universal property: for any Cartesian closed category (C, 1, ×, ⇒), we obtain a unique Cartesian closed functor F : Syn → C, once we choose objects F real n of C as well as, for each op ∈ Op m n1,...,n k , make well-typed

As a target language for our AD source code transformations, we consider a language that extends the language of §3 with limited linear types. We could opt to work with a full linear logic as in [6] or [4] . Instead, however, we will only include the bare minimum of linear type formers that we actually need to phrase the AD transformations. The resulting language is closely related to, but more minimal than, the Enriched Effect Calculus of [14] . We limit our language in this way because we want to stress that the resulting code transformations can easily be implemented in existing functional languages such as Haskell or O'Caml. As we discuss in §9, the idea will be to make use of a module system to implement the required linear types as abstract data types.

In our idealised target language, we consider linear types (aka computation types) τ , σ, ρ, in addition to the Cartesian types (aka value types) τ , σ, ρ that we have considered so far. We think of Cartesian types as denoting spaces and linear types as denoting spaces equipped with an algebraic structure. As we are interested in studying differentiation, the relevant space structure in this instance is a geometric structure that suffices to define differentiability. Meanwhile, the relevant algebraic structure on linear types turns out to be that of a commutative monoid, as this algebraic structure is needed to phrase automatic differentiation algorithms. Indeed, we will use the linear types to denote spaces of (co)tangent vectors to the spaces of primals denoted by Cartesian types. These spaces of (co)tangents form a commutative monoid under addition.

Concretely, we extend the types and terms of our language as follows: In addition to the judgement Γ t : τ , which we encountered in §3, we now consider an additional judgement Γ ; x : τ t : σ. While we think of the former as denoting a (structure-preserving) function between spaces, we think of the latter as a (structure-preserving) function from the space which Γ denotes to the space of (structure-preserving) monoid homomorphisms from the denotation of τ to that of σ. In this instance, "structure-preserving" will mean differentiable. Fig. 3 displays the typing rules of our language. We consider the terms of this language up to the βη+-equational theory of Fig. 4 . It includes βη-rules as well as commutative monoid and homomorphism laws. 

Category theory We assume familiarity with categories, functors, natural transformations, and their theory of (co)limits and adjunctions. We write:

unary, binary, and I-ary products as 1, X 1 × X 2 , and i∈I X i , writing π i for the projections and (), (x 1 , x 2 ), and (x i ) i∈I for the tupling maps; -unary, binary, and I-ary coproducts as 0, X 1 + X 2 , and i∈I X i , writing ι i for the injections and [], [x 1 , x 2 ], and [x i ] i∈I for the cotupling maps; -exponentials as Y ⇒ X, writing Λ and ev for currying and evaluation.

Monoids We assume familiarity with the category CMon of commutative monoids X = (|X|, 0 X , + X ), such as R n def = (R n , 0, +), their cartesian product X × Y , tensor product X ⊗ Y , and the free monoid !S on a set S (write δ for the inclusion S → |!S|). We will sometimes write 

The language of §3 has a canonical interpretation in any Cartesian closed category (C, 1, ×, ⇒ ), once we fix C-objects real n to interpret real n and Cmorphisms op ∈ C( Dom(op) , real m ) to interpret op ∈ Op m n1,...,n k . We interpret types τ and contexts Γ as C-objects τ and Γ :

We interpret terms Γ t : τ as morphisms t in C( Γ , τ ): s ) ; ev. This is an instance of the universal property of Syn mentioned in §3.

We discuss how to extend − to apply to the full target language of §4. We call an L satisfying all these conditions a categorical model of the language of §4. In particular, any biadditive model of intuitionistic linear logic [29, 17] is such a categorical model.

If we choose real n ∈ ob L to interpret real n and compatible L-morphisms lop in L( Dom(lop) )( LDom(lop) , real k ) for each LOp m n1,...,n k ;n 1 ,...,n l , then we can interpret linear types τ as objects τ of L:

We can interpret τ σ as the C-object τ σ def = τ σ . Finally, we can interpret terms Γ t : τ as morphisms t in C( Γ , τ ) and terms Γ ; x : τ t : σ as t in L( Γ )( τ , σ ):

Observe that we interpret 0 and + using the biproduct structure of L. -All type formers are interpreted as one expects based on their notation, using introduction and elimination rules for the required structural isomorphisms.

Diffeological Spaces Throughout this paper, we have an instance of the abstract semantics of our languages in mind, as we intend to interpret real n as the usual Euclidean space R n and to interpret each program x 1 : real n1 , . . . , x k : real n k t : real m as a smooth (C ∞ -) function R n1 × . . . × R n k → R m . A challenge is that the usual settings for multivariate calculus and differential geometry do not form Cartesian closed categories, obstructing the interpretation of higher types (see [20, Appx. A]). A solution, recently employed by [20] , is to work with diffeological spaces [33, 21] , which generalise the usual notions of differentiability from Euclidean spaces and smooth manifolds to apply to higher types (as well as a range of other types such a sum and inductive types). We will also follow this route and use such spaces to construct our concrete semantics. Other valid options for a concrete semantics exist: convenient vector spaces [19, 7] , Frölicher spaces [18] , or synthetic differential geometry [25] , to name a few. We choose to work with diffeological spaces mostly because they seem to us to provide simplest way to define and analyse the semantics of a rich class of language features. Diffeological spaces formalise the intuition that a higher-order function is smooth if it sends smooth functions to smooth functions, meaning that we can never use it to build non-smooth first-order functions. This intuition is reminiscent of a logical relation, and it is realised by directly axiomatising smooth maps into the space, rather than treating smoothness as a derived property.

A diffeological space X = (|X| , P X ) consists of a set |X| together with, for each n ∈ N and each open subset U of R n , a set P U X of functions U → |X| called plots, such that -(constant) all constant functions are plots;

We think of plots as the maps that are axiomatically deemed "smooth". We call a function f : X → Y between diffeological spaces smooth if, for all plots p ∈ P U X , we have that p; f ∈ P U Y . We write Diff (X, Y ) for the set of smooth maps from X to Y . Smooth functions compose, and so we have a category Diff of diffeological spaces and smooth functions. We give some examples of such spaces.

Given any open subset X of a Euclidean space R n (or, more generally, a smooth manifold X), we can take the set of smooth (C ∞ ) functions U → X in the traditional sense as P U X . Given another such space X , then Diff (X, X ) coincides precisely with the set of smooth functions X → X in the traditional sense of calculus and differential geometry.

Put differently, the categories CartSp of Euclidean spaces and Man of smooth manifolds with smooth functions form full subcategories of Diff .

Example 2 (Product diffeology). Given diffeological spaces (X i ) i∈I , we can equip

Examples 2 and 3 give us the categorical product and exponential objects, respectively, in Diff . The embeddings of CartSp and Man into Diff preserve products (and coproducts).

We work with the concrete semantics, where we fix C = Diff as the target for interpreting Cartesian types and their terms. That is, by choosing the interpretation real n def = R n , and by interpreting each op ∈ Op m n1,...,n k as the smooth function op : R n1 × . . . × R n k → R m that it is intended to represent, we obtain a unique interpretation − : CSyn → Diff .

To interpret linear types and their terms, we need a semantic setting L that is both compatible with Diff and enriched over the category of commutative monoids. We choose to work with commutative diffeological monoids. That is, commutative monoids internal to the category Diff .

A diffeological monoid X = (|X|, P X , 0 X , + X ) consists of a diffeological space (|X|, P X ) with a monoid structure (0 X ∈ |X|, (+ X ) : |X|×|X| → |X|), such that + X is smooth. We call a diffeological monoid commutative if the underlying monoid structure on |X| is commutative.

We write Diff CM for the category whose objects are commutative diffeological monoids and whose morphisms (|X|, P X , 0 X ,

Given that Diff CM is CMon-enriched, finite products are biproducts.

Example 4. The real numbers R form a commutative diffeological monoid R by combining its standard diffeology with its usual commutative monoid structure (0, +). Similarly, N ∈ Diff CM by equipping N with (0, +) and the discrete diffeology, in which plots are locally constant functions.

Example 5. We form the (categorical) product in Diff CM of (X i ) i∈I by equipping i∈I |X i | with the product diffeology and product monoid structure.

Example 6. For a commutative diffeological monoid X, we can equip the monoid !(|X|, 0 X , + X ) with the diffeology P U

Example 7. Given commutative diffeological monoids X and Y , we can equip the tensor product monoid (|X|, 0 X ,

In this paper, we only use the combined operation !X ⊗ Y (read: (!X) ⊗ Y ).

Example 8. Given commutative diffeological monoids X and Y , we can define a commutative diffeological monoid X Y with underlying set Diff CM (X, Y ),

In this paper, we will primarily be interested in X Y as a diffeological space, and we will mostly disregard its monoid structure for now.

Example 9. Given a diffeological space X and a commutative diffeological monoid Y , we can define a commutative diffeological monoid structure X ⇒ Y on X ⇒ (|Y |, P Y ) by using the pointwise monoid structure:

.

. ! is a left adjoint to the obvious forgetful functor Diff CM → Diff , while !(X × Y ) ∼ =!X⊗!Y and !1 ∼ = N. Seeing that (N, ⊗, ) defines a symmetric monoidal closed structure on Diff CM , cognoscenti will recognise that (Diff , 1, ×, ⇒) (Diff CM , N, 1, ×, ⊗, ) is a model of intuitionistic linear logic [29] . In fact, seeing that Diff CM is CMon-enriched, the model is biadditive [17] .

However, we do not need such a rich type system. For us, the following suffices. Define Diff CM (X), for X ∈ ob Diff , to have the objects of Diff CM and

Identities and composition are defined as x → (y → y) and f ; Diff

as Diff CM (f )(g) def = f ; Diff g. Diff CM (−) defines a locally indexed category. By taking C = Diff and L(−) = Diff CM (−), we obtain a concrete instance of our abstract semantics. Indeed, we have natural isomorphisms

The prime motivating examples of morphisms in this category are derivatives. Recall that the derivative at x, Df (x), and transposed derivative at x, (Df ) t (x), of a smooth function f : R n → R m are defined as the unique functions Df (x) :

t give maps in Diff CM (R n )(R n , R m ) and Diff CM (R n )(R m , R n ), respectively. Indeed, derivatives Df (x) of f at x are linear functions, as are transposed derivatives (Df ) t (x). Both depend smoothly on x in case f is C ∞ -smooth. Note that the derivatives are not merely linear in the sense of preserving 0 and +. They are also multiplicative in the sense that (Df )(x)(c · v) = c · (Df )(x)(v). We could have captured this property by working with vector spaces internal to Diff . However, we will not need this property to phrase or establish correctness of AD. Therefore, we restrict our attention to the more straightforward structure of commutative monoids.

Defining real n def = R n and interpreting each lop ∈ LOp as the smooth function lop :

R m it is intended to represent, we obtain a canonical interpretation of our target language in Diff CM .

In this section, we show that any categorical model L : C op → Cat of our target language gives rise to two Cartesian closed categories Σ C L and Σ C L op (which we wrote §2) . We believe these observations of Cartesian closure are novel. Surprisingly, they are highly relevant for obtaining a principled understanding of AD on a higher-order language: the former for forward AD, and the latter for reverse AD. Applying these constructions to the syntactic category LSyn : CSyn op → Cat of our language, we produce a canonical definition of the AD macros, as the canonical interpretation of the λ-calculus in the Cartesian closed categories Σ CSyn LSyn and Σ CSyn LSyn op . In addition, when we apply this construction to the denotational semantics Diff CM : Diff op → Cat and invoke a categorical logical relations technique, known as subsconing, we find an elegant correctness proof of the source code transformations. The abstract construction delineated in this section is in many ways the theoretical crux of this paper.

Recall that for any strictly indexed category, i.e. a (strict) functor L : C op → Cat, we can consider its total category (or Grothendieck construction) Σ C L, which is a fibred category over C (see [23, sections A1.1.7, B1.3.1]). We can view it as a Σ-type of categories, which generalizes the Cartesian product. Concretely, its objects are pairs (A 1 , A 2 ) of objects A 1 of C and A 2 of L(A 1 ). Its morphisms

and composition is (f 1 , f 2 ); (g 1 , g 2 ) def = (f 1 ; g 1 , f 2 ; L(f 1 )(g 2 )). Further, given a strictly indexed category L : C op → Cat, we can consider its fibrewise dual category L op : C op → Cat, which is defined as the composition C op L − → Cat op − → Cat. Thus, we can apply the same construction to L op to obtain a category Σ C L op .

6.2 Structure of Σ C L and Σ C L op for Locally Indexed Categories § §6.1 applies, in particular, to the locally indexed categories of §5. In this case, we will analyze the categorical structure of Σ C L and Σ C L op . For reference, we first give a concrete description.

Σ C L is the following category:

objects are pairs (A 1 , A 2 ) of objects A 1 of C and A 2 of L;

is given by (f 1 ; g 1 , f 2 ; L(f 1 )(g 2 )) and identities id (A1,A2) are (id A1 , id A2 ).

Σ C L op is the following category:

is given by (f 1 ; g 1 , L(f 1 )(g 2 ); f 2 ) and identities id (A1,A2) are (id A1 , id A2 ).

We examine the categorical structure present in Σ C L and Σ C L op for categorical models L in the sense of §5 (i.e., in case L has biproducts and supports ⇒-, !(−) ⊗ (−)-, and Cartesian -types). We believe this is a novel observation. We will make heavy use of it to define our AD algorithms and to prove them correct.

We observe that we need L to have biproducts (equivalently: to be CMon enriched) in order to show Cartesian closure. Further, we need linear ⇒-types and Cartesian -types to construct exponentials. (1, 1) , binary product (A 1 ,

Proof. We have (natural) bijections

Observe that we need the biproduct structure of L to construct finite products in Σ C L op . Further, we need Cartesian -types and !(−) ⊗ (−)-types, but not biproducts, to construct exponentials. t (x; y) = (snd x) * y, (fst x) * y , where we use (linear) elementwise multiplication ( * ) ∈ LOp n n;n . We represent derivatives as linear functions. This representation allows for efficient Jacobianvector/adjoint product implementations, which avoid first calculating a full Jacobian and next taking a product. Such implementations are known to be important to achieve performant AD systems. For the AD transformations to be correct, it is important that these derivatives of language primitives are implemented correctly in the sense that

In practice, AD library developers tend to assume the subtle task of correctly implementing such derivatives Dop(x; y) and Dop t (x; y) whenever a new primitive operation op is added to the library. The extension of the AD macros 

In this section, we will show that the source code transformations described in §7 correctly implement mathematical derivatives. We make correctness precise as the statement that for programs x : τ t : σ between first-order types τ and σ, i.e. types not containing any function type constructors, we have that

where − is the semantics of §5. The proof mainly consists of logical relations arguments over the semantics in Σ Diff Diff CM and Σ Diff Diff CM op . This logical relations proof can be phrased in elementary terms, but the resulting argument is technical and would be hard to discover. Instead, we prefer to phrase it in terms of a categorical subsconing construction, a more abstract and elegant perspective on logical relations. We discovered the proof by taking this categorical perspective, and, while we have verified the elementary argument (see [36, Appx. D]), we would not otherwise have come up with it.

Subsconing Logical relations arguments provide a powerful proof technique for demonstrating properties of typed programs. The arguments proceed by induction on the structure of types. Here, we briefly review the basics of categorical logical relations arguments, or subsconing constructions. We restrict to the level of generality that we need here, but we would like to point out that the theory applies much more generally.

Consider a Cartesian closed category (C, 1, ×, ⇒). Suppose that we are given a functor F : C → Set to the category Set of sets and functions which preserves finite products in the sense that F (1) ∼ = 1 and F (C × C ) ∼ = F (C) × F (C ). Then, we can form the subscone of F , or category of logical relations over F , which is Cartesian closed, with a faithful Cartesian closed functor π 1 to C which forgets about the predicates [24] :

objects are pairs (C, P ) of an object C of C and a predicate P ⊆ F C; -morphisms (C, P ) → (C , P ) are C morphisms f : C → C which respect the predicates in the sense that F (f )(P ) ⊆ P ;

identities and composition are as in C; F 1) is the terminal object, and products and exponentials are given by

In typical applications, C can be the syntactic category of a language (like Syn), the codomain of a denotational semantics − (like Diff ), or a product of the above, if we want to consider n-ary logical relations. Typically, F tends to be a hom-functor (which always preserves products), like C(1, −) or C(C 0 , −), for some important object C 0 . When applied to the syntactic category Syn and F = Syn(1, −), the formulae for products and exponentials in the subscone clearly reproduce the usual recipes in traditional, syntactic logical relations arguments. As such, subsconing generalises standard logical relations methods.

We will apply the subsconing construction above to

where we note that Diff , Σ Diff Diff CM , and Σ Diff Diff CM op are Cartesian closed (given the arguments of §5 and §6) and that the product of Cartesian closed categories is again Cartesian closed. Let us write We write P f τ and P r τ , respectively, for the relations π 2 τ f and π 2 τ r . Let us interpret

where we write Df for the semantic derivative of f (see §5). We need to verify, respectively, that ( op , (

) respect the logical relations P f and P r . This respecting of relations follows immediately from the chain rule for multivariate differentiation, as long as we have implemented our derivatives correctly for the basic operations op:

Writing real n1,..,n k def = real n1 * .. * real n k and R n1,..,n k def = R n1 ×..×R n k , we compute real n1,..,n k f =((R n1,..,n k , (R n1,..,n k , R n1,..,n k )), {(f, (g, h)) | f = g, h = Df }) real n1,..,n k r =((R n1,..,n k , (R n1,..,n k , R n1,..,n k )), 

Popular functional languages, such as Haskell and O'Caml, do not natively support linear types. As such, the transformations described in this paper may seem hard to implement. However, as we summarize in this section (and detail in [36, Appx. C]), we can easily implement the limited linear types needed for the transformations as abstract data types by using merely a basic module system. Specifically, we consider, as an alternative, applied target language for our transformations, the extension of the source language of §3 with the terms and types of Fig. 5 . We can define a faithful translation (−) † from our linear target language of §4 to this language: define (!τ ⊗ σ) † def = Tens(τ † , σ † ,), (τ σ) † def = LFun(τ † , σ † ), (real n ) † def = real n and extend (−) † structurally recursively, letting it preserve all other type formers. We then translate (x 1 : τ, . . . , x n : τ ; y : σ t :

. . , x n : τ † t † : (σ ρ) † and (x 1 : τ, . . . , x n : τ t : σ) † def = x 1 : τ † , . . . , x n : τ † t † : σ † . We believe an interested reader can fill in the details. This exhibits the linear target language as a sublanguage of the applied target language. The applied target language merely collapses the distinction between linear and Cartesian types and it adds the constructs lapp(t, s) for practical usability and to ensure that our adequacy result below is meaningful.

We can implement the API of Fig. 5 as a module that defines the abstract types LFun(τ, σ), under the hood implemented as a plain function type τ → σ, and Tens(τ, σ), which is implemented as lists of pairs List(τ * σ). Then, the required terms of Fig. 5 can be implemented as follows, using standard idiom [ ], t :: s, fold op over x in t from acc = init for empty lists, cons-ing, and folding: Our denotational semantics extends to this applied target language and is adequate with respect to the operational semantics induced by the suggested implementation. Further, our correctness proofs of the induced source-code translations also transfer to this applied setting, and they can be usefully phrased as manual, extensible logical relations proofs. As an application, we can extend our source language with higher-order primitives, like map ∈ Syn((real → real) * real n , real n ) to "map" functions over the black-box arrays real n . Then, our proofs extend to show that their correct forward and reverse derivatives are

where we use the standard functional programming idiom zip and zipWith.

Here, we can operate directly on the internal representations of LFun(τ, σ) and Tens(τ, σ), as the definitions of derivatives of primitives live inside our module.

Related work This work is closely related to [20] , which introduced a similar semantic correctness proof for a version of forward-mode AD, using a subsconing construction. A major difference is that this paper also phrases and proves correctness of reverse-mode AD on a λ-calculus and relates reverse-mode to forward-mode AD. Using a syntactic logical relations proof instead, [5] also proves correctness of forward-mode AD. Again, it does not address reverse AD. [11] proposes a similar construction to that of §6, and it relates it to the differential λ-calculus. This paper develops sophisticated axiomatics for semantic reverse differentiation. However, it neither relates the semantics to a sourcecode transformation, nor discusses differentiation of higher-order functions. Our construction of differentiation with a (biadditive) linear target language might remind the reader of differential linear logic [15] . In differential linear logic, (forward) differentiation is a first-class operation in a (biadditive) linear language. By contrast, in our treatment, differentiation is a meta-operation.

Importantly, [16] describes and implements what are essentially our sourcecode transformations, though they were restricted to first-order functions and scalars. [37] sketches an extension of the reverse-mode transformation to higherorder functions in essentially the same way as proposed in this paper. It does not motivate or derive the algorithm or show its correctness. Nevertheless, this short paper discusses important practical considerations for implementing the algorithm, and it discusses a dependently typed variant of the algorithm.

Next, there are various lines of work relating to correctness of reverse-mode AD that we consider less similar to our work. For example, [28] define and prove correct a formulation of reverse-mode AD on a higher-order language that depends on a non-standard operational semantics, essentially a form of symbolic execution. [2] does something similar for reverse-mode AD on a first-order language extended with conditionals and iteration. [8] defines an AD algorithm in a simply typed λ-calculus with linear negation (essentially, the continuation-based AD of [20] ) and proves it correct using operational techniques. Further, they show that this algorithm corresponds to reverse-mode AD under a non-standard operational semantics (with the "linear factoring rule"). These formulations of reverse-mode AD all depend on non-standard run-times and fall into the category of "define-by-run" formulations of reverse-mode AD. Meanwhile, we are concerned with "define-then-run" formulations: source-code transformations producing differentiated code at compile-time, which can then be optimized during compilation with existing compiler tool-chains.

Finally, there is a long history of work on reverse-mode AD, though almost none of it applies the technique to higher-order functions. A notable exception is [31] , which gives an impressive source-code transformation implementation of reverse AD in Scheme. While very efficient, this implementation crucially uses mutation. Moreover, the transformation is complex and correctness is not considered. More recently, [38] describes a much simpler implementation of a reverse AD code transformation, again very performant. However, the transformation is quite different from the one considered in this paper as it relies on a combination of delimited continuations and mutable state. Correctness is not considered, perhaps because of the semantic complexities introduced by impurity.

Our work adds to the existing literature by presenting (to our knowledge) the first principled and pure define-then-run reverse AD algorithm for a higherorder language, by arguing its practical applicability, and by proving semantic correctness of the algorithm.

We plan to build a practical, verified AD library based on the methods introduced in this paper. This will involve calculating the derivative of many first-and higher-order primitives according to our method.

Next, we aim to extend our method to other expressive language features. We conjecture that the method extends to source languages with variant and inductive types as long as one makes the target language a linear dependent type theory [10, 34] . Indeed, the dimension of (co)tangent spaces to a disjoint union of spaces depends on the choice of base point. The required colimits to interpret such types in Σ C L and Σ C L op should exist by standard results about arrow and container categories [3] . We are hopeful that the method can also be made to apply to source languages with general recursion by calculating the derivative of fixpoint combinators similarly to our calculation for map. The correctness proof will then rely on a domain theoretic generalisation of our techniques [35] .

Tensorflow: A system for large-scale machine learning

A simple differentiable programming language

Categories of containers

Dual intuitionistic linear logic

On the versatility of open logical relations: Continuity, automatic differentiation, and a containment theorem

A mixed linear and non-linear logic: Proofs, terms and models

A convenient differential category. Cahiers de topologie et géométrie différentielle catégoriques

Backpropagation in the simply typed lambdacalculus with linear negation

The Stan math library: Reverse-mode automatic differentiation in C++

A linear logical framework

Reverse derivative categories

Typed categorical combinatory logic

Enriching an effect calculus with linear types

An introduction to differential linear logic: proof-nets, models and antiderivatives

The simple essence of automatic differentiation

Differential structure in models of multiplicative biadditive intuitionistic linear logic

Smooth structures

Linear spaces and differentiation theory

Correctness of automatic differentiation via diffeologies and categorical gluing

Don't unroll adjoint: differentiating SSA-Form programs

Sketches of an elephant: A topos theory compendium

Quasitoposes, quasiadhesive categories and Artin glueing

Synthetic differential geometry

Introduction to higher-order categorical logic

Call-by-push-value: A Functional/imperative Synthesis

A differential-form pullback programming language for higherorder reverse-mode automatic differentiation

Categorical semantics of linear logic. Panoramas et syntheses 27

Automatic differentiation in pytorch

Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator

Efficient differentiable programming in a functional array-processing language

Differential geometrical methods in mathematical physics

A categorical semantics for linear logical frameworks

Denotational correctness of forward-mode automatic differentiation for iteration and recursion

Reverse ad at higher types: Pure, principled and denotationally correct (full version)

The differentiable curry

Demystifying differentiable programming: Shift/reset the penultimate backpropagator

since derivatives of tuple-valued functions are computed component-wise. (In fact, the corresponding facts hold more generally for any first-order type, as an iterated product of real n .) Suppose that (f, (g, h) ) ∈ P f real n 1 ,..,n k , i.e. g = f and h = Df . Then, using the chain rule in the last step, we have (f, (g, h) ); ( op , (Similarly, if (f, (g, h)) ∈ P r real n 1 ,..,n k , then by the chain rule and linear algebraConsequently, we obtain our Cartesian closed functors − f and − r .As a consequence, the two squares below commute.

SynIndeed, going around the squares in both directions define Cartesian closed functors that agree on their action on real n and all operations op. So, by the universal property of Syn, they must coincide. In particular, ( t , Most of the work is now in place to show correctness of AD. We finish the proof below. To ease notation, we work with terms in a context with a single type. Doing so is not a restriction as our language has products, and the theorem holds for arbitrary terms between first-order types.

For programs x : τ t : σ between firstorder types τ and σ,where we write D and (−) t for the usual calculus derivative and matrix transpose.

we choose a smooth curve γ : R → τ such that γ(0) = 0 and Dγ(0)(1) = v and use that t respects the logical relations P f .To show that The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.