key: cord-0058309-zsx3g0tk authors: Draxler, Felix; Schwarz, Jonathan; Schnörr, Christoph; Köthe, Ullrich title: Characterizing the Role of a Single Coupling Layer in Affine Normalizing Flows date: 2021-03-17 journal: Pattern Recognition DOI: 10.1007/978-3-030-71278-5_1 sha: 14aaa920b4a4b6dc17ddb9e4e620b082a3ae7785 doc_id: 58309 cord_uid: zsx3g0tk Deep Affine Normalizing Flows are efficient and powerful models for high-dimensional density estimation and sample generation. Yet little is known about how they succeed in approximating complex distributions, given the seemingly limited expressiveness of individual affine layers. In this work, we take a first step towards theoretical understanding by analyzing the behaviour of a single affine coupling layer under maximum likelihood loss. We show that such a layer estimates and normalizes conditional moments of the data distribution, and derive a tight lower bound on the loss depending on the orthogonal transformation of the data before the affine coupling. This bound can be used to identify the optimal orthogonal transform, yielding a layer-wise training algorithm for deep affine flows. Toy examples confirm our findings and stimulate further research by highlighting the remaining gap between layer-wise and end-to-end training of deep affine flows. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this chapter (10.1007/978-3-030-71278-5_1) contains supplementary material, which is available to authorized users. Affine Normalizing Flows such as RealNVP [4] are widespread and successful tools for density estimation. They have seen recent success in generative modeling [3, 4, 9] , solving inverse problems [1] , lossless compression [6] , out-ofdistribution detection [12] , better understanding adversarial examples [7] and sampling from Boltzmann distributions [13] . These flows approximate arbitrary data distributions μ(x) by learning an invertible mapping T (x) such that given samples are mapped to normally distributed latent codes z := T (x). In other words, they reshape the data density μ to form a normal distribution. While being simple to implement and fast to evaluate, affine flows appear not very expressive at first glance. They consist of invertible layers called coupling blocks. Each block leaves half of the dimensions untouched and subjects the other half to just parameterized translations and scalings. Explaining the gap between theory and applications remains an unsolved challenge. Taking the problem apart, a single layer consists of a rotation and an affine nonlinearity. It is often hand-wavingly argued that the deep model's expressivity comes from the rotations between the couplings by allowing different dimensions to influence one another [4] . In this work, we open a rigorous branch of explanation by characterizing the normalizing flow generated by a single affine layer. More precisely, we contribute: -A single affine layer under maximum likelihood (ML) loss learns first-and second-order moments of the conditional distribution of the changed (active) dimensions given the unchanged (passive) dimensions (Sect. 3.2). -From this insight, we derive a tight lower bound on how much the affine nonlinearity can reduce the loss for a given rotation (Sect. 3.3) . This is visualized in Fig. 1 where the bound is evaluated for different rotations of the data. -We formulate a layer-wise training algorithm that determines rotations using the lower bound and nonlinearities using gradient descent in turn (Sect. 3.4). -We show that such a single affine layer under ML loss makes the active independent of the passive dimensions if they are generated by a certain rule (Sect. 3.5). Finally, we show empirically in Sect. 4 that while improving the training of shallow flows, the above new findings do not yet explain the success of deep affine flows and stimulate further research. An affine coupling layer pushes the input density towards standard normal. Its success depends on the rotation of the input (top row). We derive a lower bound for the error that is actually attained empirically (center row, blue and orange curves). The solution with lowest error is clearly closest to standard normal (bottom row, left). The connection between affine transformations and the first two moments of a distribution is well-known in the Optimal Transport literature. When the function space of an Optimal Transport (OT) problem with quadratic ground cost is reduced to affine maps, the best possible transport matches mean and covariance of the involved distributions [17] . In the case of conditional distributions, affine maps become conditional affine maps [16] . We show such maps to have the same minimizer under maximum likelihood loss (KL divergence) as under OT costs. It has been argued before that a single coupling or autoregressive block [14] can capture the moments of conditional distributions. This is one of the motivations for the SOS flow [8] , based on a classical result on degree-3 polynomials by [5] . However, they do not make this connection explicit. We are able to give a direct correspondence between the function learnt by an affine coupling and the first two moments of the distribution to be approximated. Rotations in affine flows are typically chosen at random at initialization and left fixed during training [3, 4] . Others have tried training them via some parameterization like a series of Householder reflections [15] . The stream of work most closely related to ours explores the idea to perform layer-wise training. This allows an informed choice of the rotation based on the current estimate of the latent normal distribution. Most of these works propose to choose the least Gaussian dimensions as the active subspace [2, 11] . We argue that this is inapplicable to affine flows due to their limited expressivity when the passive dimensions are not informative. To the best of our knowledge, our approach is the first to take the specific structure of the coupling layer into account and derive a tight lower bound on the loss as a function of the rotation. Normalizing flows approximate data distributions μ available through samples x ∈ R D ∼ μ by learning an invertible function T (x) such the latent codes z := T (x) follow an isotropic normal distribution z ∈ R D ∼ N (0, 1). When such a function is found, the data distribution μ(x) can be approximated using the change-of-variables formula: where J = ∇T (x) is the Jacobian of the invertible function, and "· " is the push-forward operator. New samples x ∼ μ can be easily generated by drawing z from the latent Gaussian and transporting them backward through the invertible function: Affine Normalizing Flows are a particularly efficient way to parameterize such an invertible function T : They are simple to implement and fast to evaluate in both directions T (x) and T −1 (z), along with the Jacobian determinant det J [1] . Like most normalizing flow models, they consist of the composition of several invertible layers T (x) = (T L • · · · • T 1 )(x). The layers are called coupling blocks and modify the distribution sequentially. We recursively define the push-forward of the first l blocks as Each block T l , l = 1, . . . , L contains a rotation Q l ∈ SO(D) and a nonlinear transformation τ l : The nonlinear transformation τ l is given by: Here, y = Q l x l−1 ∼ (Q l ) μ l−1 is the rotated input to the nonlinearity (dropping the index l on y for simplicity) and is element-wise multiplication. An affine nonlinearity first splits its input into passive and active dimensions p ∈ R DP and a ∈ R DA . The passive subspace is copied without modification to the output of the coupling. The active subspace is scaled and shifted as a function of the passive subspace, where s l and t l : R DP → R DA are represented by a single generic feed forward neural network [9] and need not be invertible themselves. The affine coupling design makes inversion trivial by transposing Q l and rearranging terms in τ l . Normalizing Flows, and affine flows in particular, are typically trained using the Maximum Likelihood (ML) loss [3] . It is equivalent to the Kullback-Leibler (KL) divergence between the push-forward of the data distribution μ and the latent normal distribution [10] : The two differ only by terms independent of the trained model (the typically unknown entropy H[μ] and the normalization of the normal distribution). It is unknown whether affine normalizing flows can push arbitrarily complex distributions to a normal distribution [14] . In the remainder of the section, we shed light on this by considering an affine flow that consists of just a single coupling as defined in Eq. (5). Since we only consider one layer, we're dropping the layer index l for the remainder of the section. In Sect. 4, we will discuss how these insights on isolated affine layers transfer to deep flows. We first derive the exact form of the ML loss in Eq. (6) for an isolated affine coupling with a fixed rotation Q as in Eq. (4). The Jacobian for this coupling has a very simple structure: It is a triangular matrix whose diagonal elements are J ii = 1 if i is a passive dimension and J ii = exp(s i (p)) if i is active. Its determinant is the product of the diagonal elements, so that det J(x) > 0 and log det J(x) = DA i=1 s i (p). The ML loss thus reads: We now derive the minimizer of this loss: We derive this by optimizing for s(p), t(p) in Eq. (8) for each value of p separately. The full proof can be found in Appendix A.1. We insert the optimal s and t to find the active part of the globally optimal affine nonlinearity: It normalizes a for each p by shifting the mean of μ(a|p) to zero and rescaling the individual standard deviations to one. Consider a distribution where the first variable p is uniformly distributed on the interval [−2, 2]. The distribution of the second variable a is normal, but its mean m(p) and standard deviation σ(p) are varying depending on p: We call this distribution "W density". It is shown in Fig. 2a . We now train a single affine nonlinearity τ by minimizing the ML loss, setting Q = 1. As hyperparameters, we choose a subnet for s, t with one hidden layer and a width of 256, a learning rate of 10 −1 , a learning rate decay with factor 0.9 every 100 epochs, and a weight decay of 0. We train for 4096 epochs with 4096 i.i.d. samples from μ each using the Adam optimizer. We solve s, t in Lemma 1 for the estimated meanm(p) and standard deviation σ(p) as predicted by the learntŝ andt. Upon convergence of the model, they closely follow their true counterparts m(p) and σ(p) as shown in Fig. 2b . Example 2. This example modifies the previous to illustrate that the learnt conditional density τ μ(a|p) is not necessarily Gaussian at the minimum of the loss. The W density from above is transformed to the "WU density" by replacing the conditional normal distribution by a conditional uniform distribution with the same conditional mean m(p) and standard deviation σ(p) as before. One might wrongly believe that the KL divergence favours building a distribution that is marginally normal while ignoring the conditionals, i.e. τ μ(p) = N . Lemma 1 predicts the correct result, resulting in the following uniform push-forward density depicted in Fig. 2d : Note how τ μ(a|p) does not depend on p, which we later generalize in Lemma 2. Knowing that a single affine layer learns the mean and standard deviation of μ(a i |p) for each p, we can insert this minimizer into the KL divergence. This yields a tight lower bound on the loss after training. Even more, it allows us to compute a tight upper bound on the loss improvement by the layer, which we denote Δ ≥ 0. This loss reduction can be approximated using samples without training. μ and a single affine coupling layer T with a fixed rotation Q. Like in Eq. (5), call (a, p) = Qx the rotated versions of x ∼ μ. Then, the KL divergence has the following minimal value: The loss improvement by the optimal affine coupling as in Lemma 1 is: To proof, insert the minimizer s, t from Lemma 1 into Eq. (8). Then evaluate Δ = D KL (μ||N ) − D KL (T μ||N ) to obtain the statement. The detailed proof can be found in Appendix A.2. The loss reduction by a single affine layer depends solely on the moments of the distribution of the active dimensions conditioned on the passive subspace. Higher order moments are ignored by this coupling design. Together with Lemma 1, this paints the following picture of an affine coupling layer: It fits a Gaussian distribution to each conditional μ(a i |p) and normalizes this Gaussian's moments. The gap in entropy between the fit Gaussian and the true conditional distribution cannot be reduced by the affine transformation. This makes up the remaining KL divergence in Eq. (18). We now make the connection explicit that a single affine layer can only achieve zero loss on the active subspace iff the conditional distribution is Gaussian with diagonal covariance: a single affine block can reduce the KL divergence on the active subspace to zero: The proof can be found in Appendix A.3. Example 3. We revisit the Examples 1 and 2 and confirm that the minimal loss achieved by a single affine coupling layer on the W-shaped densities matches the predicted lower bound. This is the case for both densities. Figure 3 shows the contribution of the conditional part of the KL divergence D KL ((T μ)(a|p)||N ) as a function of p: For the W density, the conditional μ(a|p) is normally distributed. This is the situation of Corollary 1 and the remaining conditional KL divergence is zero. The remaining loss for the WU density is the negentropy of a uniform distribution with unit variance. The rotation Q of the isolated coupling layer determines the splitting into active and passive dimensions and the axes of the active dimensions (the rotation within the passive subspace only rotates the input into s, t and is irrelevant). The bounds in Theorem 1 heavily depend on these choices and thus depend on the chosen rotation Q. This makes it natural to consider the loss improvement as a function We propose to approximate this maximization by evaluating the loss improvement for a finite set of candidate rotations in Algorithm 1 "Optimal Affine Subspace (OAS)". Note that Step 5 requires approximating Δ from samples. In the regime of low D P , one can discretize this by binning samples by their passive coordinate p. Then, one computes mean and variance empirically for each bin. We leave the general solution of Eq. (23) for future work. Compute We choose δ = 0.95, σ = √ 1 − δ 2 = 0.3122... so that the mean is zero and the standard deviation along the first axis is one. We now evaluate the loss improvement Δ(θ) in Eq. (20) as a function of the angle θ with which we rotate the above distribution: Analytically, this can be done pointwise for a given p and then integrated numerically. This will not be possible for applications where only samples are available. As a proof of concept, we employ the previously mentioned binning approach. It groups N samples from μ by their p value into B bins. Then, we compute m A|p b and σ A|p b using the samples in each bin b = 1, . . . , B. Figure 4 shows the upper bound as a function of the rotation angle, as obtained from the two approaches. Here, we used B = 32 bins and a maximum of N = 2 13 = 8192 samples. Around N ≈ 256 samples are sufficient for a good agreement between the analytic and empiric bound on the loss improvement and the corresponding angle at the maximum. Note: For getting a good density estimate using a single coupling, it is crucial to identify the right rotation. If we naively or by chance decide for θ = 90 • , the distribution is left unchanged. An important step towards pushing a multivariate distribution to a normal distribution is making the dimensions independent of one another. Then, the residual to a global latent normal distribution can be solved with one sufficiently expressive 1D flow per dimension, pushing each distribution independently to a normal distribution. The following lemma shows for which data sets a single affine layer can make the active and passive dimensions independent. This results shows what our theory can explain about deep affine flows: It is easy to see that D − 1 coupling blocks with D A = 1, D P = D − 1 can make all variables independent if the data set can be written in the form of x i = f (x =i ) + x i g(x =i ). Then, only the aforementioned independent 1D flows are necessary for a push-forward to the normal distribution. Example 5. Consider again the W-shaped densities from the previous Examples 1 and 2. After optimizing the single affine layer, the two variables p, a are independent (compare Fig. 2c, d) : Do the above single-layer results explain the expressivity of deep affine flows? To answer this question, we construct a deep flow layer by layer using the optimal affine subspace (OAS) algorithm Algorithm 1. Each layer l being added to the flow is trained to minimize the residuum between the current push-forward μ l−1 and the latent N . The corresponding rotation Q l is chosen by maximizing Δ(Q l ) and the nonlinearities τ l are trained by gradient descent, see Algorithm 2. 1: Initialize T (0) = id. 2: repeat 3: Compute Q l via OAS (Algorithm 1), using samples from T (l−1) μ. Train τ l on samples y = Q l · T (l−1) (x) for x ∼ μ. Set T l = τ l • Q l . 6: Compose T (l) = T l • T (l−1) . 7: until convergence, e.g. loss or improvement threshold, max. number of layers. 8: return Final transport T (L) . Can this ansatz reach the quality of end-to-end affine flows? An analytic answer is out of the scope of this work, and we consider toy examples. Example 6. We consider a uniform 2D distribution μ = U([−1, 1] 2 ). Figure 5 compares the flow learnt layer-wise using Algorithm 2 to flows learnt layerwise and end-to-end, but with fixed random rotations. Our proposed layer-wise algorithm performs on-par with end-to-end training despite optimizing only the respective last layer in each iteration, and beats layer-wise random subspaces. Example 7. We now provide more examples on a set of toy distributions. As before, we train layer-wise using OAS and randomly selected rotations, and endto-end. Additionally, we train a mixed variant of OAS and end-to-end: New layers are still added one by one, but Algorithm 2 is modified such that iteration l optimizes all layers 1 through l in an end-to-end fashion. We call this training "progressive" as layers are progressively activated and never turned off again. We obtain the following results: Optimal rotations always outperform random rotations in layer-wise training. With only a few layers, they also outperform endto-end training, but are eventually overtaken as the network depth increases. Progressive training continues to be competitive also for deep networks. 5 . Affine flow trained layer-wise "LW", using optimal affine subspaces "OAS" (top) and random subspaces "RND" (middle). After a lucky start, the random subspaces do not yield a good split and the flow approaches the latent normal distribution significantly slower. End-to-end training "E2E" (bottom) chooses a substantially different mapping, yielding a similar quality to layer-wise training with optimal subspaces. The following rows depic different training methods: layer-wise "LW" (rows 2 and 3), progressively "PROG" (rows [4] [5] and end-to-end "E2E" (last row). Rotations are "OAS" when determined by Algorithm 1 (row 2 and 4) or randomly selected "RND" (rows 3, 5 and 6). Figure 6 shows the density estimates after twelve layers. At this point, none of the methods show a significant improvement by adding layers. Hyperparameters were optimized for each training configuration to obtain a fair comparison. Densities obtained by layer-wise training exhibit significant spurious structure for both optimal and random rotations, with an advantage for optimally chosen subspaces. In this work, we showed that an isolated affine coupling learns the first two moments of the conditioned data distribution μ(a|p). Using this result, we derived a tight upper bound on the loss reduction that can be achieved by such a layer. We then used this to choose the best rotation of the coupling. We regard our results as a first step towards a better understanding of deep affine flows. We provided sufficient conditions for a data set that can be exactly solved with layer-wise trained affine couplings and a single layer of D independent 1D flows. Our results can be seen analogously to the classification layer at the end of a multi-layer classification network: The results from Sect. 3 directly apply to the last coupling in a deep normalizing flow. This raises a key question for future work: How do the first L − 1 layers prepare the distribution μ L−1 such that the final layer can perfectly push the data to a Gaussian? Analyzing inverse problems with invertible neural networks Greedy inference with layers of lazy maps NICE: non-linear independent components estimation Density estimation using real nvp A method for simulating non-normal distributions Integer discrete flows and lossless compression Excessive invariance causes adversarial vulnerability Sum-of-squares polynomial flow Glow: generative flow with invertible 1x1 convolutions Sampling via measure transport: an introduction Large-scale optimal transport map estimation using projection pursuit Detecting outof-distribution inputs to deep generative models using a test for typicality Boltzmann generators: sampling equilibrium states of many-body systems with deep learning Normalizing flows for probabilistic modeling and inference Invert to learn to invert Conditional expectation estimation through attributable components Data-driven optimal transport Furthermore, we thank our colleagues Lynton Ardizzone, Jakob Kruse, Jens Müller, and Peter Sorrenson for their help, support and fruitful discussions.