key: cord-0045752-feqt81v1 authors: Rostami, M. Ali; Bücker, H. Martin title: Preconditioning Jacobian Systems by Superimposing Diagonal Blocks date: 2020-06-15 journal: Computational Science - ICCS 2020 DOI: 10.1007/978-3-030-50417-5_8 sha: e7970cfe39d9bb977cf97facba893adc1385dca0 doc_id: 45752 cord_uid: feqt81v1 Preconditioning constitutes an important building block for the solution of large sparse systems of linear equations. If the coefficient matrix is the Jacobian of some mathematical function given in the form of a computer program, automatic differentiation enables the efficient and accurate evaluation of Jacobian-vector products and transposed Jacobian-vector products in a matrix-free fashion. Standard preconditioning techniques, however, typically require access to individual nonzero elements of the coefficient matrix. These operations are computationally expensive in a matrix-free approach where the coefficient matrix is not explicitly assembled. We propose a novel preconditioning technique that is designed to be used in combination with automatic differentiation. A key element of this technique is the formulation and solution of a graph coloring problem that encodes the rules of partial Jacobian computation that determines only a proper subset of the nonzero elements of the Jacobian matrix. The feasibility of this semi-matrix-free approach is demonstrated on a set of numerical experiments using the automatic differentiation tool ADiMat. Large sparse systems of linear equations are critical to computational methods in science, technology, and society. A key characteristic of iterative methods for the solution of such systems is that they can be implemented in a matrix-free fashion [12] . That is, given an N -dimensional right-hand side vector b and an N × N nonsingular coefficient matrix J, these methods aim to solve systems of the form by making use of J solely in the form of matrix-vector products, Jz, or transposed matrix-vector products, J T z, where the symbol z denotes some given N -dimensional vector. Therefore, there is no need to assemble the coefficient matrix in some sparse data storage format. We consider a rather typical situation in computational science where the coefficient matrix J is the Jacobian of some mathematical function given in the form of a computer program. Jacobian-vector products as well as transposed Jacobian-vector products can be efficiently and accurately computed by automatic differentiation (AD) [6, 11] without explicitly setting up the Jacobian matrix. Thus, the major computational kernels of iterative methods match to the functionality that is provided by AD. In practice, iterative methods involve preconditioning techniques [1, 12] that transform (1) into an equivalent system of the form whose solution y is the same as the solution of (1). Here, the N × N nonsingular matrix M is the preconditioner that is to be constructed such that M is somehow close to J, i.e., M ≈ J. Preconditioning techniques typically need access to individual nonzero elements of the coefficient matrix [1, 12] . However, in a matrix-free approach, such accesses to individual Jacobian entries are computationally expensive, not only in automatic differentiation but also in numerical differentiation. To bridge the gap between preconditioned iterative methods and AD, we propose a novel approach that is based on superimposing two diagonal block schemes. The first scheme consists of nonoverlapping diagonal blocks of size r that represent a sparsification operation. These blocks are used to define the required nonzero elements [3] of a partial Jacobian computation [9] . The required nonzeros are then determined by AD employing the solution of a suitably defined graph coloring problem [5] that colors a subset of the vertices encoding the rules of partial Jacobian computation. The second scheme consists of nonoverlapping diagonal blocks of size d that define a simple preconditioner. A standard preconditioning approach is taken that applies ILU decomposition separately on each diagonal block. Here, we deliberately choose d ≥ r enabling to incorporate a maximal number of nonrequired nonzero elements outside of the r × r diagonal blocks of the Jacobian that are produced as by-products of the partial Jacobian computation. The structure of this article is as follows. In Sect. 2, the overall approach is sketched that consists of a problem arising from scientific computing. It involves the computation of a subset of the nonzero elements of the Jacobian matrix by AD. This partial Jacobian computation problem is then modeled by a suitably defined graph coloring problem in Sect. 3. In Sect. 4 implementation details of the approach are given. Numerical experiments are reported in Sect. 5 and concluding remarks are presented in Sect. 6. The novel preconditioning approach is inspired by the semi-matrix free preconditioning technique introduced in [3] . The approach in [3] for the solution of (2) is summarized as follows: -Carry out Jacobian-vector products Jz or transposed matrix-vector products J T z by applying AD with a seed matrix that is identical to the vector z. -Choose a block size r and get the sparsified matrix of the Jacobian J denoted by ρ r (J). Here, the sparsification ρ r (J) consists of the nonzero elements of the r × r diagonal blocks of J. Assemble ρ r (J) via AD and store it explicitly. -Construct a preconditioner M from ρ r (J) by performing an ILU(0) decomposition [12] on each block of ρ r (J). That is, no fill-in elements are allowed during the decomposition. The nonzero elements of J that are selected by the sparsification ρ r (J) are called required nonzero elements. The symbols used for the block size r and the sparsification ρ r (J) indicate that these quantities define the required elements. We also denote the nonzero pattern of the required elements by the set R. As usual for sparsity patterns, we use the binary matrix and the set of the positions of the nonzero elements interchangeably. That is, symbols like R denoting sparsity patterns are either matrices or sets, depending on the context. The novel approach borrows the first and the second item of the previous list and replaces the third item by a different preconditioning scheme. The new idea is that AD does not only compute the required elements R, but also certain additional information at no extra computational cost. However, only parts of this additional information is immediately useful for preconditioning. This useful information is called by-product and is denoted by the set B. The overall approach is detailed in the remaining part of this section. Like the previous approach in [3] , the new approach is based on computing only a proper subset of the nonzero elements of the Jacobian J, which is referred to as partial Jacobian computation [5, [7] [8] [9] [10] . We summarize partial Jacobian computation by considering Fig. 1 taken from [3] . Suppose that we are interested in computing the nonzeros of J on all 2×2 diagonal blocks, but are not interested in the remaining nonzeros. In this example, all nonzeros on the diagonal blocks of size r = 2 are the required nonzeros, which are denoted by black disks in the sparsity pattern of the Jacobian depicted in this figure left. All remaining nonzeros of J are called nonrequired elements, represented by black circles. The relative computational cost associated with the forward mode of AD computing the matrix-matrix product J · S is given by the number of columns of the seed matrix S, see [6, 11] . We stress that AD does not assemble the matrix J, but computes the product J · S for a given S directly. The symbol cp(J) := J · S represents this so-called compressed Jacobian matrix. The exploitation of sparsity has a long tradition in AD; see the survey [5] . The main idea behind sparsity exploitation is to form groups of columns of J. This grouping is denoted by colors in the middle of the Fig. 1 . If J is an N ×N matrix, all (zero and nonzero) elements of J are computed by setting the seed matrix to the identity of order N . The relative computational cost of this approach is then the number of columns of the identity given by N . However, exploiting the grouping of columns it is possible to find a seed matrix with fewer than N columns. In the middle of Fig. 1 , there are three colors representing three groups of columns. Each group of columns in J corresponds to a single column in the compressed Jacobian matrix depicted in the right. More precisely, a column of cp(J) with a certain color c is the linear combination of those columns of J that belong to the group of columns with the color c. Equivalently, there is a binary seed matrix S whose number of columns corresponds to the number of colors such that all required nonzero elements R of J also appear in cp(J). In the semi-matrix-free approach [3] , given the sparsity pattern of J and the set of required elements R, the problem of assembling the required nonzero elements with a minimal relative computational cost is as follows. Let J be a sparse N × N Jacobian matrix with known sparsity pattern and let ρ r (J) denote its sparsification using r × r blocks on the diagonal of J. Find a binary N × p seed matrix S with a minimal number of columns, p, such that all nonzero entries of ρ r (J) also appear in the compressed matrix cp(J) := J · S. The compressed Jacobian cp(J) contains by definition all required elements of J. However, by inspecting the example in Fig. 1 , it also contains additional nonzero elements. These additional nonzero elements decompose into two different classes. There are nonzero elements of cp(J) that are nonrequired elements of J. In the example, the three nonzeros at the positions (5, 1), (6, 1) and (6, 2) belong to this class. The other class of nonzero elements of cp(J) consists of those nonzeros that are linear combinations of nonzero entries of J. For instance, the nonzero at the position (3, 3) in cp(J) is the sum of J (3, 5) and J (3, 6) . The overall idea of the novel approach is to incorporate into the preconditioning not only the required elements of J, but also a certain subset of the nonzero elements of cp(J) that are nonrequired elements of J. To this end, another sparsification operator ρ d (·) is introduced that extracts from cp(J) the nonzero elements of the d × d diagonal blocks of J that are not required. The set of by-products B is then defined as those nonzero elements of the compressed Jacobian cp(J) that are nonzeros within these d × d blocks of J and that are not contained in the set of required elements R. In other words, the by-products B are obtained from the compressed Jacobian cp(J) by removing all entries that are linear combinations of nonzeros of J and by additionally removing all (required and nonrequired) nonzeros of J that are outside the d × d diagonal blocks. The preconditioner M that approximates J is then constructed by assembling the nonzeros R ∪ B in a matrix denoted as rc(J) and using an ILU decomposition on the d × d diagonal blocks. The symbols used for the block size d and the sparsification operator ρ d (·) indicate that these quantities are used to carry out a decomposition on each block. We remark that the sparsification operators, ρ r (·) and ρ d (·), that extract the diagonal blocks reduce the size of the bottom right block accordingly if the order of the matrix is not a multiple of the block size. For instance, returning to the example in Fig. 1 with r = 2 and assuming that d = 5, then the operator ρ d (·) leads to a top left 5 × 5 block and a bottom right 1 × 1 block. The set of byproducts B then consists of the single nonzero entry J(5, 3) which is stored in cp(J) at position (5, 1) . In summary, a high-level description of the new preconditioning approach that uses two diagonal block schemes of size r and of size d is given as follows: -Carry out Jacobian-vector products Jz or transposed matrix-vector products J T z using AD. -Choose a block size r, solve Problem 1, and compute cp(J) using AD. -Choose a block size d and assemble the required elements R as well as the by-products B from cp(J) using the sparsification operator ρ d (·). Store R∪B explicitly in a matrix rc(J). -Construct a preconditioner M from R ∪ B by performing an ILU decomposition on each diagonal d × d block of rc(J). The only other work that is related to our approach is the preconditioning technique introduced in [4] , which is also based on partial matrix computation, but differs in formulating balancing problems. The purpose of the following section is to reformulate the combinatorial problem from scientific computing given by Problem 1 in terms of an equivalent graph coloring problem. Recall from the previous section that the exploitation of sparsity is a well-studied topic in derivative computations [5] . Interpreting these scientific computing problems in the language of graph theory does not only give us a better insight to the abstract problem structure but also offers an intimate connection to the rich history of research in graph theory that can lead to efficient algorithms for the solution of the resulting problems. In this section, we consider the graph problem corresponding to the scientific computing problem that was introduced in the previous section. In the spirit of [3] , we define a combinatorial model that handles the decomposition of the nonzero elements of J into two sets called required and nonrequired elements. The following new definition introduces the concept of structurally ρ r -orthogonal columns. Next, we define the ρ r -column intersection graph which will be used to reformulate Problem 1 arising from scientific computing. That is, the edge set E ρr is constructed in such a way that columns represented by two vertices v i and v j need to be assigned to different column groups if and Using this graph model, Problem 1 from scientific computing is transformed into the following equivalent graph theoretical problem. Problem 2 (Minimum Block Coloring). Find a coloring of the ρ r -column intersection graph G ρr with a minimal number of colors. The solution of this graph coloring problem corresponds to a seed matrix S which is then used to compute the compressed Jacobian cp(J) = J · S using AD. Recall from the previous section that the required elements of J are contained in cp(J). However, we already pointed out that some additional useful information B is also contained in cp(J). In the following section, we discuss how to recover these by-products B from cp(J) and how to use it for preconditioning. Given the sparsity pattern P of the Jacobian matrix J, the following pseudocode summarizes the new preconditioning approach: 1: R = ρ r (P) 2: S = partial coloring(P, R) 3: Compute cp(J) = J · S by AD 4: rc(J) = ρ d (partial recover(P, S, cp(J), R)) 5: Construct M as the ILU decomposition of rc(J) 6: Solve the preconditioned linear system (2) In this pseudocode, we first compute the required elements R using the sparsification operator ρ r (·). The required elements R are taken as an input to solve Problem 2 using a partial graph coloring algorithm [3] . The solution of this graph coloring problem corresponds to a seed matrix S that is used by the AD tool ADiMat [2, 14] to compute the compressed Jacobian cp(J). Then, we need a function partial recover() to recover the nonzero elements R ∪ B of J from the compressed Jacobian cp(J). The preconditioner M is constructed by a blockwise ILU decomposition of rc(J) and the preconditioned system is solved by Jacobian-vector products using ADiMat. To introduce the function partial recover(), it is convenient to consider the standard approach of recovering the nonzeros of a Jacobian from its compressed version [6] . The standard approach assumes that all nonzeros of a sparse Jacobian are to be determined, whereas in a partial Jacobian approach we are interested in a subset of the nonzeros. In the standard approach, the nonzeros are recovered using the following MATLAB-like pseudocode: [rs, perm] = sort(row) 8: Given the pattern P of a sparse Jacobian J, the seed matrix S, and the compressed Jacobian cp(J) = J · S, this procedure recovers the Jacobian matrix J. It reconstructs every row i of J step by step. In each step, it first computes the indices I of the nonzeros of the row i of J. Then, it considers a reduced seed matrix S I = S(I, :). Here, S I is a matrix containing those rows of S that correspond to the nonzeros of J in the row i. Suppose that there is a nonzero element in J in position (i, k). We then need the column index of the entry 1 in the row k of the reduced seed matrix. With this column index, the corresponding nonzero is extracted from cp(J). Because of MATLAB's implementation of find(), the row indices in row have to be sorted in increasing order. For partial Jacobian computation where only a subset of nonzeros is determined, we need to extend the previous procedure to recover the Jacobian matrix using the seed matrix which is computed by the partial coloring. The following pseudocode introduces the new procedure partial recover() which computes the Jacobian J from its compressed version cp(J) in partial Jacobian computation. Compared to the previous procedure recover(), this procedure needs the pattern of the required elements R as an additional input. if ∼ isempty(positions) then 13: for p = 1 : length(positions) do 14: r i = find(S(:, positions(p))) 15: if sum(NR(i, r i )) ∼= 1 then 16: The first steps up to the step 9 of this procedure are similar to the previous procedure recover(). The new procedure, however, needs to take into account the nonrequired elements. So, it looks for the columns of S I which have more than one nonzero (in steps 10 and 11) since there are the columns in which the addition of two nonrequired elements can happen. Then, it goes through all of those columns, if any, and checks if any nonrequired elements is added. If such an addition happens in a column r i , we put a zero in the corresponding entry J(i, r i ) in the recovered Jacobian. The variable NR in this algorithm contains the positions of the nonrequired elements in P. After recovering the Jacobian matrix J via the procedure partial recover(), we need to make sure that only those elements will remain that are inside the diagonal blocks of size d. That is, we need to compute the byproducts B using the sparsification operator ρ d (·) which, in the current implementation, is carried out outside of the procedure partial recover(); see also the sparsification operator ρ d (·) in the algorithm sketched at the beginning of this section. Here, we employ the semi-matrix-free approach for the solution of a system of linear equations of the form (2). This system arises in the solution of an optimal boundary control problem for radiative transfer. Throughout the following experiments, the resulting coefficient matrix has the order N = 1, 944 and contains 49, 856 nonzero elements. Its nonzero pattern is depicted in the left of Fig. 2 . In the middle of this figure, the pattern of the sparsification ρ r (J) is depicted for a block size of r = 100. The N/r = 20 blocks are visible and are highlighted using a gray background. Notice that the last block is considerably smaller than the remaining blocks. To illustrate the preconditioning approach, the pattern of rc(J) is also plotted for a block size of d = 500 in the right of this figure. The present experiments are carried out in MATLAB, R2019b. All derivative computations are computed by ADiMat. The right-hand side b of the linear system is chosen as the sum of all columns of J such that the exact solution y to (2) is given by the vector containing ones in all positions. The linear system is solved using the Generalized Minimal RESidual method (GMRES) [13] with restart parameter of 20. We always take y 0 = 0 as the initial guess. For the unpreconditioned system, the iteration is stopped in the nth step if ||b − Jy n || 2 /||b|| 2 ≤ ε. For the preconditioned system, convergence is obtained if The tolerance ε = 10 −13 is chosen for both cases. All tests are carried out on an Intel Core i7-8550U CPU with a clock rate of 1.80 GHz and 16 GB RAM. In Fig. 3 , the convergence behavior using GMRES is plotted versus the number of matrix-vector products. The convergence is monitored by the residual vector of the nth iteration defined by r n = b − Jy n . More precisely, we show the norm of the residual scaled by the initial residual norm ||r 0 || 2 . We do not report the convergence versus the number of iterations for two reasons. Firstly, the number of matrix-vector products is known to be a better indication of the computing time than the number of iterations [12] ; secondly, the number of matrix-vector products directly corresponds to the number of colors and thus makes it easy to relate the convergence to the cost of computing cp(J) that is once needed to set up the preconditioner. This aspect is crucial in applications such as Newton-like methods for nonlinear systems where a sequence of linear systems with the same Jacobian sparsity pattern arises and the cost of solving a single coloring problem is amortized over solving multiple linear systems. On the other hand, the number of matrix-vector products is only an approximation of the computing time, in particular for GMRES without restarts, where the number of operations carried out in an iteration linearly increases with the iteration number. In the first set of experiments, where the block size for the sparsification operator ρ d (·) is fixed to d = 500, the computing time needed to converge the preconditioned iteration is always smaller than for the unpreconditioned method, if the time for partial coloring and computing cp(J) is neglected. Taking this time into account so that the complete process of setting up the preconditioner is included, the preconditioned method is faster than the unpreconditioned method for all experiments where r > 10. The unpreconditioned method exhibits the slowest convergence using the largest number of matrix-vector products needed to converge to the desired tolerance. This figure also contains six additional graphs by varying the block size r = 4, r = 20 and r = 100 and by employing two different preconditioning approaches. The approach advocated in this article is based on the blockwise ILU(0) decomposition of rc(J), the matrix that contains the nonrequired elements as well as the by-products. This approach is denoted by R ∪ B. For the sake of comparison, we also investigate another approach that is identical to the previously mentioned approach with a single exception. Rather than using the information R ∪ B to construct the blockwise ILU(0) preconditioner, the diagonal blocks involve only the information R. That is, the by-products B are discarded from the preconditioning process. This latter approach is similar to our previous work reported in [3] . However, in [3] , we do not use two different block schemes with different block sizes. For the three block size r = 4, r = 20 and r = 100, the two preconditioning techniques based on R∪B and R both converge faster than the unpreconditioned method. This statement is true for GMRES as well as for other Krylov solvers that we tested but whose results are omitted due to the lack of space. It is also interesting that the convergence is improved by increasing the block sizes from r = 4 via r = 20 up to r = 100. Furthermore, keeping the block size r fixed, the convergence of the approach R ∪ B tends to be faster than the approach using only R. This observation is valid for the two block sizes r = 4 and r = 20. For large block sizes, however, it is unlikely that there will be a large set of byproducts B. So, the differences in the convergence behavior between an approach using R ∪ B and an approach using R tend to be small. To better understand the preconditioning approach, we now focus on the number of nonzero elements when increasing the block size r. Figure 4 illustrates the number of required elements, |R|, using black bars as well as the number of by-products, |B|, using dark gray bars. The vertical axis (ordinate) is scaled to the number of nonzeros in J given by 49, 856. That is, the light gray bars denote the number of nonzero elements of J that are not taken into account when the preconditioner is constructed. For a block size of d = 500, this diagram shows that the number of required elements increase only mildly when increasing the block size r up to moderate values. However, when increasing r significantly, there is also a corresponding increase in the number of required elements; see the block sizes at the right of this figure. The sum |R| + |B| is rather constant when increasing the block size r. So, the approach tends to be relevant in particular for small block sizes r where the number of by-products is comparatively large. In this situation, the information available in B is particularly attractive since it comes from the partial Jacobian computation without any extra computational cost. Next, we consider the number of colors needed for the solution of the partial graph coloring problem that is formally specified by Problem 2. This number of colors is depicted in Fig. 5 . Here, the block size r is varied in the same range as in Fig. 4 . Since the number of colors is an estimate for the relative computational cost to compute the compressed Jacobian cp(J) = J · S using AD, a slight increase in the number of colors can be harmful. This figure illustrates that the number of colors increases with the block size. Once more, this is an indication that the preconditioning approach is particularly relevant for small block sizes. Also, for small block sizes the storage requirement tends to be lower than for larger block sizes which corresponds to the overall setting in which a sparse data structure for the Jacobian is assumed to exceed the available storage capacity. Finally, we analyze the number of nonzeros and the number of colors not only for a varying block size r, but also when varying the block size d. In Fig. 6 , the results are depicted for the three block sizes d = 300, d = 400 and d = 500. For each value of d, this set of experiments involves those block sizes r that are divisors of d. The legend contains the union of all divisors of the three block sizes d. In the layout of this figure, a number of nonzeros is indicated by a bar to which the left ordinate is associated. A number of colors is denoted by a disk whose value is specified on the right ordinate. As in Fig. 4 , black bars here denote the number of required nonzeros. In contrast, the by-products are now given by bars whose colors correspond to the values of r. The results show a similar behavior for all three values of d. The number of required nonzeros increase with increasing r. Compared to the number of required nonzeros, the number of by-products is in a similar magnitude, except when r approaches d. (By construction, there cannot be any by-product for r = d.) The number of by-products tends to by reasonably large, even for small values of r. At the same time, the number of colors is small for small block sizes r. In other words, small block sizes r are not only attractive because (i) they deliver additional information (represented by nonzero elements) that is useful for preconditioning without any extra computational cost and, at the same time, (ii) they lead to a low relative computational cost associated with AD (represented by colors). While matrix-free iterative methods and (transposed) Jacobian-vector products computed by automatic differentiation match well to each other, today, there is still a gap between preconditioning and automatic differentiation. The reason is that, in a matrix-free approach, accesses to individual nonzero entries of the Jacobian coefficient matrix which are needed by standard preconditioning techniques are computationally expensive. This statement holds not only for automatic differentiation but also for numerical differentiation. The major new contribution of this article is a semi-matrix-free preconditioning approach that uses two separate diagonal block schemes partitioning the coefficient matrix into smaller submatrices. In both schemes, the diagonal blocks do not overlap. The first scheme employs blocks that define the required nonzero elements of a partial Jacobian computation. This scheme is relevant for minimizing the relative computational cost of the partial Jacobian computation. The resulting minimization problem is equivalent to a partial graph coloring problem. The second scheme is based on blocks whose sizes are larger than those of the first scheme. The blocks of this second scheme define the positions from which by-products of the partial Jacobian computation are extracted. Together with the required nonzero elements these by-products are used to construct a preconditioner that applies ILU decompositions to each of these blocks. Numerical experiments using the automatic differentiation tool ADiMat are reported demonstrating the feasibility of the new preconditioning technique. There is room for further investigations that aim at bridging the gap between preconditioning and automatic differentiation. For instance, it is interesting to study more advanced preconditioning techniques and analyze to what extent they are capable of exploiting the information available in the by-products of the partial Jacobian computation. Preconditioning techniques for large linear systems: a survey Combining source transformation and operator overloading techniques to compute derivatives for MATLAB programs Enabling implicit time integration for compressible flows by partial coloring: a case study of a semi-matrix-free preconditioning technique Matrix-free preconditioning using partial matrix estimation What color is your Jacobian? Graph coloring for computing derivatives Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. No. 105 in Other Titles in App Partielle Berechnung von Jacobi-Matrizen mittels Graphfärbung Graphfärbung zur Berechnung benötigter Matrixelemente Full and partial Jacobian computation via graph coloring: Algorithms and applications. Dissertation Partial Jacobian computation in the domain-specific program transformation system ADiCape Automatic Differentiation: Techniques and Applications Iterative Methods for Sparse Linear Systems GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems A new user interface for ADiMat: toward accurate and efficient derivatives of Matlab programs with ease of use