key: cord-0047218-tpbfyo7a
authors: van der Hoeven, Joris; Monagan, Michael
title: Implementing the Tangent Graeffe Root Finding Method
date: 2020-06-06
journal: Mathematical Software - ICMS 2020
DOI: 10.1007/978-3-030-52200-1_48
sha: 635d0c304669f60f992d23a1037c41e903790562
doc_id: 47218
cord_uid: tpbfyo7a

The tangent Graeffe method has been developed for the efficient computation of single roots of polynomials over finite fields with multiplicative groups of smooth order. It is a key ingredient of sparse interpolation using geometric progressions, in the case when blackbox evaluations are comparatively cheap. In this paper, we improve the complexity of the method by a constant factor and we report on a new implementation of the method and a first parallel implementation.

Consider a polynomial function f : K n → K over a field K given through a black box capable of evaluating f at points in K n . The problem of sparse interpolation is to recover the representation of f ∈ K[x 1 , . . . , x n ] in its usual form, as a linear combination f = 1 i t c i x ei (1) of monomials x ei = x e1, 1 1 · · · x e1,n n . One popular approach to sparse interpolation is to evaluate f at points in a geometric progression. This approach goes back to work of Prony in the eighteen's century [15] and became well known after Ben-Or and Tiwari's seminal paper [2] . It has widely been used in computer algebra, both in theory and in practice; see [16] for a nice survey.

More precisely, if a bound T for the number of terms t is known, then we first evaluate f at 2T − 1 pairwise distinct points α 0 , α 1 , . . . , α 2T −2 , where α = (α 1 , . . . , α n ) ∈ K n and α k := (α k 1 , . . . , α k n ) for all k ∈ N. The generating function of the evaluations at α k satisfies the identity

where Λ = (1 − α e1 z) · · · (1 − α et z) and N ∈ K[z] is of degree < t. The rational function N/Λ can be recovered from f (α 0 ), f(α 1 ), . . . , f(α 2T −2 ) using fast Padé

Note: This paper received funding from NSERC (Canada) and "Agence de l'innovation de défense" (France). Note: This document has been written using GNU T E X macs [13] .

approximation [4] . For well chosen points α, it is often possible to recover the exponents e i from the values α ei ∈ K. If the exponents e i are known, then the coefficients c i can also be recovered using fast structured linear algebra [5] . This leaves us with the question how to compute the roots α −ei of Λ in an efficient way. For practical applications in computer algebra, we usually have K = Q, in which case it is most efficient to use a multi-modular strategy, and reduce to coefficients in a finite field K = F p , where p is a prime number that we are free to choose. It is well known that polynomial arithmetic over F p can be implemented most efficiently using FFTs when the order p − 1 of the multiplicative group is smooth. In practice, this prompts us to choose p of the form s2 l + 1 for some small s and such that p fits into a machine word.

The traditional way to compute roots of polynomials over finite fields is using Cantor and Zassenhaus' method [6] . In [10, 11] , alternative algorithms were proposed for our case of interest when p−1 is smooth. The fastest algorithm was based on the tangent Graeffe transform and it gains a factor log t with respect to Cantor-Zassenhaus' method. The aim of the present paper is to report on a parallel implementation of this new algorithm and on a few improvements that allow for a further constant speed-up.

In Sect. 2, we recall the Graeffe transform and the heuristic root finding method based on the tangent Graeffe transform from [10] . In Sect. 3, we present the main new theoretical improvements, which all rely on optimizations in the FFT-model for fast polynomial arithmetic. Our contributions are twofold. In the FFT-model, one backward transform out of four can be saved for Graeffe transforms of order two (see Sect. 3.2). When composing a large number of Graeffe transforms of order two, FFT caching can be used to gain another factor of 3/2 (see Sect. 3.3). In the longer preprint version of the paper [12] , we also show how to generalize our methods to Graeffe transforms of general orders and how to use it in combination with the truncated Fourier transform. Section 4 is devoted to our new sequential and parallel implementations of the algorithm in C and Cilk C. Our sequential implementation confirms the gain of a new factor of two when using the new optimizations. So far, we have achieved a parallel speed-up by a factor of 4.6 on an 8-core machine. Our implementation is freely available at http://www.cecm.sfu.ca/CAG/code/TangentGraeffe.

The traditional Graeffe transform of a monic polynomial P ∈ K[z] of degree d is the unique monic polynomial G(P ) ∈ K[z] of degree d such that G(P )(z 2 ) = P (z)P (−z).

(

If P splits over K into linear factors P = (z − β 1 ) · · · (z − β d ), then one has

More generally, given r 2, we define the Graeffe transform of order r to be the unique monic polynomial G r (P ) ∈ K[z] of degree d such that G r (P )(z) = (−1) rd Res u (P (u), u r − z).

If r, s 2, then we have

Let be a formal indeterminate with 2 = 0.

The definitions from the previous subsection readily extend to coefficients in K[ ] instead of K. Given r 2, we call G r (P ) the tangent Graeffe transform of P of order r. We have

Now assume that we have an efficient way to determine the roots α r 1 , . . . , α r d of G r (P ). For some polynomial T ∈ K[z], we may decompose G r (P ) = G r (P ) + T For any root α r k of G r (P ), we then have

Whenever α r k happens to be a single root of G r (P ), it follows that

.

If α r k = 0, this finally allows us to recover α k as α k = r α r k rα r−1 k .

Assume now that K = F p is a finite field, where p is a prime number of the form p = σ2 m + 1 for some small σ. Assume also that ω ∈ F p be a primitive element of order p − 1 for the multiplicative group of F p .

be as in the previous subsection. The tangent Graeffe method can be used to efficiently compute those α k of P for which α r k is a single root of G r (P ). In order to guarantee that there are a sufficient number of such roots, we first replace P (z) by P (z + τ ) for a random shift τ ∈ F p , and use the following heuristic:

For a random shift τ ∈ F p and any r (p − 1)/(4d), the assumption ensures with probability at least 1/2 that G r (P (z + τ )) has at least d/3 single roots. Now take r to be the largest power of two such that r (p − 1)/(4d) and let s = (p − 1)/r. By construction, note that s = O(d). The roots α r 1 , . . . , α r d of G r (P ) are all s-th roots of unity in the set {1, ω r , . . . , ω (s−1)r }. We may thus determine them by evaluating G r (P ) at ω i for i = 0, . . . , s − 1. Since s = O(d), this can be done efficiently using a discrete Fourier transform. Combined with the tangent Graeffe method from the previous subsection, this leads to the following probabilistic algorithm for root finding:

of degree d and only order one factors, p = σ2 m + 1 Output: the set {α 1 , . . . , α d } of roots of P 1. If d = 0 then return ∅ 2. Let r = 2 N ∈ 2 N be largest such that r (p − 1)/(4d) and let s := (p − 1)/r 3. Pick τ ∈ F p at random and compute P * :

, which requires three polynomial multiplications in F p [z] of degree d. In total, step 5 thus performs O(log(p/s)) such multiplications. We discuss how to perform step 5 efficiently in the FFT model in Sect. 3.

For practical implementations, one may vary the threshold r (p − 1)/(4d) for r and the resulting threshold s 4d for s. For larger values of s, the computations of the DFTs in step 6 get more expensive, but the proportion of single roots goes up, so more roots are determined at each iteration. From an asymptotic complexity perspective, it would be best to take s d √ log p. In practice, we actually preferred to take the lower threshold s 2d, because the constant factor of our implementation of step 6 (based on Bluestein's algorithm [3] ) is significant with respect to our highly optimized implementation of the tangent Graeffe method. A second reason we prefer

is that the total space used by the algorithm is linear in s. In the future, it would be interesting to further speed up step 6 by investing more time in the implementation of high performance DFTs of general orders s.

Assume n ∈ N is invertible in K and let ω ∈ K be a primitive n-th root of unity. Consider a polynomial A = a 0 + a 1 z + · · · + a n−1 z n−1 ∈ K[z]. Then the discrete Fourier transform (DFT) of order n of the sequence (a i ) 0 i<n is defined by

We will write F K (n) for the cost of one discrete Fourier transform in terms of the number of operations in K and assume that n = o (F K (n)). For any i ∈ {0, . . . , n − 1}, we have

If n is invertible in K, then it follows that DFT −1 ω = n −1 DFT ω −1 . The costs of direct and inverse transforms therefore coincide up to a factor O(n).

If n = n 1 n 2 is composite, 0 k 1 < n 1 , and 0 k 2 < n 2 , then it is well known [7] that

This means that a DFT of length n reduces to n 1 transforms of length n 2 plus n 2 transforms of length n 1 plus n multiplications in K:

In particular, if r = O(1), then F K (rn) ∼ rF K (n). It is sometimes convenient to apply DFTs directly to polynomials as well; for this reason, we also define DFT ω (A) := (â k ) 0 k<n . Given two polynomials A, B ∈ K[z] with deg(AB) < n, we may then compute the product AB using

In particular, if M K (n) denotes the cost of multiplying two polynomials of degree < n, then we obtain M K (n) ∼ 3F K (2n) ∼ 6F K (n).

In Algorithm 1, we note that step 6 comes down to the computation of three DFTs of length s. Since r is a power of two, this length is of the form s = σ2 k for some k ∈ N. In view of (5), we may therefore reduce step 6 to 3σ DFTs of length 2 k plus 3 · 2 k DFTs of length σ. If σ is very small, then we may use a naive implementation for DFTs of length σ. In general, one may use Bluestein's algorithm [3] to reduce the computation of a DFT of length σ into the computation of a product in K[z]/(z σ − 1), which can in turn be computed using FFT-multiplication and three DFTs of length a larger power of two.

Let K be a field with a primitive (2n)-th root of unity ω. Let P ∈ K[z] be a polynomial of degree d = deg P < n. Then the relation (2) yields

For any k ∈ {0, . . . , 2n − 1}, we further note that

DFT ω (P (−z)) k = P (−ω k ) = P (ω (k+n) rem 2n ) = DFT ω (P (z)) (k+n) rem 2n , (7) so DFT ω (P (−z)) can be obtained from DFT ω (P ) using n transpositions of elements in K. Concerning the inverse transform, we also note that DFT ω (G(P )(z 2 )) k = G(P )(ω 2k ) = DFT ω 2 (G(P )) k , for k = 0, . . . , n − 1. Plugging this into (6), we conclude that

This leads to the following algorithm for the computation of G(P ):

Input: P ∈ K[z] with deg P < n and a primitive (2n)-th root of unity ω ∈ K Output: G(P )

Let ω ∈ K be a primitive 2n-th root of unity in K and assume that 2 is invertible in K. Given a monic polynomial P ∈ K[z] with deg P < n, we can compute G(P ) in time G 2,K (n) ∼ 3F K (n).

Proof. We have already explained the correctness of Algorithm 2.

Step 1 requires one forward DFT of length 2n and cost F K (2n) = 2F K (n) + O(n).

Step 2 can be done in O(n).

Step 3 requires one inverse DFT of length n and cost F K (n)+O(n). The total cost of Algorithm 2 is therefore 3F K (n) + O(n) ∼ 3F K (n).

In terms of the complexity of multiplication, we obtain G 2,K (n) ∼ (1/2)M K (n). This gives a 33.3% improvement over the previously best known bound G 2,K (n) ∼ (2/3)M K (n) that was used in [10] . Note that the best known algorithm for squaring polynomials of degree < n is ∼ (2/3)M K (n). It would be interesting to know whether squares can also be computed in time ∼ (1/2)M K (n).

In view of (3), Graeffe transforms of power of two orders 2 m can be computed using

Now assume that we computed the first Graeffe transform G(P ) using Algorithm 2 and that we wish to apply a second Graeffe transform to the result. Then we note that

is already known for k = 0, . . . , n − 1. We can use this to accelerate step 1 of the second application of Algorithm 2. Indeed, in view of (5) for n 1 = 2 and n 2 = n,

for k = 0, . . . , n−1. In order to exploit this idea in a recursive fashion, it is useful to modify Algorithm 2 so as to include DFT ω 2 (P ) in the input and DFT ω 2 (G(P )) in the output. This leads to the following algorithm:

Input: P ∈ K[z] with deg P < n, a primitive (2n)-th root of unity ω ∈ K, and (Q k ) 0 k<n = DFT ω 2 (P ) Output: G(P ) and DFT ω 2 (G(P )) 1. Set (P 2k ) 0 k<n := (Q k ) 0 k<n 2. Set (P 2k+1 ) 0 k<n := DFT ω 2 ((ω i P i ) 0 i<n ) 3. For k = 0, . . . , n − 1, computeĜ k :=P kPk+n 4. Return DFT −1 ω 2 ((Ĝ k ) 0 k<n ) and (Ĝ k ) 0 k<n Proposition 2. Let ω ∈ K be a primitive 2n-th root of unity in K and assume that 2 is invertible in K. Given a monic polynomial P ∈ K[z] with deg P < n and m 1, we can compute G 2 m (P ) in time G 2 m ,K (n) ∼ (2m + 1)F K (n).

Proof. It suffices to compute DFT ω 2 (P ) and then to apply Algorithm 3 recursively, m times. Every application of Algorithm 3 now takes 2F K (n) + O(n) ∼ 2F K (n) operations in K, whence the claimed complexity bound.

Remark 5. In [10] , Graeffe transforms of order 2 m were directly computed using the formula (8), using ∼ 4mF K (n) operations in K, which is twice as slow as the new algorithm.

We have implemented the tangent Graeffe root finding algorithm (Algorithm 1) in C with the optimizations presented in Sect. 3. Our C implementation supports primes of size up to 63 bits. In what follows all complexities count arithmetic operations in F p .

In Tables 1 and 2 the input polynomial P (z) of degree d is constructed by choosing d distinct values α i ∈ F p for 1 i d at random and creating P (z) = d i=1 (z − α i ). We will use p = 3 × 29 × 2 56 + 1, a smooth 63 bit prime. For this

One goal we have is to determine how much faster the Tangent Graeffe (TG) root finding algorithm is in practice when compared with the Cantor-Zassenhaus (CZ) algorithm which is implemented in many computer algebra systems. In Table 1 we present timings comparing our sequential implementation of the TG algorithm with Magma's implementation of the CZ algorithm. For polynomials in The timings in Table 1 are sequential timings obtained on a Linux server with an Intel Xeon E5-2660 CPU with 8 cores. In Table 1 the time in column "first" is for the first application of the TG algorithm (steps 1-9 of Algorithm 1), which obtains about 69% of the roots. The time in column "total" is the total time for the TG algorithm. Columns step 5, step 6, and step 9 report the time spent in steps 5, 6, and 9 in Algorithm 1 and do not count time in the recursive call in step 10.

The Magma timings are for Magma's Factorization command. The timings for Magma version V2.25-3 suggest that Magma's CZ implementation involves a subalgorithm with quadratic asymptotic complexity. Indeed it turns out that the author of the code implemented all of the sub-quadratic polynomial arithmetic correctly, as demonstrated by the second set of timings for Magma in column V2.25-5, but inserted the d linear factors found into a list using linear insertion! Allan Steel of the Magma group identified and fixed the offending subroutine for Magma version V2.25-5. The timings show that TG is faster than CZ by a factor of 76.6 (=8.43/0.11) to 146.3 (=2809/19.2).

We also wanted to attempt a parallel implementation. To do this we used the MIT Cilk C compiler from [8] . Cilk provides a simple fork-join model of parallelism. Unlike the CZ algorithm, TG has no gcd computations that are hard to parallelize. We present some initial parallel timing data in Table 2 . The timings in parentheses are parallel timings for 8 cores. Table 2 . Real times in seconds for 1 core (8 cores) and p = 3 · 29 · 2 56 + 1.

Our parallel tangent Graeffe implementation in Cilk C Total First

Step 5

Step 6

Step 9

To implement the Taylor shift P (z + τ ) in step 3, we used the O(M(d)) method from [1, Lemma 3] . For step 5 we use Algorithm 3. It has complexity O(M(d) log p s ). To evaluate A(z), A (z) and B(z) in step 6 in O(M(s)) we used the Bluestein transformation [3] . In step 9 to compute the product Q(z) = Π α∈S (z − α), for t = |S| roots, we used the O(M(t) log t) product tree multiplication algorithm [9] . The division in step 10 is done in O(M(d)) with the fast division.

The sequential timings in Tables 1 and 2 show that steps 5, 6 and 9 account for about 90% of the total time. We parallelized these three steps as follows.

For step 5, the two forward and two inverse FFTs are done in parallel. We also parallelized our radix 2 FFT by parallelizing recursive calls for size n 2 17 and the main loop in blocks of size m 2 18 as done in [14] . For step 6 there are three applications of Bluestein to compute A(ω ir ), A (ω ir ) and B(ω ir ). We parallelized these (thereby doubling the overall space used by our implementation). The main computation in the Bluestein transformation is a polynomial multiplication of two polynomials of degree s. The two forward FFTs are done in parallel and the FFTs themselves are parallelized as for step 5. For the product in step 9 we parallelize the two recursive calls in the tree multiplication for large sizes and again, the FFTs are parallelized as for step 5.

To improve parallel speedup we also parallelized the polynomial multiplication in step 3 and the computation of the roots in step 8. Although step 8 is O(|S|), it is relatively expensive because of two inverse computations in F p . Because we have not parallelized about 5% of the computation the maximum parallel speedup we can obtain is a factor of 1/(0.05 + 0.95/8) = 5.9. The best overall parallel speedup we obtained is a factor of 4.6 = 1465/307.7 for d = 2 25 −1.

Evaluating polynomials on a fixed set of points

A deterministic algorithm for sparse multivariate polynomial interpolation

A linear filtering approach to the computation of discrete Fourier transform

Fast solution of Toeplitz systems of equations and computation of Padé approximants

Solving systems of non-linear polynomial equations faster

A new algorithm for factoring polynomials over finite fields

An algorithm for the machine calculation of complex Fourier series

The implementation of the Cilk-5 multithreaded language

Modern Computer Algebra

Randomized root finding over finite fields using tangent Graeffe transforms

Deterministic root finding over finite fields using Graeffe transforms

Implementing the tangent Graeffe root finding method

A parallel implementation for polynomial multiplication modulo a prime

Essai expérimental et analytique sur les lois de la dilatabilité des fluideś elastiques et sur celles de la force expansive de la vapeur de l'eau et de la vapeur de l'alkool,à différentes températures. J. de l'École Polytechnique Floréal et Plairial, an III

What can (and can't) we do with sparse polynomials?

A new polynomial factorization and its implementation