key: cord-0058554-deco81gu authors: Takahashi, Daisuke title: Fast Multiple Montgomery Multiplications Using Intel AVX-512IFMA Instructions date: 2020-08-26 journal: Computational Science and Its Applications - ICCSA 2020 DOI: 10.1007/978-3-030-58814-4_52 sha: b31cdc2f41ea0953e2842fd832fd72577703be2a doc_id: 58554 cord_uid: deco81gu In this paper, we propose a fast implementation of multiple Montgomery multiplications using Intel AVX-512IFMA (Integer Fused Multiply-Add) instructions. The proposed implementation is based on a modified Montgomery multiplication. For Montgomery multiplication operands with 52 bits or fewer, the proposed implementation using Intel AVX-512IFMA instructions is up to approximately 12.22 and 4.30 times faster than the implementations using Intel 64 and Intel AVX-512F (Foundation) instructions on an Intel Core i3-8121U processor, respectively. Modular multiplication, widely used in fields such as computer algebra and cryptography, includes modulo operations, which are slow because they involve an integer division process. Montgomery multiplication [1] avoids this drawback. Modern processors support single-instruction multiple-data (SIMD) vector instructions. The Intel Advanced Vector Extensions 512 (Intel AVX-512) [2] is a 512-bit vector instruction set. Montgomery multiplication algorithms that use vector instructions have been proposed [3, 4] , as has fast modular squaring with Intel AVX-512IFMA (Integer Fused Multiply-Add) [5] . Vector instructions have also been used to compute multiple Montgomery multiplications in parallel [4] . For multiple Montgomery multiplications, implementations based on Intel Streaming SIMD Extensions 2 (Intel SSE2) instructions [6] and the Cell processor [7] have been proposed. Montgomery multiplication is usually performed on integers of several hundred bits or more; however, this paper considers the case of 52 bits or fewer. Multiple Montgomery multiplications with such numbers of bits are used in the modular fast Fourier transform (FFT) algorithm [8] . To the best of our knowledge, there are no existing implementations of multiple Montgomery multiplications using Intel AVX-512IFMA instructions. Here, we propose such an implementation and evaluate its performance. uint64_t mulmod63(uint64_t a, uint64_t b, uint64_t N, uint64_t mu) /* Compute mulmod63 = (a * b * 2^-63) mod N. We need mu = -N^-1 mod 2^63. */ { __uint128_t t; uint64_t c, q; return c; } The remainder of this paper is organized as follows. Section 2 describes the vectorization of multiple Montgomery multiplications. Section 3 presents the proposed implementation of multiple Montgomery multiplications using Intel AVX-512IFMA instructions. Section 4 presents the performance results. Finally, Sect. 5 gives the concluding remarks. Montgomery multiplication [1] is shown in Algorithm 1. Figure 1 shows the Montgomery multiplication of 63-bit integers, which corresponds to β = 2 63 in Algorithm 1. This program uses the uint128 t extension for 128-bit unsigned integer arithmetic supported by the GCC, Clang, and Intel C compilers. Because this program computes a single Montgomery multiplication, vectorization, which requires multiple Montgomery multiplications to be performed simultaneously, cannot be applied. The Intel 64 instruction set supports mulq and mulx instructions, which perform 64-bit × 64-bit → 128-bit unsigned integer multiplication. In contrast, the Intel AVX-512F (Foundation) instruction set supports only the vpmuludq Algorithm 2. Radix-β interleaved Montgomery multiplication algorithm [1, 4] instruction, which performs 32-bit × 32-bit → 64-bit unsigned integer multiplication. Furthermore, a statement that contains uint128 t variables cannot be automatically vectorized by the Intel C compiler. The radix-β interleaved Montgomery multiplication algorithm [1, 4] , shown in Algorithm 2, can be used to vectorize multiple Montgomery multiplications. In radix-2 32 interleaved Montgomery multiplication, there is some overflow in 64-bit unsigned integer addition. A vectorized multiple Montgomery squaring operation of 62-bit integers with β = 2 31 and n = 2 has been proposed to avoid this overflow [9] . We modified the vectorized multiple Montgomery squaring operation to achieve vectorized multiple Montgomery multiplications. For multiple Montgomery multiplications, performance is degraded by the conditional subtraction C − N when C ≥ N on lines 6 and 7 of Algorithm 2. The conditional subtraction can be replaced by the minimum operation min(C, C − N ) for unsigned integer values C and N with wrap-around two's complement arithmetic [9] . The Intel AVX-512F instruction set supports the vpminuq instruction for the 64-bit unsigned integer minimum operation. Figure 2 shows the vectorized multiple Montgomery multiplications of 62-bit integers, which correspond to β = 2 31 and n = 2 in Algorithm 2. In Fig. 2 , #pragma ivdep instructs the compiler to ignore assumed vector dependencies and #pragma vector aligned instructs the compiler to use aligned data movement instructions for all array references during vectorization. The vectorized multiple Montgomery multiplications can be further optimized using the Intel AVX-512 intrinsics [10] , as shown in Fig. 3 . In this program, the Intel AVX-512 intrinsics mm512 mul epu32() and mm512 min epu64() correspond to the vpmuludq and vpminuq instructions, respectively. This program assumes that the vector length VLEN is divisible by 8. If VLEN is not divisible by 8, a remainder loop needs to be executed. Intel AVX-512IFMA instructions [2] are supported by the Cannon Lake and Ice Lake microarchitectures, and will be supported by the Tiger Lake microarchitecture. The Intel AVX-512IFMA instruction set supports the vpmadd52luq and vpmadd52huq instructions, which multiply 52-bit unsigned integers and produce the low and high halves, respectively, of a 104-bit intermediate result. These halves are added to 64-bit accumulators. Algorithm 3 shows a modified Montgomery multiplication algorithm. C +qN on line 3 of Algorithm 1 is divisible by β. That is, the lower-half bits of C +qN are 0, and thus do not need to be computed. In this case, if q on line 2 of Algorithm 1 is 0, the carry added to the sum of the upper-half bits of C and the upper-half _mm512_set1_epi64(0x7FFFFFFF)); N1 = _mm512_srli_epi64(_mm512_load_epi64(&N[i]), 31); t0 = _mm512_mul_epu32(a0, b0); t1 = _mm512_mul_epu32(a0, b1); t2 = _mm512_mul_epu32(a1, b0); t3 = _mm512_mul_epu32(a1, b1); q = _mm512_and_epi64(_mm512_mul_epu32(t0, _mm512_load_epi64(&mu[i])), _mm512_set1_epi64(0x7FFFFFFF)); t0 = _mm512_add_epi64(_mm512_srli_epi64(_mm512_add_epi64(t0, _mm512_mul_epu32(q, N0)), 31), _mm512_add_epi64(t1, _mm512_mul_epu32(q, N1))); t2 = _mm512_add_epi64(t2, _mm512_and_epi64(t0, _mm512_set1_epi64(0x7FFFFFFF))); t3 = _mm512_add_epi64(t3, _mm512_srli_epi64(t0, 31)); q = _mm512_and_epi64(_mm512_mul_epu32(t2, _mm512_load_epi64(&mu[i])), _mm512_set1_epi64(0x7FFFFFFF)); t2 = _mm512_add_epi64(_mm512_srli_epi64(_mm512_add_epi64(t2, _mm512_mul_epu32(q, N0)), 31), _mm512_add_epi64(t3, _mm512_mul_epu32(q, N1))); _mm512_store_epi64(&c[i], _mm512_min_epu64(t2, _mm512_sub_epi64(t2, _mm512_load_epi64(&N[i])))); } } bits of qN is 0; otherwise, it is 1. We note that the lower-half bits of qN need not be calculated in Algorithm 3. If the Montgomery multiplication operand is 52 bits or fewer, β = 2 52 can be set in Algorithm 3 by using the Intel AVX-512IFMA instructions. Line 1 of Algorithm 3 calculates the remainder of dividing AB by β, and thus only the low 52 bits of AB need to be calculated. We can use the vpmadd52luq instruction for this calculation. Line 2 of Algorithm 3 calculates the remainder of dividing Input : A, B, N such that 0 ≤ A, B < N, β > N, gcd(β, N ) μC by β, and thus only the low 52 bits of μC need to be calculated. We can use the vpmadd52luq instruction for this calculation. On line 3 of Algorithm 3, the high 52 bits of AB and qN can be calculated using the vpmadd52huq instruction. The conditional increment C + 1 when q = 0 on lines 4 and 5 of Algorithm 3 can be replaced by the minimum operation and the addition C + min(q, 1) for unsigned integer values q and C. The minimum operation min(q, 1) is performed using the vpminuq instruction. The conditional subtraction on lines 6 and 7 of Algorithm 3 is also performed using the vpminuq instruction, as described in Sect. 2. Figure 4 shows the vectorized multiple Montgomery multiplications of 52-bit integers using Intel AVX-512 intrinsics. In this program, the Intel AVX-512 intrinsics mm512 madd52lo epu64(), mm512 madd52hi epu64(), and mm512 min epu64() correspond to the vpmadd52luq, vpmadd52huq, and vpminuq instructions, respectively. This program assumes that the vector length VLEN is divisible by 8. If VLEN is not divisible by 8, a remainder loop needs to be executed. For performance evaluation, a comparison among multiple 52-bit Montgomery multiplications using Intel AVX-512IFMA instructions, multiple 62-bit Montgomery multiplications using Intel AVX-512F instructions, and multiple 63-bit Montgomery multiplications using Intel 64 instructions was performed. Since a processor based on the Ice Lake microarchitecture was not available at the time of writing this paper, the performance was measured on an Intel Core i3-8121U processor, which is the only processor based on the Cannon Lake microarchitecture. In the computation of the Montgomery multiplication C = ABβ −1 mod N , N is an odd random number in the range of 1-2 52 − 1, and A and B are random numbers in the range of 0 ≤ A, B < N . The modular multiplicative inverses μ in Algorithms 1, 2, and 3 were prepared in advance. The batch size of Montgomery multiplications was varied from 64 to 1024. Each batch of Montgomery multiplications was executed one million times. The number of Montgomery multiplications per second (Mulmod×10 9 /s) was calculated based on the average elapsed time. The specifications of the test platform are shown in Table 1 . The Intel Core i3-8121U processor has two cores. However, we evaluated the performance on a single core and a single thread to focus on vectorization. The Intel C compiler (version 19.0.5.281) was used. The compiler options were icc -O3 -xCANNONLAKE. The compiler option -O3 enables the most aggressive optimizations for maximum speed. The compiler option -xCANNONLAKE generates instructions for the Cannon Lake microarchitecture. Table 2 and Fig. 5 show the performance of multiple Montgomery multiplications using Intel 64, Intel AVX-512F, and Intel AVX-512IFMA instructions. When the batch size was 128, the proposed implementation using Intel AVX-512IFMA instructions was approximately 12.22 and 4.30 times faster than the implementations using Intel 64 and Intel AVX-512F instructions on the Intel Core i3-8121U processor, respectively. The reason for this is that while the implementation using Intel 64 instructions calculates one integer at a time, the implementation using Intel AVX-512IFMA instructions calculates eight integers simultaneously. Furthermore, the number of instructions is smaller in the implementation using Intel AVX-512IFMA instructions than in the implementation using Intel AVX-512F instructions. In this paper, we proposed a fast implementation of multiple Montgomery multiplications using Intel AVX-512IFMA instructions. The proposed implementation is based on a modified Montgomery multiplication algorithm. For Montgomery multiplication operands with 52 bits or fewer, the proposed implementation using Intel AVX-512IFMA instructions outperforms implementations using Intel 64 and Intel AVX-512F instructions on the Intel Core i3-8121U processor. Our future work is to demonstrate the effectiveness of the proposed implementation on other processors that support the Intel AVX-512IFMA instructions. Modular multiplication without trial division Intel Corporation: Intel 64 and IA-32 architectures software developer's manual Software implementation of modular exponentiation, using advanced vector instructions architectures Montgomery multiplication using vector instructions Fast modular squaring with AVX512IFMA Parallel cryptographic arithmetic using a redundant Montgomery representation High-performance modular multiplication on the Cell processor Spiral-generated modular FFT algorithms Computation of the 100 quadrillionth hexadecimal digit of π on a cluster of Intel Xeon Phi processors Intel Corporation: Intel C++ compiler 19.0 developer guide and reference Acknowledgments. This research was partially supported by JSPS KAKENHI Grant Number JP19K11989.