Pushing the Limit of Vectorized Polynomial Multiplication for NTRU Prime

Vincent Hwang

Paper 2023/604

Pushing the Limit of Vectorized Polynomial Multiplication for NTRU Prime

Vincent Hwang, Max Planck Institute for Security and Privacy

Abstract

We conduct a systematic examination of vector arithmetic for polynomial multiplications in software. Vector instruction sets and extensions typically specify a fixed number of registers, each holding a power-of-two number of bits, and support a wide variety of vector arithmetic on registers. Programmers then try to align mathematical computations with the vector arithmetic supported by the designated instruction set or extension. We delve into the intricacies of this process for polynomial multiplications. In particular, we introduce “vectorization- friendliness” and “permutation-friendliness”, and review “Toeplitz matrix- vector product” to systematically identify suitable mappings from homo- morphisms to vectorized implementations. To illustrate how the formalization works, we detail the vectorization of polynomial multiplication in $\mathbb{Z}_{4591}[x]/\left\langle x^{761} − x − 1 \right\rangle$ used in the parameter set sntrup761 of the NTRU Prime key encapsulation mechanism. For practical evaluation, we implement vectorized polynomial multipliers for the ring $\mathbb{Z}_{4591}[x]/\left\langle \Phi_{17} (x^{96}) \right\rangle$ with AVX2 and Neon. We benchmark our AVX2 implementation on Haswell and Skylake and our Neon implementation on Cortex-A72 and the “Firestorm” core of Apple M1 Pro. Our AVX2-optimized implementation is 1.99−2.16 times faster than the state-of-the-art AVX2-optimized implementation by [Bernstein, Brum- ley, Chen, and Tuveri, USENIX Security 2022] on Haswell and Skylake, and our Neon-optimized implementation is 1.29−1.36 times faster than the state-of-the-art Neon-optimized implementation by [Hwang, Liu, and Yang, ACNS 2024] on Cortex-A72 and Apple M1 Pro. For the overall scheme with AVX2, we reduce the batch key generation cycles (amortized with batch size 32) by 7.9%−12.0%, encapsulation cycles by 7.1%−10.3%, and decapsulation cycles by 10.7%−13.3% on Haswell and Skylake. For the overall performance with Neon, we reduce the encapsulation cycles by 3.0%−6.6% and decapsulation cycles by 12.8%−15.1% on Cortex-A72 and Apple M1 Pro.

Note: Update the paper to the final version.

Metadata

Available format(s): PDF
Category: Implementation
Publication info: Published elsewhere. Australasian Conference on Information Security and Privacy 2024
Keywords: Vectorization Polynomial Multiplication Fast Fourier Transform NTRU Prime
Contact author(s): vincentvbh7 @ gmail com
History: 2024-04-18: last of 8 revisions; 2023-04-27: received; See all versions
Short URL: https://ia.cr/2023/604
License: CC0

BibTeX

@misc{cryptoeprint:2023/604,
      author = {Vincent Hwang},
      title = {Pushing the Limit of Vectorized Polynomial Multiplication for NTRU Prime},
      howpublished = {Cryptology ePrint Archive, Paper 2023/604},
      year = {2023},
      note = {\url{https://eprint.iacr.org/2023/604}},
      url = {https://eprint.iacr.org/2023/604}
}