Paper 2022/1303
Fast and Clean: Auditable high-performance assembly via constraint solving
Abstract
Handwritten assembly is a widely used tool in the development of high-performance cryptography: By providing full control over instruction selection, instruction scheduling, and register allocation, highest performance can be unlocked. On the flip side, developing handwritten assembly is not only time-consuming, but the artifacts produced also tend to be difficult to review and maintain – threatening their suitability for use in practice. In this work, we present SLOTHY (Super (Lazy) Optimization of Tricky Handwritten assemblY), a framework for the automated superoptimization of assembly with respect to instruction scheduling, register allocation, and loop optimization (software pipelining): With SLOTHY, the developer controls and focuses on algorithm and instruction selection, providing a readable “base” implementation in assembly, while SLOTHY automatically finds optimal and traceable instruction scheduling and register allocation strategies with respect to a model of the target (micro)architecture. We demonstrate the flexibility of SLOTHY by instantiating it with models of the Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72 microarchitectures, implementing the Armv8.1-M+Helium and AArch64+Neon architectures. We use the resulting tools to optimize three workloads: First, for Cortex-M55 and Cortex-M85, a radix-4 complex Fast Fourier Transform (FFT) in fixed-point and floating-point arithmetic, fundamental in Digital Signal Processing. Second, on Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72, the instances of the Number Theoretic Transform (NTT) underlying CRYSTALS-Kyber and CRYSTALS-Dilithium, two recently announced winners of the NIST Post-Quantum Cryptography standardization project. Third, for Cortex-A55, the scalar multiplication for the elliptic curve key exchange X25519. The SLOTHY-optimized code matches or beats the performance of prior art in all cases, while maintaining compactness and readability.
Metadata
- Available format(s)
- Category
- Implementation
- Publication info
- Published by the IACR in TCHES 2024
- Keywords
- SuperoptimizationConstraint SolvingPost-Quantum CryptographyKyberDilithiumX25519HeliumNeonFFTNTT
- Contact author(s)
-
amin abdulrahman @ mpi-sp org
beckphan @ amazon co uk
matthias @ kannwischer eu
fabien klein @ arm com - History
- 2023-10-06: last of 5 revisions
- 2022-09-30: received
- See all versions
- Short URL
- https://ia.cr/2022/1303
- License
-
CC BY
BibTeX
@misc{cryptoeprint:2022/1303, author = {Amin Abdulrahman and Hanno Becker and Matthias J. Kannwischer and Fabien Klein}, title = {Fast and Clean: Auditable high-performance assembly via constraint solving}, howpublished = {Cryptology {ePrint} Archive, Paper 2022/1303}, year = {2022}, url = {https://eprint.iacr.org/2022/1303} }