Paper 2022/1303

Fast and Clean: Auditable high-performance assembly via constraint solving

Amin Abdulrahman, Ruhr University Bochum, Max Planck Institute for Security and Privacy
Hanno Becker, Amazon Web Services
Matthias J. Kannwischer, Institute of Information Science, Academia Sinica
Fabien Klein, Arm Limited

Handwritten assembly is a widely used tool in the development of high-performance cryptography: By providing full control over instruction selection, instruction scheduling, and register allocation, highest performance can be unlocked. On the flip side, developing handwritten assembly is not only time-consuming, but the artifacts produced also tend to be difficult to review and maintain – threatening their suitability for use in practice. In this work, we present SLOTHY (Super (Lazy) Optimization of Tricky Handwritten assemblY), a framework for the automated superoptimization of assembly with respect to instruction scheduling, register allocation, and loop optimization (software pipelining): With SLOTHY, the developer controls and focuses on algorithm and instruction selection, providing a readable “base” implementation in assembly, while SLOTHY automatically finds optimal and traceable instruction scheduling and register allocation strategies with respect to a model of the target (micro)architecture. We demonstrate the flexibility of SLOTHY by instantiating it with models of the Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72 microarchitectures, implementing the Armv8.1-M+Helium and AArch64+Neon architectures. We use the resulting tools to optimize three workloads: First, for Cortex-M55 and Cortex-M85, a radix-4 complex Fast Fourier Transform (FFT) in fixed-point and floating-point arithmetic, fundamental in Digital Signal Processing. Second, on Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72, the instances of the Number Theoretic Transform (NTT) underlying CRYSTALS-Kyber and CRYSTALS-Dilithium, two recently announced winners of the NIST Post-Quantum Cryptography standardization project. Third, for Cortex-A55, the scalar multiplication for the elliptic curve key exchange X25519. The SLOTHY-optimized code matches or beats the performance of prior art in all cases, while maintaining compactness and readability.

Available format(s)
Publication info
SuperoptimizationConstraint SolvingPost-Quantum CryptographyKyberDilithiumX25519HeliumNeonFFTNTT
Contact author(s)
amin abdulrahman @ mpi-sp org
beckphan @ amazon co uk
matthias @ kannwischer eu
fabien klein @ arm com
2023-03-26: last of 4 revisions
2022-09-30: received
See all versions
Short URL
Creative Commons Attribution


      author = {Amin Abdulrahman and Hanno Becker and Matthias J. Kannwischer and Fabien Klein},
      title = {Fast and Clean: Auditable high-performance assembly via constraint solving},
      howpublished = {Cryptology ePrint Archive, Paper 2022/1303},
      year = {2022},
      note = {\url{}},
      url = {}
Note: In order to protect the privacy of readers, does not use cookies or embedded third party content.