Paper 2024/095

ConvKyber: Unleashing the Power of AI Accelerators for Faster Kyber with Novel Iteration-based Approaches

Tian Zhou, University of Science and Technology of China
Fangyu Zheng, University of Chinese Academy of Sciences
Guang Fan, Ant Group
Lipeng Wan, University of Chinese Academy of Sciences
Wenxu Tang, University of Science and Technology of China
Yixuan Song, Ant Group
Yi Bian, University of Chinese Academy of Sciences
Jingqiang Lin, University of Science and Technology of China
Abstract

The remarkable performance capabilities of AI accelerators offer promising opportunities for accelerating cryptographic algorithms, particularly in the context of lattice-based cryptography. However, current approaches to leveraging AI accelerators often remain at a rudimentary level of implementation, overlooking the intricate internal mechanisms of these devices. Consequently, a significant number of computational resources is underutilized. In this paper, we present a comprehensive exploration of NVIDIA Tensor Cores and introduce a novel framework tailored specifically for Kyber. Firstly, we propose two innovative approaches that efficiently break down Kyber's NTT into iterative matrix multiplications, resulting in approximately a 75% reduction in costs compared to the state-of-the-art scanning-based methods.Secondly, by reversing the internal mechanisms, we precisely manipulate the internal resources of Tensor Cores using assembly-level code instead of inefficient standard interfaces, eliminating memory accesses and redundant function calls. Finally, building upon our highly optimized NTT, we provide a complete implementation for all parameter sets of Kyber. Our implementation surpasses the state-of-the-art Tensor Core based work, achieving remarkable speed-ups of 1.93x, 1.65x, 1.22x and 3.55x for polyvec_ntt, KeyGen, Enc and Dec in Kyber-1024, respectively. Even when considering execution latency, our throughput-oriented full Kyber implementation maintains an acceptable execution latency. For instance, the execution latency ranges from 1.02 to 5.68 milliseconds for Kyber-1024 on R3080 when achieving the peak throughput.

Metadata
Available format(s)
PDF
Category
Implementation
Publication info
Published by the IACR in TCHES 2024
Keywords
Lattice-based CryptographyGPUsTensor CoreKyber
Contact author(s)
weekdayzt @ mail ustc edu cn
zhengfangyu @ ucas ac cn
fanguang fg @ antgroup com
szxwlp @ foxmail com
wenxutang @ mail ustc edu cn
songyixuan syx @ antgroup com
bianyi18 @ mails ucas ac cn
linjq @ ustc edu cn
History
2024-01-22: approved
2024-01-22: received
See all versions
Short URL
https://ia.cr/2024/095
License
Creative Commons Attribution
CC BY

BibTeX

@misc{cryptoeprint:2024/095,
      author = {Tian Zhou and Fangyu Zheng and Guang Fan and Lipeng Wan and Wenxu Tang and Yixuan Song and Yi Bian and Jingqiang Lin},
      title = {ConvKyber: Unleashing the Power of AI Accelerators for Faster Kyber with Novel Iteration-based Approaches},
      howpublished = {Cryptology ePrint Archive, Paper 2024/095},
      year = {2024},
      note = {\url{https://eprint.iacr.org/2024/095}},
      url = {https://eprint.iacr.org/2024/095}
}
Note: In order to protect the privacy of readers, eprint.iacr.org does not use cookies or embedded third party content.