ConvKyber: Unleashing the Power of AI Accelerators for Faster Kyber with Novel Iteration-based Approaches

Tian Zhou; Fangyu Zheng; Guang Fan; Lipeng Wan; Wenxu Tang; Yixuan Song; Yi Bian; Jingqiang Lin

Paper 2024/095

ConvKyber: Unleashing the Power of AI Accelerators for Faster Kyber with Novel Iteration-based Approaches

Tian Zhou

, University of Science and Technology of China

Fangyu Zheng

, University of Chinese Academy of Sciences

Guang Fan

, Ant Group

Lipeng Wan

, University of Chinese Academy of Sciences

Wenxu Tang

, University of Science and Technology of China

Yixuan Song

, Ant Group

Yi Bian

, University of Chinese Academy of Sciences

Jingqiang Lin

, University of Science and Technology of China

Abstract

The remarkable performance capabilities of AI accelerators offer promising opportunities for accelerating cryptographic algorithms, particularly in the context of lattice-based cryptography. However, current approaches to leveraging AI accelerators often remain at a rudimentary level of implementation, overlooking the intricate internal mechanisms of these devices. Consequently, a significant number of computational resources is underutilized. In this paper, we present a comprehensive exploration of NVIDIA Tensor Cores and introduce a novel framework tailored specifically for Kyber. Firstly, we propose two innovative approaches that efficiently break down Kyber's NTT into iterative matrix multiplications, resulting in approximately a 75% reduction in costs compared to the state-of-the-art scanning-based methods.Secondly, by reversing the internal mechanisms, we precisely manipulate the internal resources of Tensor Cores using assembly-level code instead of inefficient standard interfaces, eliminating memory accesses and redundant function calls. Finally, building upon our highly optimized NTT, we provide a complete implementation for all parameter sets of Kyber. Our implementation surpasses the state-of-the-art Tensor Core based work, achieving remarkable speed-ups of 1.93x, 1.65x, 1.22x and 3.55x for polyvec_ntt, KeyGen, Enc and Dec in Kyber-1024, respectively. Even when considering execution latency, our throughput-oriented full Kyber implementation maintains an acceptable execution latency. For instance, the execution latency ranges from 1.02 to 5.68 milliseconds for Kyber-1024 on R3080 when achieving the peak throughput.

Metadata

Available format(s): PDF
Category: Implementation
Publication info: Published by the IACR in TCHES 2024
Keywords: Lattice-based Cryptography GPUs Tensor Core Kyber
Contact author(s): weekdayzt @ mail ustc edu cn
zhengfangyu @ ucas ac cn
fanguang fg @ antgroup com
szxwlp @ foxmail com
wenxutang @ mail ustc edu cn
songyixuan syx @ antgroup com
bianyi18 @ mails ucas ac cn
linjq @ ustc edu cn
History: 2024-01-22: approved; 2024-01-22: received; See all versions
Short URL: https://ia.cr/2024/095
License: CC BY

BibTeX

@misc{cryptoeprint:2024/095,
      author = {Tian Zhou and Fangyu Zheng and Guang Fan and Lipeng Wan and Wenxu Tang and Yixuan Song and Yi Bian and Jingqiang Lin},
      title = {{ConvKyber}: Unleashing the Power of {AI} Accelerators for Faster Kyber with Novel Iteration-based Approaches},
      howpublished = {Cryptology {ePrint} Archive, Paper 2024/095},
      year = {2024},
      url = {https://eprint.iacr.org/2024/095}
}