HI-CKKS: Is High-Throughput Neglected? Reimagining CKKS Efficiency with Parallelism

Fuyuan Chen; Jiankuo Dong; Xiaoyu Hu; Zhenjiang Dong; Wangchen Dai

Paper 2024/1976

HI-CKKS: Is High-Throughput Neglected? Reimagining CKKS Efficiency with Parallelism

Fuyuan Chen

, Nanjing University of Posts and Telecommunications

Jiankuo Dong

, Nanjing University of Posts and Telecommunications

Xiaoyu Hu

, Nanjing University of Posts and Telecommunications

Zhenjiang Dong

, Nanjing University of Posts and Telecommunications

Wangchen Dai

, Zhejiang Lab

Abstract

The proliferation of data outsourcing and cloud services has heightened privacy vulnerabilities. CKKS, among the most prominent homomorphic encryption schemes, allows computations on encrypted data, serving as a critical privacy safeguard. However, performance remains a central bottleneck, hindering widespread adoption. Existing optimization efforts often prioritize latency reduction over throughput performance. This paper presents HI-CKKS, a throughput-oriented High-performance Implementation of CKKS homomorphic encryption, addressing these challenges. Our HI-CKKS introduces a batch-supporting asynchronous execution scheme, effectively mitigating frequent data interactions and high waiting delays between hosts and servers in service-oriented scenarios. We analyze the fundamental (I)NTT primitive, which is critical in CKKS, and develop a hierarchical, hybrid high-throughput implementation. This includes efficient arithmetic module instruction set implementations, unified kernel fusion, and hybrid memory optimization strategies that significantly improve memory access efficiency and the performance of (I)NTT operations. Additionally, we propose a multi-dimensional parallel homomorphic multiplication scheme aimed at maximizing throughput and enhancing the performance of (I)NTT and homomorphic multiplication. In conclusion, our implementation is deployed on the RTX 4090, where we conduct a thorough throughput performance evaluation of HI-CKKS, enabling us to pinpoint the most effective parallel parameter settings. Compared to the CPU implementation, our system achieves throughput increases of , , and for NTT, INTT, and HMult, respectively. And our throughput performance still demonstrates a significant improvement, ranging from to compared to the latest GPU-based works.

Metadata

Available format(s): PDF
Category: Implementation
Publication info: Preprint.
Keywords: CKKS Homomorphic Multiplication Number Theoretic Transform (NTT)Parallel Processing GPU
Contact author(s): 2022040501 @ njupt edu cn
djiankuo @ foxmail com
1535575390 @ qq com
History: 2025-06-07: revised; 2024-12-06: received; See all versions
Short URL: https://ia.cr/2024/1976
License: CC BY-NC-SA

BibTeX

@misc{cryptoeprint:2024/1976,
      author = {Fuyuan Chen and Jiankuo Dong and Xiaoyu Hu and Zhenjiang Dong and Wangchen Dai},
      title = {{HI}-{CKKS}: Is High-Throughput Neglected? Reimagining {CKKS} Efficiency with Parallelism},
      howpublished = {Cryptology {ePrint} Archive, Paper 2024/1976},
      year = {2024},
      url = {https://eprint.iacr.org/2024/1976}
}