Paper 2024/1976

HI-CKKS: Is High-Throughput Neglected? Reimagining CKKS Efficiency with Parallelism

Fuyuan Chen, Nanjing University of Posts and Telecommunications
Jiankuo Dong, Nanjing University of Posts and Telecommunications
Xiaoyu Hu, Nanjing University of Posts and Telecommunications
Zhenjiang Dong, Nanjing University of Posts and Telecommunications
Wangchen Dai, Zhejiang Lab
Jingqiang Lin, University of Science and Technology of China
Fu Xiao, Nanjing University of Posts and Telecommunications
Abstract

The proliferation of data outsourcing and cloud services has heightened privacy vulnerabilities. CKKS, among the most prominent homomorphic encryption schemes, allows computations on encrypted data, serving as a critical privacy safeguard. However, performance remains a central bottleneck, hindering widespread adoption. Existing optimization efforts often prioritize latency reduction over throughput performance. This paper presents HI-CKKS, a throughput-oriented High-performance Implementation of CKKS homomorphic encryption, addressing these challenges. Our HI-CKKS introduces a batch-supporting asynchronous execution scheme, effectively mitigating frequent data interactions and high waiting delays between hosts and servers in service-oriented scenarios. We analyze the fundamental (I)NTT primitive, which is critical in CKKS, and develop a hierarchical, hybrid high-throughput implementation. This includes efficient arithmetic module instruction set implementations, unified kernel fusion, and hybrid memory optimization strategies that significantly improve memory access efficiency and the performance of (I)NTT operations. Additionally, we propose a multi-dimensional parallel homomorphic multiplication scheme aimed at maximizing throughput and enhancing the performance of (I)NTT and homomorphic multiplication. In conclusion, our implementation is deployed on the RTX 4090, where we conduct a thorough throughput performance evaluation of HI-CKKS, enabling us to pinpoint the most effective parallel parameter settings. Compared to the CPU implementation, our system achieves throughput increases of $175.08\times$, $191.27\times$, and $679.57\times$ for NTT, INTT, and HMult, respectively. And our throughput performance still demonstrates a significant improvement, ranging from $1.54\times$ to $693.17\times$ compared to the latest GPU-based works.

Metadata
Available format(s)
PDF
Category
Implementation
Publication info
Preprint.
Keywords
CKKSHomomorphic MultiplicationNumber Theoretic Transform (NTT)Parallel ProcessingGPU
Contact author(s)
2022040501 @ njupt edu cn
djiankuo @ foxmail com
1535575390 @ qq com
History
2024-12-12: approved
2024-12-06: received
See all versions
Short URL
https://ia.cr/2024/1976
License
Creative Commons Attribution-NonCommercial-ShareAlike
CC BY-NC-SA

BibTeX

@misc{cryptoeprint:2024/1976,
      author = {Fuyuan Chen and Jiankuo Dong and Xiaoyu Hu and Zhenjiang Dong and Wangchen Dai and Jingqiang Lin and Fu Xiao},
      title = {{HI}-{CKKS}: Is High-Throughput Neglected? Reimagining {CKKS} Efficiency with Parallelism},
      howpublished = {Cryptology {ePrint} Archive, Paper 2024/1976},
      year = {2024},
      url = {https://eprint.iacr.org/2024/1976}
}
Note: In order to protect the privacy of readers, eprint.iacr.org does not use cookies or embedded third party content.