Paper 2023/465

RPU: The Ring Processing Unit

Deepraj Soni, New York University
Negar Neda, New York University
Naifeng Zhang, Carnegie Mellon University
Benedict Reynwar, USC Information Sciences Institute
Homer Gamil, New York University Abu Dhabi
Benjamin Heyman, New York University
Mohammed Nabeel Thari Moopan, New York University Abu Dhabi
Ahmad Al Badawi, Duality Technology
Yuriy Polyakov, Duality Technologies
Kellie Canida, USC Information Sciences Institute
Massoud Pedram, University of Southern California
Michail Maniatakos, New York University Abu Dhabi
David Bruce Cousins, Duality Technologies
Franz Franchetti, Carnegie Mellon University
Matthew French, USC Information Sciences Institute
Andrew Schmidt, USC Information Sciences Institute
Brandon Reagen, New York University

Ring-Learning-with-Errors (RLWE) has emerged as the foundation of many important techniques for improving security and privacy, including homomorphic encryption and post-quantum cryptography. While promising, these techniques have received limited use due to their extreme overheads of running on general-purpose machines. In this paper, we present a novel vector Instruction Set Architecture (ISA) and microarchitecture for accelerating the ring-based computations of RLWE. The ISA, named B512, is developed to meet the needs of ring processing workloads while balancing high-performance and general-purpose programming support. Having an ISA rather than fixed hardware facilitates continued software improvement post-fabrication and the ability to support the evolving workloads. We then propose the ring processing unit (RPU), a high-performance, modular implementation of B512. The RPU has native large word modular arithmetic support, capabilities for very wide parallel processing, and a large capacity high-bandwidth scratchpad to meet the needs of ring processing. We address the challenges of programming the RPU using a newly developed SPIRAL backend. A configurable simulator is built to characterize design tradeoffs and quantify performance. The best performing design was implemented in RTL and used to validate simulator performance. In addition to our characterization, we show that a RPU using 20.5mm2 of GF 12nm can provide a speedup of 1485x over a CPU running a 64k, 128-bit NTT, a core RLWE workload

Available format(s)
Publication info
Published elsewhere. 2023 IEEE International Symposium on Performance Analysis of Systems and Software
RPUHardware acceleratorRing ProcessingRPUNTTCryptographyFully Homomorphic EncryptionFHE hardware
Contact author(s)
dss545 @ nyu edu
negar @ nyu edu
naifengz @ cmu edu
breynwar @ isi edu
og532 @ nyu edu
bch5868 @ nyu edu
mtn2 @ nyu edu
aalbadawi @ dualitytech com
ypolyakov @ dualitytech com
kcanida @ isi edu
pedram @ usc edu
mihalis maniatakos @ nyu edu
dcousins @ dualitytech com
franzf @ ece cmu edu
mfrench @ isi edu
aschmidt @ isi edu
bjr5 @ nyu edu
2023-03-31: approved
2023-03-30: received
See all versions
Short URL
Creative Commons Attribution


      author = {Deepraj Soni and Negar Neda and Naifeng Zhang and Benedict Reynwar and Homer Gamil and Benjamin Heyman and Mohammed Nabeel Thari Moopan and Ahmad Al Badawi and Yuriy Polyakov and Kellie Canida and Massoud Pedram and Michail Maniatakos and David Bruce Cousins and Franz Franchetti and Matthew French and Andrew Schmidt and Brandon Reagen},
      title = {RPU: The Ring Processing Unit},
      howpublished = {Cryptology ePrint Archive, Paper 2023/465},
      year = {2023},
      note = {\url{}},
      url = {}
Note: In order to protect the privacy of readers, does not use cookies or embedded third party content.