Number-Theoretic Transform Architecture for Fully Homomorphic Encryption from Hypercube Topology

Jingwei Hu\textsuperscript{1}, Yuhong Fang\textsuperscript{2} and Wangchen Dai\textsuperscript{2}

\textsuperscript{1} Nanyang Technological University, Singapore, davidhu@ntu.edu.sg
\textsuperscript{2} Zhejiang Lab, China, w.dai@my.cityu.edu.hk

Abstract. This paper introduces a high-performance and scalable hardware architecture designed for the Number-Theoretic Transform (NTT), a fundamental component extensively utilized in lattice-based encryption and fully homomorphic encryption schemes. The underlying rationale behind this research is to harness the advantages of the hypercube topology. This topology serves to significantly diminish the volume of data exchanges required during each iteration of the NTT, reducing it to a complexity of $\Omega(\log N)$. Concurrently, it enables the parallelization of $N$ processing elements. This reduction in data exchange operations is of paramount importance. It not only facilitates the establishment of interconnections among the $N$ processing elements but also lays the foundation for the development of a high-performance NTT design. This is particularly valuable when dealing with large values of $N$.

Keywords: Fully Homomorphic Encryption, Number Theoretic Transform, FPGA Implementation
Contents

1 FHEW-like Fully Homomorphic Encryption Scheme 3
  1.1 LWE and RLWE .................................................. 3
  1.2 Key Switch and Mod Switch ................................... 4
  1.3 Sample Extraction .............................................. 4
  1.4 RGSW cryptosystem ............................................ 4
  1.5 Blind Rotation .................................................. 5
  1.6 Bootstrapping a NAND gate .................................... 6

2 Number-theoretic Transform Architecture 7
  2.1 Higher level description for NTT with merged twiddle factors .... 7
  2.2 $log_2 d$-Dimensional Hypercube Multiprocessors .................. 11
  2.3 A Useful Equivalent Notation: |PID|Local $M$ ...................... 12
  2.4 First attempt: parallel in-place FFTs without inter-processor permutations 13
  2.5 Second attempt: Parallel NTTs with Inter-processor permutations .... 14
  2.6 Butterfly Processor ............................................. 19
  2.7 Microbench Implementations ..................................... 22
  2.8 System Integration on Xilinx MPSOC platform ..................... 24
Introduction

The contributions of this paper include:

- Pioneering Hypercube Topology in NTT Designs: This research introduces the innovative concept of applying hypercube topology to NTT designs. It successfully addresses the challenging issue of managing a substantial volume of data exchange within high-performance NTT designs.

- Prototyping a Hypercube-Based NTT Hardware: The study provides a practical implementation of NTT hardware based on the hypercube topology. Importantly, it allows users the flexibility to configure the degree of parallelization as per their requirements. The paper also offers theoretical estimations of the timing performance, which are subsequently validated through concrete implementation results.

1 FHEW-like Fully Homomorphic Encryption Scheme

TFHE \cite{CGGI20} and FHEW \cite{DM15} are examples of the third generation of fully homomorphic encryption (FHE) schemes. They offer advantages in terms of bootstrapping performance, which is a crucial operation in FHE schemes. The state-of-the-art advancements in TFHE/FHEW indicate that any discrete function \( f \) over a small domain \( \mathbb{Z}_t \) can be bootstrapped. This unique capability is often referred to as functional or programmable bootstrapping. It allows for performing computations on encrypted data without decrypting it. In this work, we stay focused on one particular function called NAND gate which is proposed in the original FHEW paper as:

\[
\text{NAND}(x, y) = \begin{cases} 
0 & \text{if } x = y \\
1 & \text{otherwise}
\end{cases}
\]

where \( x, y \in \{0, 1\} \).

1.1 LWE and RLWE

TFHE ciphertext is essentially an LWE ciphertext. The LWE ciphertext is defined as follows:

\[
(a, b) = (a_0, a_1, \ldots, a_n) + q \cdot m + e \in \mathbb{Z}_q^n \times \mathbb{Z}_q
\]

where \( s \) is a secret key, \( m \in \mathbb{Z}_t \) is the message, \( e \) is an error term extracted from a discrete Gaussian distribution with small variance \( e \sim \mathcal{N}(0, \sigma^2) \). The parameters \( n, q, s \) characterize an LWE ciphertext. We use the notation \( LWE_{\mathbb{Z}_q}(m) \) (or \( LWE_{\mathbb{Z}_q}(m) \)) when the context is clear) to denote an LWE encryption of the message \( m \).

In the TFHE bootstrapping algorithm, another format of ciphertext, called RLWE ciphertext, is used. An RLWE ciphertext can be viewed as an LWE ciphertext defined over polynomial ring \( R_\mathbb{Q} = \mathbb{Z}_\mathbb{Q}[X]/(X^N + 1) \):

\[
(a(X), b(X)) = (a(X), a(X) \cdot z(X) + Q \cdot m(X) + e(X)) \in R_\mathbb{Q} \times R_\mathbb{Q}
\]

where \( z \) is the secret key, \( m \) is the message, \( e = \sum_i e_i X^i \) is the error polynomial where each coefficient is extracted from a discrete Gaussian distribution with small variance \( e_i \sim \mathcal{N}(0, \sigma^2) \). Likewise, we use \( RLWE_{\mathbb{Z}_\mathbb{Q}}(m) \) (or \( RLWE_{\mathbb{Z}_\mathbb{Q}}(m) \), \( RLWE_{\mathbb{Z}_\mathbb{Q}}(m) \)) when the context is clear) to denote an RLWE encryption of the message \( m(X) \).
1.2 Key Switch and Mod Switch

Key switching and modulus switching are two important homomorphic operations in TFHE. We skip the algorithmic detail for these two primitives but outline the functionality abstractly. **KeySwitch** takes as input an LWE encryption of message \( m \) under secret key \( z \) and outputs another LWE ciphertext encrypting the same message \( m \) but under a different secret key \( s \):

\[
\text{LWE}_{N,Q}^z(m) \xrightarrow{\text{KeySwitch}} \text{LWE}_{N,Q}^s(m)
\]

**KeySwitch** utilizes another publicly known key called key switching key \( \text{KSK} \) to perform this operation and **KeySwitch** introduces extra noise to the input LWE ciphertext. In practice, we may use another form of key switching, called LWE-to-RLWE keyswitch which homomorphically converts an LWE encryption to an RLWE encryption without changing the encrypted message:

\[
\text{LWE}_{N,Q}^z(m) \xrightarrow{\text{LWE-to-RLWE KeySwitch}} \text{RLWE}_{N,Q}^z(m)
\]

**ModSwitch** takes input as an LWE encryption of message \( m \) defined over \( \mathbb{Z}^N \times \mathbb{Z}_Q \) and outputs another LWE encryption of the same message defined over \( \mathbb{Z}^N \times \mathbb{Z}_q \) with \( Q > q \):

\[
\text{LWE}_{N,Q}^s(m) \xrightarrow{\text{ModSwitch}} \text{LWE}_{N,q}^s(m)
\]

A unique feature of **ModSwitch** is that it reduces the noise within the LWE ciphertext. TFHE bootstrapping utilizes this feature to homomorphically evaluate a discrete function \( f \) while reducing the noise in the ciphertext.

1.3 Sample Extraction

Sample extraction is another classical technique that homomorphically extracts a term in an encrypted polynomial. More precisely, the input for sample extraction is an RLWE ciphertext \( \text{RLWE}_{N,Q}^z(m(X)) \) which encrypts a polynomial \( m(X) = \sum_{i=0}^{N-1} m_i X^i \) and sample extraction extracts the \( i \)-th term of \( m(X) \) as an LWE encryption \( \text{LWE}_{N,Q}^z(m_i) \) denoted as:

\[
\text{LWE}_{N,Q}^z(m_i) \leftarrow \text{SampleExt}(\text{RLWE}_{N,Q}^z(m(X), i))
\]

Note that the secret key \( z = \sum_{i=0}^{N-1} z_i X^i \) is changed accordingly to its vector form \( z = (z_0, \cdots, z_{N-1}) \). Sometimes, we abuse the notation above and use \( \text{LWE}_{N,Q}^z(m_0) \leftarrow \text{SampleExt}(\text{RLWE}_{N,Q}^z(m(X))) \) to denote the extraction of the constant term. The computational overhead for **SampleExt** is trivial and the noise does not grow.

1.4 RGSW cryptosystem

In TFHE/FHEW schemes, a new encryption scheme called RGSW [GSW13] is used as the basis for homomorphic multiplication. We use the notation \( \text{RGSW}_{N,Q}^z(m(X)) \) to denote an RGSW encryption of a polynomial \( m(X) \in R_Q = \mathbb{Z}_Q[X]/(X^N + 1) \) under the secret key \( z \).

Before detailing RGSW, an extended form of RLWE encryption with respect to some basis \( B_g \) should be introduced, i.e.,

\[
\text{RLWE}'(m) = (\text{RLWE}(m), \cdots, \text{RLWE}(B^d m), \cdots, \text{RLWE}(B^{d_g-1} m))
\]

where \( d_g = \lceil \log_{B_g}(Q) \rceil \).

The noise growth for the scalar multiplication \( d \cdot \text{RLWE}'(m) \) defined in \( \text{RLWE}'(\cdot) \) is well controlled since:

\[
d \cdot \text{RLWE}'(m) = \sum_{i=0}^{d_g-1} d_i \cdot \text{RLWE}(B^i m) = \text{RLWE}(\sum_i d_i B^i m) = \text{RLWE}(d \cdot m)
\]
where \( d = \sum d_iB^i \).

RGSW encryption is further constructed based on \( RLWE'(\cdot) \) as:

\[
RGSW_z(m) = (RLWE'_z(-z \cdot m), RLWE'_z(m))
\]

In other terms, RGSW not only encrypts the message \( m \) but also encrypts \( z \cdot m \) which encloses the secret key \( z \).

The RGSW homomorphic multiplication between an RLWE ciphertext \( RLWE(m_0) \overset{def}{=} (a, b) \) and another RGSW ciphertext \( RGSW(m_1) \overset{def}{=} RLWE \times RGSW \rightarrow RLWE \) proceeds as follows:

\[
RLWE(m_0) \odot RGSW(m_1) = (a, b) \odot (RLWE'_(-z \cdot m_1), RLWE'(m_1))
\]

\[
= a \cdot RLWE'_(-z \cdot m_1) + b \cdot RLWE'(m_1)
\]

\[
= RLWE((b - as) \cdot m_1)
\]

\[
= RLWE\left(\frac{Q}{t} m_0 m_1 + e_0 m_1\right)
\]

since \( e_0 m_1 \) is a small component and can be treated as noise and thus we have \( RLWE(\frac{Q}{t} m_0 m_1 + e_0 m_1) = RLWE(\frac{Q}{t} m_0 m_1) \). Also note that the inputs are asymmetric where one input is an RLWE encryption while the other is an RGSW encryption. This type of homomorphic multiplication is also known as external product. Each external product takes \( 4d_qN \log N \) \( \mathbb{F}_q \) multiplications where \( d_q \approx 6 \) is a small constant used to describe RSGW encryption if number-theoretic transform (NTT) is applied.

### 1.5 Blind Rotation

Blind Rotation essentially homomorphically performs an inner product of a public known vector \( a \overset{def}{=} (a_0, \cdots, a_{n-1}) \) and the private key vector \( s \overset{def}{=} (s_0, \cdots, s_{n-1}) \) over the exponent of \( X \) by extensive use of RGSW external product. In order to control the noise growth during the homomorphic computation, a redundant encryption of the private key \( s \), i.e., \( Z_{i,j,v} = RGSW(X^{vB_i^j} s_i) \) for all \( i, j, v \) is used. Algorithm 1 formally describes the primitive BlindRotate.

```
Input: bootstrapping key \( brK \), \( Z_{i,j,v} = RGSW(X^{vB_i^j} s_i) \) for all \( i, j, v \), RLWE ciphertext \( ct = RLWE(X^b) \) and a public known vector \( a \).

Output: an RLWE encryption of \( X^{b-(a \cdot s)} \).

1 for \( i \leftarrow 0 \) to \( n - 1 \) do
  2 Compute \( a_i \leftarrow q - a_i \)
  3 for \( j \leftarrow 0 \) to \( \log_B q - 1 \) do
    4 Compute \( a_{i,j} \leftarrow [a_i/B_j] \)
  5 Compute \( ACC \leftarrow ACC \odot Z_{i,j,a_{i,j}} \)
6 return \( ACC \)
```

Algorithm 1: the primitive blind rotation \( \text{BlindRotate}(brK, ct, a) \)

#### Proposition 1

Given an RLWE encryption of a monomial \( X^b \), and a public known vector \( a \) an an pre-computed encryption of the secret keys, i.e., \( Z_{i,j,v} = RGSW(X^{vB_i^j} s_i) \) for all \( i, j, v \). Algorithm 1 correctly computes an RLWE encryption of \( X^{b-(a \cdot s)} \).

**Proof.** The algorithm proceeds in a nested loop with loop-\( i \) as the outer loop and loop-\( j \) as the inner loop. Each iteration of loop-\( i \) first flattens \( a_i \) to \( \{a_{i,j}\} \) for all \( j \) and then
computes in the inner loop-\(j RLWE(X^{b_{j-1}}\sum_{k=0}^{i-1} a_{k}s_k)\) from \(RLWE(X^{b_{j-1}}\sum_{k=0}^{i-1} a_{k}s_k)\) since \(-a_{s_i} = \sum_{j} a_{j}B_{i}j\). By induction, ACC returns \(RLWE(X^{b_{i-1}}\sum_{k=0}^{i-1} a_{k}s_k)\) at the end of the algorithm.

\[\square\]

### 1.6 Bootstrapping a NAND gate

A full description of homomorphic NAND gate operation is presented in Alg. 2. acc\(_{BR}.f = RLWE_{a,i}^{N,Q}(\text{rotP} \cdot X^{m})\) where \(m = b_N - (a_N, s) = \frac{2}{3}m + e\), \(ct_{Q} = RLWE_{a,i}^{N,Q}(f(m))\), and \(ct_{q} = RLWE_{a,i}^{N,Q}(f(m))\). HDB is relatively fast since it performs the most time-consuming operation BlindRotate, whose computational complexity is \(O(d_qN^2\log N)\), only once. Here, \(d_q\) is a small constant used in the TFHE system parameter description. For 80-bit security, one HDB function evaluation takes less than 1 second by a typical software implementation on a desktop PC.

| Input: | bootstrapping key brK, LWE ciphertexts |
| | \(ct_{i} = RLWE_{a,i}^{N,Q}(m_i) = (a, b)\) for \(i \in \{0, 1\}\), LWE-to-LWE key-switching key ksK. |
| Output: | an LWE encryption of NAND\((m_0, m_1)\), i.e., |
| | \(ct_{q} = RLWE_{a,i}^{N,Q}(\text{NAND}(m_0, m_1))\). |
| 1 Compute \(ct \leftarrow ct_0 + ct_1\) |
| 2 Set \(\text{rotP} = f(0) \left(\sum_{j=0}^{N} X^j + \sum_{j=N - \frac{N}{2} + 2}^{N - 1} X^j\right) - f(1) \left(\sum_{j=\frac{N}{2} + 1}^{N} X^j\right)\) where |
| | \(f(0) = f(1) = \frac{2}{3}\) |
| 3 Set \(ct_N \leftarrow \text{ModSwitch}(ct_{q}, 2N) = (a_N, b_N)\) |
| 4 Compute \(acc_{f} \leftarrow \text{rotP} \cdot X^{b_N}\) |
| 5 Run \(acc_{BR.f} \leftarrow \text{BlindRotate}(brK, acc_{f}, ct_N)\) |
| 6 Run \(ct_{Q} \leftarrow \text{SampleExt}(acc_{BR.f}, 0)\) |
| 7 Run \(ct_{Q, ksK} \leftarrow \text{KeySwitch}(ct_{Q}, ksK)\) |
| 8 Run \(ct_{q} \leftarrow \text{ModSwitch}(ct_{Q}, q)\) |
| 9 return \(ct_{q} + (0, \frac{3}{4})\) |

**Algorithm 2:** Half domain bootstrapping Bootstrap\((brK, ct, f(\cdot), ksK)\)

**Proposition 2.** Given an LWE encryption of a binary message \(m_0\), and an LWE encryption of another message \(m_1\), Algorithm 2 correctly computes an LWE encryption of NAND\((m_0, m_1)\).

**Proof.** In line 1, the homomorphic addition returns \(ct = RLWE_{a,i}^{N,q}(m_0 + m_1)\). In line 2, the rotation polynomial \(\text{rotP}(x)\) is configured according to the anti-cyclic function \(f(x)\) for \(x \in \{0, 1, 2, 3\}\) s.t. \(f(0) = f(1) = \frac{2}{3}\) and \(f(2) = f(3) = -\frac{2}{3}\). In line 3, the primitive ModSwitch is invoked to scale the ciphertext modulus \(q\) down to \(2N\) making \(ct_N = RLWE_{a,i}^{2N,q}(m_0 + m_1)\). Line 4 essentially creates a noiseless and trivial RLWE encryption of \(\text{rotP}(X) \cdot X^{b_N}\). In line 5, the primitive BlindRotate uses the bootstrapping key to compute \(acc_{BR.f} = RLWE_{a,i}^{N,q}(\text{rotP} \cdot X^{m})\) where \(m = b_N - (a_N, s) = \frac{2N}{3}(m_0 + m_1) + e \mod 2N\). In line 6, the constant term in \(\text{rotP} \cdot X^{m}\) is extracted as a new LWE encryption over a relatively larger ciphertext modulus, i.e., \(RLWE_{a,i}^{N,q}(f(m_0 + m_1))\). Line 7 switches the secret key \(z\) back to the original secret key \(s\) such that \(ct_{Q, ksK} = RLWE_{a,i}^{N,q}(f(m_0 + m_1))\). Line 8 performs again the primitive ModSwitch to switch back to the original ciphertext modulus \(q\) such that \(ct_{q} = RLWE_{a,i}^{N,q}(f(m_0 + m_1))\). In line 9, the final result \(RLWE_{a,i}^{N,q}(f(m_0 + m_1) + q/8)\) is returned. It is easy to verify that \(RLWE_{a,i}^{N,q}(f(m_0 + m_1) + q/8) = (a, \{a, s\} + e)\) which encrypts ‘0’ if \(m_0 = m_1 = 1\), and \(RLWE_{a,i}^{N,q}(f(m_0 + m_1) + q/8) = (a, \{a, s\} + e)\) which encrypts ‘1’ otherwise. \(\square\)
2 Number-theoretic Transform Architecture

This section describes NTT hardware at the bottom level. Firstly, The distinguishing feature is that a series of new twiddle factor LUTs is constructed and used: an independent twiddle factor LUT, denoted as \( \{ w_i[i] \}_{i=0}^{log_2 N-1} \), is prepared for the \( i \)-th round of butterfly computation (\( logN \) rounds in total) as described in Alg. 4. Note that differing from the standard FFT which uses twiddle factor \( \omega_N^i \) for \( i \in [N] \), the NTT used in the ring \( \mathcal{R}_q \) uses the modified twiddle factor \( \omega_N^i \cdot \omega_2^j \) for \( i \in [N], j = 2^0, 2^1, \ldots, 2^{logN-1} \).

2.1 Higher level description for NTT with merged twiddle factors

In this subsection, we discuss the NTT algorithm with merged twiddle factors. No pre-processing or post-processing is required in this variant of NTT algorithm. At an abstract level, the structure of this NTT algorithm is identical to that of the classic FFT algorithm.

The formal description of this NTT variant is shown in Alg. 3 which is also referred to as Cooley-Tukey (CT) butterfly or decimation in time (DIT) in the open literature. It is essentially identical to the classic FFT algorithm except the twiddle factor array \( w_i[i] \). It has \( logN \) iterations (loop-i) at outermost, where each iteration computes one layer of butterfly computations. The \( i(i = 0, \ldots, logN - 1) \)-th layer of butterfly computation always has \( \frac{N}{2} \) butterflies. These butterflies are bundled into 2\( i \) groups (recorded by the variable \( \text{NumberofGroups} \)) and each groups has \( \frac{N}{2} \) pairs of butterflies (recorded by the variable \( \text{PairsInGroup} \)). The key feature is that at a particular iteration (say the \( i \)-th iteration), the butterflies in a particular group (say the \( k \)-th group) share the same twiddle factor \( w_i[k] \). The variable \( \text{Distance} \) is used to locate precisely two inputs of a particular pair of butterfly in loop-\( j \), i.e., \( a[j] \) and \( a[j + \text{Distance}] \). The variables \( \text{JFirst} \) and \( \text{JLast} \) indicate the starting and the ending position of the array \( a[i] \), respectively, used in the \( k \)-th group of the \( i \)-th iteration.

A visualization of Alg. 3 is depicted in Fig. 1a when \( N = 8 \). The inputs are \( a[0], a[1], \ldots, a[7] \) where \( a[i] \) represents the \( i \)-th coefficient \( a_i \) in the polynomial \( a(X) = \sum_{i=0}^{N-1} a_i X^i \). The NTT computation has 3 layers of butterflies: In the first layer (\( i = 0 \) for loop-\( i \) in Alg. 3), only one butterfly group (associated with twiddle factor \( \omega_{16}^i \)) exists; in the second layer, two butterfly groups (associated with twiddle factor \( \omega_{16}^i \) and \( \omega_{16}^{i+1} \)) exist; in the third layer, four butterfly groups (associated with twiddle factors \( \omega_{16}^i, \omega_{16}^{i+1}, \omega_{16}^{i+2}, \) and \( \omega_{16}^{i+3} \), respectively) exist. It is worth noting that the outputs from the NTT network is in bit-reversed order as \( A[0], A[4], A[2], A[6], A[1], A[5], A[3], A[7] \).

Next, we detail how to construct the twiddle factor LUT. Recall the NTT with pre-processing can be written together as a summation of \( N \) terms:

\[
A_i = \sum_{j=0}^{N-1} a_j \omega_{2N}^{2ij} \omega_N^j \mod q, i \in [0, N-1]
\]  

(1)

Next, by splitting the summation above (Equation 1) into even and odd groups according to the index \( i \) of \( A_i \), we obtain

\[
A_i = \sum_{j=0}^{\frac{N}{2}-1} a_{2j} \omega_N^{2ij} \omega_{2N}^{2j} + \sum_{j=0}^{\frac{N}{2}-1} a_{2j+1} \omega_N^{j(2j+1)} \omega_{2N}^{2j+1} \mod q
\]

\[
= \sum_{j=0}^{\frac{N}{2}-1} a_{2j} \omega_N^{ij} \omega_N^j + \omega_N^{ij} \omega_{2N} \sum_{j=0}^{\frac{N}{2}-1} a_{2j+1} \omega_N^{ij} \omega_N^j \mod q
\]  

(2)

Now express \( A_i \)'s in Equation 2 into the first half \( A_i \) and the second half \( A_{i+\frac{N}{2}} \) as
Algorithm 3: Higher level description of NTT, a.k.a. DIT\textsubscript{NN}→RN

\textbf{Input:} a polynomial ring \(R_q\), and NTT points \(N\)

\textbf{Output:} Twiddle factors \(\{w_i[\cdot]\}_{i=0,\ldots,\log_2 N-1}\) used in Algorithm 3

1. FirstPart \(\leftarrow 0\) where \([0\cdots 0]_2 = BinRepr(0)\)
2. SecondPart \(\leftarrow 2^{N-1}\) where \([1\cdots 0]_2 = BinRepr(2^{N-1})\)
3. for \(i \leftarrow 0\) to \(\log_2 N - 1\) do
4. \quad for \(j \leftarrow 0\) to \(N - 1\) do
5. \quad \quad [j_{\log_2 N-1,\ldots,j_0}]_2 \leftarrow BinRepr(j)
6. \quad \quad FirstPart \(\leftarrow \sum_{k=0}^{i} j_{\log_2 N-i+1+k} \cdot 2^{\log_2 N-1-k}\)
7. \quad \quad w_i[j] \leftarrow φ_{FirstPart} \cdot φ_{SecondPart}\)
8. \quad SecondPart \(\leftarrow SecondPart/2\)
9. \textbf{return} \(\{w_i[\cdot]\}_{i=0,\ldots,\log_2 N-1}\)

Algorithm 4: Construction of Twiddle Factor LUTs

\begin{align*}
A_i &= \sum_{j=0}^{\frac{N}{2}-1} a_{2j} \omega_N^{ij} + \omega_N^{ij} \omega_{2N} \sum_{j=0}^{\frac{N}{2}-1} a_{2j+1} \omega_N^{ij} \omega_N^{ij} \mod q \text{ for } i \in [0, \frac{N}{2} - 1] \\
A_{i + \frac{N}{2}} &= \sum_{j=0}^{\frac{N}{2}-1} a_{2j} \omega_N^{ij} \omega_N^{ij} - \omega_N^{ij} \omega_{2N} \sum_{j=0}^{\frac{N}{2}-1} a_{2j+1} \omega_N^{ij} \omega_N^{ij} \mod q
\end{align*}

(3)

Assume \(N = 2^n\), let \(Y_i^{(n-1)}\) and \(Z_i^{(n-1)}\) be solutions to the two half-sized subproblems (NTT of size of \(\frac{N}{2} = 2^{n-1}\) for the even terms \(\{a_{2j}\}_{j \in \left[\frac{N}{2}\right]}\) and the odd terms \(\{a_{2j+1}\}_{j \in \left[\frac{N}{2}\right]}\))
Table 1: Merged Twiddle Factor used in $N$-point NTT. The exponent is expressed in binary form.

<table>
<thead>
<tr>
<th>NTT iteration $i$</th>
<th>twiddle factor associated with $a[j]$ and $a[j + \text{distance}]$ where $j = [j_{n-1} j_{n-2} \cdots j_1 j_0]_2$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$i = 0$</td>
<td>$\omega_N^{00 \cdots 00}, \omega_{2N}^{01 \cdots 00}$</td>
</tr>
<tr>
<td>$i = 0$</td>
<td>$\omega_N^{00 \cdots 00}, \omega_{2N}^{01 \cdots 00}$</td>
</tr>
<tr>
<td>$i = n - 2$</td>
<td>$\omega_N^{j_2 j_1 0 \cdots 00}, \omega_{2N}^{00 \cdots 10}$</td>
</tr>
<tr>
<td>$i = n - 1$</td>
<td>$\omega_N^{j_1 j_2 \cdots j_{n-2} j_{n-1}}, \omega_{2N}^{00 \cdots 01}$</td>
</tr>
</tbody>
</table>

defined by

$$Y_i^{(n-1)} = \sum_{j=0}^{\frac{N}{2} - 1} a_{2j} \omega_N^{i j} \omega_N^{j} \mod q \text{ for } i \in [0, \frac{N}{2} - 1]$$

$$Z_i^{(n-1)} = \sum_{j=0}^{\frac{N}{2} - 1} a_{2j+1} \omega_N^{i j} \omega_N^{j} \mod q$$

To unify the notation systems, we define $Y_i^{(n)}$ and $Z_i^{(n)}$ as the first half part and the second half part, respectively, of $A_i$, i.e., $Y_i^{(n)} \equiv A_i$ for $i \in [0, \frac{N}{2} - 1]$ and $Y_i^{(n)} \equiv \omega_{2N} A_i$ for $i \in [\frac{N}{2}, N - 1]$. Therefore, the Equation 3 is rewritten in a more compact form:

$$Y_i^{(n)} = Y_i^{(n-1)} + \omega_N^i \omega_{2N} Z_i^{(n-1)} = A_i \mod q \text{ for } i \in [0, \frac{N}{2} - 1]$$

$$Z_i^{(n)} = Y_i^{(n-1)} - \omega_N^i \omega_{2N} Z_i^{(n-1)} = A_i \mod q$$

The key observation for the equation above is that $Y_i^{(n)}$ and $Z_i^{(n)}$ has a recursive structure: for example, $Y_i^{(n)}$ and $Z_i^{(n)}$ are computed from a butterfly computation of $Y_i^{(n-1)}$ and $Z_i^{(n-1)}$, $Y_i^{(n-1)}$ and $Z_i^{(n-1)}$ are computed from a butterfly computation of $Y_i^{(n-2)}$ and $Z_i^{(n-2)}$, and so on so forth. Note that in the $k$-th iteration of such recursion (i.e., $Y_i^{(k)}$ and $Z_i^{(k)}$), and let $K = 2^k$), the twiddle factor always has the form $\omega^i_k \omega_{2K}$. As we have known from the standard FFT, the index $i$ in $\omega^i_k$ appears in bit-reversed order, therefore, we generalize the modified twiddle factor in our case as shown in Table 1.

As shown in Table 1, the updated twiddle factor is composed of two multiplicative factors which is called the first part and the second part in this paper. The first part of the merged twiddle factor is identical to the standard NTT. We keep the same notations here and do not repeat the proof. It has the form $\omega_N^{00 \cdots 00}, \omega_{2N}^{01 \cdots 00}, \cdots, \omega_N^{j_1 j_2 \cdots j_{n-2} j_{n-1}}$, for $i = 0, 1, \cdots, n - 1$, respectively. The second part of the merged twiddle factor is $\omega_{2N}^i$. However, the value of $N'$ depends on the recursive structure of butterfly, for the $i$-th layer, noted as $N' = N / 2^{n-1-i}$ where $N = 2^n$. In other words, the second part equals to $\omega_{2N}^i$ for $i = 0, 1, \cdots, n - 1$. Further to unify the second part is rewritten in binary form as $\omega_{2N}^{10 \cdots 00}, \omega_{2N}^{01 \cdots 00}, \cdots, \omega_{2N}^{00 \cdots 01}$ for $i = 0, 1, \cdots, n - 1$. 


Alg. 4 formally describes how to construct the twiddle factors for each round of butterfly computation based on our first-part-second-part concept mentioned above. The key variable **Firstpart** is updated in line 6 within the inner j-loop to maintain the desired binary form **Firstpart** = \([j_{\log_2 N-1-i-\cdots-j_{\log_2 N-1}},0,\cdots,0]_2\). The other key variable **Secondpart** is updated in line 8 at the end of the outer i-loop. The first impression on Alg. 4 might be that the size of twiddle factor LUTs is about \(O(N\log N)\): It has \(\log N\) rounds and each round consumes \(N/2\) twiddle factor for the \(N/2\) pairs of input points. However, we deploy a simplified twiddle factor LUT with only \(N-1\) elements for our actual hardware design. The key observation for reducing the size of twiddle factor LUT \(\{w_i[\cdot]\}\) is that many entries in \(w_i[\cdot]\) are duplicates and thus redundant. In particular, \(w_0[\cdot]\) has only one distinct element \(w_0[0]\), \(w_1[\cdot]\) has two distinct elements \(w_1[0]\) and \(w_1[N/2]\), \(w_2[\cdot]\) has four distinct elements \(w_2[0]\), \(w_2[N/4]\), \(w_2[2N/4]\), and \(w_2[3N/4]\) and so on so forth. Therefore, the total valid entries used in \(\{w_i[\cdot]\}\) after eliminating duplicates equal to

\[
\sum_{i=0}^{\log_2 N-1} 2^i = N - 1 = O(N)
\]
Figure 2: Hypercubes of Dimension $\log_2 d = 2$ and $\log_2 d = 3$

**Algorithm 5: Construction of the $\log_2 d$-dimensional hypercube**

<table>
<thead>
<tr>
<th>Input: $d$ node processors</th>
</tr>
</thead>
<tbody>
<tr>
<td>Output: $d$ node processors with connections</td>
</tr>
<tr>
<td>1 Denote the processor IDs as ${P_0, P_1, \cdots, P_{d-1}}$ where $d$ is a power-of-2</td>
</tr>
<tr>
<td>2 for $i \leftarrow 0$ to $d - 1$ do</td>
</tr>
<tr>
<td>3 $[i_{\log_2 d-1}, \cdots, i_0]_2 \leftarrow \text{BinRepr}(i)$</td>
</tr>
<tr>
<td>4 for $j \leftarrow 0$ to $\log_2 d - 1$ do</td>
</tr>
<tr>
<td>5 flip the bit $i_j$ in $[i_{\log_2 d-1}, \cdots, i_j, \cdots, i_0]<em>2$ to make $[\tilde{i}</em>{\log_2 d-1}, \cdots, \tilde{i}_j, \cdots, \tilde{i}_0]_2$</td>
</tr>
<tr>
<td>6 if $[\tilde{i}_{\log_2 d-1}, \cdots, \tilde{i}_j, \cdots, \tilde{i}_0]<em>2 &gt; [i</em>{\log_2 d-1}, \cdots, i_j, \cdots, i_0]_2$ then</td>
</tr>
<tr>
<td>7 connect $P_{[i_{\log_2 d-1}, \cdots, i_j, \cdots, i_0]<em>2}$ and $P</em>{[\tilde{i}_{\log_2 d-1}, \cdots, \tilde{i}_j, \cdots, \tilde{i}_0]_2}$</td>
</tr>
<tr>
<td>8 return $(P_0, \cdots, P_{d-1})$</td>
</tr>
</tbody>
</table>

**Algorithm 6: Subcube-doubling communication in $\log_2 d$-dimensional hypercube**

<table>
<thead>
<tr>
<th>Input: $\log_2 d$-dimensional hypercubes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Output: communication pattern</td>
</tr>
<tr>
<td>1 Denote the processor IDs as ${P_0, P_1, \cdots, P_{d-1}}$</td>
</tr>
<tr>
<td>2 for $k \leftarrow 0$ to $\log_2 d - 1$ do</td>
</tr>
<tr>
<td>3 /* $d/2$ pairs of processors exchange data in step-$k$/</td>
</tr>
<tr>
<td>4 exchange data between processors $P_{[i_{\log_2 d-1}, \cdots, i_{\log_2 d-1-k}, \cdots, 0]<em>2}$ and $P</em>{[i_{\log_2 d-1}, \cdots, i_{\log_2 d-1-k}, \cdots, 0]<em>2}$ which differ at $i</em>{\log_2 d-1-k}$</td>
</tr>
<tr>
<td>5 return $c(x)$</td>
</tr>
</tbody>
</table>

2.2 $\log_2 d$-Dimensional Hypercube Multiprocessors

In this subsection, we introduce the hypercube topology which fits the parallelized version of NTT algorithm. The hypercube is also the basis for hardware architecture proposed in this work.

Before detailing the parallel NTT algorithm and hardware, the computing model used in this paper must be clarified. There are $d$ identical node processors organized in a hypercube of dimension $\log_2 d$. Each node processor includes one butterfly unit and some
Table 2: Subcube-doubling communication in 3-dimensional hypercube

<table>
<thead>
<tr>
<th>steps</th>
<th>connections</th>
</tr>
</thead>
<tbody>
<tr>
<td>Step-(0)</td>
<td><img src="image1" alt="Diagram" /></td>
</tr>
<tr>
<td>Step-(1)</td>
<td><img src="image2" alt="Diagram" /></td>
</tr>
<tr>
<td>Step-(2)</td>
<td><img src="image3" alt="Diagram" /></td>
</tr>
</tbody>
</table>

storage ($N/d$ NTT points). Roughly speaking, this $\log_2 d$-dimensional hypercube structure should increase the speed of sequential NTT algorithm by $d$ times. Fig. 2a illustrates the hypercube architecture of dimension 2 where 4 node processors (i.e., $P_0, P_1, P_2, P_3$ labeled as it binary form '00', '01', '10', and '11') are implemented. The node processors are sparsely connected with each other where any one of them are connected to the other 2 processors. For example, $P_0$ is only connected to $P_1$ and $P_2$, $P_1$ is only connected to $P_0$ and $P_3$. Fig. 2b illustrates the hypercube architecture of dimension 3 where 8 node processors (i.e., $P_0, P_1, \cdots, P_7$ labeled as it binary form '000', '001', ..., and '111') are implemented. The node processors are sparsely connected with each other where any one of them are connected to the other 3 processors. For example, $P_0$ is only connected to $P_1$, $P_2$ and $P_4$, $P_1$ is only connected to $P_0$, $P_3$ and $P_5$.

Alg. 5 formally describes how to construct the $\log_2 d$-dimensional hypercube by sparsely connecting $d$ node processors. The key idea here is that for each processor $P_i$, rewrite the index $i$ in binary form as $[i_{\log_2 d-1}, \cdots, i_0]_2$, and connects those processors $P_j$ whose index $j = [j_{\log_2 d-1}, \cdots, j_0]_2$ differs only 1 bit compared with $i$. In particular, each node processor connects only to $\log_2 d$ other node processors in this $\log_2 d$-dimensional hypercube topology. The if condition in the for-loop in Alg. 5 helps rule out the possibility of connecting the same pair of nodes repeatedly. When the computation continues in the hypercube, the intermediate data generated in each round of computations typically requires exchange between node processors. This type of data exchange is referred to as ‘subcube-doubling’ communication in the literature. There are in total $\log_2 d$ rounds of exchange during the communication as described in Alg. 6: In step-$k$, each node processor $P_i$ with index $i = [i_{\log_2 d-1}, \cdots, i_0]_2$ exchanges data with $P_j$ whose index $j$ differs at the $\log_2 d - 1 - k$-th bit.

An illustration instance with $d = 8$ for subcube-doubling algorithm (Alg. 6) is given in Table 2. $\log_2 d = 3$ communication steps are required in this example: In step-(0), processor $P_{[i_2, i_1, i_0]}$ connects processor $P_{[\overline{i_2}, i_1, \overline{i_0}]}$ which differs at $i_2$, and there are $d/2 = 4$ such pairs of connections, i.e., $P_0 - P_4, P_1 - P_5, P_2 - P_6, P_3 - P_7$; In step-(1), processor $P_{[i_2, i_1, i_0]}$ connects processor $P_{[\overline{i_2}, i_1, \overline{i_0}]}$ which differs at $i_1$; Finally, in step-(2), processor $P_{[i_2, i_1, i_0]}$ connects processor $P_{[\overline{i_2}, i_1, \overline{i_0}]}$ which differs at $i_0$.

### 2.3 A Useful Equivalent Notation: $|$PID$|$Local $M$

Assume that $N$ points are stored in the global array $a[\cdot] = \{a_{N-1}, \cdots, 0\}$ or simplified as $a[\cdot] = \{a_i\}_{i=0}^{N-1}$, and the elements in the array are assigned evenly to $d$ node
processors for storage and processing. Then the array address based notation uses a logN-bit integer \( i = i_0 \cdots i_{\log N - 1} \):

\[
i_{\log N - 1} \cdots i_{k+1} | i_k \cdots i_{-k - \log d + 1} | i_{-k - \log d}
\]

to indicate that consecutive \( \log d \) bits \( i_k \cdots i_{-k - \log d + 1} \) are chosen to specify the data-to-processor allocation.

In general, since any \( \log d \) bits can be used to form the processor ID number, it is easier to concatenate the bits representing the processor ID into one group denoted by ‘PID’, and refers to the remaining \( \log N - \log d \) bits, which are concatenated to form the local array address, as ‘Local \( M \)’. This paper uses the following equivalent notation, where the leading \( d \) bits are always used to identify the processor ID number.

\[
|\text{PID}|\text{Local} \; M = \left| i_k \cdots i_{-k - \log d + 1} \right| i_{N - 1} \cdots i_{k+2} | i_{k+1} | i_{-k - \log d} \cdots i_{1} | i_{0}
\]

Table 3 shows the details about the data allocation for hypercube processor array after a naturally ordered input series of \( N = 32 \) elements are divided among \( d = 4 \) processors using one particular cyclic block mapping \( i_4 i_3 i_2 i_1 i_0 \). For instance, to locate \( a_{10} = a_{26} \), one writes down \( m = 26 = 11010_2 = i_4 i_3 i_2 i_1 i_0 \), from which one knows that \( a_{26} \) is stored in \( a[7] \), \( r = i_4 i_3 i_2 i_1 i_0 = 11010_2 = 26 \), meaning that \( a[26] = a_{26} \) (the element \( a_{26} \) is located in \( a[26] \)) is allocated by processor \( P_{i_4 i_3 i_2 i_1} = P_0 \).

Table 3: Local data in processor \( P_{i_4 i_3 i_2 i_1} \) expressed in terms of global array element \( a[m] \), \( m = i_4 i_3 i_2 i_1 i_0 \) for the notation \( i_4 i_3 i_2 i_1 i_0 \)

<table>
<thead>
<tr>
<th></th>
<th>P_{i_4 i_3}</th>
<th>P_{i_3 i_2}</th>
<th>P_{i_2 i_1}</th>
<th>P_{i_1}</th>
</tr>
</thead>
<tbody>
<tr>
<td>( i_4 ) ( i_3 ) ( i_2 ) ( i_1 )</td>
<td>( a_0 )</td>
<td>( a_1 ) ( a_2 )</td>
<td>( a_3 )</td>
<td>( a_4 )</td>
</tr>
<tr>
<td>( i_4 ) ( i_3 ) ( i_2 ) ( i_1 )</td>
<td>( a_5 )</td>
<td>( a_6 ) ( a_7 )</td>
<td>( a_8 )</td>
<td>( a_9 )</td>
</tr>
<tr>
<td>( i_4 ) ( i_3 ) ( i_2 ) ( i_1 )</td>
<td>( a_{10} ) ( a_{11} )</td>
<td>( a_{12} )</td>
<td>( a_{13} )</td>
<td>( a_{14} )</td>
</tr>
<tr>
<td>( i_4 ) ( i_3 ) ( i_2 ) ( i_1 )</td>
<td>( a_{15} ) ( a_{16} )</td>
<td>( a_{17} ) ( a_{18} )</td>
<td>( a_{19} )</td>
<td>( a_{20} )</td>
</tr>
<tr>
<td>( i_4 ) ( i_3 ) ( i_2 ) ( i_1 )</td>
<td>( a_{21} ) ( a_{22} )</td>
<td>( a_{23} )</td>
<td>( a_{24} )</td>
<td>( a_{25} )</td>
</tr>
<tr>
<td>( i_4 ) ( i_3 ) ( i_2 ) ( i_1 )</td>
<td>( a_{26} )</td>
<td>( a_{27} )</td>
<td>( a_{28} )</td>
<td>( a_{29} )</td>
</tr>
<tr>
<td>( i_4 ) ( i_3 ) ( i_2 ) ( i_1 )</td>
<td>( a_{30} )</td>
<td>( a_{31} )</td>
<td>( a_{32} )</td>
<td>( a_{33} )</td>
</tr>
</tbody>
</table>

On the other hand, when the input elements are stored in \( a \) in bit-reversed order, i.e., \( a[r] = a_m \) where \( m = i_{n-1} i_{n-2} \cdots i_0 \), and \( r = i_0 \cdots i_{n-2} i_{n-1} \), then the equivalent notation is as follows:

\[
|\text{PID}|\text{Local} \; M = \left| i_{k+2} \cdots i_k \right| i_{N - 1} \cdots i_{k+1} | i_{-k - \log d} \cdots i_{1} | i_{0}
\]

Table 4 shows the details about the data allocation for hypercube processor array after an inverse ordered input series of \( N = 32 \) elements are divided among \( d = 4 \) processors using one particular cyclic block mapping \( i_0 i_1 | i_2 i_3 i_4 \). For instance, to locate \( a_{11} = a_{26} \), one writes down \( m = 26 = 11010_2 = i_4 i_3 i_2 i_1 i_0 \), from which one knows that \( a_{26} \) is stored in \( a[11] \), \( r = i_0 i_1 | i_2 i_3 i_4 = 01 | 0112 = 11 \), meaning that \( a[11] = a_{26} \) (the element \( a_{26} \) is located in \( a[11] \)) is allocated by processor \( P_{i_0 i_1} = P_0 \).

### 2.4 First attempt: parallel in-place FFTs without inter-processor permutations

Consider the \( DIT_{NN \rightarrow RN} \) algorithm (Alg. 3) and use the cyclic block mapping introduced in the last subsection. For \( N = 32, d = 4 \), the computation is depicted below:
First, it indicates the input source of external data: the incoming data from another and Fig. 3d

Table 4: Local data in processor $P_{i_k}$ (bit reversed) expressed in terms of global array
element $a[r]$ = $r_{m, r} = i q_j i_j i_j i_j i_j, m = i q_j i_j i_j i_j$ for the notation $i q_j i_j i_j i_j$

<table>
<thead>
<tr>
<th>PID</th>
<th>Local $M$</th>
<th>$P_{i_0} = P_{i_0}$</th>
<th>$P_{i_1} = P_{i_0}$</th>
<th>$P_{i_2} = P_{i_1}$</th>
<th>$P_{i_0}$ = $P_{i_0}$</th>
<th>$P_{i_1}$ = $P_{i_0}$</th>
</tr>
</thead>
<tbody>
<tr>
<td>0000</td>
<td>0000</td>
<td>a[0]</td>
<td>0000</td>
<td>a[8]</td>
<td>1000</td>
<td>a[16]</td>
</tr>
</tbody>
</table>

The initial map indicates that the processor $P_k$ initially holds the elements $a[8k], \ldots, a[8k+7]$ for $k = 0, 1, 2, 3$. The shorthand notation previously used for sequential NTT is augmented by two additional symbols. The double-headed arrow $\leftrightarrow$ indicates that $\frac{N}{d}$ data elements must be exchanged between processors in advance of butterfly computation. In our example, the NTT takes 5 rounds where the first two rounds require data exchange among processors and the last three rounds do not require data exchange. The symbol $i_k$ identifies two things:

- First, it indicates the input source of external data: the incoming data from another processor are the elements whose addresses differ from a processor’s own data in bit $i_k$.
- Second, it indicates that all pairs of processors whose binary ID number differ in bit $i_k$ send each other a copy of their own data.

The required data communications before the first stage of butterfly computation (step-0) are explicitly depicted in Fig. 3a and Fig. 3b: $P_0$ swaps data with $P_2$ such that $a[i]$ pairs with $a[i+16]$ to perform the required butterfly computation in the same processor for $i = 0, \ldots, 7$, and $P_1$ swaps data with $P_3$ such that $a[i]$ pairs with $a[i+16]$ to perform the required butterfly computation in the same processor for $i = 8, \ldots, 15$; the required data communications before the second stage of butterfly computation (step-1) are depicted in Fig. 3c and Fig. 3d: $P_0$ swaps data with $P_1$ such that $a[i]$ pairs with $a[i+8]$ to perform the required butterfly computation in the same processor for $i = 0, \ldots, 7$, and $P_2$ swaps data with $P_3$ such that $a[i]$ pairs with $a[i+8]$ to perform the required butterfly computation in the same processor for $i = 8, \ldots, 23$. The last three steps (step-2,3,4) do not require data swaps since all elements needed for butterfly computation are already within the processor: for example, in step-2, $a[0]$ pairs $a[4], a[1]$ pairs $a[5], a[2]$ pairs $a[6], a[3]$ pairs $a[7]$, which are all located within processor $P_0$.

Remarks The parallel in-place NTT without inter-processor permutations approach employs data exchange between a pair of processors. That is, one processor’s initial complement of data may swap with that of another processor. With use of this type of data exchange, $N/d$ butterfly computations are performed in parallel at the cost of a number of $N/d$ data swaps per processor.

2.5 Second attempt: Parallel NTTs with Inter-processor permutations

In this subsection, we discuss the class of parallel NTTs which employ inter-processor data permutations. Similar to the one presented in the previous subsection which evenly
Butterfly Stage

(a) In round-0, Data sent and received by processors $P_0$ and $P_2$

(b) In round-0, Data sent and received by processors $P_1$ and $P_3$

(c) In round-1, Data sent and received by processors $P_0$ and $P_1$

(d) In round-1, Data sent and received by processors $P_2$ and $P_3$

Figure 3: An illustrative example for parallelizing in-place $\text{NTT}(N = 32, d = 4)$ without inter-processor permutations

distributes all butterfly computations among the processors, the new method also reduces the message length from $\frac{N}{d}$ elements to $\frac{1}{2} \frac{N}{d}$ in each of the $\log_2 d + 1$ concurrent message exchanges.

The complete algorithm We use the shorthand notation we have developed with symbols $\triangle$ and $\blacktriangledown$, the complete parallel algorithm corresponding to $DIT_{NR}$ NTT is represented below for the $N = 32$ example.

| $i4i3|i2i1i0$ | $i2i4i0$ | $i2i4i0$ | $i7i4i2i1i0$ | $i7i4i2i1i0$ | $i7i4i2i1i0$ |
|---------------|----------|----------|----------------|----------------|----------------|
| Initial Map    | $\rightarrow$ | $\rightarrow$ | $\rightarrow$ | $\rightarrow$ | $\rightarrow$ |

To provide complete information for this example, in the initial map (before performing the first stage butterfly computation), the input data are distributed as the element $a_{i2i4i2i1i0}$ can be found in $A_{i2i4i2i1i0}$ in processor $P_{i4i3}$. For example, $a[19] = a_{19}$ is shown to be initially in $A[3]$ in $P_2$ and $A[14] = a_{14}$ in $A[6]$ in $P_1$ in Fig. 4a since $19 = 100112$ and $14 = 011102$.

To prepare data for each processor in the first round of butterfly computation where $P_0$ connects $P_2$ and $P_1$ connects $P_3$ due to the hypercube structure, $P_0$ swaps the second half of his local array with the first half of $P_2$’s local array, and $P_1$ swaps the second half
(a) In round-0, DIT_NR butterfly computation with data migration between processors $P_{00}$ and $P_{02}$, and $P_{10}$ and $P_{12}$, respectively.

(b) In round-1, DIT_NR butterfly computation with data migration between processors $P_{00}$ and $P_{11}$, and $P_{10}$ and $P_{13}$, respectively.

(c) In round-2/3/4, DIT_NR butterfly computation with data migration between processors $P_{00}$ and $P_{20}$, and $P_{11}$ and $P_{31}$, respectively.

Figure 4: An illustrative example for parallelizing in-place NTT($N = 32, d = 4$) with inter-processor permutations of his local array with the first half of $P_i$'s local array as depicted in Fig. 4a. The symbols $\triangle$ are used to locate the exact position of the element $a_i$, after such data swap: the bit $i_k$, which has just been permuted from PID to Local $M$, and the bit $i_\triangle$, which has just
been permuted from Local $M$ to the PID. In our case, the notation $|i_2i_3|_j|_k_i_1 |_0$ is used which means bit $i_4$ in the PID and bit $i_2$ in the local $M$ switch their positions in the shorthand notation (denoted by the symbols $\triangle$) making the memory mapping changed to $i_2i_3[i_4i_1i_0]$, which means that the data in $a[i_4i_3i_2i_1i_0]$ can now be found in $A[i_4i_1i_0]$ in $P_{i_2i_3}$. For example, $a[19] = a_{19}$ is relocated to $A[7]$ in $P_0$ ($A[7]$ means the 7-th element for $P_0$’s local array) after the inter-processor permutation shown in Fig. 4a since 10|0112 is changed to 00|1112; $a[14] = a_{14}$ is relocated to $A[2]$ in $P_3$ after the inter-processor permutation shown in Fig. 4a since 01|1102 is changed to 11|0012. After the memory swap, all data are located correctly in the corresponding processors to perform the first round of butterfly computation. The symbol $\blacktriangleleft$ is used to indicate the pairs of elements for the butterfly computation in each processor, i.e., $|i_2i_3|_j|_k_i_1 |_0$ means $A[i_2i_30i_1i_0]$ should pair with $A[i_2i_31i_1i_0]$ to complete elementary butterfly unit computation for $P_{i_2i_3}$. Also, the index $i_k$ that the symbol $\blacktriangleleft$ points to is changed to $\tau_k$ for showing that this particular round of butterfly computation is completed. A quick observation is that there are $i-1$ indices changed to $\tau$ in the notation for the $i$-th round of computation: for example, $|i_2i_3|_j|_k_i_1 |_0$ represents the first round where no $\tau$ indices exist; $|i_2\tau_k|_j|_k_i_1 |_0$ represents the second round where one $\tau$ index ($i_k$ changed to $\tau_4$) exists and etc.

For the second round of butterfly computation where $P_0$ connects $P_1$ and $P_1$ connects $P_3$, $P_0$ swaps the second half of his array with the first half of $P_1$’s local array, and $P_2$ swaps the second half of his local array with the first half of $P_3$’s local array as dipicted Fig. 4b. A similar notation $|i_2\tau_4|_j|_k_i_1 |_0$ is used to denote the memory swap: bit $\tau_4$ in the PID and bit $i_3$ in the local $M$ switch their positions in the shorthand notation making the memory mapping changed from previous $i_2i_3[i_4i_1i_0]$ to $i_2\tau_4[i_4i_1i_0]$. For example, $a_{19}$ which is previously stored in $A[7]$ from $P_0$ is now changed to $A[3]$ from $P_3$ since 19 = 10|0112 is rearranged to 01|1112. Moreover, $A[i_2\tau_40i_4i_1i_0]$ pairs with $A[i_2i_41i_1i_0]$ to complete elementary butterfly unit computation for $P_{i_2i_4}$ for all $i_2, i_4, i_1, i_0$.

For the third round of butterfly computation where $P_0$ connects $P_2$ and $P_1$ connects $P_3$, $P_0$ swaps the second half of his local array with the first half of $P_2$’s local array, and $P_2$ swaps the second half of $P_1$’s local array with the first half of $P_3$’s local array as dipicted Fig. 4c. This swapping pattern is captured in the notation $|\tau_3\tau_4|_j|_k_i_1 |_0$ indicating the previous $i_2i_4[i_3i_1i_0]$ is changed to $i_3i_4[i_2i_1i_0]$ due to the $\triangle$ annotations. For example, $a_{19}$ which is previously stored in $A[3]$ from $P_1$ is still preserved in $A[3]$, $P_3$ since 19 = 10|0112 is rearranged to 01|1112. The $\blacktriangleleft$ notation indicates that $A[i_3i_40i_1i_0]$ pairs with $A[i_3i_41i_1i_0]$ to complete elementary butterfly unit computation for $P_{i_3i_4}$ for all $i_3, i_4, i_1, i_0$.

There are no inter-processor data swapping for the last two rounds, i.e., the fourth and the fifth round of butterfly computation. However, the $\blacktriangleleft$ notation helps distinguish which two data elements should pair to complete the elementary butterfly unit computation inside the processor: in the fourth round, $A[i_3i_4i_20i_0]$ pairs with $A[i_3i_41i_2i_1i_0]$ for $P_{i_3i_4}$, and in the fifth round, $A[i_3i_4i_2i_10]$ pairs with $A[i_3i_4i_2i_1i_1]$ for $P_{i_3i_4}$. The computations in the fourth and fifth round are merged to Fig. 4c.

After all five rounds of butterfly computation are completed, the NTT results are stored in the array $a[\cdot]$ but the position is rearranged: the output data element $a_{i_4i_3i_2i_1i_0}$, which overwrites the data in $a[i_4i_3i_2i_1i_0]$, is finally contained in $A[i_4i_3i_2i_1i_0]$ in $P_{i_4i_3i_2i_1i_0}$. Such arrangement for the data mapping for the output elements is observed as following:

- The in-place butterfly computation in the DIT$_{NR}$ algorithm ensures $a[i_4i_3i_2i_1i_0] = a_{5(2)} = a_{i_4i_3i_2i_1i_0} = A_{i_0i_1i_2i_3i_4}$ where $A_{i_0i_1i_2i_3i_4} = \sum_j a_j (\omega_2^{i_0i_1i_2i_3i_4} \cdot \omega_{2N}^{i_1})$.

- The final mapping $|\tau_3\tau_4|_j|_k_i_1 |_0$ indicates that the final content in $a[i_4i_3i_2i_1i_0]$ is now
located in \( a[i_2i_1i_0] \) in processor \( P_{i_3i_4} \) (rather than the initially assigned processor \( P_{i_4i_3} \)).


**Remarks on the correctness of the notation** Because \( i_k \) was in PID session before the switch, \( i_k = 1 \) in one processor, and \( i_k = 0 \) in the other processor. On the other hand, because \( i_l \) was in Local \( M \) session before the switch, \( i_l = 0 \) for half of the data, and \( i_l = 1 \) for another half of the data. Consequently, the value of \( i_k \), the PID bit, is equal to \( i_l \), the local \( M \) bit, for half of the data elements in each processor, and the notation which represents the switch of these two bits identifies both the PID of the other processor as well as the data to be sent out or received. To depict exactly what happens, the data exchange between two processors and the butterfly computation represented by \( i_2i_3i_4i_1i_0 \) is shown in its entirety in Fig. 4a and 4b.

---

**Input:** a polynomial ring \( R_q \), and NTT points \( N \), input \( a = (a[0], \ldots, a[N-1]) \)

**Output:** \( \text{NTT}(a) = A = (A[0], \ldots, A[N-1]) \)

1. Initialize the hypercube connections between \( d \) processors as described in Alg. 5.
2. Initialize the merged twiddle factor look-up table \( \{w_i\} \) as described in Alg. 4.
3. /*arrange the data array \( a[i] \) in natural order*/
4. Initialize the data \( a[i_{\log_2 N} - \log_2 d - 1 \cdots i_{i_0}] \) in \( P_{i_{\log_2 N} - 1 \cdots i_{\log_2 N} - \log_2 d} \) with
   \( a[i_{\log_2 N} - 1 \cdots i_{i_0}] \) for all \( i_{\log_2 N} - 1, \ldots, i_0 \)
5. /*perform the first \( \log_2 d \) + 1 round of computations where inter-processor data swapping is required*/
6. for \( j \leftarrow 0 \) to \( \log_2 d \) do
   7.     if \( j \neq \log_2 d \) then
       8.         exchange the first half of data in \( P_{i_{\log_2 N} - 1 \cdots i_{\log_2 N - 1} - j \cdots i_{\log_2 N} - \log_2 d} \) w.r.t. \( i_{\log_2 N - 1} = 0 \) with the second half of data in
       9.         \( P_{i_{\log_2 N} - 1 \cdots i_{\log_2 N - 1} - j \cdots i_{\log_2 N} - \log_2 d} \) w.r.t. \( i_{\log_2 N - 1} = 1 \)
   10. else
       11.     exchange the first half of data in \( P_{i_{\log_2 N} - 1 \cdots i_{\log_2 N - 1} - j \cdots i_{\log_2 N} - \log_2 d} \) w.r.t. \( i_{\log_2 N - 1} = 0 \) with the second half of data in
       12.     \( P_{i_{\log_2 N} - 1 \cdots i_{\log_2 N - j} \cdots i_{\log_2 N} - \log_2 d} \) w.r.t. \( i_{\log_2 N - 1} = 1 \)
   13. perform within each processor \( P_{i_{\log_2 N} - 1 \cdots i_{\log_2 N - j} \cdots i_{\log_2 N} - \log_2 d} \) the \( \frac{N}{2^d} \) butterfly computations (round-\( j \) butterfly)
   14. /*perform the first \( \log_2 N - \log_2 d \) + 1 round of computations where inter-processor data swapping is not required*/
   15. for \( j \leftarrow \log_2 d + 1 \) to \( \log_2 N - 1 \) do
       16.     perform within each processor \( P_{i_{\log_2 N} - 1 \cdots i_{\log_2 N - j} \cdots i_{\log_2 N} - \log_2 d} \) the \( \frac{N}{2^d} \) butterfly computations (round-\( j \) butterfly)
17. return the data in all \( d \) processors as \( A \)

**Algorithm 7:** Parallel Hypercube NTT

**Twiddle Factor LUT Distribution** Let us discuss in details on the distribution of the twiddle factor LUT within each butterfly processor here. In the first \( \log_2 d + 1 \) rounds of butterfly computations, memory swapping occurs and each butterfly processor utilizes only 1 twiddle factor; in the next \( \log_2 N - \log_2 d - 1 \) rounds, no memory swapping occurs and the number of twiddle factors utilized in each butterfly processor increases exponentially.
Therefore, the total number of twiddle factors (the depth of twiddle factor LUT) in each processor is:

\[
\sum_{i=1}^{\log d + 1} 1 + \sum_{i=1}^{\log N - \log d - 1} 2^i = \frac{N}{d} + \log d - 1
\]

A concrete example for the twiddle factor LUT distribution can also be found in Fig. 4. In the first 3 rounds, each processor uses only 1 twiddle factor, i.e., \(\Phi_{10000}\) where \(\Phi\) denotes the 2N-th primitive root of unity \(\omega_{2N}\). In the 4-th round, each processor uses 2 twiddle factors, i.e., \(\Phi_{741000}\) for \(\tau_4 \in \{0, 1\}\). In the 5-th round, each processor uses 4 twiddle factors, i.e., \(\Phi_{74374100}\) for \(\tau_4 \in \{0, 1\}, \tau_3 \in \{0, 1\}\). Therefore, each processor stores \(1 + 2 + 4 = 7\) twiddle factors for the proposed hypercube NTT architecture.

### 2.6 Butterfly Processor

**Design Overview** To perform the butterfly computations and related memory access in each processor efficiently as illustrated in Fig. 4, a butterfly processor architecture is proposed. Fig. 5 depicts the internal structure of the butterfly processor. Two memory blocks are instantiated: one dual-port RAM for the \(\frac{N d}{d}\) points, namely \(a_{\frac{N d}{d} + i} - a_{\frac{N d}{d} + i + \frac{d}{2} - 1}\) for \(i \in [d]\), and one single-port ROM for the \(\frac{N d}{d} + \log d - 1\) precomputed twiddle factors. At first, two points which forms the pair for the elementary butterfly computation unit, e.g., \(a_i\) and \(a_j\) are simultaneously extracted on \(\text{memory}_{\text{dout}}\) from the dual-port RAM. Then \(a_i\) and \(a_j\) are fed to the input ports \(\text{butterfly}_{\text{din}1}\) and \(\text{butterfly}_{\text{din}2}\) of the butterfly structure. This butterfly structure consists of one Barret modular multiplier (apply Alg. 8), one modular adder (apply Alg. 9), and one modular subtractor (apply...
Alg. 10). After the butterfly computation is completed, the results \(a_i + a_j \cdot w\) and \(a_i - a_j \cdot w\) appear at the output ports \texttt{butterfly\_dout1} and \texttt{butterfly\_dout2}. Finally, the two results are simultaneously written back to the RAM through the ports \texttt{memory\_dina} and \texttt{memory\_dinb}. It is worth mentioning that the butterfly processor is fully pipelined such that a pair of valid data \texttt{butterfly\_dout1} and \texttt{butterfly\_dout2} is written back to the RAM every clock cycle, which maintains a relatively high throughput of butterfly computation. This characteristic is crucial for high speed implementation of FHE scheme since the parameter \(N\) (the number of NTT points) is typically set to be large (typical value is around 1k) for maintaining the hardness of the (Ring-) LWE problem.

### Algorithm 8: Barret-Reduction based Modular Multiplication

**Input:** two integers \(a\) and \(b\) over \(\mathbb{Z}_q\)

**Output:** \(a \cdot b \in \mathbb{Z}_q\)

1. Precompute an integer \(k = \lceil \log_2 q \rceil\)
2. Precompute an integer \(r = \lfloor \frac{4^k}{q} \rfloor\)
3. Calculate \(x = a \cdot b\)
4. Calculate \(t = x - \lfloor \frac{q}{2^k} \rfloor \cdot q\)
5. if \(t < q\) then
   6. return \(t\)
   7. else
   8. return \(t - q\)

### Algorithm 9: Modular Addition

**Input:** two integers \(a\) and \(b\) over \(\mathbb{Z}_q\)

**Output:** \(a + b \in \mathbb{Z}_q\)

1. Calculate \(t = a + b\)
2. if \(t < q\) then
   3. return \(t\)
   4. else
   5. return \(t - q\)

### Algorithm 10: Modular Subtraction

**Input:** two integers \(a\) and \(b\) over \(\mathbb{Z}_q\)

**Output:** \(a - b \in \mathbb{Z}_q\)

1. Calculate \(t = a - b\)
2. if \(t \geq 0\) then
   3. return \(t\)
   4. else
   5. return \(t + q\)

### Timing analysis

Let one unit denote the delay of one clock cycle, \(T_{\text{mul}}\) denote the delay of standard integer multiplication, \(T_{\text{modmul}}\) denote the delay of Barret reduction based modular multiplication algorithm, and \(T_{\text{modadd}}(T_{\text{modsub}})\) denote the delay of modular addition(subtraction) algorithm. The delay of one butterfly computation is calculated as

\[
T_{\text{butterfly}} = T_{\text{swap}} + T_{\text{mul}} + T_{\text{modmul}} + T_{\text{modadd}}
\]

Note that the proposed butterfly processor is fully pipelined and therefore it takes \(T_{\text{butterfly}} + \frac{N}{2^2} - 1\) to process \(\frac{N}{2^2}\) butterfly computations.
Fully pipelined computation The key point for fully pipelined butterfly computation is to streamline the generation of memory address, i.e., memory_addr and memory_addrb in Fig. 5. Note that the NTT butterfly address generation pattern is rather complicated: it varies distinctly in different butterfly computation round. It is desirable to implement some other simpler patterns and later combine these simple patterns to create the address generation. In our design, we use five registers, cntb, roundi, dist, cnt, and base to assist the generation of memory_addr and memory_addrb in every clock cycle:

- cntb: base counter register, used to generate the basic logic pattern, i.e., a square wave signal with period of $\frac{2^d}{N}$ cycles
- roundi: butterfly round register, used to indicate the current round of butterfly computation
- dist: distance register, used to record the distance between memory_addr and memory_addrb s.t. $\text{memory_addrb} = \text{memory_addr} + \text{dist}$
- cnt: counter register, used to indicate the incremental offset value for generating memory_addr
- base: the (basis) starting address for memory_addr in each round of butterfly calculation

Moreover, we use two pre-computed arrays blk and dist to help generate the correct values in the five registers mentioned above. blk is related to the variable NumOfGroups in Alg. 3, and indicates the number of butterfly blocks in every round of butterfly calculation and has $\log_2 N$ elements; dist is related to the variable Distance in Alg. 3, and indicates the distance between memory_addr and memory_addrb in every round of butterfly calculation and has $\log_2 N$ elements. The construction blk goes like this: The first $\log d$ elements are always 1; starting from the $\log d + 1$-th element down to the last one, i.e. the last $\log N - \log d$ elements formulate a geometric sequence with initial value 1 and common ratio 2. The construction dist goes like this: The first $\log d$ elements are always $\frac{N}{2^d}$; Then the last $\log N - \log d$ elements formulate a geometric sequence with initial value $\frac{N}{2^d}$ and common ratio $\frac{d}{2}$. For example, if $N = 32, d = 4$, then blk = \{1,1,1,2,4\} and dist = \{4,4,2,1,2\}.

The generation of memory_addr and memory_addrb in Fig. 5 is formally described in Alg. 11. The generated addresses basically map to the memory location of two butterfly inputs (butterfly_din1 and butterfly_din2 shown in Fig. 5). A more concrete example for when $N = 32, d = 4$ is depicted in Fig. 6. Every register including cntb, roundi, dist, cnt, and base has 5 phases each of which corresponds to one of the $\log N = 5$ rounds of butterfly computation. Each phase costs 4 clock cycles. For example, cntb updates as 0,1,2,3 in every phase; whereas roundi updates as i in phase-i(i=0,1,2,3,4). We also assume the calculation of memory address (step6-step7 in Alg. 11) takes one clock cycle delay and thus the result appearing in memory_addr and memory_addrb is delayed by one clock cycle as shown in Fig. 6. The sequence of memory_addr and memory_addrb can be interpreted as follows: In the first clock cycle of phase-0, memory_addr outputs 0 and memory_addrb output 4 (extracting a[0] and a[4] from the local memory a[i] within the node processor); in the second clock cycle, memory_addr outputs 1 and memory_addrb outputs 5, and so on so forth. Finally, in the first clock cycle of phase-4, memory_addr outputs 0 and memory_addrb outputs 1; in the second clock cycle, memory_addr outputs 2 and memory_addrb outputs 3, and so on so forth.

Based on the memory address generation pattern described in Fig. 6, we can finally introduce the complete memory address control logic (See Fig. 7) used in the proposed
butterfly processor. Again, all registers are represented in 5 phases where each phase costs 3 clock cycles. The register current_state indicates one of the three current status in each phase as follows:

- ADDR_RD: In this state, butterfly processor reads the corresponding butterfly inputs (butterfly_din1 and butterfly_din2 in Fig. 5) from memory in a pipelined fashion.
- IDLE: This state is optional, and is used only if N is relatively small. For more details, refer to the next section.
- ADDR_WR: In this state, butterfly processor writes back the computed results (butterfly_dout1 and butterfly_dout2 in Fig. 5) to memory in a pipelined fashion.

Note that the entire butterfly computation takes \( \log N = 5 \) iterations. If \( N \) is relatively small and \( d \) is relatively large, the state register transits by ADDR_RD \( \rightarrow \) IDLE \( \rightarrow \) ADDR_WR in each iteration; otherwise, the state register transits by ADDR_RD \( \rightarrow \) ADDR_WR. A more detailed analysis on the delay of the state IDLE for prescribed parameters \( N, d \) is given in the next subsection.

In state IDLE, the address is invalid since the purpose of IDLE is to wait for the correct results from the butterfly computing module and thus does not need the address signal to interact with memory. The address pattern used in state ADDR_RD is identical to that used in ADDR_WR: our butterfly processor is fully pipelined and, therefore, whenever it reads some data from some specific address in state ADDR_RD, it must write back to the same location later in state ADDR_WR.

Input: the number of NTT points \( N \) and the number of butterfly processors \( d \)

Output: memory address memory_addra and memory_addrb for butterfly computation

```
1 Precompute blk and dist
2 for roundi \( \leftarrow 0 \) to \( \log N - 1 \) do
3   dist \( \leftarrow \) dist[roundi]
4 for blk \( \leftarrow 0 \) to \( \text{blk}[i] - 1 \) do
5   base \( \leftarrow \frac{N}{\text{blk}[i]} \cdot \text{blk}
6   for cnt \( \leftarrow 0 \) to \( \frac{N}{\text{blk}[i]} - 1 \) do
7     memory_addra \( \leftarrow \) base + cnt
8     memory_addrb \( \leftarrow \) base + cnt + dist
```

### Algorithm 11: Memory address generation for butterfly computation

#### 2.7 Microbench Implementations

**Timing analysis** The main states we used are ADDR_RD and ADDR_WR which are used for memory read and memory write, respectively. If the delay of butterfly computation is longer than that of ADDR_RD, then an auxiliary state called IDLE is inserted in between because the node processor cannot write valid data back to memory until the butterfly unit outputs the NTT results. Precisely speaking, if \( \frac{N}{2d} < T_{\text{ADDR, RD}} + 1 < T_{\text{butterfly}} \), then IDLE with delay \( T_{\text{IDLE}} = T_{\text{butterfly}} - \frac{N}{2d} + T_{\text{ADDR, RD}} - 1 \) is required. The delay for the state ADDR_RD and the state ADDR_WR are \( \frac{N}{2d} \) respectively, i.e., \( T_{\text{ADDR, RD}} = T_{\text{ADDR, WR}} = \frac{N}{2d} \). In summary, the total delay for the hypercube NTT with \( d \) processors is:

\[
\log N \cdot \left( \frac{N}{d} + \max(T_{\text{IDLE}}, 0) \right)
\]
Table 5: Performance of the configurable hypercube NTT hardware for FHEW-like FHE schemes on Xilinx Artix-7 FPGA

<table>
<thead>
<tr>
<th>Instance</th>
<th># of processors</th>
<th>freq</th>
<th>cycle</th>
<th>CLB/LUT/Reg</th>
<th>memory</th>
<th>DSPs</th>
</tr>
</thead>
<tbody>
<tr>
<td>N = 1024, q ≈ 2^{32}</td>
<td>2</td>
<td>100</td>
<td>5120</td>
<td>451/2309/1756</td>
<td>3</td>
<td>30</td>
</tr>
<tr>
<td></td>
<td>4</td>
<td>100</td>
<td>2560</td>
<td>760/3581/2940</td>
<td>6</td>
<td>60</td>
</tr>
<tr>
<td></td>
<td>8</td>
<td>100</td>
<td>1280</td>
<td>1402/4238/5480</td>
<td>12</td>
<td>120</td>
</tr>
<tr>
<td></td>
<td>16</td>
<td>100</td>
<td>640</td>
<td>2174/1069/1059</td>
<td>24</td>
<td>240</td>
</tr>
<tr>
<td></td>
<td>32</td>
<td>100</td>
<td>430</td>
<td>3037/19250/18738</td>
<td>48</td>
<td>480</td>
</tr>
<tr>
<td></td>
<td>64</td>
<td>80</td>
<td>350</td>
<td>7835/39009/37060</td>
<td>96</td>
<td>960</td>
</tr>
</tbody>
</table>

In our concrete experiment, $T_{ADDR_{RB}}$ set to 1 and $T_{butterfly}$ set to 27. Therefore the total delay for the hypercube NTT is further simplified to $\log N \cdot \left( \frac{N}{2} + \max(27 - \frac{N}{2}, 0) \right)$.

**Experimental data** The proposed design is implemented on Xilinx Zynq UltraScale+ ZCU106 evaluation board using Vivado 2018.1. The number of NTT points is set to 1024, a typical value used in FHE schemes. The number of NTT processors is configured to 2,4,8,16,32, and 64 to fully demonstrate the scalability of our hypercube NTT design. It is worth mentioning that our implementation follows the parameterized design approach, i.e., our NTT hardware can be customized and auto-generated on the fly from a script file by inputting core parameters of hypercube NTT, for example, $N$ and $d$. The experimental results are collected in Table 5. As the parameter $d$ increases, the clock frequency is rather stable around 100 MHz, which indicates the hypercube memory swapping strategy is successful to maintain a good critical path delay. If the number of processors is smaller than 32, the cycle delay equals to $\log N \cdot \frac{N}{2}$ and thus the increase of $d$ reduces significantly the cycle delay: for example, doubling $d$ suggests cycle delay reduced by half. As the number of processors gets even bigger ($\geq 32$), the IDLE states are inserted and the cycle delay increases.

Figure 6: Illustrative timing diagram for memory address generation in line with Alg. 11($N = 32, d = 4$)

Figure 7: Top-level timing diagram for hypercube NTT in 5 rounds($N = 32, d = 4$)
delay equals to \( \log N \cdot (\frac{N}{2^d} + 27) \) for which the performance boost by increasing \( d \) is rather marginal. For this case, we can optimize the performance of butterfly computation (more concretely, modular reduction) to further improve it. However, we do not push the limit on this direction which is not the focus in this paper.

2.8 System Integration on Xilinx MPSOC platform

We also conduct an FPGA implementation experiment for the entire FHE bootstrapping where the NTT module is implemented in PL logic and the FHEW software (written in C++) runs on ARM PS logic. The target hardware platform is Xilinx Versal VMK180 development board. Figure 8 illustrates how a software-hardware co-design for a NTT-accelerated FHEW is built. The integration between FHEW software and NTT hardware is achieved via Xilinx’s XDMA core. As a DMA, the core can be used for high performance block data movement between the PCIe address space and the AXI address space using the provided character driver [Inc23]. In this architecture, the NTT hardware is divided into computational and communication segments. The NTT computation module is composed of several NTT computation nodes that execute distinct NTT calculations. The NTT communication module is responsible for data exchange with the software.

In the specific process of software-hardware collaboration, in order to execute a single NTT hardware computation, the host needs to transmit 1,024 data segments for calculation to the FPGA, exemplified by an NTT configuration of \( N = 1024 \) points and \( q < 2^{32} \). The host sets up buffer space in system memory and creates descriptors that the DMA engine use to move the data. When the hardware DMA core receives the data, it stores the data into the DDR memory and informs the NTT computing module that the data is ready. The data conversion module subsequently reads the DDR data, converts it to the format required by the NTT computing module, and forwards it for processing. At this time, the NTT computation module starts to calculate. The calculation speed varies with different numbers of processors. After the computation is completed, the output data will be written back to DDR, and the processing is completed through the interrupt reporting host. Consequently, the DMA engine relocates the data from the DDR back to the host, completing an NTT calculation process.

**Performance** The performance of the NTT-accelerated FHEW solution is presented in Table 6. We configured the FHEW hardware with two different NTT setups: one featuring two NTT processor cores and the other with four NTT processor cores. As discussed earlier, the hardware NTT is divided into computation time which is consumed...
Table 6: Performance of FHEW-like FHE schemes with and without NTT hardware module on Xilinx Artix-7 FPGA

<table>
<thead>
<tr>
<th>Instance</th>
<th>Computing Time</th>
<th>Freq</th>
<th>LUT/Reg</th>
<th>Memory</th>
<th>DSPs</th>
</tr>
</thead>
<tbody>
<tr>
<td>Pure Software</td>
<td>Overall</td>
<td>1.3s</td>
<td>n.a.</td>
<td>n.a.</td>
<td>n.a.</td>
</tr>
<tr>
<td></td>
<td>Excluding NTT</td>
<td>0.24s</td>
<td>2.5 GHz</td>
<td>n.a.</td>
<td>n.a.</td>
</tr>
<tr>
<td></td>
<td>NTT (9484 times)</td>
<td>1.06s</td>
<td>n.a.</td>
<td>n.a.</td>
<td>n.a.</td>
</tr>
<tr>
<td>NTT-accelerated FHEW</td>
<td>Overall</td>
<td>1.3s</td>
<td>n.a.</td>
<td>n.a.</td>
<td>3</td>
</tr>
<tr>
<td>(NTT of 2 processors)</td>
<td>FHEW Software</td>
<td>0.25s</td>
<td>2.5 GHz</td>
<td>n.a.</td>
<td>n.a.</td>
</tr>
<tr>
<td></td>
<td>NTT computation</td>
<td>0.24</td>
<td>250 MHz</td>
<td>2377/1879</td>
<td>3</td>
</tr>
<tr>
<td></td>
<td>NTT communication</td>
<td>0.82</td>
<td>2.5 GHz</td>
<td>123090/135107</td>
<td>142</td>
</tr>
<tr>
<td>NTT-accelerated FHEW</td>
<td>Overall</td>
<td>1.2s</td>
<td>n.a.</td>
<td>n.a.</td>
<td>3</td>
</tr>
<tr>
<td>(NTT of 4 processors)</td>
<td>FHEW Software</td>
<td>0.25s</td>
<td>2.5 GHz</td>
<td>n.a.</td>
<td>n.a.</td>
</tr>
<tr>
<td></td>
<td>NTT computation</td>
<td>0.14s</td>
<td>250 MHz</td>
<td>3702/3191</td>
<td>6</td>
</tr>
<tr>
<td></td>
<td>NTT communication</td>
<td>0.81s</td>
<td>2.5 GHz</td>
<td>123090/135107</td>
<td>142</td>
</tr>
</tbody>
</table>

for performing the proposed parallel hypercube NTT algorithm and communication time which is consumed for data exchanging between the NTT hardware module and the FHEW software. While there is a noticeable speed improvement regarding the NTT computation run time compared to the pure software solution, the advantage is merely incremental: For example, the overall run time of 2 NTT processor accelerated FHEW is almost identical to that of the pure FHEW software, and the overall run time of 2 NTT processor accelerated is about 7.7% faster. It is apparent that the majority amount of time is spent in NTT communication, involving the transfer of encrypted data between the NTT hardware and the FHEW software. In summary, the primary performance bottleneck for FHEW software-hardware integration design stems from the extensive data movement between the NTT hardware and the FHEW software. Therefore, optimizing NTT communication to minimize data transfer is essential for significantly enhancing FHEW performance.

References


