Exploring Energy Efficient Quantum-resistant Signal Processing Using Array Processors

Hamid Nejatollahi\textsuperscript{1}, Sina Shahhosseini\textsuperscript{1}, Rosario Cammarota\textsuperscript{2} and Nikil Dutt\textsuperscript{1}

\textsuperscript{1} University of California Irvine, Irvine, USA
\textsuperscript{2} Intel AI, Privacy and Security Research, USA
\{hnejatol, sshahhos, dutt\}@ics.uci.edu, rosario.cammarota@intel.com

Abstract. Quantum computers threaten to compromise public-key cryptography schemes such as DSA and ECDSA in polynomial time, which poses an imminent threat to secure signal processing. The cryptography community has responded with the development and standardization of post-quantum cryptography (PQC) algorithms, a class of public-key algorithms based on hard problems that no known quantum algorithms can solve in polynomial time. Ring learning with error (RLWE) lattice-based cryptographic (LBC) protocols are one of the most promising families of PQC schemes in terms of efficiency and versatility. Two common methods to compute polynomial multiplication, the most compute-intensive routine in the RLWE schemes, are convolutions and Number Theoretic Transform (NTT).

In this work, we explore the energy efficiency of polynomial multiplier using systolic architecture for the first time. As an early exploration, we design two high-throughput systolic array polynomial multipliers, including \textit{NTT-based} and \textit{convolution-based}, and compare them to our low-cost sequential (non-systolic) NTT-based multiplier. Our sequential NTT-based multiplier achieves more than 3x speedup over the state-of-the-art FPGA implementation of the polynomial multiplier in the NewHope-Simple key exchange mechanism on a low-cost Artix7 FPGA. When synthesized on a Zynq UltraScale+ FPGA, the \textit{NTT-based} systolic and \textit{convolution-based} systolic designs achieve on average 1.7x and 7.5x speedup over our sequential NTT-based multiplier respectively, which can lead to generating over 2x more signatures per second by CRYSTALS-Dilithium, a PQC digital signature scheme. These explorations will help designers select the right PQC implementations for making future signal processing applications quantum-resistant.

Keywords: Public Key Cryptography, Lattice-based Cryptography, Acceleration, Number Theoretic Transform, Systolic Array

1 Introduction

Industry, academia, and government are working together to realize quantum computer that can achieve unprecedented levels of performance in specific application domains, such as biology. With the power of quantum computers, solving problems such as the discrete logarithm problem promises to nullify the effectiveness of
Exploring Energy Efficient Quantum-resistant Signal Processing Using Array Processors

current public-key cryptography, as illustrated by Shor’s algorithm to factorize large integers in polynomial time [1]. Breaking public-key cryptography creates the need to secure signal processing using newer Quantum-resistant cryptography algorithms.

Fortunately, post-quantum cryptography (PQC) is a vibrant area of research devoted to studying alternative schemes for public-key cryptographic protocols, capable of withstanding quantum cryptanalysis attacks, and executable on classical computers [2]. The relevance of the problem is demonstrated by the evaluation and standardization process of PQC algorithms. The most relevant evaluation effort was started with the submission of potential candidates to the National Institute of Standards and Technology (NIST) in 2017 and continues with the evaluation of the candidates until the creation of recommendations for adoption by standards bodies.

Lattice-based cryptography (LBC) schemes are the most promising family of quantum-resistant schemes due to their versatility and superior performance. In both the first and second rounds of the NIST PQC competition, about half of the candidates belong to the LBC family. In this work, we adopt the systolic architecture [3] to accelerate polynomial multiplier which is the heart of a subset of (LBC) algorithms (i.e., ideal LBC). The compute-intensive polynomial multiplier kernels can be performed using the Schoolbook algorithm in $O(N^2)$ time complexity [4]. The Discrete Fourier Transform (DFT) and its fast variant, Fast Fourier Transform (FFT), are the two other candidates to multiply two polynomials. The former has a time complexity of $O(N^2)$, which involves matrix-vector multiplication in the time domain, while the latter has a complexity of $O(N \log N)$. Both schemes are critically dependent on acceleration to achieve satisfactory performance. However, increased performance comes at the expense of energy overheads.

Towards this end, we allow signal processing designers to evaluate tradeoffs between performance and energy of quantum-resistant cryptographic schemes using systolic array architectures [3]. Thus, energy-efficient quantum-resistant signal processing schemes can be developed for a wide range of applications, from resource-constrained internet-of-things (IoT) settings to performance-driven real-time signal processing scenarios.

The contributions of this work are summarized as the following:

- Analyze, for the first time, energy consumption of various systolic array implementations of polynomial multiplier for accelerating quantum-resistant signal processing.

- Replace the NTT-based algorithm as the standard in LBC with a convolution-based multiplier that leads to 7x improvement in the execution time of the polynomial multiplier.

- Reach an order of magnitude speedup in computing polynomial multiplier over the state-of-the-art FPGA implementation of the polynomial multiplier of NewHope-Simple key exchange mechanism. [5].
2 Background and Related Work

A lattice \( L \subset \mathbb{R}^n \) is the set of all integer linear combinations of basis vectors \( b_1, \ldots, b_n \in \mathbb{R}^n \), i.e., \( L = \{ \sum a_i b_i : a_i \in \mathbb{Z} \} \). LBC exploits the hardness of two problems: Short Integer Solution (SIS) and Learning With Errors (LWE) [6]. Cryptosystems based on the LWE problem, the most common one, have their foundation in the difficulty of finding the secret key \( sk \), given \( (A, pk) \), where \( pk = A \ast sk + e \mod q \), given the public key \( pk \), an error vector \( e \) with Gaussian distribution, and a matrix \( A \) of constants in \( \mathbb{Z}^q_{r \times n} \) chosen randomly from a uniform distribution. Because LWE requires large keys (e.g., 11 KB for Frodo [7]), it can be impractical on devices with limited on-chip memory. To overcome this limitation, Lyubashevsky et. al [8] introduced Ring-LWE (RLWE), a derivation of LWE in which \( A \) is implicitly defined as a vector in the ring \( \mathbb{R} \equiv \mathbb{Z}_q [x]/(x^n + 1) \).

Arithmetic operations for a Ring-LWE-based scheme are performed over a ring of polynomials. Let \( n \), a power of two, and \( p \) be the degree of the lattice and a prime number (\( p = 1 \mod 2n \)), respectively. \( Z_p \) denotes the ring of integers modulo \( p \), and \( x^n + 1 \) is an irreducible degree \( n \) polynomial. The quotient ring \( \mathbb{R}_p \) contains all polynomials with degree less than \( n \) in \( Z_p \), that defines \( R_p = Z_p / [x^n + 1] \) in which coefficients of polynomials are in \([0, p)\). Degrees of the polynomials in RLWE-based schemes vary between 256 [9] and 1024 [10].

In the following we describe two common methods to compute polynomial multiplier.

2.1 Convolution-based multiplier

The easiest way to multiply two polynomials is to use the convolution (Schoolbook [11]) with the time complexity of \( O(n^2) \) as shown in Algorithm 1. A convolution-based multiplier can be seen as a discrete feed-forward finite impulse response (FIR) over the polynomials in \( \mathbb{R} \equiv \mathbb{Z}_q [x]/(x^n + 1) \). Due to the inefficiency of the convolution-based multiplier, there is no notable work that uses it for the acceleration; however, by using its systolic architecture, we can achieve considerable performance gains.

2.2 NTT-based multiplier

Polynomial multiplier is usually implemented by using the Number Theoretic Transform (NTT), which drops the time complexity of the polynomial multiplier from \( O(n^2) \) to \( O(n \cdot \log n) \). Polynomials \( a = a(n-1) \cdot x^{n-1} + \ldots + a(0) \) and \( b = b(n-1) \cdot x^{n-1} + \ldots + b(0) \) are transformed into their NTT representations \( A = A(n-1) \cdot x^{n-1} + \ldots + A(0) \) and \( B = B(n-1) \cdot x^{n-1} + \ldots + B(0) \), and their multiplication can be computed coefficient-wise as \( C = \sum_{i=0}^{n-1} A(i) \cdot B(i) \cdot x^i \). The result \( c = a \ast b \) is obtained after the computation of the inverse number theoretic transform (NTT\(^{-1}\)) of \( C \). Algorithm 2 describes the NTT-based polynomial multiplier. Figure 1 illustrates basic blocks of the NTT-based polynomial multiplier. Standard algorithms to perform number theoretic transform are: Cooley-Tukey (CT) [12], which produces the result in the bit-reverse order by receiving the input in the correct order; and Gentleman-Sande (GS) [13], which receives the input
Exploring Energy Efficient Quantum-resistant Signal Processing Using Array Processors

Algorithm 1 Convolution (Schoolbook)-based Polynomial Multiplier

1: **Initialization:** Let \( a = \{a_0, a_1, a_2, ..., a_{n-1}\} \) and \( b = \{b_0, b_1, b_2, ..., b_{n-1}\} \in \mathbb{Z}_q[x]/<f(x)> \) be two polynomials with the length of \( n \), where \( f(x) = x^n + 1 \) is an irreducible polynomial with \( n \) a power of 2, and \( q \equiv 1 \mod 2n \) is a large prime number.

2: \( c \leftarrow 0 \)

3: for \( i = 0 \) to \( n - 1 \) do

4: for \( j = 0 \) to \( j - 1 \) do

5: \( \text{sign} \leftarrow (-1)^{(i+j)/n} \)

6: \( \text{index} \leftarrow (i + j) \mod n \)

7: \( \text{coeff} \leftarrow a_i b_i \mod q \)

8: \( c_{\text{index}} \leftarrow \text{integer}(c_{\text{index}} + \text{sign} \times \text{coeff}) \mod q \)

9: end for

10: end for

11: Return \( c \)

Algorithm 2 NTT-based Polynomial Multiplier

1: **Initialization:** Let \( a = \{a_0, a_1, a_2, ..., a_{n-1}\} \) and \( b = \{b_0, b_1, b_2, ..., b_{n-1}\} \in \mathbb{Z}_q[x]/<f(x)> \) be two polynomials with length of \( n \), where \( f(x) = x^n + 1 \) is an irreducible polynomial with \( n \) a power of 2, and \( q \equiv 1 \mod 2n \) is a large prime number. \( w \) is the \( n \)-th root of unity and \( \phi \) is the \( 2n \)-th root of unity \((\phi^2 = w \mod q)\); \( w^{-1} \) and \( \phi^{-1} \) are the inverse of \( w \mod q \) and \( \phi \mod q \), respectively.

2: **Precompute:** \( \{w^i, w^{-i}, \phi^i, \phi^{-i}\} \) for \( i \in [0, n - 1] \)

3: for \( i = 0 \) to \( n - 1 \) do

4: \( a_i \leftarrow a_i \phi^i \)

5: \( b_i \leftarrow b_i \phi^i \)

6: end for

7: \( A \leftarrow \text{NTT}_w^n(\bar{a}) \)

8: \( B \leftarrow \text{NTT}_w^n(\bar{b}) \)

9: \( \hat{C} = A \hat{B} \)

10: \( \hat{c} \leftarrow i\text{NTT}_w^n(\hat{C}) \)

11: for \( i = 0 \) to \( n - 1 \) do

12: \( c_i \leftarrow \hat{c}_i \phi^{-i} \)

13: end for

14: Return \( C \)

in the reverse order and produces the output in the correct order. Similar to [5], we use the Gentleman-Sande method to compute both NTT and NTT\(^{-1}\), which needs bit-reverse calculation. As previously mentioned, NTT designs are known as parallel NTT due to the parallel use of butterfly processing elements. Pipeline FFT processor [14] is a high throughput design of FFT that can be implemented using single-path delay feedback. Authors in [15] adopt Pipeline FFT to design systolic array polynomial multiplier to develop a high throughput RLWE cryptoprocessor. In this work, we focus on the parallel NTT designs and leave the Pipeline FFT for
future work.

2.3 Previous works

Previous efforts on the acceleration of the NTT mostly focus on the area and performance of the polynomial multiplier for LBC and leave the energy unexplored [16] [17]. Some efforts have evaluated the energy as well as area and performance of NTT accelerators [18][19][20]. The only notable work that uses systolic arrays to accelerate polynomial multiplier reports only performance and area metrics for $n = 256$ and $n = 512$; however, we report energy as the primary goal as well as the performance and area for $n \in \{128, 256, 512, 1024\}$.

3 Systolic Array polynomial multiplier

We use the concept of systolic array architecture to design polynomial multipliers.

3.1 Systolic arrays

Today’s systems are intensely designed to move data for computation. Data movement is highly expensive in terms of energy consumption and latency compared to computation. Consequently, movement of data is the key bottleneck in computing systems as applications become more data-intensive. To address the bottleneck, we need an alternative architecture – such as a systolic array – to process data with less data movement. A systolic array consists of a set processing elements (PE), each capable of performing simple operations. Each PE is connected to its nearest PEs and performs operations on data that flows between them. In other words, data flows from the memory cells, passes through PEs which operate on it, before returning to the memory cells. This relieves the repeated memory access problem for general-purpose computing systems, which in turn helps to reduce latency.

We design and implement two systolic array polynomial multipliers including NTT-based and convolution-based systolic array and compare them to the sequential NTT-based multiplier.

3.2 Implementation of systolic array polynomial multiplier

3.2.1 NTT-based polynomial multiplier

The conventional hardware implementation for NTT-based polynomial multiplier uses a processing element with only one butterfly block sequentially (Figure 1-a); we use sequential NTT-based multiplier ($\text{Seq}_\text{NTT}$) as the baseline in our experiments; $\text{Seq}_\text{NTT}$, our slowest design, provides 3x speedup to compute forward and inverse number theoretic transforms compared to the implementation of [5] on a low-cost Artix7 Xilinx FPGA.

NTT-based polynomial multiplier can be implemented using a systolic array ($\text{SA}_\text{NTT}$ for the rest of the paper). Figure 1-b shows a $\log n$ array of processing elements, each executing $n/2$ butterfly operations on the all elements of the input polynomial. In other words, we cascade all $\log n$ stages of the NTT-method and
connect them using FIFO buffers. According to the Gentleman-Sande method, extracting parallelism between stages of the NTT is non-trivial; we can improve the performance of NTT by fusing multiple stages through the dataflow optimization of a high-level synthesis (HLS) tool.

![High level diagram for polynomial multiplication using NTT units](image1)

![NTT unit in the sequential NTT-based (Seq_NTT) polynomial multiplier](image2)

![NTT unit in the systolic array NTT-based (SA_NTT) polynomial multiplier](image3)

Figure 1: (a) Polynomial multiplication using NTT units (b) NTT unit based on only one PE which processes inputs sequentially in \(\frac{n}{2}\log n\) iterations. In order to perform polynomial multiplication NTT unit is executed three times (c) NTT-based systolic array polynomial multiplier encompass \(\log n\) PE blocks each performs butterfly operation in \(n/2\) iterations.
3.2.2 Convolution-based polynomial multiplier

The time complexity of the convolution-based polynomial multiplier can be decreased to $O(n)$ if we adjust the systolic architecture to use $n$-cascaded multiply-accumulators (MACs). As shown in Figure 2, each processing element in the convolution-based polynomial multiplier (CONV) performs modular reduction (MR) after each multiplication and addition. The significant performance improvement comes at the large area and energy overhead of performing $n$ MACs. We use our optimized versions of Montgomery [21] and Barrett [22] reduction after each multiplication and addition, respectively, by means of only shifts and additions.

Table 1: High-level synthesis results on Zynq UltraScale++ for three different designs of polynomial multipliers. To multiply two polynomials using NTT-based multipliers, we need to execute NTT block three times each includes bit-reversal, forward NTT, and point-wised multiplication of the input vector with the precomputed twiddle factors. Although the degree of the polynomials in most of the RLWE-based schemes ranges from 256 to 1024, we also provide the results for other degrees to show the trends.

<table>
<thead>
<tr>
<th>Design</th>
<th>N</th>
<th>Cycles</th>
<th>Latency (us)</th>
<th>Energy (uJ)</th>
<th>BRAM</th>
<th>CLB</th>
<th>DSP</th>
<th>FF</th>
<th>LUT</th>
<th>Freq. (MHz)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Seq_NTT</td>
<td>64</td>
<td>2112</td>
<td>7.47</td>
<td>0.16</td>
<td>0</td>
<td>221</td>
<td>10</td>
<td>1120</td>
<td>937</td>
<td>282.48</td>
</tr>
<tr>
<td></td>
<td>128</td>
<td>4776</td>
<td>16.79</td>
<td>0.62</td>
<td>1</td>
<td>364</td>
<td>10</td>
<td>2085</td>
<td>1.520</td>
<td>284.33</td>
</tr>
<tr>
<td></td>
<td>256</td>
<td>10728</td>
<td>36.37</td>
<td>1.27</td>
<td>2</td>
<td>667</td>
<td>10</td>
<td>4281</td>
<td>2999</td>
<td>284.89</td>
</tr>
<tr>
<td></td>
<td>512</td>
<td>23736</td>
<td>83.93</td>
<td>4.36</td>
<td>2</td>
<td>1220</td>
<td>10</td>
<td>8374</td>
<td>5278</td>
<td>282.80</td>
</tr>
<tr>
<td></td>
<td>1024</td>
<td>52152</td>
<td>169.28</td>
<td>13.37</td>
<td>2</td>
<td>2551</td>
<td>10</td>
<td>17091</td>
<td>10888</td>
<td>308.07</td>
</tr>
<tr>
<td>SA_NTT</td>
<td>64</td>
<td>897</td>
<td>4.86</td>
<td>0.49</td>
<td>14</td>
<td>366</td>
<td>38</td>
<td>1577</td>
<td>1760</td>
<td>184.29</td>
</tr>
<tr>
<td></td>
<td>128</td>
<td>1878</td>
<td>10.19</td>
<td>0.95</td>
<td>17</td>
<td>426</td>
<td>43</td>
<td>1906</td>
<td>2063</td>
<td>184.16</td>
</tr>
<tr>
<td></td>
<td>256</td>
<td>4008</td>
<td>21.56</td>
<td>2.15</td>
<td>24</td>
<td>547</td>
<td>48</td>
<td>2230</td>
<td>2372</td>
<td>185.83</td>
</tr>
<tr>
<td></td>
<td>512</td>
<td>8634</td>
<td>47.63</td>
<td>5.28</td>
<td>27</td>
<td>605</td>
<td>53</td>
<td>2610</td>
<td>2772</td>
<td>181.25</td>
</tr>
<tr>
<td></td>
<td>1024</td>
<td>18636</td>
<td>101.84</td>
<td>12.52</td>
<td>29</td>
<td>732</td>
<td>58</td>
<td>3097</td>
<td>3140</td>
<td>182.98</td>
</tr>
<tr>
<td>CONV</td>
<td>64</td>
<td>271</td>
<td>1.31</td>
<td>0.85</td>
<td>1</td>
<td>2226</td>
<td>322</td>
<td>8746</td>
<td>10223</td>
<td>265.88</td>
</tr>
<tr>
<td></td>
<td>128</td>
<td>533</td>
<td>2.65</td>
<td>3.39</td>
<td>2</td>
<td>4398</td>
<td>642</td>
<td>19098</td>
<td>21001</td>
<td>200.68</td>
</tr>
<tr>
<td></td>
<td>256</td>
<td>1058</td>
<td>5.32</td>
<td>8.33</td>
<td>2</td>
<td>8580</td>
<td>1282</td>
<td>38108</td>
<td>47302</td>
<td>198.57</td>
</tr>
<tr>
<td></td>
<td>512</td>
<td>2107</td>
<td>10.75</td>
<td>64.03</td>
<td>2</td>
<td>19301</td>
<td>2562</td>
<td>135585</td>
<td>161924</td>
<td>195.84</td>
</tr>
<tr>
<td></td>
<td>1024</td>
<td>3720</td>
<td>17.40</td>
<td>257.23</td>
<td>2</td>
<td>49714</td>
<td>4282</td>
<td>310247</td>
<td>322866</td>
<td>195.22</td>
</tr>
</tbody>
</table>
4 Evaluation

To perform the synthesis, we use Vivado high level synthesis (HLS) 2018.2 and select Artix7 and Zynq UltraScale+ as the target FPGA devices for from resource-constrained internet-of-things (IoT) settings and performance-driven real-time signal processing scenarios, respectively. We extract the numbers of reported resources (BRAM, CLB, DSP, FF, and LUT), maximum achieved frequency, and energy from the post-implementation process of the HLS tool. The latency of the polynomial multiplier is the number of execution cycles at the maximum achieved frequency.

For the polynomial-size of \( N=1024 \), \texttt{Seq\_NTT} achieves 3x speedup to compute forward and inverse number theoretic transform compared to the implementation of [5] on a low-cost Artix7 FPGA. Table 1 shows the synthesis results of \texttt{Seq\_NTT} as the smallest design along with the two systolic array polynomial multipliers, \texttt{SA\_NTT} (NTT-based) and \texttt{CONV} (convolution-based on Zynq UltraScale++). \texttt{Seq\_NTT} achieves the highest maximum frequency than \texttt{SA\_NTT} and \texttt{CONV} because of its sequential architecture.

According to Figure 3, with the increase in the degree of the polynomial from 512 to 1024, the rise in the energy consumption of \texttt{CONV} is exponential due to the increase in the number of resources required to satisfy the timing constraints.

Figure 4 shows the latency and energy consumption, extracted from Table 1, of the designs normalized to the \texttt{Seq\_NTT}. For \( N=1024 \), e.g., NewHope scheme, \texttt{SA\_NTT} reaches 1.67x speedup with 7% decrease in energy compared to the \texttt{Seq\_NTT}; for \( N=256 \) and \( N=512 \), 1.7x improvement in latency is achieved with 69% and 21% increase in the energy, respectively. We suggest using systolic array NTT-based polynomial multiplier for resource-constrained devices to achieve plausible performance with low energy consumption to verify the digital signatures and/or perform encrypting and decryption. The convolution-based systolic array multiplier

![Figure 3](image-url)

Figure 3: Increase in the latency and energy of polynomial multiplier by the increase in the size of the polynomials.
is suitable for high-performance servers that can tolerate higher energy consumption to generate 2x more signatures per second by CRYP\-TALS – Dilithium [23], a PQC digital signature scheme with a 7x speedup in the computing NTT. Forward and inverse NTT consumes around 65% of cycles in CRYP\-TALS – Dilithium to generate a signature.

5 Conclusion and future work

The advent of quantum computing threatens to render ineffective classical cryptographic schemes to secure signal processing application. Emerging quantum resistant cryptographic schemes show promise, but are hampered by the computational overhead of key critical kernels such as polynomial multiplier. This work explores, for the first time, the energy efficiency of array processors for implementing polynomial multipliers with degrees up to 1024.

We design and synthesize an NTT-based and a convolution-based systolic array polynomial multipliers and compare their performance, area, and energy with the sequential NTT-based counterpart. Our serial NTT-based design achieves 3x speedup on a low-cost Artix7 FPGA compare to the hardware implementation of NewHope-Simple. NTT-based systolic array on average is 1.7x faster than sequential NTT-based polynomial multiplier with less than 30% increase in the energy.

Convolution-based systolic array on average is 7.5x faster than serial NTT-based polynomial multiplier with 13.5x increase in the energy overhead; thus, convolution-based systolic array polynomial multipliers are suitable for high performance servers that can tolerate higher energy consumption.

Our future work will evaluate the energy efficiency of the Pipeline FFT processor to perform polynomial multiplication. Additionally, we will make the systolic architecture of the CONV more area- and energy-efficient for securing quantum resistance of future signal processing applications.

Figure 4: Comparison of the different polynomial multiplier designs, normalized to NTT, for polynomials with degree 256, 512 and 1024.
References


