Efficient isochronous fixed-weight sampling with applications to NTRU

Décio Luiz Gazzoni Filho\textsuperscript{1,2}, Tomás S. R. Silva\textsuperscript{3} and Julio López\textsuperscript{1}

\textsuperscript{1} Universidade Estadual de Campinas (UNICAMP), Instituto de Computação, Campinas, Brazil
\textsuperscript{2} State University of Londrina, Department of Electrical Engineering, Londrina, Brazil
\textsuperscript{3} Universidade Estadual de Campinas (UNICAMP), Instituto de Matemática, Estatística e Computação Científica, Campinas, Brazil

Abstract. We present a solution to the open problem of designing an efficient, unbiased and timing attack-resistant shuffling algorithm for NTRU fixed-weight sampling. Although it can be implemented without timing leakages of secret data in any architecture, we illustrate with ARMv7-M and ARMv8-A implementations; for the latter, we take advantage of architectural features such as NEON and conditional instructions, which are representative of features available on architectures targeting similar systems, such as Intel. Our proposed algorithm improves asymptotically upon the current approach, which is based on constant-time sorting networks ($O(n)$ versus $O(n \log^2 n)$), and an implementation of the new algorithm is also faster in practice, by a factor of up to 6.91 (591\%) on ARMv8-A cores and 12.58 (1158\%) on the Cortex-M4; it also requires fewer uniform random bits. This translates into performance improvements for NTRU encapsulation, compared to state-of-the-art implementations, of up to 50\% on ARMv8-A cores and 71\% on the Cortex-M4, and small improvements to key generation (up to 2.7\% on ARMv8-A cores and 6.1\% on the Cortex-M4), with negligible impact on code size and a slight improvement in RAM usage for the Cortex-M4.

Keywords: Post-quantum cryptography · NTRU · Sampling · ARM

1 Introduction

In the late 1990s, the rise of quantum algorithms for database search and factorization [Gro96, Sho97] posed a threat to public-key cryptosystems based on integer factorization and/or discrete logarithms. Even though quantum computers capable of efficiently performing such computations do not exist yet, growing concern within the community led to seeking alternative cryptographic primitives capable of resisting attacks from quantum algorithms. Thus, Post-Quantum Cryptography (PQC) arises as an attempt to counter these attacks by developing new public-key cryptographic algorithms built on problems known to be resistant to quantum attacks, such as lattice-based problems.

One of the oldest lattice-based cryptosystems is NTRU, first presented in the rump session of CRYPTO ’96 [HPS]. It remains relevant, as shown by advancing to the third round of the NIST PQC standardization process [CDH+20, Nat17], and its standardization in other forums [IEE09, ASC10]. A performance bottleneck of NTRU is fixed-weight sampling of polynomials, i.e. those with a prescribed number of randomly permuted $-1, 0$ and $1$ coefficients, employed in key generation and encapsulation. Unless carefully optimized, this contributes a large runtime cost, particularly to encapsulation.
Shuffling algorithms appear perfectly suited to solve the problem of fixed-weight sampling; however, there is no known efficient algorithm that is resistant to timing attacks for this problem. Instead, constant-time sorting is used to generate random permutations, as mandated by the NTRU submission to the NIST PQC contest [CDH+20]. We propose a new, timing attack-resistant shuffling algorithm to replace the sorting-based approach, with improved asymptotic running time and large performance improvements in actual implementations, especially for embedded architectures.

Prior to our work, the main proposal to avoid the cost of fixed-weight sampling for NTRU is NTRU-HRSS [HRSS17]. Their technique was later merged into the NTRU proposal for NIST’s PQC standardization process [CDH+20]. Due to larger key and ciphertext sizes, it was adopted for only one out of the four suggested parameter sets.

There exist many shuffling algorithms, such as Fisher-Yates [FY38, Dur64, Knu97], Rao-Sandelius [Rao61, San62] and MERGE SHUFFLE [BBHL18]. Algorithms in the coin tossing model, aimed at minimizing the consumption of random bits, are reviewed in [BBHT17]. However, none of these are designed to resist side-channel attacks. Indeed, [Dan19] remarks that Fisher-Yates is the most straightforward implementation of fixed-weight sampling, but cautions that “implementing Fisher-Yates in such a way that there is no side channel is difficult.” They opt for a constant-time sorting network, as proposed by Bernstein for use with the McEliece [BCS13] and NTRUPrime [BCLvV18] cryptosystems.

A constant-time version of Fisher-Yates has been proposed by Sendrier [Sen21] with running time $O(n^2)$ for the BIKE cryptosystem; as BIKE samples $O(\sqrt{n})$ out of $n$ elements, its performance behaves acceptably as $O((\sqrt{n})^2) = O(n)$. In NTRU, a full shuffle is required, and Sendrier’s algorithm cannot compete with $O(n \log_2 n)$ for the fastest practical sorting networks.

**Our contributions.** In §3, we solve the open problem of designing an unbiased shuffling algorithm resistant to timing attacks for the NTRU fixed-weight sampling problem. It is a drop-in replacement for NTRU’s current sampling-by-sorting approach, improving the running time from $O(n \log^2 n)$ for the best practical sorting networks to $O(n)$, without impacting security; we discuss implementation aspects in §4. We show in §5 that an implementation of our proposed approach is considerably faster for the fixed-weight sampling step: up to $6.91 \times (591\%)$ on ARMv8-A cores and $12.58 \times (1158\%)$ on the Cortex-M4. This translates into considerable improvements for the KEM encapsulation operation (up to 50% on ARMv8-A cores and 71% on the Cortex-M4) and smaller improvements for key generation (up to 2.7% and 6.1% on the same respective platforms), with little effect on code size, and small gains in memory usage, for embedded architectures. We illustrate how to implement its main operations efficiently in the ARMv8-A and ARMv7-M architectures, as well as generic operations suitable for any architecture, and discuss possible implementations for Intel architectures. Our implementation is available under an open source license in https://github.com/...²

2 Preliminaries

2.1 NTRU random sampling

NTRU is a post-quantum public-key cryptosystem whose security relies on the difficulty of finding short vectors in high-dimensional lattices [Ajt96, MG02]. It is based on a polynomial ring over a finite field, and some of its parameters are random ternary polynomials, i.e., with coefficients in $\{-1, 0, +1\}^2$. A subset of these are additionally restricted to being fixed-weight, i.e. having a prescribed number of $-1$, 0 and $+1$ coefficients. The NTRU¹

¹Source code will be made available following the publication of the paper.
²Given that these are ternary polynomials, the coefficient 2 may be used interchangeably with $-1$. 

specification [CDH+20] defines $T(d)$ as the set of ternary “polynomials that have exactly $d/2$ coefficients equal to $+1$ and $d/2$ coefficients equal to $-1$”.

The straightforward approach to sample from $T(d)$ is to fix a representative of $T(d)$ (e.g., $-1$ for the first $d/2$ coefficients, $+1$ for the next $d/2$ coefficients, and $0$ for the remaining ones), and randomly permute its coefficients using a shuffling algorithm. However, known shuffling algorithms are not side-channel attack resistant [Dan19]. The usual solution to this problem, mandated by the NTRU specification [CDH+20], is based on sorting. Briefly, an array of key-value pairs are created, using uniformly random samples as keys, and values are coefficients from the chosen fixed representative of $T(d)$. Sorting the random keys induces a random permutation of the coefficients. This approach is illustrated in Algorithm 1.

Algorithm 1 \textsc{SampleFixedType}: fixed-weight sampling by sorting [CDH+20, §1.10.5]

\textbf{Input:} \((b_0, b_1, \ldots, b_{l−1})\) (random bit string of length \(l = 30(n−1)\))

\textbf{Output:} \(v\) (a polynomial in \(T(q/16−1)\))

\textbf{Notes:} We denote by \(\text{Int}(x_0, \ldots, x_{k−1})\) the unsigned integer with \(x_j (0 \leq j \leq k−1)\) at the \(j\)-th bit of its binary representation.

1: \(a \leftarrow [0, 0, \ldots, 0]\) \hspace{1cm} \(\triangleright\) Array of \(n−1\) zeros
2: \(v \leftarrow 0\) \hspace{1cm} \(\triangleright\) The zero polynomial
3: \(i \leftarrow 0\)
4: \textbf{for} \(i = 0\) \textbf{to} \(q/16−2\) \textbf{do}
5: \(a[i] \leftarrow 1 + 4 \cdot \text{Int}(b_{30i}, \ldots, b_{30i+29})\)
6: \textbf{for} \(i = q/16−1\) \textbf{to} \(q/8−3\) \textbf{do}
7: \(a[i] \leftarrow 2 + 4 \cdot \text{Int}(b_{30i}, \ldots, b_{30i+29})\)
8: \textbf{for} \(i = q/8−2\) \textbf{to} \(n−2\) \textbf{do}
9: \(a[i] \leftarrow 0 + 4 \cdot \text{Int}(b_{30i}, \ldots, b_{30i+29})\)
10: Sort \(a\) in constant time
11: \textbf{for} \(i = 0\) \textbf{to} \(n-2\) \textbf{do}
12: \(v \leftarrow v + (a[i] \mod 4) 2^i\)
13: \textbf{return} \(v\)

2.2 Shuffling algorithms

\textbf{Fisher-Yates.} The Fisher-Yates shuffle algorithm, also known as Knuth’s shuffle [FY38, Dur64, Knu97], is a classical technique for randomly and unbiasedly permuting elements in a collection. It is displayed in Algorithm 2.

Algorithm 2 \textsc{Fisher-Yates}(a, n)

\textbf{Input:} An array \(a\) of \(n\) elements

\textbf{Output:} A random permutation of \(a\)

1: \textbf{for} \(i = n−1\) \textbf{downto} 0 \textbf{do}
2: \(j \leftarrow \text{random integer such that} \ 0 \leq j \leq i\)
3: \(\text{Exchange} a[j] \text{ and } a[i]\)
4: \textbf{return} \(a\)

Fisher-Yates has favourable performance characteristics: \(O(n)\) running time with small constants. However, the pioneering work of [Ber04], applied to S-boxes in the AES cipher, revealed that array accesses indexed by secret data are susceptible to side-channel attacks,
due to timing variabilities induced by the presence or absence of data in CPU caches. A similar attack can be mounted on Algorithm 2 by recovering the indices \( j \) in the accesses to \( a[j] \) in line 3, allowing the permutation to be reconstructed by an attacker. We recall that a constant-time version was proposed in [Sen21]; however, its running time for a full shuffle (as required in NTRU) is degraded to \( O(n^2) \).

**Rao-Sandelius.** A relevant shuffling algorithm is Rao-Sandelius, independently proposed in the 1960s by [Rao61] and [San62]. It relies on a divide-and-conquer strategy.

**Algorithm 3 RS\((a, n)\): Rao-Sandelius shuffle**

**Input:** An array \( a \) of \( n \) elements

**Output:** A random permutation of \( a \)

1: if \( n \leq 1 \) then
2: return \( a \)
3: if \( n = 2 \) then
4: if \( \text{rand-bit} = 1 \) then
5: return \([a[1], a[0]]\)
6: else
7: return \([a[0], a[1]]\)
8: Let \( A_0 \) and \( A_1 \) be two empty arrays
9: for \( i = 0 \) to \( n \) do
10: Add \( a[i] \) into \( A_{\text{rand-bit}} \)
11: return \( \text{RS}(A_0, |A_0|) \| \text{RS}(A_1, |A_1|) \)

The case \( n = 2 \) can be made constant-time using standard techniques. Line 10 directs each element \( a[i] \) to a different array depending on a random bit; by evicting both arrays from the cache for later probing, an attacker can find which array was written to. This can be countered by writing to both arrays regardless of the random bit drawn, but only incrementing the correct pointer. However, the random choice of array for assignment may lead to uneven growth of the arrays. We are unaware of any concrete analyses in the literature, but conjecture that this leaks enough data to mount a cache timing attack.

**MergeShuffle.** Finally, **MergeShuffle**, introduced in [BBHL18], “is an (easy to implement) extremely efficient algorithm to generate random permutations (or to randomly permute an existing array)”. As with the Rao-Sandelius algorithm, **MergeShuffle** uses a divide-and-conquer strategy and is amenable to a parallel implementation.

Let \( k \) be a cut-off threshold to switch to Fisher-Yates. **MergeShuffle** splits an input array \((a_0, a_1, \ldots, a_{n-1})\) into \( 2^k \) blocks to be shuffled using Fisher-Yates (Algorithm 2), and then merges the resulting permutations as presented in Algorithm 4. The merging procedure is similar in spirit to that of e.g. mergesort, but it is performed in-place and uses a random bit to choose whether to swap elements from the two input arrays.

The use of Fisher-Yates as a subroutine of **MergeShuffle** renders it equally susceptible to cache timing attacks. It is also unclear whether the merging step can be vectorized, to attain competitive performance, and implemented in constant-time.

### 3 Fixed-weight sampling by constant-time shuffling

As just discussed, while shuffling is the natural solution to the fixed-weight sampling problem in NTRU, we are unaware of any shuffling algorithm resistant to side-channel attacks. In this section, we propose an efficient, unbiased and timing attack-resistant
shuffling algorithm suitable for NTRU fixed-weight sampling. Throughout this section, $n$ is defined as in the NTRU specification and assumes values of either 509, 677 or 821.

We first describe a subroutine (Algorithm 5) to generate an array of random integers $s_i$ such that $s_i[j] \sim U((0, n - 1 - i))$ for $0 \leq i < n - 1$. It is a slightly modified version of [Lem19, Algorithm 5]. While other approaches exist to achieve the same result, some of which are discussed in the same paper, this method achieves the best performance among all methods we experimented with, while restricting costly (and, in all CPUs we’re familiar with, variable-time) divisions by non-power-of-two integers to a pre-computation step.

Algorithm 5 RejSamplingMod($n$, $\text{rnd}$): Unbiased uniform random number generation

\textbf{Input:} $n$

\textbf{Input:} $\text{rnd}$ (array of random $L$-bit integers; refer to discussion in §4 for its length)

\textbf{Output:} $s_i$ (output array of $n - 1$ integers)

1. for $i = 0 \text{ to } n - 2$ do
2. \hspace{1em} $t[i] = 2^L \mod (n - 1 - i)$ \hfill \Comment{Precomputation}
3. $j \leftarrow 0$
4. for $i = 0 \text{ to } n - 2$ do
5. \hspace{1em} repeat
6. \hspace{2em} $m \leftarrow \text{rnd}[j] \times (n - 1 - i)$
7. \hspace{2em} $j \leftarrow j + 1$
8. \hspace{1em} $l \leftarrow m \mod 2^L$ \hfill \Comment{Implement with a bitmask}
9. until $l \geq t[i]$
10. $s_i[i] \leftarrow \left\lfloor m/2^L \right\rfloor$ \hfill \Comment{Implement using right-shift by $L$ bits}
11. return $s_i$

Lemma 1 (Correctness and unbiasedness of Algorithm 5 [Lem19]). Let $L \in \mathbb{N}^* = \mathbb{N} \setminus \{0\}$. Then, $\forall s \in [0, 2^L)$ and $\forall y \in [0, s)$, with $s, y$ integers, there are $2^L/s$ values of $x \in [0, 2^L)$ such that $[(x \cdot s)/2^L] = y$ and $(x \cdot s) \mod 2^L \geq 2^L \mod s$.

\textbf{Proof.} Let $L \in \mathbb{N}^*$. Take $s \in [0, 2^L)$ and $y \in [0, s)$. Given $x \in [0, 2^L)$, the product $x \cdot s$ maps integers in $[0, 2^L)$ to integers in $[0, s \cdot 2^L)$. Take $z \in [0, s \cdot 2^L)$ an integer, the result of the integer division of $z$ by $2^L$ is 0 if $z \in [0, 2^L)$, 1 if $z \in [2^L, 2^L+1)$, and so on. Every interval $[i \cdot 2^L, (i+1) \cdot 2^L)$ contains $[(i+1) \cdot 2^L - i \cdot 2^L]/s = [2^L/s]$ multiples of $s$; as such, the interval $[i \cdot 2^L + (2^L \mod s), (i+1) \cdot 2^L)$ contains $[((i+1) \cdot 2^L - i \cdot 2^L - (2^L \mod s))/s] = [2^L/s]$ multiples of $s$, since $(2^L - (2^L \mod s))$ is multiple of $s$. \hfill $\square$

Lemma 1 implies that rejecting values in the interval $[i \cdot 2^L, (i+1) \cdot 2^L + 2^L \mod s)$ that are multiple of $s$ produces exactly $[2^L/s]$ multiples of $s$. We discuss issues of timing...
Algorithm 6 SHUFFLE($n, c_0, c_1, \text{rnd}$): Fixed-weight sampling by shuffling

Input: $n$
Input: $c_0, c_1$ (prescribed number of coefficients equal to 0, resp. 1)
Input: $\text{rnd}$ (array of random $L$-bit integers; refer to discussion in §4 for its length)
Output: $v$ (output array of $n - 1$ integers)

1: $s_i \leftarrow \text{RejSamplingMod}(n, \text{rnd})$
2: for $i = 0$ to $n - 2$
3:  if $s_i < c_0$ then \> See text for discussion of constant-time implementation
4:     $v[i] \leftarrow 0$
5:  else if $s_i < c_0 + c_1$ then
6:     $v[i] \leftarrow 1$
7:  else
8:     $v[i] \leftarrow -1$
9: return $v$

attack resistance, as well as the choice of the performance-critical parameter $L$ in §4. Algorithm 6 is our proposed shuffling approach for the fixed-weight sampling problem.

Evidently, the main loop of Algorithm 6, as presented, does not execute in constant time due to the use of branches. However, architecture-agnostic standard techniques, as well as architecture-specific conditional instructions, can be used to obtain a branchless, constant-time implementation; see §4. Moreover, all accesses to the arrays $s_i$ and $v$ are performed sequentially. We exploit the fact that $O(1)$ distinct values need to be shuffled (indeed, only $3: -1, 0, 1$), a situation not considered in the usual shuffling algorithms. Intuitively, one could draw an analogy between Fisher-Yates shuffling and selection sort, and by replacing the latter with counting sort, arrive at our proposed algorithm.

Lemma 2 (Correctness and unbiasedness of Algorithm 6). Let $n$ be an integer and $\text{rnd}$ an array of random $L$-bit integers as in Algorithm 6. Let $c_i$ for $i \in \{-1, 0, 1\}$ be the number of coefficients equal to $i$ in the output polynomial, so that $c_{-1} + c_0 + c_1 = n - 1$, and write $\Sigma = \{-1\}^{c_{-1}} \times \{0\}^{c_0} \times \{1\}^{c_1}$. Then, Algorithm 6 produces an array $v = \sigma(\Sigma)$, where $\sigma \in \text{Perm}(\Sigma; c_{-1}, c_0, c_1)$ is an uniformly taken permutation of $\Sigma$ with $c_{-1}, c_0, c_1$ repetitions.

Proof. It is immediate to verify that the output of Algorithm 6 is $v = (\sigma(\Sigma_1), \ldots, \sigma(\Sigma_{n-1}))$ for some $\sigma \in \text{Perm}(\Sigma; c_{-1}, c_0, c_1)$. To show unbiasedness, we proceed by directly computing $P(v = (\sigma(\Sigma_1), \ldots, \sigma(\Sigma_{n-1})))$ for an arbitrary $\sigma$. Recalling that $s_i[j] \sim \mathcal{U}(0, n - 1 - j)$ for $0 \leq j < n - 1$, we have that

$$P(v = (\sigma(\Sigma_1), \ldots, \sigma(\Sigma_{n-1}))) = \prod_{k=0}^{n-1} P(v_k = \sigma(\Sigma_K))$$
$$= \left( \prod_{l=0}^{c_{-1}-1} \frac{c_{-1}-l}{n-1-l} \right) \left( \prod_{m=0}^{c_0-1} \frac{c_0-m}{n-1-c_1-m} \right) \left( \prod_{p=0}^{c_1-1} \frac{c_1-p}{n-1-c_1-c_0-p} \right)$$
$$= \frac{c_{-1}!}{(c_{-1} - 1)!} \frac{c_0!}{(c_0 - 1)!} \frac{c_1!}{(c_1 - 1)!} \cdot \frac{1}{|\text{Perm}(\Sigma; c_{-1}, c_0, c_1)|}$$
where the fourth equality follows from the fact that $n = c_{-1} + c_0 + c_1$. Hence, Algorithm 6 covers all possible permutations $\sigma \in \text{Perm}(\Sigma; c_{-1}, c_0, c_1)$ with uniform probability. 

Lemma 3. Algorithm 6 executes in time $O(n)$ on average, assuming $2^L > 2n$.

Proof. The loop of line 2 clearly executes in time $O(n)$. Thus, the remaining work consists in analyzing Algorithm 5. The outer loop (line 4) consists of $n-1$ iterations. Noticing that $t[i] < n, \forall i$, the condition in line 9 will be satisfied with probability $1 - \frac{n}{2^L} > 50\%$. Thus, the expected number of iterations in the inner loop (line 5) is less than 2, so Algorithm 5 also executes in time $O(n)$ on average.

We remark that the $O(n)$ running time of Algorithm 6 improves upon the $O(n \log^2 n)$ running time of sorting networks typically used for constant-time sorting implementations [BCS13, BCLvV18], such as Butcher’s odd-even merge sort [Knu98].

The algorithm necessarily consumes at least $n-1$ random $L$-bit integers and may, in principle, consume an infinite number of them due to rejections; however, in §4, we show that, for $L = 16$, generating just 4% to 5.5% extra random integers is sufficient in practice.

4 Implementation aspects

Architectural guarantees regarding constant-time execution. Both ARMv8-A and Intel architectures have recently introduced hardware flags that, when set, guarantee constant-time execution of a subset of CPU instructions, which should generally be sufficient to implement most cryptographic algorithms: FEAT_DIT for ARMv8-A [ARM23, Sections A.2.6.1, B.1.3.6, C.5.2.4] and DOIT for Intel [Int23a, Int23b]. We verified that all instructions handling secret data in our ARMv8-A implementations are included in the affected subset.

These new features do not imply that CPUs launched prior to the introduction of these flags execute these instructions in variable time. Indeed, ARM is unaware of older CPUs with variable timing for instructions now covered by FEAT_DIT [ARM]; and Intel advises developers to assume older microarchitectures behave as if DOIT is enabled [Int23a].

This issue has garnered attention at the beginning of 2024, as Apple ARMv8-A cores (which are designed by Apple and not ARM) are subject to a microarchitectural attack called GoFetch [CWS24]; setting the FEAT_DIT bit on the M3 disables the data memory-dependent prefetchers targeted by the attack, rendering it ineffective.

Resistance against timing attacks of Algorithm 5. There are some possible sources of timing leaks in Algorithm 5, which we enumerate and analyze.

The integer multiplication in line 6 must execute in constant-time, which is the norm in modern CPUs\footnote{Note that multiplication instructions are covered by ARMv8-A’s FEAT_DIT and Intel’s DOIT flags.}, although there are rare exceptions such as the ARM Cortex-M3 for $32 \times 32 = 64$-bit multiplications; however, $32 \times 32 = 32$-bit multiplication suffices for the purpose of this algorithm, and there is evidence that it executes in constant time in the Cortex-M3 [dG15, Por18].

Array accesses in lines 6 and 9 use sequential indices; thus, secret data is not leaked.

The loop in lines 5 to 9 performs rejection sampling based on public data, precomputed in line 2: the remainder of $2^L$ divided by integers in the sequence $n - 1, n - 2, \ldots, 1$, where $L$ and $n$ are public parameters. Nevertheless, given the attack of Guo et al. [GHJ22] targeting rejection sampling in fixed-weight sampling algorithms for BIKE and HQC code-based cryptosystems, it is worth analyzing whether a similar attack could apply here. We note that their attack relies on two key assumptions:
1. A high rejection rate, leading to multiple calls to the seedexpander routine (equivalently in our case, the randombytes routine) which creates a timing distinguisher. As discussed later, the rejection rate for our chosen parameter $L = 16$ is sufficiently small that e.g. a full run of Algorithm 5 in the case $n = 509$ has $> 40\%$ probability of no rejections at all. Due to this low rejection rate, we sample enough uniform random integers from the outset so that the probability is overwhelmingly low ($< 2^{-74}$) that extra samples are required, avoiding potential extra calls to randombytes while introducing a negligible overhead.

2. Derivation of the random seed for fixed-weight sampling from secret data – namely, the output of decryption from the reencryption step of decapsulation, as required by the Fujisaki-Okamoto transform for IND-CCA security of the KEM. The attack starts by trial encrypting many candidate messages until finding an $m$ that requires multiple calls to seedexpander, which gives rise to a timing distinguisher (made considerably harder, but not impossible, in our case by the first point above). Carefully constructed perturbations of the resulting ciphertext $c$ are fed to the decapsulation procedure, while using the timing distinguisher to determine whether the decryption step of reencryption outputs the same $m$ or a different message, allowing the attacker to learn information about the secret key. Repeated application of this procedure extracts the vast majority of key material, and the remaining bits are easily found. However, we note that NTRU does not require reencryption due to the rigidity of the NTRU DPKE [CDH+20, Figures 9 and 10]; indeed, the fixed-weight sampling algorithm is not executed at all during either decapsulation or decryption.

Thus, we conclude that Algorithm 5 does not render NTRU vulnerable to the attack of Guo et al [GHJ+22].

Choosing the parameter $L$. The choice of $L$ in Algorithm 5 is a tradeoff between the cost of random number generation and the frequency of rejections; the latter lead to branch mispredictions and costly pipeline flushes in modern, highly-pipelined superscalar CPUs such as some of the ARMv8-A cores considered in this work. If samples are rarely rejected, a SIMD implementation of the algorithm becomes feasible; one can keep track of which lanes were rejected and resample them later (usually with scalar code). To minimize rejection, one must choose $L$ such that $2^L \gg n - i$, but this translates into added cost for random number generation, and thus $L$ should not be unreasonably larger than $n - i$.

We propose $L = 16$ as a natural choice, supported by all scalar and SIMD instruction sets we’re aware of. The next smaller size, 8 bits, is insufficient for half or more of the values to be sampled in the standard NTRU parameter sets, and for most of the intervals where it is sufficient, it would lead to a high rejection rate, running counter to the SIMD philosophy. By exactly matching an available lane size, no bit shifts/masks/permutations are required to load random integers into SIMD registers, further improving performance. It is also the natural choice for storing the 11- or 12-bit NTRU polynomial coefficients; indeed, it is the representation used by the reference code and the state-of-the-art implementations we chose for performance comparisons, requiring no size conversions.

Finally and most importantly, rejections are relatively rare: a block of 16 samples is fully accepted (zero rejections) with probability at least 94.2%, 91.6% and 90.1% for $n = 509, 677$ and $821$, respectively. These are minimum figures, and as $n - i$ decreases, the acceptance probability increases even further. Furthermore, the probability of accepting all $n - 1$ samples (i.e., no rejections at all during a complete execution of the algorithm) is 40.2%, 18.9% and 8.6% for $n = 509, 677$ and $821$ respectively. These figures are obtained by modeling the number of required samples as a sum of geometric random variables and are displayed in a Jupyter notebook accompanying the source code of our implementation.

Due to the low rejection probability, it is sufficient to generate just a few extra random integers over the lower bound of $n$. For each $n$, we computed the cumulative distribution
function $P(x \leq k)$ and sought the minimum $k$ such that $1 - P(x \leq k) < 2^{-74}$, enough to sample $2^{10} > n$ integers for each of $2^{64}$ key exchanges. For $L = 16$, and rounding up to the next multiple of 8 (the number of 16-bit lanes in a NEON register), we find that 536, 704 and 856 random 16-bit integers are sufficient (i.e., an overhead of 5.5%, 4.1% and 4.4%) for $n = 509, 677$ and 821, respectively; these are the suggested lengths for the array \texttt{rnd} in Algorithms 5 and 6. This calculation is included in the aforementioned Jupyter notebook, and can be used to obtain suggested array lengths for other choices of $L$ if desired.

One might argue that $L = 16$ is a “wasteful” choice, as it requires 123%, 109% and 103% more bits than the (unattainable) lower bound of $\log_2(n!)$ bits for $n = 509, 677$ and 821, respectively. Still, we note this is slightly more than half as many random bits as the approach dictated by the NTRU specification [CDH+20], which calls for $30 \times n$ bits.

Taking $L > 16$ appears counterproductive, e.g. due to reduced computational throughput from using larger SIMD lanes. On the other hand, in scenarios where pseudo-random number generation is expensive, SIMD is not available and pipeline flushes have less performance impact (i.e. deeply embedded cores such as the Cortex-M4), choosing $L < 16$ (say, 12 or 10) may result in better overall performance. One might even conceive of an adaptive choice, decreasing $L$ along with $n - i$, although this results in more complex code.

**SIMD implementation of Algorithm 5.** To minimize the execution time of Algorithm 6, we seek to implement Algorithm 5 using SIMD instructions. At first glance, it is unsuitable for SIMD, as some lanes may be rejected while others are accepted during sampling. However, it is possible to sample a whole SIMD register and take note of which lanes, if any, were rejected, to be fixed up later using scalar code (recall that an adequate choice of $L$ ensures that rejections only happen with low probability, so the effect of this fixup procedure on the running time is small.) We also break the dependency on the index $j$ of the random array by using disjoint ranges of the array for SIMD sampling (indices 0 to $n - 2$) and the fixup procedure ($n - 1$ onwards). This idea is captured in Algorithm 7.

In addition to previously discussed issues of timing attack resistance of Algorithm 5, we note that any non-sequential accesses to the array \texttt{rnd} arise from switching between the ranges of indices $0 \leq i + k < n - 1$ and $j \geq n - 1$, that is, they are due to rejections and thus do not leak secret data; accesses within each range are sequential.

Line 10 should use SIMD comparison instructions (e.g. NEON’s \texttt{CMHI} or AVX2’s \texttt{VPCMPGT}). These create a mask with all bits set or clear in the corresponding lane, while Algorithm 7 as written calls for setting and clearing individual bits, a choice made purely for ease of exposition. Actual implementations are advised to tweak the representation to employ groups of bits instead, so as to achieve an efficient implementation of the inner loop of line 6. For instance, \texttt{VPMOVMSKB} is a natural choice in AVX2, resulting in 2-bit mask groups for 16-bit lanes. In NEON, we extract 8-bit masks with \texttt{UZP1}, and reduce them to 4-bit masks using \texttt{SHRN} by 4. NEON’s 128-bit registers suggest a choice of $W = 8$ if $L = 16$. However, we achieved better performance by taking $W = 16$, implemented as an unrolled 2-iteration loop processing 8-element vectors. We attribute this to the fact that converting a mask with \texttt{UZP1} and \texttt{SHRN} costs the same for 8 or 16 values.

**Constant-time implementation of Algorithm 6.** We now discuss how to implement Algorithm 6 in constant-time. First, we rewrite it using the C language’s ternary operator, as shown in Algorithm 8, and then discuss strategies to implement this operator in constant time, firstly as an architecture-agnostic solution, and then consider conditional instructions present in the ARMv8-A, ARMv7-M and Intel architectures. Note that this version replaces $-1$ coefficients by 2; this is not an issue as the sampled polynomial has coefficients in $\mathbb{Z}/32\mathbb{Z}$, and indeed, the reference NTRU code employs the same representation.

Expressions of the form $(x < y) \times -1 : 0$, in lines 4 and 5 of Algorithm 8, can be made constant-time by noticing that, in two’s complement integer arithmetic (used in nearly all
Algorithm 7 SIMD-RejSamplingMod$(n, \text{rnd})$: SIMD version of Algorithm 5.

Input: $n$
Input: $\text{rnd}$ (array of random $L$-bit integers; refer to previous discussion about its length)
Output: $\text{si}$ (output array of $(W+1)\lceil(n-1)/W \rceil$ integer elements, of which only the first $n-1$ entries are valid.)

1: for $i = 0$ to $n-2$ do \hspace{1em} $\triangleright$ Precomputation
2: \hspace{1em} $t[i] = 2^L \mod (n-1-i)$
3: \hspace{1em} $j \leftarrow n - 1$
4: for $i = 0, W, \ldots, W \lceil(n-1)/W \rceil$ do
5: \hspace{2em} $\text{mask} \leftarrow 0$
6: \hspace{3em} for $k = 0$ to $W-1$ do \hspace{1em} $\triangleright$ Loop body implemented using SIMD code
7: \hspace{4em} $m[k] \leftarrow \text{rnd}[i+k] \times (n-1-(i+k))$
8: \hspace{4em} $l[k] \leftarrow m[k] \mod 2^L$
9: \hspace{3em} if $l[k] < t[i+k]$ then
10: \hspace{4em} $\text{mask}_k \leftarrow 1$ \hspace{1em} $\triangleright$ $\text{mask}_k$ denotes the $k$-th bit of $\text{mask}$
11: \hspace{4em} else
12: \hspace{5em} $\text{mask}_k \leftarrow 0$
13: \hspace{2em} while $\text{mask} \neq 0$ do \hspace{1em} $\triangleright$ Loop body implemented using scalar code
14: \hspace{3em} $k = \text{COUNTTRAILINGZEROS}(\text{mask})$
15: \hspace{3em} repeat
16: \hspace{4em} $m' \leftarrow \text{rnd}[j] \times (n-1-(i+k))$
17: \hspace{4em} $j \leftarrow j + 1$
18: \hspace{4em} $l' \leftarrow m \mod 2^L$
19: \hspace{4em} until $l' \geq t[i+k]$
20: \hspace{4em} $\text{si}[i+k] \leftarrow \lfloor m'/2^L \rfloor$
21: \hspace{4em} $\text{mask}_k \leftarrow 0$
22: return $\text{si}$

modern architectures), $-1$ and 0 have all bits set and cleared, respectively. The sign (most significant) bit of $x - y$ is 1 if $x < y$ and 0 otherwise; an arithmetic right shift by $w - 1$ bits, where $w$ is the word size, replicates the sign bit across the entire word. Concretely, the following C code implements line 4 for 16-bit signed integer variables:

\[
t_0 = (\text{si}[i] - c_0) \gg 15;
\]

While already efficient, better performance is achievable. To that end, we analyze the critical path of the main loop of Algorithm 8, shown in Figure 1. We disregard memory loads and stores, which can be removed from the critical path by proper scheduling. For any mobile-, desktop- or server-class modern CPU, one can assume at least a 2-way superscalar pipeline and single-cycle latency for all used operations, in which case the critical path from lines 4 and 5 of one iteration to the next (the bold arrows in the figure) takes 3 cycles.

In ARMv8-A, arithmetic operations can be encoded so that one of the input operands is shifted; thus, a single instruction can compute both $t_0 = t_0 \gg (w-1)$ and $c_0 = c_0 + t_0$. Unfortunately, ARMv8-A CPUs considered in this work, such as the Apple M1 [Joh22] and Cortex-A72 [ARM15], execute these instructions with a 2-cycle latency, offering no gain in performance (but a slight reduction in code size).

By employing ARMv8-A conditional instructions such as \texttt{CINC} and \texttt{CSET}, it is possible to reduce the critical path to 2 cycles. However, Algorithm 8 calls for decrementing $c_0$ and $c_{01}$, and there is no \texttt{CDEC} instruction in ARMv8-A; we modify the algorithm to use negative values for $c_0$, $c_{01}$ and $\text{si}[i]$, so that we can increment $c_0$ and $c_{01}$ using \texttt{CINC} instead. Thus, we arrive at the following code for the algorithm’s main loop:
Algorithm 8 CT-Shuffle\((n,c_0,c_1,\text{rnd})\): Fixed-weight sampling by shuffling, implemented in constant-time

**Input:** \(n\)

**Input:** \(c_0, c_1\) (prescribed number of coefficients equal to 0, resp. 1)

**Input:** \(\text{rnd}\) (array of random \(L\)-bit integers; refer to previous discussion about its length)

**Output:** \(v\) (output array of \(n-1\) integers)

**Notes:** We employ the C language ternary operator ? to denote constant-time selection between two values based on a condition. See text for implementation possibilities.

1: \(s_i \leftarrow \text{SIMD-RejSamplingMod}(n, \text{rnd})\)
2: \(c_{01} \leftarrow c_0 + c_1\) \(\triangleright\) Note this invariant is maintained in the loop body
3: for \(i = 0\) to \(n-2\) do
4: \(t_0 \leftarrow (si[i] < c_0) ? -1 : 0\)
5: \(t_1 \leftarrow (si[i] < c_{01}) ? -1 : 0\)
6: \(c_0 \leftarrow c_0 + t_0\)
7: \(c_{01} \leftarrow c_0 + t_1\)
8: \(v[i] \leftarrow 2 + t_0 + t_1\)
9: return \(v\)

![Critical path of the main loop of Algorithm 8.](image)

There are two critical paths: one from cmp \(c_0, r\) to cinc \(c_0, c_0, lt\) to the next iteration’s cmp \(c_0, r\); and the second for the same instructions involving \(c_{01}\). In all the considered ARM CPUs, all instructions in the code fragment above have single-cycle latency, and thus the loop has the potential to execute in 2 cycles/iteration.

Unfortunately, we run into throughput issues: in the Apple M1, reverse engineering efforts [Joh22] indicate that, although it is capable of executing 6 scalar instructions/cycle, only 3 execution units can execute flag-setting and conditional instructions, i.e. all instructions in the above code fragment. While theoretically sufficient to run the code at maximum throughput, we have observed instruction scheduling issues while attempting to software-pipeline Algorithms 7 and 8, preventing execution at maximum throughput. The following instruction sequence requires more \(\mu\)ops, but performs better in the M1:

```c
subs tmp, c0, si[i]
```

```c
cmp c0, si[i]
cinc c0, c0, lt
cset v[i], ge
```
cinc c0, c0, lt  
add v[i], two, tmp, asr #31  
subs tmp, c01, si[i]  
cinc c01, c01, lt  
add v[i], v[i], tmp, asr #31

We use 32-bit registers (w0, w1, etc.) and initialize two with the constant 2. It is also advantageous for the Cortex-A72, since the add instruction with shifted argument executes in the M pipeline, whereas all other instructions execute in the I0/I1 pipelines. While other bottlenecks come into play in the Cortex-A72, notably its 3-wide instruction decoder, this alternative instruction sequence performs better than the original.

Intel has conditional instructions for conditional moves (CMOVcc) and sets (SETcc), where cc are condition codes, but no conditional increments or decrements. For positive values of c0 and c01, as in the original version of Algorithm 8, an alternative is to decrement c0 and c01 and use CMOV to select between original and decremented values; decrements can execute in parallel with comparisons, thus the critical path is not lengthened.

Unfortunately, Intel instructions do not offer the three-operand form of ARMv8-A and other RISC architectures, so an extra MOV is required to create a copy prior to decrementing in order to avoid overwriting the original values; this doesn’t necessarily increase the critical path, due to MOV elimination [Fog22], but it does increase front-end pressure. Implementers are advised to keep in mind the achievable performance given the critical path, to benchmark and analyze compiler-generated code if employing a high-level language, and to consider inline assembly (or a full assembly language implementation) to emit instructions that are well-matched to the decoder restrictions.

For the ARMv7-M architecture, a straightforward implementation of Algorithm 8, implementing lines 4 and 5 using the arithmetic right shift trick, works really well; this is aided by the ability to shift one of the input operands to data processing (logical and arithmetic) instructions, resulting in very compact and efficient code.

We have experimented with ARMv7-M’s IT instruction, attempting to improve performance compared to the straightforward implementation, but we were unsuccessful. However, we did find an especially compact instruction sequence, devoid of conditional execution instructions, to implement the main loop of Algorithm 8:

cmp si[i], c0  
sbc c0, #0  
sbc v[i], one  
cmp si[i], c01  
sbc c01, #0  
sbc v[i], #-1

We set one to the constant 1. As the straightforward implementation is already efficient, this alternative saves one clock cycle per loop iteration, i.e. < 1000 cycles for the full algorithm. As fixed-weight sampling is performed only once during key generation and encapsulation, the speedup is just < 0.02% for the former and ≈ 0.15% for the latter.

Software pipelining of Algorithms 7 and 8. Modern superscalar CPUs use distinct execution units for scalar and SIMD instructions. Most of the execution time of Algorithm 7 is spent in SIMD code, while Algorithm 8 is strictly scalar. This is amenable to software pipelining [Lam88]. In the best-case scenario, one can achieve execution time close to the maximum, rather than the sum, of the execution times of Algorithms 7 and 8.

Concretely, we inline Algorithm 7 into Algorithm 8, strip-mine the main loop of the latter, and then fuse the outer loops of both algorithms, processing W entries at a time.
With this approach, we were able to achieve, in the Apple M1, execution times only \( \approx 12\% \) slower than the lower bound (2 cycles/iteration) for the main loop of Algorithm 8 alone. This includes all overhead such as function calls and returns, prologue and epilogue, initialization, and of course, the execution of Algorithm 7 itself, as seen in Table 4. The narrow (3-wide) decoder of the Cortex-A72 precludes achieving a similar result as the M1, but by interleaving instructions of both algorithms to improve scheduling, we achieved results not far from the limit dictated by the decoder bandwidth bottleneck.

**Known Answer Tests.** We note that the Known Answer Tests (KATs) in NTRU’s specification [CDH+20] are tightly coupled to the fixed-weight sampling by sorting approach mandated there. Therefore, an implementation employing Algorithm 6 will fail these KATs for key generation and encapsulation. However, our sampled polynomials meet the fixed-weight requirement imposed by NTRU and are in principle indistinguishable from those generated by the existing approach. Thus, keys generated using our algorithm are valid, and the result of an encapsulation employing our algorithm will produce a correct decapsulation even by an unmodified implementation of the current NTRU proposal.

Given the simplicity and improved performance and code size characteristics of Algorithm 6, we suggest that future standardization attempts of NTRU specify our approach instead of sampling by sorting, and generate KATs accordingly. Implementers attempting to replicate our results, whether on ARMv8-A or other architectures, can use unofficial KATs generated by us, included in our source code package.

## 5 Experimental results

We now present experimental results for implementations of our proposed approach for various 64-bit ARMv8-A cores, as well as the 32-bit ARMv7-M Cortex-M4 core.

### 5.1 Methodology

We implemented reference versions of Algorithms 5 and 6, and optimized versions for ARMv7-M and ARMv8-A by replacing Algorithm 6 with Algorithm 8: for ARMv8-A specifically, we replaced Algorithm 5 by a NEON version of Algorithm 7. We integrated the reference and optimized implementations with existing state-of-the-art implementations of NTRU: pqm4 [KPR+] for ARMv7-M and [GFBL24, NG21, CCHY23] for ARMv8-A. KATs were generated using the reference implementation and compared against the optimized implementations; we added tests to ensure interoperability between a conventional implementation (using sampling by sorting) and our proposed approach.

**Testbeds and measurement methods.** Our testbeds for performance measurement, with their corresponding CPU cores, are:

- Apple M1 P-core at 3200 MHz in an Apple MacBook Air laptop running macOS;
- Apple M3 P-core at 4064 MHz in an Apple MacBook Pro laptop running macOS;
- Cortex-A72 at 1500 MHz in a Raspberry Pi 4 single-board computer running Linux;
- Cortex-A57 at 1430 MHz in an Nvidia Jetson Nano single-board computer running Linux;
- Cortex-A53 at 1400 MHz in a Raspberry Pi 3 single-board computer running Linux;

Although NTRU was removed from the most recent version of pqm4, after Kyber was selected in the NIST post-quantum standardization process, we used the most recent version prior to NTRU’s removal.
• Cortex-M4 at 24 MHz in an STM32F4DISCOVERY development board.

Save for the ARMv7-M Cortex-M4 core, the remaining testbeds are ARMv8-A, running in 64-bit mode. Of these, the Apple M1, M3 and Cortex-A57 cores feature ARMv8-A Cryptographic Extensions, but the Cortex-A72 and the Cortex-A53 do not.

Our ARMv8-A performance measurements use the cycle counting routines originally introduced in [NG21]. Each routine is executed for 1,024 times and the average cycle count is reported. ARMv7-M measurements employ the pqm4 [KPR+] benchmarking harness, which counts cycles using the Cortex-M4 SysTick timer. The number of iterations is set to 10, and the mean of results are reported; although this is a small number, the Cortex-M4 core is much simpler and more deterministic than the large out-of-order ARMv8-A cores, thus exhibiting little run-to-run variability.

While, to a first approximation, cycle counts are not influenced by CPU clock speed, there may be second-order effects such as the decoupling of CPU and bus/RAM/cache clocks. Thus, we take precautions to maximize the likelihood that benchmarks are performed at the nominal clock speeds quoted above: we use the performance scaling governor for Linux systems; in Apple systems, as far as we aware, there is no control over clock speeds, and there is no TurboBoost-like feature. In both cases, we try to avoid thermal throttling by inserting delays between benchmark runs to allow systems to cool down. The Cortex-M4 core does not automatically boost/throttle clock speeds; pqm4 configures it to 24 MHz at startup, ensuring all benchmarks run at that fixed clock speed.

ARMv8-A binaries were compiled with Apple clang 15 (Apple M1 and M3), clang 17 (Cortex-A72 and Cortex-A53), and clang 10 (Cortex-A57), with -o3 and core-specific -mcpu optimization flags. ARMv7-M binaries were compiled with gcc 12.2.1, passing the -o speed flag to the pqm4 benchmark script. We enable the FEAT_DIT bit on ARMv8-A cores where it is available (in the case of our testbeds, only the Apple M1 and M3).

ARMv8-A implementation. Our implementation is based on the source code provided by [GFBL24], which contains their AMX implementation and the NEON implementations of [CCHY23, NG21]. As [CCHY23] is the state-of-the-art in NEON implementations, but targets only the HPS2048677 and HRSS701 parameter sets, [NG21] is included to display HPS2048509 and HPS4096821 results. Importantly, [GFBL24] backports optimized auxiliary routines of [CCHY23] to [NG21] (in particular a NEON implementation of constant-time sorting) and provides an optimized implementation of NIST’s randombytes() AES-CTR-DRBG pseudo-random number generator (PRNG), using ARMv8-A Cryptographic Extensions. These routines are critical to the performance of fixed-weight sampling.

For CPUs without ARMv8-A Cryptographic Extensions, the ChaCha20 PRNG of [CCHY23] is used. As KATs are incompatible across different PRNGs, we supply two sets of KATs for validation, using ChaCha20 and AES-CTR-DRBG generators. We ensure that latter matches those provided in the NTRU specification, which uses the same PRNG.

ARMv7-M implementation. pqm4 [KPR+] is the gold standard for Cortex-M4 implementations of PQC schemes. While its NTRU implementation has highly optimized polynomial multiplication and inversion routines, the constant-time sorting routine in use is the portable3 variant of djbsort [Ber19], using a non-architecture-specific implementation of the core minimum/maximum operation of the sorting network. Compiler output inspection reveals that the compiler the minimum/maximum idiom was not recognized, thus generating suboptimal code without using e.g. conditional instructions. We performed some optimization work on this routine, so as to avoid casting our proposed approach in an excessively favorable light. We switched to the more efficient portable4 variant of djbsort, wrote inline assembly versions of the core minimum/maximum operation using conditional operations and a reduced number of memory accesses, and replaced all long long (64-bit) variables by 32-bit long variables to avoid unnecessary use of multi-precision arithmetic,
given that ARMv7-M is a 32-bit architecture. This range reduction does not present an
issue in NTRU due to the small lengths (hundreds of elements) of the arrays to be sorted.
While it is certainly possible to further optimize this routine, further experiments by us
resulted in code size increases, which are undesirable in deeply embedded environments.

Table 1 compares the performance, code size and stack memory usage of encapsulation
in the existing NTRU scheme (using sampling by sorting), for the original pqm4 implementa-
tion and our optimized version in our STM32F4DISCOVERY testbed; we denote these
as “[KPR+] original” and “[KPR+] optimized”, respectively, in Table 1. It is seen that
our optimizations result in large speedups (44–48%) with negligible effect on code size and
none at all on stack usage. While we omit corresponding figures for key generation, our
optimizations also outperformed stock pqm4, although by smaller amounts (5.4–6.0%); code
size and stack usage differences are similar. Results for decapsulation and for the HRSS701
parameter set are not shown, as they do not call the constant-time sorting routine.

Table 1: Comparison of the original pqm4 [KPR+] NTRU implementation and our
optimized version for encapsulation. Code size and stack usage are in bytes. For differences,
positive values denote an increase in the optimized version relative to the original one.

<table>
<thead>
<tr>
<th>Parameter set</th>
<th>Work</th>
<th>Cycle count</th>
<th>Code size</th>
<th>Stack usage</th>
</tr>
</thead>
<tbody>
<tr>
<td>HPS2048509</td>
<td>[KPR+] original</td>
<td>557 131</td>
<td>192 560</td>
<td>14 084</td>
</tr>
<tr>
<td></td>
<td>[KPR+] optimized</td>
<td>386 278</td>
<td>192 632</td>
<td>14 084</td>
</tr>
<tr>
<td>Speedup</td>
<td></td>
<td>44%</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Difference</td>
<td></td>
<td>+0.04%</td>
<td></td>
<td>0%</td>
</tr>
<tr>
<td>HPS2048677</td>
<td>[KPR+] original</td>
<td>801 357</td>
<td>282 292</td>
<td>19 996</td>
</tr>
<tr>
<td></td>
<td>[KPR+] optimized</td>
<td>546 570</td>
<td>282 364</td>
<td>19 996</td>
</tr>
<tr>
<td>Speedup</td>
<td></td>
<td>47%</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Difference</td>
<td></td>
<td>+0.03%</td>
<td></td>
<td>0%</td>
</tr>
<tr>
<td>HPS4096821</td>
<td>[KPR+] original</td>
<td>997 084</td>
<td>370 808</td>
<td>23 436</td>
</tr>
<tr>
<td></td>
<td>[KPR+] optimized</td>
<td>672 825</td>
<td>370 884</td>
<td>23 436</td>
</tr>
<tr>
<td>Speedup</td>
<td></td>
<td>48%</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Difference</td>
<td></td>
<td>+0.02%</td>
<td></td>
<td>0%</td>
</tr>
</tbody>
</table>

5.2 Performance figures and analysis

We present performance figures for NTRU KEM key generation and encapsulation in
Tables 2 (for Apple SoCs) and 3 (for ARM Cortex cores); decapsulation does not employ
fixed-weight sampling, thus its performance is unaffected by our proposed approach. We
present NEON results from the implementations of [NG21] for the HPS2048509 and
HPS4096821 parameter sets, and [CCHY23] for the HPS2048677 and HRSS701 parameter
sets. AMX results are from the implementation of [GFBL24]. We emphasize that all
ARMv8-A implementations use the NEON optimized constant-time sorting routine of
[CCHY23]. For the Cortex-M4 core, we use the implementation of [KPR+], incorporating
our optimizations for constant-time sorting. We present performance results as cycle
counts, calculating speedups as $c_{\text{sorting}} / c_{\text{shuffling}} - 1$.

Results for the shuffling approach consist in replacing the sample_fixed_type routine
by our proposed algorithms, and adjusting the amount of uniform random bits to match
the requirements of the shuffling algorithms, as discussed in Section 4.

We also present performance figures for fixed-weight sampling, by measuring calls
to the sample_fixed_type routine, whose results are presented in Table 4. Finally, we
present code size (Flash) and stack (RAM) usage figures for the Cortex-M4 in Table 5.
Table 2: Cycle counts (in kilocycles) for NTRU KEM key generation (KG) and encapsulation (Enc.) in the Apple M1 and M3 SoCs.

<table>
<thead>
<tr>
<th>Param. Set</th>
<th>Sampling</th>
<th>Apple M1</th>
<th>Apple M3</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>NEON</td>
<td>AMX</td>
</tr>
<tr>
<td></td>
<td></td>
<td>KG</td>
<td>Enc.</td>
</tr>
<tr>
<td>509</td>
<td>Sorting</td>
<td>218</td>
<td>16.1</td>
</tr>
<tr>
<td></td>
<td>Shuffling</td>
<td>214</td>
<td>12.5</td>
</tr>
<tr>
<td></td>
<td>Speedup</td>
<td>1.7%</td>
<td>29%</td>
</tr>
<tr>
<td>677</td>
<td>Sorting</td>
<td>307</td>
<td>20.6</td>
</tr>
<tr>
<td></td>
<td>Shuffling</td>
<td>309</td>
<td>14.9</td>
</tr>
<tr>
<td></td>
<td>Speedup</td>
<td>-0.7%</td>
<td>39%</td>
</tr>
<tr>
<td>821</td>
<td>Sorting</td>
<td>498</td>
<td>28.0</td>
</tr>
<tr>
<td></td>
<td>Shuffling</td>
<td>491</td>
<td>21.8</td>
</tr>
<tr>
<td></td>
<td>Speedup</td>
<td>1.3%</td>
<td>28%</td>
</tr>
<tr>
<td>701</td>
<td>N/A</td>
<td>323</td>
<td>14.6</td>
</tr>
<tr>
<td></td>
<td>Slowdown vs. 677 sorting</td>
<td>5.4%</td>
<td>-29%</td>
</tr>
<tr>
<td></td>
<td>Slowdown vs. 677 shuffling</td>
<td>4.7%</td>
<td>-1.7%</td>
</tr>
</tbody>
</table>

Key generation and encapsulation. Our proposed approach achieved performance improvements across the board, for both key generation and encapsulation, save for a few outliers in the former. For Cortex-M4, these improvements come at a negligible cost to code size (Flash), and even a slight improvement in stack (RAM) usage, as seen in Table 5.

With regards to key generation, we see improvements of up to 2.7% for ARMv8-A cores and 6.1% for the Cortex-M4. We recall that NTRU key generation is computationally expensive; disregarding simpler operations, it requires a modulo-$q$ inversion (usually realized by a modulo-2 inversion followed by 8 multiplications), a modulo-3 inversion, 5 extra multiplications, 2 different types of sampling (including `sample_fixed_type`) and pseudo-random number generation. Therefore, it is not surprising that optimizing a single sampling routine results in limited performance improvements.

Results are more significant for encapsulation, which are arguably of more interest than key generation, seeing as, for most cryptographic applications, the former will be run far more often than the latter. We see improvements of up to 44% and 50% for NEON and AMX implementations in ARMv8-A, and 71% for the Cortex-M4. Improvements correlate well with polynomial multiplication performance, which is fastest for NEON in the HPS2048677 parameter set (based on the faster TMVP approach of [CCHY23]) and in AMX implementations; this is expected due to Amdahl’s law.

Fixed-weight sampling. Table 4 shows that our shuffling approach significantly improves performance of fixed-weight sampling compared to the sampling by sorting approach of previous works. We see very significant speedups for all platforms: up to 6.91 (591%) in ARMv8-A cores and 12.58 (1158%) in the Cortex-M4. Measurements do not include the cost of pseudo-random number generation (i.e. the `randombytes` routine), which is highly platform-dependent; recall that our approach requires slightly more than half as many pseudo-random bytes as sampling by sorting.

Comparison with NTRU-HRSS. It is instructive to compare NTRU-HPS2048677 to NTRU-HRSS701, as both are designed to the same NIST security level. Fortunately,
Table 3: Cycle counts (in kilocycles) for NTRU KEM key generation (KG) and encapsulation (Enc.) in ARM Cortex cores.

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>509</td>
<td>Sorting</td>
<td>884</td>
<td>71.2</td>
<td>884</td>
<td>61.5</td>
<td>1243</td>
<td>111</td>
<td>2674</td>
<td>386</td>
</tr>
<tr>
<td></td>
<td>Shuffling</td>
<td>860</td>
<td>54.1</td>
<td>892</td>
<td>51.8</td>
<td>1213</td>
<td>86.7</td>
<td>2520</td>
<td>234</td>
</tr>
<tr>
<td></td>
<td>Speedup</td>
<td>2.7%</td>
<td>32%</td>
<td>-0.8%</td>
<td>19%</td>
<td>2.4%</td>
<td>28%</td>
<td>6.1%</td>
<td>65%</td>
</tr>
<tr>
<td>677</td>
<td>Sorting</td>
<td>1140</td>
<td>77.4</td>
<td>1146</td>
<td>65.1</td>
<td>1636</td>
<td>121</td>
<td>4317</td>
<td>547</td>
</tr>
<tr>
<td></td>
<td>Shuffling</td>
<td>1123</td>
<td>55.3</td>
<td>1123</td>
<td>45.3</td>
<td>1600</td>
<td>84.9</td>
<td>4094</td>
<td>327</td>
</tr>
<tr>
<td></td>
<td>Speedup</td>
<td>1.5%</td>
<td>40%</td>
<td>2.1%</td>
<td>44%</td>
<td>2.3%</td>
<td>43%</td>
<td>5.4%</td>
<td>67%</td>
</tr>
<tr>
<td>821</td>
<td>Sorting</td>
<td>2156</td>
<td>135</td>
<td>2131</td>
<td>121</td>
<td>2993</td>
<td>197</td>
<td>5705</td>
<td>673</td>
</tr>
<tr>
<td></td>
<td>Shuffling</td>
<td>2111</td>
<td>109</td>
<td>2102</td>
<td>95.2</td>
<td>2951</td>
<td>156</td>
<td>5149</td>
<td>395</td>
</tr>
<tr>
<td></td>
<td>Speedup</td>
<td>2.1%</td>
<td>24%</td>
<td>1.4%</td>
<td>27%</td>
<td>1.4%</td>
<td>26%</td>
<td>5.3%</td>
<td>71%</td>
</tr>
<tr>
<td>701</td>
<td>N/A</td>
<td>1200</td>
<td>58.0</td>
<td>1200</td>
<td>53.6</td>
<td>1579</td>
<td>77.7</td>
<td>4188</td>
<td>368</td>
</tr>
<tr>
<td>Slo慢down vs. 677 sorting</td>
<td>5.2%</td>
<td>-25%</td>
<td>4.7%</td>
<td>-18%</td>
<td>-3.5%</td>
<td>-36%</td>
<td>-3.0%</td>
<td>-33%</td>
<td></td>
</tr>
<tr>
<td>Slo慢down vs. 677 shuffling</td>
<td>6.8%</td>
<td>4.8%</td>
<td>6.9%</td>
<td>18%</td>
<td>-1.3%</td>
<td>-8.5%</td>
<td>2.3%</td>
<td>13%</td>
<td></td>
</tr>
</tbody>
</table>

Table 4: Cycle counts (in kilocycles) for fixed-weight sampling, excluding the cost of uniform random number generation.

<table>
<thead>
<tr>
<th>Param. Set</th>
<th>Sampling</th>
<th>Apple M1</th>
<th>Apple M3</th>
<th>Cortex-A72</th>
<th>Cortex-A57</th>
<th>Cortex-A53</th>
<th>Cortex-M4</th>
</tr>
</thead>
<tbody>
<tr>
<td>509</td>
<td>Sorting</td>
<td>4.49</td>
<td>4.36</td>
<td>13.0</td>
<td>14.0</td>
<td>23.2</td>
<td>152</td>
</tr>
<tr>
<td></td>
<td>Shuffling</td>
<td>1.11</td>
<td>1.06</td>
<td>2.28</td>
<td>2.30</td>
<td>4.97</td>
<td>13.3</td>
</tr>
<tr>
<td></td>
<td>Speedup</td>
<td>4.04×</td>
<td>4.12×</td>
<td>5.71×</td>
<td>6.11×</td>
<td>4.67×</td>
<td>11.45×</td>
</tr>
<tr>
<td>677</td>
<td>Sorting</td>
<td>6.43</td>
<td>6.25</td>
<td>19.5</td>
<td>21.2</td>
<td>35.1</td>
<td>221</td>
</tr>
<tr>
<td></td>
<td>Shuffling</td>
<td>1.49</td>
<td>1.41</td>
<td>3.11</td>
<td>3.07</td>
<td>6.79</td>
<td>18.4</td>
</tr>
<tr>
<td></td>
<td>Speedup</td>
<td>4.32×</td>
<td>4.42×</td>
<td>6.25×</td>
<td>6.91×</td>
<td>5.18×</td>
<td>12.05×</td>
</tr>
<tr>
<td>821</td>
<td>Sorting</td>
<td>7.65</td>
<td>7.48</td>
<td>22.6</td>
<td>24.9</td>
<td>40.9</td>
<td>280</td>
</tr>
<tr>
<td></td>
<td>Shuffling</td>
<td>1.79</td>
<td>1.71</td>
<td>3.75</td>
<td>3.77</td>
<td>8.12</td>
<td>22.3</td>
</tr>
<tr>
<td></td>
<td>Speedup</td>
<td>4.26×</td>
<td>4.37×</td>
<td>6.03×</td>
<td>6.61×</td>
<td>5.03×</td>
<td>12.58×</td>
</tr>
</tbody>
</table>

the state-of-the-art NEON implementation of [CCHY23] implements both parameter sets, allowing for a fair comparison. Tables 2 and 3 include rows marked “Slowdown vs. 677 sorting” and “Slowdown vs. 677 shuffling”, computed as $c_{701}/c_{677} - 1$; thus, positive values indicate that HRSS701 is slower than HPS2048677, and the contrary for negative values.

Even with the sampling by sorting approach, HPS2048677 is usually faster than HRSS701 for key generation, with the exception of the Cortex-A53 and Cortex-M4 cores; with the shuffling approach, HPS2048677 key generation always outperforms HRSS701. As for encapsulation, HPS2048677 using sampling by sorting was significantly slower in all cases, by up to 33% in Apple SoCs and the Cortex-M4, and 36% in ARMv8-A Cortex cores. The shuffling approach closes this gap, with HPS2048677 slower by at most 4.2% in Apple SoCs, and 8.5% in the Cortex-A53; for other ARMv8-A cores, HPS2048677 is actually faster, by up to 18%, and in the Cortex-M4, it is also faster by 13%.
Table 5: Code size (Flash) and stack (RAM) usage, in bytes, for ARMv7-M binaries. Statically allocated data (.data and .bss sections) were reported as zero in all cases. “Diff.” refers to the percentual difference between implementations; positive values denote an increase in our version relative to [KPR⁺].

<table>
<thead>
<tr>
<th>Param. set</th>
<th>Work</th>
<th>Code size</th>
<th>Stack usage</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>Key gen.</td>
<td>Encaps.</td>
</tr>
<tr>
<td>HPS 2048509</td>
<td>[KPR⁺]</td>
<td>192632</td>
<td>21376</td>
</tr>
<tr>
<td></td>
<td>Ours</td>
<td>193092</td>
<td>20544</td>
</tr>
<tr>
<td></td>
<td>Diff.</td>
<td>+0.2%</td>
<td>−3.9%</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>−6.0%</td>
</tr>
<tr>
<td>HPS 2048677</td>
<td>[KPR⁺]</td>
<td>282364</td>
<td>28480</td>
</tr>
<tr>
<td></td>
<td>Ours</td>
<td>283164</td>
<td>27352</td>
</tr>
<tr>
<td></td>
<td>Diff.</td>
<td>+0.3%</td>
<td>−4.0%</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>−5.6%</td>
</tr>
<tr>
<td>HPS 4096821</td>
<td>[KPR⁺]</td>
<td>370884</td>
<td>35216</td>
</tr>
<tr>
<td></td>
<td>Ours</td>
<td>371964</td>
<td>33856</td>
</tr>
<tr>
<td></td>
<td>Diff.</td>
<td>+0.3%</td>
<td>−3.9%</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>−5.8%</td>
</tr>
<tr>
<td>HRSS701</td>
<td>[KPR⁺]</td>
<td>265264</td>
<td>27528</td>
</tr>
</tbody>
</table>

6 Conclusion

In this work, we showed that timing attack-resistant fixed-weight sampling can be performed without using constant-time sorting. We have proposed a new algorithm (Algorithm 8) which achieves a running time of \(O(n)\), an improvement over \(O(n \log^2(n))\) for previous, sorting network-based approaches. This results in performance improvements in actual implementations across a range of different platforms, from deeply embedded to high-performance laptop CPUs. Additionally, the amount of random data needed for sampling is reduced by almost half, which is advantageous for architectures without instructions to accelerate cryptographically secure PRNGs. Moreover, our proposed method may be simpler to implement in an optimized fashion than constant-time sorting networks.

This solves a long-standing open problem: to date, the best alternative was the NTRU-HRSS variant, which also seeks to eliminate the cost of constant-time sorting required for sampling fixed-weight polynomials. As discussed in §5, a modified NTRU-HPS2048677, using our proposed approach, closes the performance gap to NTRU-HRSS701 (recalling that both are designed to the same NIST security level). We also note that key and ciphertext sizes for NTRU-HPS2048677 are smaller: 930 (resp. 1138) bytes for the public key and ciphertext, and 1234 (resp. 1450) bytes for the private key, for NTRU-HPS2048677 (resp. NTRU-HRSS701). Finally, the need to support both NTRU-HPS and NTRU-HRSS to achieve different security levels results in increased implementation complexity, e.g., due to the HRSS-specific version of Lift [CDH⁺20, §1.9.3] and the additional Ternary_Plus sampling routine [CDH⁺20, §1.10.4]. In light of these arguments, we call into question the need for a separate NTRU-HRSS parameter set.

Future work Although NTRU is no longer being considered by NIST, we recall that it has been standardized in other forums [IEE09, ASC10]. Since our proposed Algorithm 8 improves upon the existing fixed-weight sampling by sorting approach mandated by the NTRU specification submitted to NIST [CDH⁺20], we suggest amending NTRU specifications to use Algorithm 8, and incorporating it into any future standardization efforts (for instance, we note that FrodoKEM [BCD⁺16] is also no longer under consideration by NIST, but is being considered for standardization by ISO [Int23c]). We also suggest developing implementations for other widely-used architectures, in particular, Intel
AVX2 and AVX-512 SIMD extensions) and the recently released ARMv8.1-M Helium SIMD instruction set for deeply embedded systems [Dir19].

Algorithm 8, as stated, is not amenable to vectorization, due to a loop-carried dependency between iterations of its main loop. Using a similar idea as the initial step of MergeShuffle (Algorithm 4), vectorization becomes possible; we developed a prototype implementation that confirms its potential for large speedups, especially on wide CPUs such as the M1 and M3. However, without applying the remaining steps of MergeShuffle, the resulting permutation is biased, which may create an avenue of attack. An alternative we envisioned involves sampling from the hypergeometric distribution; however, this is an uncommon distribution in cryptography, and we were unable to find any efficient, constant-time algorithms. We invite future work into either modifying MergeShuffle to be constant-time, or to propose efficient, constant-time hypergeometric sampling algorithms.

References


Efficient isochronous fixed-weight sampling with applications to NTRU


