Low-Latency ASIC Algorithms of Modular Squaring of Large Integers for VDF Applications

Ahmet Can Mert, Erdinç Öztürk, Erkay Savaş, Member, IEEE

Abstract—This study is an attempt in quest of the fastest hardware algorithms for the computation of the verifiable delay function (VDF), \(a^2 \mod N\), proposed for use in various distributed protocols, in which no party is assumed to compute it significantly faster than other participants. To this end, we propose a class of modular squaring algorithms suitable for low-latency ASIC implementations. The proposed algorithms aim to achieve highest levels of parallelization that have not been explored in previous works in the literature, which usually pursue more balanced optimization of speed and area. For this, we utilize redundant representations of integers and introduce three modular squaring algorithms that work with integers in redundant forms: i) Montgomery algorithm, ii) memory-based algorithm and iii) direct reduction algorithm for fixed moduli. All algorithms enable \(O(\log k)\) depth circuit implementations, where \(k\) is the bit-size of the modulus \(N\) in the VDF function. We analyze and compare gate level-circuits of the proposed algorithms and provide estimates for their critical path delay and gate count.

Index Terms—Verifiable Delay Functions, Modular Squaring, Reduction, Montgomery, Redundant Representation

1 INTRODUCTION

A Verifiable Delay Function (VDF) is an inherently sequential operation, which takes some prescribed and modifiable amount of time for its computation. The term verifiable emerges from the fact that the computation needs to be verified fairly quickly, without re-doing the entire costly operation. Any function that is inherently sequential, cryptographically secure and publicly and efficiently verifiable can be considered a VDF.

The two recent VDF constructions proposed by Pietrzak [1] and Wesolowski [2] based on time-lock puzzles are analyzed by Boneh et al. in [3]. The notion of time-lock puzzles, introduced by Rivest, Shamir, and Wagner [4] in 1999, is built upon an inherently sequential and cryptographically secure mathematical operation: exponentiation in a group of unknown order [5]. The exponent utilized for time-lock puzzles is of special form, which transforms the exponentiation operation into repeated squaring operations; e.g, \(a^{2^{k}} \mod N\).

Since time-lock puzzles are constructed upon repeated modular squaring operations, VDF constructions utilizing time-lock puzzles rely on efficient implementations of low-latency modular squaring algorithms, optimized for repeated-squaring setting. As the entire exponentiation operation, rather than a single modular squaring, calls for a low-latency design, any redundant representation and lazy reduction technique can be utilized to accelerate intermediate modular squaring computations. While there is a plethora of efficient implementations of modular multiplication and modular squaring in the literature [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], they mostly focus on throughputs, utilizing the time-area metric for measuring performance. Yet, for a VDF implementation, the most important metric is time itself as the computation of VDF should take the minimum amount of time possible.

There are well-studied high-level algorithms and methods enabling efficient modular multiplication implementations. One of the most commonly used modular multiplication algorithm is due to Montgomery [17]. The Montgomery modular multiplication algorithm realizes \(C' = A \cdot B \cdot R^{-1} \mod N\) instead of the desired \(C = A \cdot B \mod N\) result. To compute the desired result, pre- and post-processing steps are usually required. Therefore, the Montgomery modular multiplication algorithm is efficient only for applications involving many modular arithmetic operations, such as exponentiation, due to inherent pre-processing and post-processing overheads.

Barrett reduction [18] can also be utilized for implementing modular multiplication operations. It computes the desired outcome \(C = A \cdot B \mod N\) directly, which makes it a better choice for a single modular multiplication scheme. Although Barrett and Montgomery reduction algorithms are similar in terms of complexity, they yield different performance figures in different implementation settings [19], [20].

There are algorithms that allow lower circuit depth than classical multiplication operation [21], [22]. Schonhage-Strassen is an NTT-based method presented in 1971 that can multiply two \(n\)-bit integers in \(O(n \log n \log \log n)\) time [23]. Harvey et al. presented an improved algorithm that can multiply two \(n\)-bit integers in \(O(n \log n \log^{2} \log^{2} n)\) operations [24], [25]. However, efficient low-latency hardware implementations of these schemes do not exist in literature. While a theoretical log-depth circuit algorithm is presented in [22], no practical low-latency implementation of this algorithm has been reported in the literature to the best of our knowledge.

Depending on specific application and computation platform used for implementation, different constraints and design goals can be adopted in the design and implementation of a modular multiplication circuit. For instance, a low-power design is usually targeted for implementations in resource-constrained devices. For data center applications, an implementation with high throughput is usually the primary requirement. For typical client applications, a public-key operation finished in milliseconds range is sufficient for a practical implementation. Consequently, low-latency is hardly the predomin-
integer multiplication, there are a total of \( k \) bitwise in-
...concepts. We show that the proposed algorithms enable
O(log \( k \)) depth circuit realizations the multiplication operation is performed of integers. We present
an explicit algorithm for Dadda tree. Section 4 presents
two other modular reduction algorithms: i) memory-
AND (\( \land \)) operations and sequential addition operations resulting
in prohibitively long carry chains. Therefore, in hardware
realizations the multiplication operation is performed in
vectors. Algorithm 1 elaborates this method. In the first
phase, all partial product bits are calculated using \( k^2 \) logical-
AND gates in parallel and grouped in the partial product list
\( t \) as shown in Steps 2-6 of Algorithm 1. Here, a list is a
two-dimensional data structure with different sized columns. Fur-
thermore, for \( i = 0, \ldots, 2k-1 \), \( t_i \) s are the columns of the product
list \( t \), where each column is a multiset of partial product bits of
the same weight. Each partial product bit is appended (using the set union operation \( \cup \)) in the corresponding column and
no addition operation is performed in the first phase.

Then in the second phase (Step 7 of Algorithm 1), the partial
product bits are summed using efficient adder tree constructions such as Wallace [28] or Dadda Trees [29]. The res-
ult \( c \) is in the redundant C-S form, which means most of \( c_i \)
consist of two bits: the carry bit \( c_{i,1} \) and the save bit \( c_{i,0} \). Since
the result \( c \) is in redundant form, it is transformed into non-
redundant form using fast carry propagation adders (CPAs)
[30], [31]. As our operands are very large, the CPA still creates
very long carry chains. In this paper, we propose algorithms
either to shorten the carry chain of CPA or eliminate CPA altogether from the hardware implementation for low-latency
applications such as VDF as stated in [26]. To this end, we
utilize various redundant forms for integers.

**Example 1.** Fig. 1 illustrates the operations of Algorithm 1 for
4-bit operands (\( k = 4 \)). As observed from the figure, the columns of the partial product list have different number of
bits.

<table>
<thead>
<tr>
<th>( a_0 )</th>
<th>( a_1 )</th>
<th>( a_2 )</th>
<th>( a_3 )</th>
<th>( b_0 )</th>
<th>( b_1 )</th>
<th>( b_2 )</th>
<th>( b_3 )</th>
</tr>
</thead>
<tbody>
<tr>
<td>( a_0 \land b_0 )</td>
<td>( a_1 \land b_0 )</td>
<td>( a_2 \land b_0 )</td>
<td>( a_3 \land b_0 )</td>
<td>( a_0 \land b_1 )</td>
<td>( a_1 \land b_1 )</td>
<td>( a_2 \land b_1 )</td>
<td>( a_3 \land b_1 )</td>
</tr>
<tr>
<td>( a_0 \land b_2 )</td>
<td>( a_1 \land b_2 )</td>
<td>( a_2 \land b_2 )</td>
<td>( a_3 \land b_2 )</td>
<td>( a_0 \land b_3 )</td>
<td>( a_1 \land b_3 )</td>
<td>( a_2 \land b_3 )</td>
<td>( a_3 \land b_3 )</td>
</tr>
<tr>
<td>( a_0 \land b_4 )</td>
<td>( a_1 \land b_4 )</td>
<td>( a_2 \land b_4 )</td>
<td>( a_3 \land b_4 )</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Fig. 1. The Operations of Algorithm 1 for \( k = 4 \)**

### 2.1 Bitwise Integer Multiplication

In a straightforward gate-level implementation of bitwise integer multiplication, there are a total of \( k^2 \) bitwise logical-

### Algorithm 1 Bitwise Integer Multiplication Suitable for Hardware Implementation

**Input:** \( a = (a_{k-1}, \ldots, a_1, a_0) \)

\( b = (b_{k-1}, \ldots, b_1, b_0) \)

**Output:** \( c = (c_{2k-1}, \ldots, c_1, c_0) \) where \( c = a \cdot b \)

1. for \( i \) from 0 to \((2k-1)\) do \( t_i \leftarrow (0) \) end for
2. for \( i \) from 0 to \((k-1)\) do
3. for \( j \) from 0 to \((k-1)\) do
4. \( t_{i+j} \leftarrow t_{i+j} \cup (b_j \land a_j) \)
5. end for
6. end for
7. \( t \leftarrow ADDERTREE(t) \)
8. \( c \leftarrow CPA(t) \)

\( \Rightarrow \) First Phase

\( \Rightarrow \) Second Phase

and the only design metric for most cases as others such as
cost and time-area are almost always taken into consideration.

In a typical application of VDF verifiable lottery using a
randomness beacon [5], no participant is allowed to compute
VDF significantly faster than others. Namely, all participants
are supposed to be equipped with sufficiently fast hardware for
the VDF computation. Consequently, for the RSA-based VDF
construction [1], [5], [2], metrics such as throughput, power and
area become either irrelevant or less important emphasizing
the need for a low-latency design. This calls for more aggressive
acceleration techniques, which have not been explored in the
literature before except in [26]. In [26], the author discusses a
low-latency modular multiplication algorithm and construction
for a known modulus, targeting FPGA architecture for proof of
concept performance.

**Contribution:** In this paper, highly parallel, regularly struc-
tured, and bitwise modular squaring algorithms are explored
and proposed for efficient low-latency ASIC implementations.
The algorithms utilize various redundant representations to
reduce circuit depth. To this end, we first introduce a vari-
ant of the Montgomery reduction algorithm that works with
integers in redundant representations, requires no final sub-
traction and utilizes incomplete arithmetic [27]. Then, we
explore two other modular reduction algorithms: i) memory-
based reduction for variable moduli and ii) direct reduction for
fixed moduli. We study the modular multiplication algo-

Section 5 describes Redundant-Polynomial (R-P) representation for integers and presents its utilization
in different modular squaring methods. Section 6 compares
the proposed algorithms and gives estimates for their ASIC
implementations. Section 7 concludes the paper.

### 2 BACKGROUND

Throughout the paper, a \( k \)-bit integer \( a \) is represented in radix 2
as \( a = (a_{k-1}, \ldots, a_1, a_0) \), where \( a_i \in \{0, 1\} \). When a radix larger
than 2 is used, we use uppercase letters to represent the digits
of integers; e.g., \((A_{k-1}, \ldots, A_1, A_0)\), where \( A_i \) is the \( i^{th} \) digit and
\( A_i < 2^r \) with \( r > 1 \).

#### 2.1 Bitwise Integer Multiplication

In a straightforward gate-level implementation of bitwise in-
teger multiplication, there are a total of \( k^2 \) bitwise logical-

#### Example 1.** Fig. 1 illustrates the operations of Algorithm 1 for
4-bit operands (\( k = 4 \)). As observed from the figure, the columns of the partial product list have different number of
bits.
### 2.2 Bitwise Integer Squaring

Algorithm 2 for integer squaring is a simplified version of Algorithm 1 for the case \( b = a \), in which \( a_i \land b_j = a_i \land b_j \) for \( i \neq j \). As we have \( a_i \land a_j + a_j \land a_i = 2 \cdot (a_i \land a_j) \) and multiplication by 2 is realized via a simple shift operation, the weight of \( (a_i \land a_j) \) becomes \( i + j + 1 \), instead of \( i + j \) (see Step 5 of Algorithm 2).

**Algorithm 2 Bitwise Integer Squaring**

**Input:** \( a = (a_{k-1}, \ldots, a_0) \)

**Output:** \( d = (d_{2k-1}, \ldots, d_1, d_0) \), where \( d = a^2 \)

1. for \( i \) from 0 to \( (2k-1) \) do \( t_i \leftarrow \{0\} \) end for
2. for \( i \) from 0 to \( (k-1) \) do \( \triangleright \) First Phase
   3. \( t_{2i} \leftarrow t_{2i} \cup a_i \)
   4. for \( j \) from \( (i+1) \) to \( (k-1) \) do \( \triangleright \) Second Phase
   5. \( t_{i+j+1} \leftarrow t_{i+j+1} \cup (a_i \land a_j) \)
   6. end for
7. end for
8. \( t \leftarrow \text{adderTree}(t) \)
9. \( d \leftarrow \text{cpa}(t) \)

Algorithm 2 shows that there is a total of \( (k^2 + k)/2 \) logical-AND operations, all of which can be performed in parallel. Thus, the circuit delay for the first phase is equivalent to the delay of an AND gate. For the second phase, the circuit delay is determined by the delay of the adder tree, which is logarithmic with respect to the input size \( k \) as shown in Section 3.

**Example 2.** Fig. 2 illustrates the operations of Algorithm 2 for \( k = 4 \). Note that the column \( t_1 \) in the partial product list is an empty set.

We can generalize the C-S representation to have an arbitrary number of bits for each weight as

\[
a = \sum_{i=0}^{k-1} (a_{i,0} + a_{i,1} + \ldots + a_{i,k_i}) 2^i,
\]

where \( i_v \geq 0 \) is an integer. Also, we can combine all the bits with the same weight into a set of bits; e.g., \( a_i = (a_{i,0}, a_{i,1}, \ldots, a_{i,k_i}) \) for \( i = 0, \ldots, k - 1 \). This new notation is especially useful to denote the bits accumulated in the adder tree in the first phase of Algorithms 1 and 2 before the addition operation.

**Example 3.** Using the new notation, the result of the first phase of the bitwise squaring operation of a 4-bit integer \( a, d = a^2 \) in Example 2 can be written in generic redundant representation as \( t_6 = \{t_0, 0\}, t_1 = \{0, 0\}, t_2 = \{t_2, 0, t_2, 1\}, t_3 = \{t_3, 0\}, t_4 = \{t_4, 0, t_4, 1, t_4, 2\}, t_5 = \{t_5, 0\}, t_6 = \{t_6, 0, t_6, 1\} \), where \( t_0, 0 = a_0 \), \( t_2, 0 = a_0 \land a_1, t_2, 1 = a_1 \), \( t_3, 0 = a_0 \land a_2, t_4, 0 = a_0 \land a_3, t_4, 1 = a_1 \land a_2, t_4, 2 = a_2, t_5, 0 = a_1 \land a_3, t_6, 0 = a_2 \land a_3, \) and \( t_6, 0 = a_3 \).

As shown in Example 3, the sets for each weight can contain different number of bits. We adopt the notation that \( |t_i| \) is the number of elements in the set \( t_i \) and \( |t| \) is the number of sets in the redundant representation. Also, \( \text{depth}(t) = \max |t_0|, |t_1|, \ldots, |t_k| \). For example, the classical C-S representation has \( \text{depth}(t) = 2 \) while the redundant representation in Example 3 has depth 3. The actual summation operation can be performed using adder tree constructions as explained in Section 3.

### 3 Adder Tree Constructions

In [28], Wallace proposed adder tree circuits to eliminate long carry chains in Algorithms 1 and 2 using counters, which are combinational logic elements that output the number of ones in their inputs. The two most common counters are \((2 \times 2) \) half adder (HA) and \((3 \times 2) \) full adder (FA) counters. Other counters include \((7 \times 3),(15 \times 4)\), and \((31 \times 5)\) counters.

The Wallace tree consists of layers, in each of which we have a different redundant representation of the same integer; e.g., the result of the squaring operation. A layer is of lower depth than the previous layer and the last layer is of depth two. In a layer, there are counters acting on the bits of the columns of the partial product list \( t \). As the counters in one layer works in parallel, they are referred as parallel counters in [29]. Consequently, the delay of a Wallace tree layer is equivalent to the delay of the slower counter employed in the layer.

There are different adaptations of Wallace trees. For example, the greedy Wallace tree working with HAs and FAs combines the sets in redundant representation in each layer into three-bit groups and apply each group to the inputs of a FA. If there are two bits not included in any group, they are applied to the input of a HA. When there is only a single bit left, it will be copied to the next layer as is. Both FA and HA have two outputs: carry and sum. The sum output will be copied to the set with the same weight in the next layer while the carry output will be appended to the set with one larger weight in the next layer.

An alternative adaptation performs the grouping according to the deepest set; e.g., \( t_{k-1} \) for \( k \)-bit multiplication. The bits of this set are grouped in three. If there are two bits in any other set corresponding to a three-bit group in the longest set, then they are applied to the inputs of a HA. If there is a single bit
in such situations, it will be copied to the next layer as is. If there are one or two rows left after grouping, they will directly be copied to the next layer, as well. The next example illustrates this adaptation.

Example 4. The integer multiplication operation in Example 1 results in a redundant representation of depth 4 after the first phase as shown in Fig. 3 (see Layer I). The Wallace tree first reduces the depth to 3 and then to 2, as can be observed in Layer II of the circuit and the resulting integer in C-S form, respectively. The boxes with two and three dots stand for HA and FA, respectively. The Wallace tree employs 3 HAs and 5 FAs, in total.

![Layer I and Layer II](image)

Fig. 3. The Wallace Tree Layers for k = 4

After the final layer of a Wallace tree, the resulting integer is in redundant representation of depth two, which is the regular C-S form. For the final result, a CPA is used to add carry and save parts. As all HAs and FAs work independently and thus in parallel to each other, the time delay of one layer of Wallace tree is always equivalent to one FA delay.

When FA and HA are used, we can give a lower bound for the number of layers \( L(h) \) to reduce a redundant representation of depth \( h \) to depth 2 representation [32]:

\[
L(h) \geq \log_{1,5}(h/2).
\]  

(2)

The multiplication of two \( k \)-bit integers and squaring of a \( k \)-bit integer lead to trees of depth \( k \) and \( [k/2] + 1 \), respectively. Consequently, the lower bounds for the number of layers to reduce the depths of \( k \)-bit multiplication and squaring to 2 are \( \lceil \log_{1,5}(k/2) \rceil \) and \( \lceil \log_{1,5}((k/2) + 1)/2 \rceil \), respectively. For example, the multiplication of two 2048-bit integers results in a tree of depth 2048 while the squaring of a 2048-bit integer leads to a shorter tree of depth \( [2048/2] + 1 = 1025 \). Consequently, the lower bounds for the number of layers needed to reduce the trees in multiplication of two 2048-bit integers and squaring of a 2048-bit integer are \( \lceil \log_{1,5}(2048/2) \rceil = 18 \) and \( \lceil \log_{1,5}(1025/2) \rceil = 16 \), respectively.

The number of layers is, in fact, determined by the compression factor, which is the ratio of the number of inputs to the number of outputs of the largest counter used in the Wallace tree. For instance, when the largest counter is a FA, then the compression factor is 1.5, which explains the base of the logarithm in Equation 2.

In [29], Dadda observes that the compression factor in fact determines the reduction amount in the depths of the tree from one layer to the next. He, then, proposes a method to minimize the number of counters used in one layer. To this end, Dadda’s method utilizes an array of integers that is recursively generated as \( D[0] = 2 \) and \( D[i] = \lceil 1.5 \cdot D[i-1] \rceil \) for \( i \geq 1 \). This array also determines how to reach a representation of depth 2 progressively.

Algorithm 3 ADDERTREE with Dadda’s Method [29]

Input: \( t = ([f_{k-1,0}, \ldots, f_{k-1,d_{k-1}-1}], \ldots, [f_{0,0}, \ldots, f_{0,d_{0}-1}]) \)

\( D = (D[0], D[1], D[2], \ldots) \)

Output: \( t = ([f_{k-1,0}, f_{k-1,1}, \ldots, f_{0,0}, f_{0,1}]) \)

1: while \( \text{MAXDEPTH}(t) > 2 \) do
2: \( \beta \leftarrow \text{FINDNEXTMAXDEPTH}(\text{DEPTH}(t), D) \)
3: \( v \leftarrow \{0\} \)
4: for \( i \) from 0 to \((k-1)\) by 1 do
5: \( t_i \leftarrow t_i \cup v \)
6: \( v \leftarrow \{0\}; \sigma \leftarrow \{0\} \)
7: if \( \text{DEPTH}(t_i) > \beta \) then
8: \( \Delta \leftarrow \text{DEPTH}(t_i) - \beta \)
9: \( f \leftarrow \lceil \Delta/2 \rceil; h \leftarrow \Delta \mod 2 \)
10: for \( j \) from 0 to \((3f - 1)\) by 3 do
11: \( C, S \leftarrow \text{FA}(t_{i,j}, t_{i,j+1}, t_{i,j+2}) \)
12: \( \sigma \leftarrow \sigma \cup S; v \leftarrow v \cup C \)
13: end for
14: for \( j \) from \((3f)\) to \((3f + 2h - 1)\) by 2 do
15: \( C, S \leftarrow \text{HA}(t_{i,j}, t_{i,j+1}) \)
16: \( \sigma \leftarrow \sigma \cup S; v \leftarrow v \cup C \)
17: end for
18: for \( j \) from \((3f + 2h)\) to \((\text{DEPTH}(t_i) - 1)\) by 1 do
19: \( \sigma \leftarrow \sigma \cup t_{i,j} \)
20: end for
21: \( t_i \leftarrow \sigma \)
22: end if
23: end for
24: end while

Example 5. The array \( D = \{2,3,4,6,9,13,19,28,42,63,94,141,\ldots\} \) and suppose \( k = 128 \). Then, Dadda’s method first reduces the tree depth from 128 to 94. Then, it reduces from 94 to 63. The operation continued until the tree of depth 2 is obtained.

Dadda’s method uses no more counters than sufficient to reduce the depth of a tree layer to next smaller integer in \( D \) than the current depth. The method is given in Algorithm 3, where one iteration of the while loop stands for one layer of Dadda tree.

Algorithm 3 determines the depth of the next layer \( \beta \) in Step 2 using the FINDNEXTMAXDEPTH function, which computes the maximum element in \( D \) that is smaller than the depth of the current layer. Then, using FAs and HAs, the depth of the next layer will be reduced to \( \beta \). If the depth of a set \( t_i \) is already less than or equal to \( \beta \), then no counter is used and the set is copied to the next layer with no change. Otherwise, counters are used to decrease the depth of \( t_i \) to \( \beta \) (Steps 8-21 of Algorithm 3).

Example 6. Fig. 4 shows how Dadda’s method reduces the depth of redundant result in Example 1 from 4 to 2. In Layer I, where \( \beta = 3 \), 2 HAs are used to reduce \( t_3 \) and \( t_4 \) as these sets are likely to produce sets with depths greater than 3 in the next layer if their depths are not reduced otherwise. In Layer II, where \( \beta = 2 \), the three sets \( t_3 \) to \( t_5 \) are reduced using 3 FAs and \( t_2 \) is reduced using 1 HA. Dadda’s method employs 3 HAs and 3 FAs, which results in smaller circuit area than the one produced by the Wallace tree in Example 4 (cf. 3 HAs and 5 FAs) and the circuit delay is equivalent to the total delay of one FA and one HA.
As it is likely to result in slightly smaller circuits, the function ADDER-tree uses Dadda’s method in the subsequent sections of the this paper.

## 4 Modular Squaring Algorithms with Carry-Save Representation

In this section, we present three modular squaring methods suitable for low-latency circuit implementations for the computation of the VDF function \( a^{2^T} \pmod N \). All the methods take an integer \( a \) in C-S form and outputs the result \( c = a^2 \pmod N \) also in C-S form, where the modulus \( N \) is any odd integer. Consequently, no CPA is needed and long carry chains are completely avoided during the computation of \( a^{2^T} \pmod N \). Only one CPA is used after the final modular squaring of the exponentiation. Note that algorithms in this section employ incomplete arithmetic similar to [27], which basically works with inputs and outputs not fully reduced with modulus \( N \), but always in the equivalence class.

The first method is a variant of the Montgomery modular multiplication algorithm, which is efficient for randomly chosen variable moduli. The second method uses a memory-based modular reduction algorithm, where memory cells can be used to configure the circuit for different moduli. The third method is used when the modulus is fixed. In all cases, we assume the modulus is randomly chosen and has no special form as the RSA setting is usually employed in VDF instances.

### 4.1 Montgomery Modular Squaring with Variable Modulus

The Montgomery algorithm [17] eliminates costly division operation from the modular reduction and employs shift operations instead, which are free of cost in hardware implementations. It, however, computes \( abR^{-1} \pmod N \) instead of \( ab \pmod N \), where \( \text{gcd}(R,N) = 1 \) and \( R \) is chosen as a power of two. This, on the other hand, leads to no complications, especially in exponentiation operations as at the end, a simple transformation operation suffices to obtain the result in the correct form. For this the input \( a \) is first transformed into the Montgomery form \( aR \pmod N \) by multiplying it with the precomputed constant \( R^2 \pmod N \) using the Montgomery multiplication algorithm. Then, \( aR \pmod N \) is raised to the power of \( 2^T \), which results in \( a^{2^T}R \pmod N \). Finally, to obtain the desired result \( a^{2^T} \pmod N \), \( a^{2^T}R \pmod N \) is multiplied by \( 1 \).

There are various methods to compute the Montgomery multiplication [34] such as separate operand scanning (SOS), coarsely integrated operand scanning (CIOS) etc. Here, we employ the simple SOS method, which performs the integer squaring first and then the Montgomery reduction operation.

### Algorithm 4 Squaring Algorithm for Integers in C-S form

**Input:** \( a = (\{a_{0,0}, a_{1,0}\}, \ldots, \{a_{0,0}, a_{0,1}\}) \), \( a < 2^{k+1} \)

**Output:** \( d = a^2 = (\{d_{2k+1,0}, d_{2k+1,1}\}, \ldots, \{d_{0,0}, d_{0,1}\}) \), \( d < 2^{2k+2} \)

1. for \( i \) from 0 to \( (2k+1) \) do \( t_i \leftarrow 0 \) end for
2. for \( i \) from 0 to \( k \) do \( d_{i} \leftarrow d_{i} \cup a_{i,0} \cup a_{1,1} \) end for
3. for \( j \) from \( (i+1) \) to \( k \) do \( t_{i+j} \leftarrow t_{i+j} \cup (a_{i,0} \land a_{j,0}) \cup (a_{i,1} \land a_{j,1}) \) end for
4. end for

**Example 7.** For \( k = 2048 \), Algorithm 4 results in a partial product list \( t \) of depth 4098 before the last step using 8,398,851 logical-AND gates where it takes 20 layers of FAs for its reduction to depth 2. And the Dadda tree in the last step utilizes 8,390,654 FAs and 2,054 HAs. Finally the
Algorithm 5 Montgomery Reduction Algorithm for Integers in C-S form

**Input:** $d = (d_{2k+1,0}, d_{2k+1,1})$, ..., $d_{0,0}, d_{0,1})$, $d < 2^{2k+2}$

**Output:** $c = d \cdot 2^{-m} \pmod{N}$, where $c = \{c_{2k,0}, c_{2k,1}, ..., c_{0,0}, c_{0,1}\}$, $c < 2^{k+1}$

1. for $i$ from 0 to $(m-1)$ do $\mu_i \leftarrow \{0\}$ end for
2. for $i$ from 0 to $(m-1)$ do $\mu_i \leftarrow d N' \mod R$
3. for $j$ from 0 to $(m-1-i)$ do $\mu_{i+j} \leftarrow \mu_{i+j} \cup \{N_i \land d_{j,0}\} \cup \{N_i \land d_{j,1}\}$
4. end for
5. $\mu \leftarrow \text{ADDTree}(\mu)$
6. $\mu_{m-1,0} \leftarrow \mu_{m-1,0} \land (\neg \mu_{m-1,1})$
7. $\mu_{m-1,1} \leftarrow \mu_{m-1,1} \land (\neg \mu_{m-1,0})$
8. $t \leftarrow d$
9. for $i$ from 0 to $(k-1)$ do $\Rightarrow III: t = d + N(dN' \mod R)$
10. for $j$ from 0 to $(m-1)$ do $\Rightarrow \{i \leftarrow \text{FA} \leq dN' \land dR \}
11. end for
12. $\Rightarrow \{t \leftarrow \text{AND} \land \text{OR} \}
13. $\Rightarrow \{c \leftarrow \text{ADDTree}(c) \}

**Example 8.** Table 1 lists the delay and the area costs of Algorithm 4 and each stage of Algorithm 5 for $k = 2048$.

### Table 1

<table>
<thead>
<tr>
<th>Operation</th>
<th>Delay</th>
<th>Logical gates</th>
<th>FA</th>
</tr>
</thead>
<tbody>
<tr>
<td>Algorithm 4</td>
<td>$20\tau_{\text{FA}} + \tau_{\text{AND}}$</td>
<td>8,396,851 AND</td>
<td>6,390,654</td>
</tr>
<tr>
<td>Algorithm 5 - I</td>
<td>$20\tau_{\text{FA}} + \tau_{\text{AND}}$</td>
<td>4,208,652 AND</td>
<td>4,208,651</td>
</tr>
<tr>
<td>Algorithm 5 - II</td>
<td>$\tau_{\text{FA}} + \tau_{\text{AND}}$</td>
<td>2 AND + 2 NOT</td>
<td>-</td>
</tr>
<tr>
<td>Algorithm 5 - III</td>
<td>$20\tau_{\text{FA}} + \tau_{\text{AND}}$</td>
<td>8,413,202 AND</td>
<td>8,413,193</td>
</tr>
<tr>
<td>Algorithm 5 - IV</td>
<td>$\tau_{\text{FA}} + \tau_{\text{AND}}$</td>
<td>1 OR</td>
<td>2,049</td>
</tr>
</tbody>
</table>

Note that Algorithm 5 does not require the final subtraction operation as in [35], which proposes to use $R = 2^k + 2$ for a k-bit modulus $N$ to eliminate the final subtraction. Here, we need to add one more bit to all operands and work with $R = 2^{k+3}$. This is especially important as the subtraction operation is prohibitively expensive in C-S form.

### 4.2 Memory-Based Modular Squaring with Variable Modulus

In memory-based modular squaring operation, no sophisticated reduction algorithms such as Montgomery’s method [17] are needed. Instead, we can use lookup table approach for reducing the upper part of the output bits of the integer squaring (see Algorithm 4 for $a < 2^8$ and $d < 2^8$), namely $(d_{2k-1,0}, d_{2k-1,1}, ..., d_{0,0}, d_{0,1})$. In particular, we can precompute and store the integers $(2^i \mod N)$ for $k \leq i < 2k$ in a...
look up table, \( T \). Then, after the integer squaring operation, each lookup table cell goes to input of an AND gate with corresponding \( d_{i,0} \) (as well as \( d_{i,1} \)) for \( k \leq i < 2k \) and appended to the lower part of \( d \).

There is, however, one small concern about the bit-size of the final result \( c = d \mod N \) after the adder tree is applied as \( c \) now requires more than \( k \) bits. The final result \( c \) still needs to be reduced to \( k \) bits. However, this will require another reduction operation. Therefore, to capture all bits of arithmetic, where input and output of our modular squaring algorithm are \((k + \ell)\)-bit integers.

The algorithm for integer squaring operation, which are identical to Algorithm 4 except for the input being now \((k + \ell)\)-bit long, outputs a \((2k + 2\ell)\)-bit integer, \( d \). The uppermost \((k + 2\ell)\)-bits of \( d_0 \) and \( d_1 \) are now reduced modulo \( N \) with the help of the lookup table \( T \) appended to the lower part of \( d \). This generates a partial product list of depth \( 2(k + 2\ell) + 2 \) requiring \( \log_2(2k + 4\ell + 2) \) extra bits for adder tree which should be less than or equal to \( \ell \). By \( \log_2(2k + 4\ell + 2) \leq \ell \), the number of extra bits, \( \ell \), is formulated as shown in Equation 5.

\[
\ell = \begin{cases} 
\log_2 k + 4 & \text{if } k \leq 2 \\
\log_2 k + 3 & \text{if } 2 < k < 16 \\
\log_2 k + 2 & \text{if } k \geq 16 
\end{cases} \tag{5}
\]

Since the uppermost \((k + 2\ell)\) bits of \( d \) are reduced, the lookup table \( T \) has \((k + 2\ell)\) entries where each entry is \( k \)-bit long as shown in Algorithm 6. This explains the upper limit of the iteration in Step 2 of Algorithm 6. Also, the memory cells are appended to the partial product list via \( 2k(k + 2\ell) \) logical-AND gates. After the reduction operation, the adder tree reduces \( d \) to its C-S form. All operations for modular reduction are given in Algorithm 7.

### 4.3 Direct Modular Squaring with Fixed Modulus

In some VDF settings, the modulus \( N \) is generated once and used thereafter. In this scenario, that \( N \) is fixed enables significant optimizations in the circuit complexity and time delay. Direct modular squaring method uses the same approach as the memory-based modular squaring method except for the modular reduction operation. Since the modulus is fixed, then only the nonzero bits of integers \( (2^i \mod N) \) for \( k \leq i < (2k + 2\ell) \) are hardwired in the modular reduction circuit. This means we need no memory cells and only half of the AND gates (i.e., \( k(k + 2\ell) \)), on average. This also simplifies the final addition operation in Step 9 of Algorithm 7 by reducing the depth of the partial product list \( t \) from \( 2(k + 2\ell) + 2 \) to \( k + 2\ell + 2 \).

---

**Algorithm 6** Lookup Table Generation Algorithm for \( k \geq 16 

**Input:** \( N = \{N_{1-1}, \ldots, N_0\} \), \( \ell = \lceil \log_2 k \rceil + 2 \)

**Output:** \( T_{i,j} \) for \( i \in [0, k + 2\ell) \) and \( j \in \{0, k\} \)

1: for \( i \) from 0 to \( k + 2\ell - 1 \) do \( T_i \leftarrow \{0\} \) end for

2: for \( i \) from 0 to \( k + 2\ell - 1 \) do

3: \( a \leftarrow 2^{i+k} \mod N \)

4: for \( j \) from 0 to \( k - 1 \) do

5: \( T_{i,j} \leftarrow T_{i,j} \cup a_j \)

6: end for

7: end for

---

**Algorithm 7** Memory-Based Modular Reduction Algorithm

**Input:** \( d = ((d_{2(k+\ell)-1}, \ldots, d_{2(k+\ell)-1}), \ldots, (d_{0}, d_{0})) \), \( T \) : Lookup table

**Output:** \( c \equiv d \mod N \),

where \( c = ((c_{2(k+\ell)-1}, c_{2(k+\ell)-1}), \ldots, (c_{0}, c_{0})) \)

1: for \( i \) from 0 to \( k + \ell - 1 \) do \( t_i \leftarrow d_i \) end for

2: for \( i \) from \( k \) to \( k + \ell - 1 \) do \( t_i \leftarrow (0) \) end for

3: for \( j \) from \( k \) to \( 2k + 2\ell - 1 \) do

4: for \( i \) from 0 to \( k - 1 \) do

5: \( t_i \leftarrow T_{j-k,i} \wedge d_j \)

6: \( t_i \leftarrow T_{j-k,i} \wedge d_j \)

7: end for

8: end for

9: \( c \leftarrow ADDERTREE(t) \)

---

**Example 9.** Table 2 lists the delay and the area cost of Algorithm 4 and Algorithm 7 with \( k = 2048 \) and \( \ell = 13 \) for the cases when the modulus is variable or fixed.

| TABLE 2 |
|-----------------|-----------------|-----------------|
| **Res.** | **Memory-Based** | **Direct** |
| FA / HA | 16,984,339 | 7,144 | 12,736,787 | 7,132 |
| AND | 16,392,667 | 12,745,035 |
| M.Cells | 4,247,552 | 384FA + 2TAND |
| Delay | 407FA + 2TAND | 384FA + 2TAND |

---

### 5 Modular Squaring Algorithms with Redundant-Polynomial Representation

In this section, we present a novel polynomial representation for integers, referred as Redundant-Polynomial form, which is a variant of redundant representation introduced here and similar to the one in [26]. We also present its utilization in the three modular squaring methods presented in Section 4.

The polynomial representation enables low-latency circuit implementations of modular multiplication and squaring, targeting applications with variable and fixed moduli. Whereas an FPGA-optimized version is presented in [26], where the implementation is optimized to utilize FPGA-specific full-custom built-in blocks such as DSPs and Block RAMs effectively, the polynomial representation can be profitably employed for implementations on different target devices. In this paper, we utilize a novel variant of modular squaring algorithm that is tailored to be more suitable for ASIC implementations. Besides using R-P form for input and output, we take advantage of the inherent simplicity of a squaring operation and the fact that it is used repeatedly for the exponentiation \( a^{2r} \mod N \) in the VDF computation.

Each \( k \)-bit integer \( a \) can also be considered as an \( s \)-digit integer with \( r \)-bit digits \( A_i \), where \( s = \lceil k/r \rceil \) and \( A = \sum_{i=0}^{s-1} A_i \cdot 2^i \). Similarly, the integer \( A \) can also be represented as a polynomial \( A(x) = \sum_{j=0}^{r-1} A_i \cdot x^j \), where the radix \( x = 2^r \) and \( A_i < 2^r \).

In the R-P form presented in [26], an integer \( A \) is represented using a polynomial \( A(x) \) of degree \((s - 1)\) with \((r + 1)\)-bit coefficients \( A_i \) as shown in Equation 6.

\[
A(x) = \sum_{i=0}^{r-1} A_i \cdot x^i = \sum_{i=0}^{r-1} A_i \cdot x^i + \sum_{i=0}^{r-1} e_i \cdot x^{i+1} \tag{6}
\]

Here, \( A_i = (A_i + x e_i) < 2^{r+1} \) and \( e_i \in \{0, 1\} \).
Algorithm 8 CSTOPOLY Algorithm

Input: \( a = ([a_{k-1,0}, a_{k-1,1}],..., [a_0,0,...,a_0]), \) 
\\( s = [k/r] \) where \( r \) : bit-size of a digit 

Output: \( b = a = ([b_{k-1,0}, b_{k-1,1}],..., [b_0,0,...,b_0]) \) 
where \( b_i = e_i \) for \( i = 0 \) (mod \( r \)) and \( i \neq 0 \) 
where \( b_i = 0 \) for \( i \neq 0 \) (mod \( r \)) and \( i = 0 \)

1. for \( i \) from 0 to \( (s-1) \) do 
2. \( C \leftarrow 0 \)
3. for \( j \) from 0 to \( (r-1) \) do 
4. \( C, S \leftarrow FA(a_{i+j,0}, a_{i+j,1}, C) \)
5. \( b_{i+j,0} \leftarrow S \)
6. end for 
7. \( b_{(i+1),1} \leftarrow C \)
8. end for

Equation 6 is indeed a redundant form for integers similar to C-S form, where only certain bits are represented by two bits. More specifically in polynomial form for a \( k \)-bit integer \( a \), we have \( A_i = ([a_{i+1,0}, a_{i+1,1}],..., [a_0,0,...,a_0]) \) for \( (s-1) \geq i \geq 1 \) and \( A_0 = ([a_{i,0},0,...,a_0,0,0]) \) where \( a_{i,0} = a_{i,1} = 0 \) for \( j \geq k \) and \( a_{i,1} = e_i \) for \( i = 1,2,...,s-1 \). Also, \( a = a_1 + a_0 \) where \( a_1 = a_{k-1,...,a_1,0,a_0} \) and \( a_0 = a_{k,0,...,a_0,0,0} \).

Example 10. For \( k = 9 \) and \( r = 3 \), C-S and R-P forms are shown in Fig. 5. The bits representing the first, second and third digits of the R-P form are shown with black, gray and blue dots, respectively. As shown in the Fig. 5, the R-P form can also be viewed as a modified version of C-S form.

<table>
<thead>
<tr>
<th>C-S</th>
<th>R-P</th>
</tr>
</thead>
<tbody>
<tr>
<td><img src="image" alt="C-S and R-P Forms for k = 9 and r = 3" /></td>
<td></td>
</tr>
</tbody>
</table>

The basic idea of the polynomial representation is to confine the carry propagation into \( r \)-bit digits and use only \( r \)-bit CPA (CPA.) circuits. During the VDF computation, therefore, the intermediate results are kept in R-P form; but at the end, the final result is converted to the non-redundant form using a CPA.

After applying the adder tree to partial products in our algorithms, we obtain the result in C-S form, which needs to be converted to R-P form. Algorithm 8 shows the steps of this conversion function, CSTOPOLY().

When the new R-P form is used, all algorithms in Section 4 can easily be adapted with minor modifications as discussed in the next sections.

5.1 Squaring Operation for Integers in R-P Form

A low-latency integer multiplication algorithm with the polynomial form is detailed in Algorithm 7 in [26], in which its FPGA-optimized version is implemented. In an FPGA architecture, there are special functional blocks such as DSP including a full-custom \( 25 \times 18 \) signed multiplier [36]. In [26], the entire integer multiplication algorithm was optimized to utilize this basic building block in FPGA for a low-latency implementation. Since the entire algorithm is based on coefficient-wise multiplications, a DSP block is utilized to realize a \( 17 \times 17 \) core multiplication operation, which provides best utilization for \( r = 16 \).

Algorithm 9 Squaring Algorithm for Integers in R-P Form

Input: \( a = ([a_{k,0}, a_{k,1}],..., [a_0,0,...,a_0]) \)

Output: \( d = a^2 = ([d_{2k+1,0}, d_{2k+1,1}],..., [d_0,0,...,d_0]) \)

1. for \( i \) from 0 to \( (2k+1) \) do \( t_i \leftarrow \phi \) end for
2. for \( i \) from \( r \) to \( (k-1) \) by \( r \) do \( \triangleright a_1 \)
3. \( t_{2i} \leftarrow t_{2i} \cup a_{i,0} \)
4. for \( j \) from \( (i+1) \) to \( (k-1) \) by \( r \) do \( \triangleright a_0 \)
5. \( t_{i+j+1} \leftarrow t_{i+j+1} \cup (a_{i,0} \wedge a_{j,0}) \)
6. end for 
7. end for 
8. for \( i \) from 0 to \( k \) do \( \triangleright 2a_1a_0 \)
9. \( t_{2i} \leftarrow t_{2i} \cup a_{i,1} \)
10. for \( j \) from \( (i+1) \) to \( k \) do \( \triangleright a_0 \)
11. \( t_{i+j+1} \leftarrow t_{i+j+1} \cup (a_{i,1} \wedge a_{j,1}) \)
12. end for 
13. end for 
14. for \( i \) from 0 to \( k \) do \( \triangleright 2a_1 \)
15. \( t_i \leftarrow ADDTREE(t) \)
16. \( t \leftarrow CSTOPOLY(t) \)
17. end for 
18. end for 
19. for \( t \) from 0 to \( k \) do
20. \( d \leftarrow CSTOPOLY(t) \)

For ASIC implementation, utilizing core \( 17 \times 17 \) multipliers of non-redundant form is not always most efficient, as these multipliers results in long carry chains in the partial product calculation stage. Therefore, carry propagation should be deferred as long as possible. To this end, we propose Algorithm 9 to compute the square of a \( (k+1) \)-bit integer \( a \) given in R-P form, where \( a = a_1 + a_0 < 2^{k+1} \) and \( a_1,a_0 < 2^{k+1} \). The result \( d \) is also in R-P form, where \( d = d_1 + d_0 < 2^{2k+2} \). Note that Algorithm 9 basically computes Equation 4.

Example 11. For \( k = 2048 \) and \( r = 8 \), Algorithm 9 results in a partial product list \( t \) of depth 1,407 before the last two steps using 2,657,665 logical-AND gates, where it takes 17 layers of FAs for its reduction to depth 2. And the operations in the last two steps employ 2,649,472 FAs, 2,135 HAs and 513 CPA.s. Finally, the total delay of the squaring circuit is \( 17 \times FA + \tau \text{AND} + \tau \text{CPA} \).

Note that all partial product bits are calculated using logical-AND gates in parallel and the resulting bits are grouped together in the partial product list \( t \) as shown in Steps 1-18 of Algorithm 9. This is a totally different approach from the one in Algorithm 7 in [26], which is based on digit-wise multiplications.

As shown in Steps 2-13 of Algorithm 9, the partial product bits generated by squaring of \( a_1 \) and \( a_0 \) are calculated using \( s(\text{s}-1)/2 \) and \( (k+1)(k+2)/2 \) logical-AND gates, respectively.

The partial product bits generated by the multiplication \( 2a_0a_1 \), (see Steps 14-18 of Algorithm 9), need \( (\text{s}-1)(\text{k}+1) \) logical-AND gates. The partial product list \( t \) has depth of \( s + \lfloor (k+1)/2 \rfloor + \lfloor (s-1)/2 \rfloor + 1 \) at most.

After the generation of the partial product list \( t \), the algorithm reduces the resulting list to the regular C-S form using Dadda tree (Step 19 of Algorithm 9). The output of the adder tree represents a \( (2k+2) \)-bit integer in C-S form with \( (2k+2) \)-bit carry and save parts, \( d_1 \) and \( d_0 \), respectively, where \( d_1,d_0 < 2^{2k+2} \). Therefore, the resulting integer is transformed...
to R-P form using CSTOPOLY function, which uses \( \lfloor (2k + 2)/r \rfloor \)
CPA_\text{s}.

**Example 12.** Fig. 6 depicts an example implementation of Algorithm 9 for \( k = 4 \) and \( r = 2 \).

![Fig. 6. Example implementation of Algorithm 9 for \( k = 4 \) and \( r = 2 \)](image)

### 5.2 Modular Reduction for Integers in R-P Form

In this section, we investigate the utilization of the R-P form in the three modular reduction methods already presented in Section 4: Montgomery modular reduction with variable modulus, memory-based modular reduction with variable modulus and direct modular reduction with fixed modulus.

#### 5.2.1 Montgomery Reduction with Variable Modulus

For the R-P form, we utilize the Montgomery reduction algorithm (see Algorithm 5) presented in Section 4 and propose three alternative variations in the actual implementation. In the R-P form, similar to the C-S form, the Montgomery reduction algorithm takes \( (2k + 2) \)-bit integer \( d \) as input and outputs a \( (k + 1) \)-bit integer \( c \) where \( R = 2^m \) and \( m = k + 3 \).

Modular squaring operation with redundant representation requires only its input and output to be in the identical redundant form. Integers in the intermediate steps of modular squaring can have different forms as long as the operation generates the output in the same form as the input. For the integers in R-P form, CSTOPOLY operation is applied after ADDTREE to convert the integer in C-S form to R-P form. Therefore, CSTOPOLY applied in the intermediate steps of modular squaring operation can be omitted as long as it is applied to the final modular squaring result, \( c \).

In modular squaring with Montgomery reduction for integers in R-P form, CSTOPOLY can be applied at three different points. The first one is after squaring operation (see Step 20 of Algorithm 9). The second CSTOPOLY can be applied after \( \mu \) is calculated as shown in Step 7 of Algorithm 5. The third conversion is applied after the output \( c \) is calculated, which is performed in all cases. The first and second CSTOPOLY operations can be skipped as explained next.

We investigate three different Montgomery reduction methods. In the first method (Method 1), CSTOPOLY operations after squaring and the calculation of \( \mu \) are skipped. CSTOPOLY is only applied after the Step 21 of Algorithm 5. In the second method (Method 2), CSTOPOLY is performed after squaring and skipped after the calculation of \( \mu \). Since \( d \) is in the R-P form after the squaring, the partial product list \( \mu \) in Stage I of Algorithm 5 is computed using \( m(m + 1)/2 + m z - r(z + 1)/2 \) logical-AND gates and has depth of \( m + z \) where \( z = |(m - 1)/r| \). In the third method (Method 3), all CSTOPOLY operations are performed. The integer \( \mu \) in C-S form is converted to R-P form using \( |m/r| \) CPA\_\text{s} after the Step 7 of Algorithm 5. Since \( \mu \) is in R-P form, the partial product list \( r \) generated in Stage III is computed using \( k(m + z) \) logical-AND gates and has depth of \( k + z + 2 \) at most where \( z = |(m - 1)/r| \). Finally, the final modular squaring result \( c \) in C-S form is converted to R-P form using \( \lfloor k(1 + 1)/r \rfloor \) CPA\_\text{s}.

#### 5.2.2 Memory-Based Reduction with Variable Modulus

As already explained Section 4.2, we can use lookup tables for reducing integers in R-P form employing Algorithm 9 followed by Algorithm 7. Since input and output of our modular squaring algorithm are \( (k + \ell) \)-bit integers as explained in Section 4.2, the algorithm for integer squaring operation, which is identical to Algorithm 9 except for the input being now \( (k + \ell) \)-bit long, outputs a \( (2k + 2\ell) \)-bit integer, \( d \), where \( \ell \) is formulated as shown in Equation 5.

Following the approach introduced in Section 5.2.1, we can utilize R-P form in memory-based modular squaring algorithm with two different variants. In the first method (Method 1), CSTOPOLY after squaring is skipped and it is applied after the final modular squaring result \( c \) is calculated (see Step 9 of Algorithm 7). In the second method (Method 2), CSTOPOLY is performed after squaring and applied to the upper part of \( d \), namely \( \{d_{2k+2\ell-1,0}, d_{2k+2\ell-1,1}, \ldots, d_{k,0}, d_{k,1}\} \), using \( \lceil (k + 2\ell)/r \rceil \) CPA\_\text{s}. Since the upper part of \( d \) is in R-P form after squaring,
the partial product list \( r \) generated after reduction is computed using 
\[ k + 2\ell + l/\lceil (k + 2\ell - 1)/2 \rceil \] 
logical-AND gates and it has depth of 
\[ k + 2\ell + l/\lceil (k + 2\ell - 1)/2 \rceil + 2. \] 
Finally, the final modular squaring result \( c \) in C-S form is converted to R-P form using 
\[ ((k + \ell)/r) \] 
CPAs.

**Example 14.** Table 4 lists the delay and the area costs of 
memory-based modular squaring operation (Algorithm 9 and Algorithm 7) for 
\( k = 2048, \ell = 13 \) and \( r = [4,8] \).

**Example 15.** Table 5 lists the delay and the area cost of 
direct modular squaring operation with fixed modulus 
(Algorithm 9 and Algorithm 7) for 
\( k = 2048, \ell = 13 \) and \( r = [4,8] \).

### 5.2.3 Direct Reduction with Fixed Modulus

This method uses the same approach as the memory-based 
squaring method except for the modular reduction operation. 
In this method, we need no memory cells and only half of 
the AND gates, on average. This also simplifies the final 
addition operation in Step 9 of Algorithm 7. Also, we have 
two variants due to reasons explained in Section 5.2.2: i) the 
first method (**Method 1**) avoids CSTOPOLY operation after 
the integer squaring operation (Algorithm 9) and ii) the second 
method (**Method 2**) uses two applications of CSTOPOLY operation; 
one after the integer squaring and the other at the end of the 
modular reduction.

**Example 15.** Table 5 lists the delay and the area cost of 
direct modular squaring operation with fixed modulus 
(Algorithm 9 and Algorithm 7) for 
\( k = 2048, \ell = 13 \) and 
\( r = [4,8] \).

### 6 Results and Comparison

In this section, we provide the results and comparison of the 
algorithms presented in this work. To provide a fair comparison, 
we implemented the same algorithms for integers in non-redundant form, 
as well. In Table 6, we present the delay and 
area costs of the modular squaring methods proposed for the 
integers in C-S and non-redundant forms for \( k = [1024,4096] \). 
Similarly, Table 7 presents the delay and area costs of the modular squaring methods proposed for the integers in R-P form for 
\( k = [1024,4096] \) and \( r = [8,16] \). The key points are summarized as follows:

- Algorithms utilizing C-S and R-P forms avoid long carry chains. 
  These representations show superior time performance in comparison with non-redundant representation for modular squaring operation.
- R-P form requires at least one layer of CPA units for modular squaring due to the requirement that input and output be in the same form. Since the carry part of R-P form has many fewer number of bits than the save part, adder trees generated using R-P form has less depth than the adder trees in C-S form. Furthermore, the design with R-P form uses less resources at the expense of extra CPAs.
- Overall, R-P form has fewer number of FAs on the critical path of the design than the C-S form at the expense of additional CPAs. Thus, there is a trade-off 
between CPA and FA delays. For instance, Montgomery multiplication with C-S incurs about the delay of 63\( \tau \)FA (see Table 6) whereas its R-P version does the delay of 61\( \tau \)FA + \( \tau \)CPA (see Method 1 in Table 7). Our experiments with TSMC 65 nm technology indicate that both C-S and R-P based designs feature similar critical path 
 delays whereas the latter consumes much less chip area (see Fig. 7). This is due to the fact that highly optimized CPA designs are available in ASIC libraries. However, it is difficult to derive a definitive conclusion as to which 
redundant form yields a lower latency circuit, which depends on the technology and design optimization efforts; and this can be the focus of a future study.
- Memory-based modular reduction algorithms incur much lower gate count and result in significantly shorter critical path compared to the Montgomery modular reduction algorithms at the expense of chip area for large number of memory cells. As memory access can be done in parallel to the integer squaring operation, 
memory access latency is not on the critical path. In summary, memory based modular squaring algorithms 
can be strong alternatives to Montgomery-based designs subject to the overall cost of memory.
- The direct modular reduction methods offer the best solution 
in terms of area and time performance; however, they support only fixed moduli.

In order to provide a theoretical lower bound for the area and 
time complexities of ASIC implementations, we synthesized 
every building block (AND, OR, NOT, HA, FA, CPAs) used in 
the proposed algorithms using the TSMC 65 nm standard cell library. The time and area costs of different methods and representations are plotted in Fig. 7 for 
\( k = 2048 \) and \( r = 8 \).

As expected, direct modular reduction algorithms yield the 
best area and time performance for all representations since they 
take the advantage of fixed moduli. For variable moduli, 
memory-based modular reduction methods enjoy higher performance than the Montgomery modular reduction algorithm 
as far as the critical path delay is concerned. Memory-based 
modular reduction methods, however, require memory cells; 
and they are not included in the area cost. For each modu-
TABLE 6
Delay and Area Costs of Modular Squaring Methods with C-S and Non-Redundant Forms for \( k = [1024, 4096] \)

<table>
<thead>
<tr>
<th>( k )</th>
<th>Res.</th>
<th>C-S Form</th>
<th>Non-Redundant Form</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>Montgomery</td>
<td>Memory-Based</td>
</tr>
<tr>
<td>1024</td>
<td>FA</td>
<td>5,259,274</td>
<td>4,289,766</td>
</tr>
<tr>
<td></td>
<td>HA</td>
<td>4,112</td>
<td>1,078</td>
</tr>
<tr>
<td></td>
<td>Logical Gates</td>
<td>5,267,491 AND + 1 OR + 2 NOT</td>
<td>4,293,932 AND</td>
</tr>
<tr>
<td></td>
<td>M.Cells</td>
<td>–</td>
<td>–</td>
</tr>
<tr>
<td></td>
<td>Delay</td>
<td>54( t ) FA + ( t ) HA + 4( t ) AND + ( t ) OR + ( t ) NOT</td>
<td>36( t ) FA + 2( t ) AND</td>
</tr>
<tr>
<td></td>
<td>HA</td>
<td>16,400</td>
<td>4,202</td>
</tr>
<tr>
<td></td>
<td>Logical Gates</td>
<td>85,894,419 AND + 1 OR + 2 NOT</td>
<td>67,572,118 AND</td>
</tr>
<tr>
<td></td>
<td>M.Cells</td>
<td>–</td>
<td>16,891,904</td>
</tr>
<tr>
<td></td>
<td>Delay</td>
<td>63( t ) FA + ( t ) HA + 4( t ) AND + ( t ) OR + ( t ) NOT</td>
<td>42( t ) FA + 2( t ) AND</td>
</tr>
</tbody>
</table>

Fig. 7. Area (NAND2-based) vs. Time (ns) Graph for the ASIC Implementations of Proposed Modular Squaring Methods with Redundant and Non-Redundant Representations for \( k = 2048 \) and \( r = 8 \).

6.1 Implementation Notes and Alternative Approaches

The adder trees, which constitute an important part of our algorithms, are much larger in comparison with those in the literature. Works on multiplier designs in the literature have never explored adders tree constructions, which are deeper than necessary for the word size of computers as they are concerned with general-purpose computing and optimization of time-area product. As our goal is to find a lower bound for the latency of modular squaring of RSA-size operands, we, for the first time in the literature, explored unconventionally large adder trees optimized for low latency.

Furthermore, it may not be really practicable to construct adder trees of the aforementioned sizes using available CAD tools. Therefore, the actual realizations of the proposed designs may call for a modular approach, in which modules are used to compress only a part of the adder tree. As this requires several iterations incurring extra overhead in latency, the modular approach is likely to yield slower circuits.

An interesting algorithm [37] using the Chinese remainder theorem turns the modular multiplication with large modulus into many modular multiplications with much smaller moduli, all of which can be performed in parallel. By leveraging even larger number of small primes, parallelization can be further utilized to achieve much lower latency values. However, certain aspects of the algorithm should be investigated in more detail for a possible ASIC implementation. While, it can indeed reduce the adder tree depths, one major design issue of the algorithm in [37] is its high fan-out and involved place-and-route requirements. For instance, a 2048-bit modular multiplication requires modular multiplications with 422 small moduli, whose results are inputs to subsequent 422 additions. Employing larger number of small moduli (e.g., 725 for further parallelization) will increase the fan-out requirements by the same amount. Our algorithms, on the other hand, are likely to enable low fan-out designs (especially in the adder tree) and relatively simple place-and-routing. As there is no hardware implementation of the algorithm [37] in the literature, it is not possible to provide its fair comparison with our algorithms. Nevertheless, it would be highly valuable from research point of view to explore its various hardware implementations, which is an interesting future research subject.

7 CONCLUSION

In this work, we provided an in-depth study for utilization of redundant representations of integers to accelerate modular squaring operation, which is the core operation in VDF applications. Our analysis and the results indicate that redundant representations potentially yield a significant reduction in the critical path delay of the modular squaring hardware. In particular, we introduced three different modular reduction algorithms that work with redundant representations, which are amenable to regular, low fan-out and low-latency ASIC implementations. The direct reduction algorithm enjoys the best time as well as best area performance with respect to the other two; but it works for only fixed modulus. With variable moduli, the Montgomery reduction algorithm stands the best solution for memory-constrained designs while designs can greatly benefit from memory-based reduction algorithm provided that the required memory is available.
### TABLE 7
Delay and Area Costs of Modular Squaring Methods with R-P Form for $k = (1024,4096)$ and $r = (8,16)$

<table>
<thead>
<tr>
<th>Res.</th>
<th>Montgomery</th>
<th>Memory-Based</th>
<th>Direct</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Method 1</td>
<td>Method 2</td>
<td>Method 3</td>
</tr>
<tr>
<td>FA</td>
<td>3,822,284</td>
<td>3,360,714</td>
<td>2,437,443</td>
</tr>
<tr>
<td>HA</td>
<td>4,268</td>
<td>3,436</td>
<td>3,470</td>
</tr>
<tr>
<td>CPA6</td>
<td>129</td>
<td>386</td>
<td>515</td>
</tr>
<tr>
<td>Logical Gates + 1 OR + 2 NOT + 1 OR + 2 NOT</td>
<td>2,518,802 AND 1 OR + 2 NOT</td>
<td>2,625,499 AND 1,885,467 AND 1,752,347 AND 1,282,331 AND</td>
<td></td>
</tr>
<tr>
<td>M.Cells</td>
<td>0151FA + TFA + 4TAND + 1TOR + 8TNOT + 2TCPA8</td>
<td>491FA + TFA + 4TAND + 1TOR + 8TNOT + 2TCPA8</td>
<td>33TFA + 2TAND + 2TCPA8</td>
</tr>
<tr>
<td>Delay</td>
<td>0151FA + TFA + 4TAND + 1TOR + 8TNOT + 2TCPA8</td>
<td>491FA + TFA + 4TAND + 1TOR + 8TNOT + 2TCPA8</td>
<td>33TFA + 2TAND + 2TCPA8</td>
</tr>
</tbody>
</table>

We also provided figures for area and time complexities of the proposed hardware algorithms in terms of number of basic building blocks such as logical gates, half- and full-adders and carry propagation adders. These figures are intended to serve as lower bounds for the future studies on ASIC implementations of modular squaring operation, which is urgently needed for the adoption of the verifiable delay function in practical applications.

### REFERENCES


Ahmet Can Mert received the B.S. and M.S. degrees in Electronics Engineering Program from Sabancı University, Istanbul, Turkey, in 2015 and 2017, respectively. Currently, he is working towards the Ph.D. degree in Electronics Engineering Program at Sabancı University, Istanbul, Turkey. His research interest includes homomorphic encryption, lattice-based cryptography and cryptographic hardware design.

Erdinç Öztürk received his MS degree in Microelectronics from Sabancı University in 2003. He received his MS degree in Electrical Engineering in 2005 and PhD degree in Electrical and Computer engineering in 2009 from Worcester Polytechnic Institute. After receiving his PhD degree, he worked at Intel in Hudson, Massachusetts for almost 5 years as a hardware engineer, before joining Istanbul Commerce University in Turkey as an assistant professor. He joined Sabancı University in 2017 as an assistant professor. His research interest focuses on efficient hardware implementations of compute-intensive algorithms.

Erkay Savaş received the BS and MS degrees in electrical engineering from the Electronics and Communications Engineering Department, Istanbul Technical University in 1990 and 1994, respectively. He received the PhD degree from the Department of Electrical and Computer Engineering, Oregon State University in June 2000. He has been a faculty member at Sabancı University since 2002. His research interests include applied cryptography, data and communication security, security and privacy in data mining applications, embedded systems security, and distributed systems.