Speed and Area Optimized Parallel Higher-Radix Modular Multipliers

Khalid Javeed, Xiaojun Wang

Abstract—Modular multiplication is the fundamental and compute-intensive operation in many Public-Key crypto-systems. This paper presents two modular multipliers with their efficient architectures based on Booth encoding, higher-radix, and Montgomery powering ladder approaches. Montgomery powering ladder technique enables concurrent execution of main operations in the proposed designs, while higher-radix techniques have been adopted to reduce an iteration count which formally dictates a cycle count. It is also shown that by adopting Booth encoding logic in the designs helps to reduce their area cost with a slight degradation in the maximum achievable frequencies. The proposed designs are implemented in Verilog HDL and synthesized targeting virtex-6 FPGA platform using Xilinx ISE 14.2 Design suite. The radix-4 multiplier computes a 256-bit modular multiplication in 0.93 µs, occupies 1.6K slices, at 137.87 MHz in a cycle count of \([n/2]+2\), whereas the radix-8 multiplier completes the operation in 0.69 µs, occupies 3.6K slices, achieves 123.43 MHz frequency in a cycle count of \([n/3]+4\). The implementation results reveal that the proposed designs consumes 18% lower FPGA slices without any significant performance degradation as compared to their best contemporary designs.

Index Terms—Finite field, elliptic curve cryptography (ECC), interleaved multiplication, public key cryptography (PKC).

I. INTRODUCTION

Crypto-systems based on public-key cryptography (PKC) [1], [2], [3], [4] are structured using finite field arithmetic primitives such as modular addition, subtraction, multiplication, and inversion. Among these primitives, modular multiplier is the one that is computationally intensive and therefore it must be carefully optimized in order to boost a performance of the associated system. Dedicated hardware architectures have been the most optimum choice for high-performance systems. Field Programmable Gate Array (FPGA) becomes a very popular implementation platform to accelerate many tedious operations because of its low cost, reconfigurability, short design cycle, and many others factors.

Several independent hardware architectures have been proposed to speed-up a modular multiplication primitive in order to reduce the computational complexities of an overall security infrastructure in many communication networks. Modular multiplication of operands \(x, y\) over a modulus \(M\) (denoted as \(z = x \times y \mod M\)) is a two step process: integer multiplication and reduction modulo \(M\). The reduction operation typically requires a division operation which is a very compute-intensive operation, therefore many strategies have been proposed to lower the computational intensity of the reduction step. Generally these can be lined-up in three main categories: designs over standard primes [5], designs based on Montgomery multiplication method [6] and designs over interleaved multiplication method [7].

II. BACKGROUND AND RELATED WORK

In order to lower a computational intensity of the reduction step the National Institute of Standard and Technology (NIST) has recommended five specialized primes \(M\) of size \((M_{192}, M_{224}, M_{256}, M_{384}, M_{512})\). These primes have a special structure (very close to the power of 2) i.e., \(2^n \pm 2^k \pm 1\), and are called pseudo-Mersenne primes. Typically a modular multiplication operation when computed over this prime form results in high performance and lower computational cost. However, as a modulus value is pre-defined, which results in a very-rigid structure not able to opt for any other prime value, hence lacks flexibility. The designs reported in [8], [9] exploited the special structure of NIST recommended primes. The architecture in [9] is developed to support NIST primes \(M_{224}\) and \(M_{256}\) while [8] supports all of five NIST primes. It occupies 8340 slices and 259 dedicated DSPs blocks on Virtex-6 FPGA platform and performs the respective operation between 80-200 ns. These implementations have limited flexibility which may not be suited to many applications.

Montgomery multiplication method converts the required division operation into cheap shift and addition operations. However, to take leverage of the Montgomery method one needs to transform operands and a result from normal to Montgomery representation and vice versa before and after the original operation. The method is suitable where an overall back and forth conversion overhead is negligible as compared to the main operation cost e.g., in exponentiation algorithms. Montgomery multiplication algorithm based designs are reported in [10], [11]. Among these [11] is a based on radix-4 and [10] incorporated radix-16 techniques. Koç et al. in [12] discussed several possible strategies on basis of their performance and implementation cost.

Interleaved multiplication method is proposed by Blakley [7] in 1983. The method is based on iterative addition and reduction of partial products. Partial products accumulation and intermediate results reduction are integrated in a way to eliminate the final division. The idea is to reduce intermediate results below a modulus value in each iteration so that the final division can be omitted. The algorithm starts traversing a multiplier from most-significant-bit (MSB) to least-significant-bit (LSB). Several modifications and hardware architectures...
have been reported [13], [14], [15], [16], [17], [18]. In [17] a faster interleaved modular multiplier based on Montgomery and Barrett reduction techniques is reported. Its 130-nm ASIC implementation runs at a maximum frequency of 320 MHz and computes one 256-bit modular multiplication in 0.05 us. Ghosh et al in [16] reported a radix-2 parallel interleaved modular multiplier. Its Virtex-II Pro FPGA implementation consumes 3475 slices with a latency of 3.2 us and takes n clock cycles to perform n-bit modular multiplication.

A. Contribution

The main objective of this work is to design efficient modular multipliers that are not only optimized for speed and area, but also have a flexibility to adopt any value for a prime \( M \). The main contributions of this work are summarized below.

- A Booth encoded Montgomery powering ladder based radix-4 modular multiplier is presented. Radix-4 method is adopted to reduce iteration count (partial products). Montgomery powering ladder based strategy is incorporated to perform main operations concurrently, and Booth recoding is used to optimize the design space complexity.
- Then, the same idea is extended to a radix-8 Booth encoded Montgomery powering ladder based modular multiplier.
- The radix-4 version of the multipliers performs a \( n \)-bit modular multiplication in \( \lceil n/2 \rceil + 2 \) clock cycles, whereas the radix-8 version executes the operation in \( \lceil n/3 \rceil + 4 \) for the same bit length.
- In both of the proposed modular multipliers addition and subtraction is performed by the available fast carry chains (FCC) on FPGA platform.

The rest of this paper is structured as follows: Section III presents our proposed radix-4 and radix-8 Booth encoded Montgomery Powering ladder based modular multipliers. Hardware architectures of the multipliers are presented in section IV. Section V presents implementation results and performance evaluation to the other related designs, while the paper is concluded in section VI.

III. PROPOSED MODULAR MULTIPLIERS

Typical Higher-radix based multipliers produce faster results because of their lower iteration count as compared to their bit-level implementation. However, these techniques deteriorate the critical path, which limit their maximum achievable frequencies. Several techniques have been proposed to reduce inner complexities that shortened the critical path of higher-radix multipliers. We introduced modifications to the Interleaved Multiplication (IM) algorithm.

![Fig. 1. BE radix-4 scheme](image_url)

Algorithm 1: The Montgomery Powering Ladder

**Input:** \( x, y, M \) : \( y := [1, y_{n-2}, \ldots, y_0] \)

**Output:** \( x^y \mod M \)

1. \( r_1 := x \) and \( r_2 := x^2 \)
2. for \( i := n - 2 \) downto 0 do
3.   if \( (y_i := 1) \) then
4.     \( r_1 := r_1r_2 \) and \( r_2 := (r_2)^2 \)
5.   else
6.     \( r_2 := r_1r_2 \) and \( r_1 := (r_2)^2 \)
7. return \( r_1 \)

Algorithm 2: BE radix-4 IMML Modular Multiplication

**Input:** \( M, x, y : 0 \leq x, y \leq M \)

**Output:** \( z = x \times y \mod M \)

1. \( z := 0, R_1 := x, R_2 := 2x \mod M \)
2. for \( i := 0 \) downto \( n - 1; i := i + 2 \) do
3.   if \( (y_{(i,i+1,i+2)} := (000); (111)) \) then
4.     \( v^{(i)} := 0 \)
5.   else if \( (y_{(i,i+1,i+2)} := (001); (010); (101); (110)) \) then
6.     \( v^{(i)} := R_1^{(i)} \)
7.   else
8.     \( v^{(i)} := R_2^{(i)} \)
9.     \( R_1^{(i+2)} := 4R_1^{(i)} \mod M \)
10. \( R_2^{(i+2)} := 4R_2^{(i)} \mod M \)
11. if \( (BE_{ctr} := 1) \) then
12. \( z^{(i+2)} := z^{(i)} - v^{(i)} \mod M \)
13. else
14. \( z^{(i+2)} := z^{(i)} + v^{(i)} \mod M \)
15. return \( z^{(i)} \)

A. BE Radix-4 IMML Multiplier

A Booth encoded Montgomery powering ladder based radix-4 multiplier (BE radix-4 IMML) is a modification of the interleaved modular multiplication (IM) algorithm which is also known as binary double-and-add algorithm. Montgomery powering ladder (ML) [19] technique was initially proposed to speed-up a square and multiply method of a modular exponentiation. The ML method given in Alg. 1 eliminates conditional branch evaluation and enables parallel execution of a modular multiplication and modular squaring operations as mentioned in steps 4 and 6 of the Alg. 1. Both of these operations are performed at every iteration of the algorithm irrespective of an exponent bit \( y_i \). Serial Booth encoded radix-4 and radix-8 IM multipliers are proposed in [13] and radix-4 and ML based parallel multiplier is reported in [18]. Booth encoding technique was first suggested by Booth [20] and later modified by MacSorley [21]. The modified radix-4 Booth encoding technique is shown in Fig. 1, where it appends zero to the right of least significant bit (LSB) of a multiplier \( y \).

In the case of radix-4, it is operated on triplets with one bit overlap starting from \( y - 1 \) and proceeds towards the most significant bit (MSB). Possible partial products in case of BE radix-4 due to scanning of a triplet of \( y \) are \( \{0, \pm 1, \pm 2\}x \) for a multiplicand \( x \). Our proposed modification to the IM algorithm on basis of BE, radix-4, and ML techniques is given in Alg.
TABLE I

<table>
<thead>
<tr>
<th>$z$ := 0</th>
<th>$R_1$ := $x$</th>
<th>$R_2$ := $2x$ mod $M$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$i = 0$</td>
<td>$z^{(i+2)} := z^{(i)} + v^{(i)}$ mod $M$</td>
<td>$R_1^{(i+2)} := 4R_1^{(i)}$ mod $M$</td>
</tr>
<tr>
<td></td>
<td></td>
<td>$R_2^{(i+2)} := 4R_2^{(i)}$ mod $M$</td>
</tr>
<tr>
<td>$i = n/2$</td>
<td>$z^{(i+2)} := z^{(i)} + v^{(i)}$ mod $M$</td>
<td>$R_1^{(i+2)} := 4R_1^{(i)}$ mod $M$</td>
</tr>
<tr>
<td></td>
<td></td>
<td>$R_2^{(i+2)} := 4R_2^{(i)}$ mod $M$</td>
</tr>
</tbody>
</table>

Fig. 2. BE radix-8 scheme

2. The modified BE radix-4 IMML algorithm scans triplets of a multiplier $y$ from LSB to MSB and instead of shifting an accumulator contents it shifts partial products in each iteration. The Alg. 2 is comprised of several independent steps such as accumulator contents it shifts partial products in each iteration. From LSB to MSB and instead of shifting an accumulator

$z^{(i+3)} := z^{(i)} + v^{(i)}$ mod $M$ $R_1^{(i+3)} := 8R_1^{(i)}$ mod $M$ $R_2^{(i+3)} := 8R_2^{(i)}$ mod $M$ $R_3^{(i+3)} := 8R_3^{(i)}$ mod $M$ $R_4^{(i+3)} := 8R_4^{(i)}$ mod $M$

Algorithm 3: BE radix-8 IMML Modular Multiplication

B. BE Radix-8 IMML Multiplier

A modified BE radix-8 IMML algorithm is given in Alg. 3. A BE radix-8 technique is shown in Fig. 2, where it scans a quadruplet of a multiplier $y$ with a single bit overlap between adjacent quadruplets. Possible partial products in this case are \{0, ±1, ±2 ± 3, ±4\} $x$. The generation of ±3$x$, ±4$x$ is a major difference to the BE radix-4 IMML algorithm.

The Alg. 3 is comprised of six main steps i.e., 14, 15, 16, 17, 19, and 20. In the steps 14, 15, 16, and 17 three-bit left-shift modulo $M$ operation is performed on their respective partial products, while in the steps 15 and 17 modular addition or subtraction of a respective partial product is performed from an accumulator $z$ as detailed in Table II. Note that due to the ML method, there is hardly any data dependency among all these operations, therefore these can be executed concurrently. In the case of BE radix-8 IMML method, the iteration count is further reduced to \lfloor n/3 \rfloor, however it requires more design space due to several parallel units to execute the above mentioned steps which would be discussed in the next section.

IV. HARDWARE ARCHITECTURES

Hardware architectures of the presented BE radix-4 IMML and BE radix-8 IMML algorithms mentioned in Alg. 2, Alg. 3 are shown in Fig. 3(a) and Fig. 4(a), respectively.

The BE radix-4 IMML architecture consists of three main blocks $E_1$, $E_2$ and $A/S$. In addition to these main blocks there is a BE block, three $n$-bit data registers $R_1$, $R_2$, $Z$ and one shift register (SR). Blocks $E_1$ and $E_2$ are exactly identical, and perform two-bit left shift mod $M$ operation i.e., $4R_i$ mod $M$. The A/S block is responsible for performing addition/subtraction mod $M$ operation i.e., $x ± y$ mod $M$. Internal architectures of the blocks $E_1$, $A/S$ and $BE$ are shown in Fig. 3(c), Fig. 3(d) and Fig. 3(e), respectively. Each $E_1$ block consists of two identical smaller $A_i$ blocks cascaded in series, where each of the $A_i$ blocks performs single-bit left shift mod $M$ operation i.e., $2x$ mod $M$ operation. The BE block accepts three respective multiplier bits, $y_i, y_{i+1}, y_{i+2}$ and outputs a single-bit control signal (BE_ctrl) for the A/S block. Initially, the SR register is loaded with multiplier $y$.
which performs two-bit right shift of $y$ in each clock cycle. Steps 9, 10 are performed by the blocks EU$_1$ and EU$_2$ while steps 12 and 14 are performed by the A/S block. In each iteration of the algorithm all these steps are executed on their respective blocks concurrently. It is worth mentioning that at the start an operand $x$ is loaded to register R$_1$, however register R$_2$ needs to be loaded with the $2x \mod M$ value, which is done by a pre-computation process not shown in Fig. 3. In the pre-computation process any of the EU$_1$ blocks can be configured to compute the $2x \mod M$ value which is then saved in register R$_2$. This pre-computation process does not incur any additional combinational blocks and it only costs an extra two clock cycles overhead. As the total number of iterations in the BE radix-4 IMML algorithm in Alg. 2 is $\lceil n/2 \rceil$ and the proposed architecture performs each iteration in a single clock cycle, therefore the latency of the proposed design is $\lceil n/2 \rceil + 2$ clock cycles, where an extra two cycles are consumed in the pre-computation process. Note that the critical path of the BE radix-4 IMML is comprised of an A/S block and a multiplexer, overall which is comprised of four multiplexers and two $n$-bit adders.

Similarly, the BE radix-8 IMML architecture in Fig. 4(a) is comprised of four identical processing units PU$_{1-4}$ and the same A/S block. In addition to these it also contains some data registers R$_{1-3}$ and Z, a BE block and SR. Each PU$_i$ unit performs three-bit left shift mod $M$ operation i.e., $8x \mod M$, and consists of the three A$_1$ units cascaded in series fashion as shown in Fig. 4(b). The BE block as shown in Fig. 4(c) now operates on four respective multiplier bits i.e., $y_i, y_{i+1}, y_{i+2}, y_{i+3}$ and generates a control signal for the A/S block. Note that for the BE radix-4 IMML we need to have a pre-computed value ($2x \mod M$), however in the BE radix-8 IMML registers R$_1$, R$_2$, R$_3$, R$_4$ must be pre-loaded with an operand $x$ and $2x \mod M$, $3x \mod M$, and $4x \mod M$, respectively. The values $2x \mod M$ and $4x \mod M$ are computed by the block PU$_1$ in a single clock cycle, then the registers R$_2$ and R$_3$ are updated with the results. The $3x \mod M$ value is calculated in the A/S block by selecting appropriate inputs (R$_1$, R$_3$) and the result is stored in the register R$_3$ in the next clock cycle. The pre-computation process in the BE
TABLE III
MODULAR MULTIPLIERS IMPLEMENTATION COMPARISON IN FPGA

<table>
<thead>
<tr>
<th>Design</th>
<th>Platform</th>
<th>$M$ size</th>
<th>Area (slices)</th>
<th>LUTs$^a$</th>
<th>S.reg$^b$</th>
<th>Freq. (MHz)</th>
<th>Time (us)</th>
<th>TP$^c$ (Mh/s)</th>
<th>AT$^d$/b$^e$</th>
</tr>
</thead>
<tbody>
<tr>
<td>Radix-2 IM [7]$^f$</td>
<td>Virtex 6</td>
<td>256</td>
<td>1012</td>
<td>2900</td>
<td>777</td>
<td>125</td>
<td>2.03</td>
<td>126</td>
<td>8.02</td>
</tr>
<tr>
<td>Radix-2 PIM [16]</td>
<td>Virtex II pro</td>
<td>256</td>
<td>3475</td>
<td>-</td>
<td>-</td>
<td>80</td>
<td>3.02</td>
<td>84.76</td>
<td>40.99</td>
</tr>
<tr>
<td>Radix-2 PIM [16]$^f$</td>
<td>Virtex 6</td>
<td>256</td>
<td>1190</td>
<td>3207</td>
<td>1075</td>
<td>174.1</td>
<td>1.48</td>
<td>172.9</td>
<td>6.879</td>
</tr>
<tr>
<td>BE Radix-4 IM [13]</td>
<td>Virtex 6</td>
<td>256</td>
<td>1375</td>
<td>4630</td>
<td>-</td>
<td>86.6</td>
<td>1.49</td>
<td>171.8</td>
<td>7.42</td>
</tr>
<tr>
<td>Radix-4 IMML [18]</td>
<td>Virtex 6</td>
<td>256</td>
<td>1985</td>
<td>6300</td>
<td>2187</td>
<td>166</td>
<td>0.78</td>
<td>328</td>
<td>6.04</td>
</tr>
<tr>
<td></td>
<td></td>
<td>224</td>
<td>1745</td>
<td>5367</td>
<td>1883</td>
<td>167.6</td>
<td>0.68</td>
<td>329.4</td>
<td>5.29</td>
</tr>
<tr>
<td></td>
<td></td>
<td>192</td>
<td>1519</td>
<td>4641</td>
<td>1625</td>
<td>168.7</td>
<td>0.57</td>
<td>336.84</td>
<td>4.52</td>
</tr>
<tr>
<td>Radix-8 IMML [18]$^f$</td>
<td>Virtex 6</td>
<td>256</td>
<td>4428</td>
<td>13880</td>
<td>2756</td>
<td>124.4</td>
<td>0.69</td>
<td>79.2</td>
<td>11.93</td>
</tr>
<tr>
<td></td>
<td></td>
<td>224</td>
<td>4014</td>
<td>12377</td>
<td>2436</td>
<td>127.6</td>
<td>0.59</td>
<td>379.66</td>
<td>10.57</td>
</tr>
<tr>
<td></td>
<td></td>
<td>192</td>
<td>3631</td>
<td>10520</td>
<td>2116</td>
<td>132.6</td>
<td>0.48</td>
<td>400</td>
<td>9.07</td>
</tr>
<tr>
<td>BE Radix-4 IMML</td>
<td>Virtex 6</td>
<td>256</td>
<td>1631</td>
<td>4935</td>
<td>1382</td>
<td>137.87</td>
<td>0.93</td>
<td>275.26</td>
<td>5.9</td>
</tr>
<tr>
<td></td>
<td></td>
<td>224</td>
<td>1496</td>
<td>4427</td>
<td>1221</td>
<td>142.7</td>
<td>0.79</td>
<td>283.54</td>
<td>5.27</td>
</tr>
<tr>
<td></td>
<td></td>
<td>192</td>
<td>1395</td>
<td>3846</td>
<td>1057</td>
<td>145.7</td>
<td>0.66</td>
<td>290.9</td>
<td>4.8</td>
</tr>
<tr>
<td></td>
<td></td>
<td>160</td>
<td>1042</td>
<td>3184</td>
<td>910</td>
<td>147</td>
<td>0.54</td>
<td>296.2</td>
<td>3.51</td>
</tr>
<tr>
<td>BE Radix-8 IMML</td>
<td>Virtex 6</td>
<td>256</td>
<td>3622</td>
<td>10284</td>
<td>1952</td>
<td>123.35</td>
<td>0.696</td>
<td>367.81</td>
<td>9.84</td>
</tr>
<tr>
<td></td>
<td></td>
<td>224</td>
<td>3326</td>
<td>9115</td>
<td>1727</td>
<td>125.7</td>
<td>0.61</td>
<td>367.2</td>
<td>9.05</td>
</tr>
<tr>
<td></td>
<td></td>
<td>192</td>
<td>2745</td>
<td>7728</td>
<td>1502</td>
<td>127</td>
<td>0.50</td>
<td>384</td>
<td>7.14</td>
</tr>
<tr>
<td></td>
<td></td>
<td>160</td>
<td>2306</td>
<td>6334</td>
<td>1276</td>
<td>128.4</td>
<td>0.42</td>
<td>609.5</td>
<td>6.05</td>
</tr>
</tbody>
</table>

$^a$Our implementation, $^b$Look-up-tables, $^c$Slice registers, $^d$Throughput, $^e$Slice area times product per bit

The radix-8 IMML multiplier is performed in 4 clock cycles. As the total number of iterations in the Alg. 3 is $\lceil n/3 \rceil$, therefore the proposed architecture computes $n$-bit modular multiplication operation in $\lceil n/2 \rceil + 4$ clock cycles.

V. IMPLEMENTATION AND RESULTS

The proposed BE radix-4 and BE radix-8 IMML multipliers have been coded in Verilog HDL and Xilinx ISE 14.2 Design Suite is used for synthesis, mapping, placement and routing purposes targeting Virtex-6 FPGA device XCV6LX550. Xilinx ISIM simulator has been used for behavioral simulation of the designs. Addition and subtraction is performed through in-built fast carry chains of the device.

Table IV lists PAR (post placement and routing) results of the designs against four different field sizes (160, 192, 224, 256), where the BE radix-4 IMML multiplier computes a 256-bit modular multiplication operation in 0.93 us, consuming 1985 slices (6300 LUTs) whilst running at 137.87 MHz maximum frequency. The BE radix-8 IMML multiplier completes the same bit length operation in 0.696 us, occupies 3622 slices (10284 LUTs). A throughput (TP) of our BE radix-4 IMML is $275.26 \times 10^6$ bits per second (bps), whereas the BE radix-8 IMML has a TP of $367.8 \times 10^6$ bps. Slice area $\times$ time per bit (AT/b) values of the BE radix-4 and BE radix-8 IMML designs are 5.9, 9.84 respectively.

The same Table also demonstrates performance metrics of several other related designs based on the IM algorithm. It is important to mention that to perform a fair and conclusive performance comparison, Table IV only demonstrates designs based on IM algorithm. Our radix-2 implementation of the IM algorithm [7] on the same platform takes 2.03 us to perform a 256-bit modular multiplication operation and occupies 1012 slices (2900 LUTs) whilst achieving a 125 MHz maximum frequency. It is 54% and 66% slower than the proposed BE radix-4 IMML and BE radix-8 IMML multipliers, respectively. The design in [16] is also based on radix-2 implementation of the algorithm on Virtex-II pro, where it computes the operation in 3.02 us and occupies 2475 slices. A direct comparison against our designs and [16] is not conclusive, therefore we implement the design on Virtex-6, where it takes 1.48 us, occupies 1190 slices and achieves 171.8 MHz maximum frequency. It is 1.59 and 2.12 times slower as compared to the proposed BE radix-4 and BE radix-8 IMML modular multipliers respectively.

The BE radix-4 and BE radix-8 based designs reported in [13] are not using ML method, hence, they execute the main operations of IM algorithm in a serial fashion. Though these designs consumes the same amount of clock cycles to perform a modular multiplication operation, however due to serial nature of the designs they are not able to achieve lower critical path delays as compared to the proposed designs. Our BE radix-4 IMML design is 1.6 times faster than the BE radix-4 IM design, while the BE radix-8 IMML design is 1.74 times faster than the BE radix-8 IM design on the same platform.

In [18] radix-4 IMML design is proposed and its implementation on Virtex-6 platform takes 0.77 us, occupies 1985 slices, and runs at a maximum frequency of 166 MHz. It is 0.16 times faster than our BE radix-4 IMML design, however it consumes 0.18 times more FPGA slices. If the same idea [18] is even extended to radix-8 then it requires 4428 slices, which is 1.2 times more than our BE radix-8 IMML design without any significant speed improvement. In Fig. 5 we present comparisons among the above mentioned designs on the basis of throughput versus slice area $\times$ time per bit value (AT/b) for a field size of 256-bit, which depicts that our BE radix-4 IMML design has the lowest area-delay product while the BE radix-8 IMML has the same throughput and much lower area-delay product as compared to the Radix-8 IMML [18]$^f$. Therefore, by combining BE and Montgomery powering ladder approaches to include a parallelism in the IM algorithm result in speed and area optimized modular multipliers. Both of the presented architectures are highly flexible to perform modular
multiplication for any prime number $M$, which demonstrates that these designs are very much compatible to build many crypto-processors based on elliptic curve cryptography (ECC) using several different arbitrary curves [22].

VI. CONCLUSION

This paper has introduced Booth encoded radix-4 and radix-8 Montgomery powering ladder based modular multipliers. Higher-radix approaches have been used to decrease iteration count, which formally is the total number of required clock cycles to compute a modular multiplication operation. We noticed that using the Montgomery powering ladder approach reduces inner multiplier complexities and enables parallel execution of the main operations. Furthermore, we explored that incorporating Booth encoding logic helps to reduce design space complexity with a slight overhead in the critical path as compared to designs not using Booth recoding logic. Radix-4 Booth encoded Montgomery powering ladder version computes a $n$-bit modular multiplication operation in $\lceil n/2 \rceil + 2$ clock cycles and achieves a maximum frequency of 137.87 MHz, while the radix-8 version takes $\lceil n/3 \rceil + 4$ clock cycles and achieves a maximum frequency of 123.43 MHz for a field size of 256-bit.

VII. ACKNOWLEDGMENT

This work is funded by the Higher Education Authority (HEA) Ireland, under Telecommunication Graduate Initiative (TGI) scheme.

REFERENCES