An Instruction Set Extension to Support Software-Based Masking

Si Gao$^1$, Johann Großschädl$^2$, Ben Marshall$^3$, Dan Page$^3$, Thinh Pham$^3$ and Francesco Regazzoni$^{4,5}$

$^1$ Alpen-Adria Universität Klagenfurt.  
$^2$ Department of Computer Science, University of Luxembourg.  
$^3$ Department of Computer Science, University of Bristol.  
$^4$ University of Amsterdam,  
$^5$ ALaRI, University of Lugano.

Abstract. In both hardware and software, masking can represent an effective means of hardening an implementation against side-channel attack vectors such as Differential Power Analysis (DPA). Focusing on software, however, the use of masking can present various challenges: specifically, it often 1) requires significant effort to translate any theoretical security properties into practice, and, even then, 2) imposes a significant overhead in terms of efficiency. To address both challenges, this paper explores use of an Instruction Set Extension (ISE) to support masking in software-based implementations of a range of (symmetric) cryptographic kernels including AES: we design, implement, and evaluate such an ISE, using RISC-V as the base ISA. Our ISE-supported first-order masked implementation of AES, for example, is an order of magnitude more efficient than a software-only alternative wrt. execution latency and memory footprint; this renders it comparable to an unmasked implementation using the same metrics, but also first-order secure.

Keywords: side-channel attack, masking, RISC-V, ISE

1 Introduction

The threat of implementation attacks. Evolution of the technology landscape, for example improvement in storage, computational, and communication capability, has produced a corresponding evolution in user-facing platforms and applications that we now routinely depend on. Many such cases are now deemed security-critical, as a result of trends such increased connectivity (cf. IoT), outsourced computation (cf. cloud computing), and use of identity-, location-, and finance-related data. Within this setting, cryptography often represents a transparent enabler: cryptographic solutions are routinely tasked with ensuring the secrecy, robustness, and provenience of our data (when communicated and/or while stored), plus the authenticity of interacting parties. While mature theoretical foundations often underpin such solutions, their realisation in practice can remain difficult. For example, per the remit above, cryptographic implementations represent an important component of the attack surface; in an attack landscape of increasing breadth and complexity (where “attacks only get better”), the threat of implementation attacks is particularly acute.
The premise of an implementation attack is that by considering a concrete implementation, vs. an abstract specification say, theoretical security properties (however strong) can potentially be bypassed. At a high level, they are often divided into active (e.g., fault injection) or passive (i.e., side-channel) classes. Differential Power Analysis (DPA) [KJJ99, MOP07] is an concrete example\(^1\) of a side-channel attack with particular relevance to embedded devices. Following an optional profiling phase, a typical DPA attack performs an initial, online acquisition phase: (passive) monitoring by the attacker yields traces of power consumption during computation of some target operation by the target device. The underlying assumption is that both operations (e.g., addition vs. multiplication) and the operands they process (e.g., higher vs. lower Hamming weight) contribute to features, or leak information, then evident in the traces. Such features are harnessed by a subsequent, offline analysis phase, which attempts to recover security-critical information (e.g., key material) they relate to.

**Challenges in realisation of countermeasures.** Techniques for mitigating implementation attacks are becoming increasingly well understood. At a high level, examples pertinent to DPA are often classified as based on hiding [MOP07, Chapter 7] and/or masking [MOP07, Chapter 10]. The latter, which is our focus, can be viewed as a low-level realisation of the more typically higher level “computing on encrypted data” concept. For a target operation normally invoked as \( r = f(x) \), application of a given masking scheme demands that 1) \( x \) is masked (resp. encrypted) to yield \( \tilde{x} \), 2) alternative computation is applied to \( \tilde{x} \), i.e., \( \tilde{r} = \tilde{f}(\tilde{x}) \), such that it acts on the underlying \( x \) in a manner compatible with \( f \), then 3) \( \tilde{r} \) is unmasked (resp. decrypted) to yield \( r \); any leakage stemming from the computation of \( \tilde{f} \) will now relate to \( \tilde{x} \) rather than \( x \), so the latter cannot be directly recovered as would likely be the case using \( f \).

In common with other countermeasure techniques, masking can be utilised at various levels in either hardware and/or software: for example, algorithm-level (e.g., to a block cipher such as AES [Mes01]), system-level (e.g., across the data-path of a processor core [GJM+16, MGH19]), and gate-level (e.g., in secure logic style such as MDPL [PM05]) techniques are all viable. For a concrete implementation that uses such techniques, however, at least two significant challenges must be addressed. First, it must translate theoretically modelled security properties into practice. This challenge is neatly illustrated by the contrast between a theoretically, provably secure masking scheme proposed by Rivain and Prouff [RP10], vs. attacks on a practical implementation thereof by Balasch et al. [BGG+14]. Second, it must do so while satisfying other quality metrics such as demand for high-volume, low-latency, high-throughput, low-footprint, and/or low-power.

**An ISE-assisted approach to masking.** Instruction Set Extensions (ISEs) [GB11, BGM09, RI16] have proved an effective implementation technique within the context of cryptography. The idea is to identify, e.g., through benchmarking, a set of additional instructions that allow the target operation to leverage special-purpose, domain-specific functionality in the resulting ISE, vs. general-purpose functionality in the base Instruction Set Architecture (ISA), and thereby deliver improvement wrt. pertinent quality metrics. ISEs are particularly effective for embedded devices, because they afford a compromise improving footprint and latency vs. a software-only option while also improving area and flexibility vs. a hardware-only option.

There is an increasingly accepted argument (see, e.g., [RKL+04, RKKH04, BMT16]) that security should be considered as a first-class metric at design-time, rather than a problem to be addressed in a reactive, post hoc manner. In line with such an argument, this paper explores use of an ISE as a means of supporting masking in software-based

\(^1\)Although our focus is specifically on DPA, we note that associated attack and countermeasure techniques apply more generally, e.g., to classes such as EM.
implementations of cryptography: we design, implement, and evaluate such an ISE using RISC-V as the base ISA. We suggest there are (at least) three reasons an ISE-based approach may be attractive vs. alternatives (e.g., a dedicated IP module). First, use of masking in software-only implementations will impose a significant overhead, e.g., wrt. execution latency and demand for high-quality randomness; our ISE can help mitigate this problem. Second, an ISE is well positioned to act as an interface wrt. security properties. For example, there is increased evidence (see, e.g., [CGD18, GMPO19]) that secure use of masking in software-only implementations is complicated by the lack of guarantees wrt. leakage stemming from the underlying micro-architecture; our ISE can help mitigate this problem, e.g., by adopting an approach similar to the augmented ISA (or aISA) of Ge et al. [GYH18] and constraining the micro-architecture to meet properties demanded by the ISA. Third, the design of masking schemes is a relatively fast-paced field, with novel designs and techniques appearing regularly. Our ISE mitigates this problem by following a RISC-like ethos: it provides a suite of general-purpose “building block” operations, that can be used to support a wide range of cryptographic constructions (e.g., block ciphers) and higher-level masking schemes.

We note that, concurrently with our work, Kiaei and Schaumont [KS20] published a proposal that is similar in some respects. We detail the differences between their and our work in Section 2.3, but, in short, note that we a) enrich the ISE with a wider set of operations, b) provide an implementation of the ISE within an existing RISC-V compliant micro-architecture, and c) evaluate it, wrt. efficiency and security properties, using a suite of representative kernels.

Organisation. Section 2 surveys related work. Section 3 introduces the ISE design. Section 4 looks at the ISE from a hardware perspective, outlining and then evaluating an implementation of the ISE set within the context of an existing RISC-V compliant micro-architecture. Section 5 looks at the ISE from a software perspective, focusing on how it is utilised: we evaluate the ISE when used to implement a range of (symmetric) cryptographic kernels including AES. Finally, Section 6 presents some conclusions and potential directions for future work.

2 Background

2.1 RISC-V

RISC-V (see, e.g., [AP14, Wat16]) is an open ISA specification. It adopts strongly RISC-oriented design principles (so is similar to MIPS) and can be implemented, modified, or extended by anyone with neither licence nor royalty requirements (so is dissimilar to MIPS, ARM, and x86). A central tenet of the ISA is modularity: a general-purpose base ISA can be augmented with some set of special-purpose, standard or non-standard (i.e., custom) extensions. As a result of these features, coupled with the surrounding community and availability of supporting infrastructure such as compilation tool-chains, a range of (typically open-source) RISC-V implementations exist.

We focus wlog. on extending RV32I [RV:19, Section 2], i.e., the 32-bit integer RISC-V base ISA. Let \( GPR[i] \), for \( 0 \leq i < 32 \), denote the \( i \)-th entry of the general-purpose register file. RISC-V uses XLEN to denote the word size; we adopt the same approach, but by focusing on RV32I assume a focus on XLEN = 32.

2.2 Masking

Masking is based on the concept of secret-sharing. In 1999, Chari et al. [CJRR99] leveraged this concept as a countermeasure against side-channel attacks. However, use of the term
masking first appeared in 2000 when Messerges [Mes00] described the use of a “random mask to obscure the calculation made by the fundamental operations” of AES candidates.

A given masking scheme specifies a non-standard representation of data, where each variable $x$ is represented by (or split into) $n$ separate shares, and a non-standard implementation of functions, which operate on said representations. The shares representing some $x$ must fulfill two properties: 1) they must be uniformly distributed, and 2) every subset of shares has to be statistically independent from $x$. An implementation of such a scheme is said to resist a $t$-th order attack (e.g., under the probing model of Ishai, Sahai, and Wagner [ISW03]), if knowledge of $t < n$ shares cannot be used to recover $x$.

2.2.1 Representation

A masking scheme can be classified as Boolean (or additive) or arithmetic (or multiplicative). If $x_i$ denotes the $i$-th share for $0 \leq i < n$, the shares representing $x$ satisfy $x_0 \oplus x_1 \oplus \cdots \oplus x_{n-1}$ under Boolean masking, and $x_0 + x_1 + \cdots + x_{n-1} \mod 2^w$ under arithmetic masking. Consider the specific case of $n = 2$, and let $\hat{x} = (x_0, x_1)$ denote the representation of some $x$ under Boolean masking, i.e., as two shares $x_0$ and $x_1$: this demands $x = x_0 \oplus x_1$. Likewise, let $\bar{x} = (x_0, x_1)$ denote the representation of some $x$ under arithmetic masking: this demands $x = x_0 + x_1 \mod 2^w$, noting that wlog. we set $w = XLEN = 32$.

2.2.2 Hardware-oriented implementation

Classical. Goubin and Patarin [GP99] formalised the idea of replacing each intermediate variable of the computation that is dependent of the inputs or outputs (thus potentially exploitable by an attacker), by a combination of sub-variables. The recovery of the original variable would be possible only when all the sub-variables are combined together. This approach is secure if the function selected for implementing the combination operation allows one to perform computation with the sub-variables without computing the original variable. The two functions analysed are XOR (cf. additive masking) and modular multiplication (cf. multiplicative masking).

Threshold Implementation (TI). Threshold Implementation (TI), presented by Nikova et al. [NRR06], is a countermeasure that is provable secure against first-order attacks (even in the presence of glitches). TI requires use of shares with three properties: correctness, incompleteness, and uniformity. Correctness means that the computation performed on the shares should be correct, i.e., composition of the results of the operations carried out on each shares has to be equal to the shared representation of the original result. Incompleteness means that each share should be independent from at least one input share. To guarantee the security of the scheme, masks must be uniformly distributed. Uniformity is usually the most difficult property to guarantee, but can be relaxed by using non-uniform functions if their randomness is refreshed frequently.

Domain-Oriented Masking (DOM). Domain-Oriented Masking (DOM) is presented by Gross et al. [GMK16]. The main objective of their work was to guarantee security against $t$-th order attacks using $n = t + 1$ shares, reaching the same level of security of TI, but incurring in less area overhead (when implemented in hardware) and requiring less randomness. To achieve this, the authors concentrate their effort in the design of the DOM-dep multiplier, that is a dedicated masked multiplier to implementing the proposed scheme in an efficient and secure way. The approach is evaluated using the AES algorithm as a case of study, which is analysed upto a 15-th order security level.
2.2.3 Software-oriented implementation

Although masking can be applied to more general classes of computation, consider application to block ciphers specifically. The main challenge when applying masking in software is to implement the round functions in such a way that the shares can be processed independently from each other, while it still must be possible to recombine them at the end of the execution to get the correct result. This is fairly easy for all linear operations, but can introduce massive overheads for the non-linear parts of a block cipher, e.g., S-boxes or modular additions/subtractions. Furthermore, all round transformations need to be executed twice (namely for \(x_0\) and for \(x_0\), where \(x = x_0 \oplus x_1\), which imposes additional overhead. Another problem is that a basic 2-share masking scheme is vulnerable to a so-called second-order attack where an attacker combines information from two leakage points. Such a second-order attack can, in turn, be thwarted by second-order masking, in which each sensitive variable is concealed with two random masks and, consequently, represented by three shares.

Depending on the algorithmic properties of a block cipher, a masking scheme can have to protect Boolean operations (e.g., XOR, shift) or arithmetic operations (e.g., modular addition). When a block cipher involves both Boolean and arithmetic operations, it is necessary to convert the masks from one form to the other to obtain the correct ciphertext (or plaintext). Examples of symmetric algorithms that involve arithmetic as well as Boolean operations include the widely-used hash functions SHA-2, Blake and Skein, and any ARX-based block cipher (e.g., Speck). In essence, the basic operations performed by common block ciphers can be divided into three categories depending on how costly they are to mask in software: 1) linear operations (e.g., XOR, NOT, shift, rotation), 2) non-linear Boolean operations (e.g., AND, OR), and 3) non-linear arithmetic operations (e.g., modular addition, and inversion in \(\mathbb{F}_{2^8}\)).

As mentioned before, linear operations like XOR and rotation are fairly easy to mask in software since one just has to apply the operation to each pair of shares individually. The XOR of a constant to a set of shares can be performed by XOR'ing it to a single share. Similarly, the logical NOT operation is masked by applying NOT to one of the shares. Computing a non-linear Boolean function on the shares assuring all variables processed are independent of sensitive variables is more complicated and introduces higher computational overheads. The simplest non-linear Boolean operations is the logical AND, which can be masked in different ways, whereby the different approaches proposed in the literature differ by the amount or randomness and the number of underlying basic operations. The first proposal for a first-order masked AND gate came from Trichina and was published more than 15 years ago \cite{Tri03}. This so-called “Trichina AND-gate” consists of four basic AND operations, four XORs, and requires additional fresh randomness to ensure that the shares are statistically independent of any sensitive variable. Biryukov et al. introduced in \cite{BDCU17} an improved expression for masked AND that consists of only seven basic operations does not require an additional random variable since the shares are inherently refreshed. Furthermore, on ARM micro-controllers, the masked AND can be performed using only six basic instructions. Biryukov et al. also presented a masked OR operation, which consists of only six basic operations (and six basic instructions on ARM) and does not require fresh randomness.

Highly non-linear arithmetic operations, such as modular addition or inversion in a binary field, are the most costly operations when it comes to masking in software. There are two basic options for implementing a masked addition (or subtraction) in software; the first consists of converting the Boolean shares to Arithmetic shares, then performing the addition on the arithmetic shares, and finally converting the arithmetic shares of the sum back to Boolean shares. The second option is to perform the modular addition directly on Boolean shares without conversion. Both options have in common that a straightforward software implementation has a complexity that increases linearly with
the length of the operands to be added. Coron et al. presented in [CGTV15] a recursive formula for arithmetic addition on Boolean shares with logarithmic complexity. This approach is based on the Kogge-Stone adder (a special variant of a carry-lookahead adder) and uses masked AND, masked XOR, and masked shift as sub-operations. Biryukov et al. presented an improved Kogge-Stone adder that uses the more efficient masked operations from [BDCU17] and is able to perform a 32-bit addition on Boolean shares between 14% and 19% faster than the Kogge-Stone adder of Coron et al.

### 2.3 Related work

Gross et al. [GJM+16] propose a SCA-protected processor design based on the open-source V-scale RISC-V processor. The starting point is the experience gained with the study of DOM (introduced in the previous section) which is leveraged to modify the CPU to make it resistant against side-channel attacks. The authors split the processor in a protected and an unprotected part, and realize an ALU protected using the domain oriented approach to carry the needed basic operations. Experimental results show an increased resistance against side-channel attacks and a scale of the system almost linear with the order of protection.

Protection against power analysis attacks for the RISC-V processor have been also proposed by De Mulder et al. [MGH19]. The proposed solution aims at protect against first-order power and electromagnetic attacks. The protection is achieved using a combination of known masking techniques and a masked access to memory. The second mask for accessing the memory is generated on the fly within the boundary of the CPU, and thus, at least in principle, robust. The leakage reduction is demonstrated by a number of practical experiments.

The most relevant work to ours is probably that of Kiaei and Schaumont [KS20]. They propose to extend the RISC-V processor with dedicated instructions to mitigate side-channel attacks, focusing in particular on DOM. Our paper shares the core idea of extending the instruction set of a processor to achieve side-channel resistance, but provides novel contributions. Firstly, our instructions are not limited to the case of DOM, but are suitable for implementing masking countermeasures in general and can protect a wide range of algorithms. Secondly, our instructions are integrated in the core allowing us to provide an quantitative analysis of the achieved robustness and of the performance overhead. Lastly, we show that it is possible to achieve security of masking using dedicated instructions without the need of duplicating the datapath to strongly separate the secure and insecure zone. To the best of our knowledge, previous works on instruction set extension for accelerating masking and for side-channel security in general, have always proposed to have such strong differentiation.

### 3 A design perspective

**Concept.** Focusing wlog. on use of Boolean masking, the ISE targets inclusion of instructions to support 1) binary masked operations, i.e., \( \hat{r} = \hat{x} \ominus \hat{y} \) for some set of \( \ominus \), 2) unary masked operations, i.e., \( \hat{r} = \ominus \hat{x} \) for some set of \( \ominus \), and 3) various auxiliary operations, such as conversion into, from, and between masked representations. The set of supported operations should be general-purpose in the sense they are useful for a range of cryptographic constructions and masking schemes; they often have an equivalent in, and so represent close to a “drop in” replacement for instructions in the base ISA by including, e.g., \( \ominus \in \{ \wedge, \vee, \oplus, +, - \} \) and \( \ominus \in \{ \neg \} \) to mirror the unmasked Boolean operations already available. Doing so is complicated, however, by the fact that for \( n = 2 \) shares we have

\[
\hat{r} = \hat{x} \ominus \hat{y} \implies (r_0, r_1) = (x_0, x_1) \ominus (y_0, y_1),
\]
### Syntax Semantics

<table>
<thead>
<tr>
<th>Class</th>
<th>Syntax</th>
<th>Semantics</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>C mask.a2b (rd1,rd2), (rs1,rs2)</td>
<td>$\Rightarrow (\text{GPR}[rd1], \text{GPR}[rd2]) \leftarrow \text{Bool2Arith}((\text{GPR}[rs1], \text{GPR}[rs2]))$</td>
</tr>
<tr>
<td>2</td>
<td>mask.b2a (rd1,rd2), (rs1,rs2)</td>
<td>$\Rightarrow (\text{GPR}[rd1], \text{GPR}[rd2]) \leftarrow \text{Arith2Bool}((\text{GPR}[rs1], \text{GPR}[rs2]))$</td>
</tr>
<tr>
<td>3</td>
<td>B mask.b.mask (rd1,rd2), rs1</td>
<td>$\Rightarrow (\text{GPR}[rd1], \text{GPR}[rd2]) \leftarrow \text{BoolMask}(\text{GPR}[rs1])$</td>
</tr>
<tr>
<td>4</td>
<td>mask.b.umask rd1, (rs1,rs2)</td>
<td>$\Rightarrow \text{GPR}[rd1] \leftarrow \text{BoolUnmask}((\text{GPR}[rs1], \text{GPR}[rs2]))$</td>
</tr>
<tr>
<td>5</td>
<td>mask.b.remask (rd1,rd2), (rs1,rs2)</td>
<td>$\Rightarrow (\text{GPR}[rd1], \text{GPR}[rd2]) \leftarrow \text{BoolRemask}((\text{GPR}[rs1], \text{GPR}[rs2]))$</td>
</tr>
<tr>
<td>6</td>
<td>mask.b.not (rd1,rd2), (rs1,rs2)</td>
<td>$\Rightarrow (\text{GPR}[rd1], \text{GPR}[rd2]) \leftarrow \text{BoolNot}((\text{GPR}[rs1], \text{GPR}[rs2]))$</td>
</tr>
<tr>
<td>7</td>
<td>mask.b.and (rd1,rd2), (rs1,rs2), (rs3,rs4)</td>
<td>$\Rightarrow (\text{GPR}[rd1], \text{GPR}[rd2]) \leftarrow \text{BoolAnd}((\text{GPR}[rs1], \text{GPR}[rs2]), (\text{GPR}[rs3], \text{GPR}[rs4]))$</td>
</tr>
<tr>
<td>8</td>
<td>mask.b.ior (rd1,rd2), (rs1,rs2), (rs3,rs4)</td>
<td>$\Rightarrow (\text{GPR}[rd1], \text{GPR}[rd2]) \leftarrow \text{BoolIor}((\text{GPR}[rs1], \text{GPR}[rs2]), (\text{GPR}[rs3], \text{GPR}[rs4]))$</td>
</tr>
<tr>
<td>9</td>
<td>mask.b.xor (rd1,rd2), (rs1,rs2), (rs3,rs4)</td>
<td>$\Rightarrow (\text{GPR}[rd1], \text{GPR}[rd2]) \leftarrow \text{BoolXor}((\text{GPR}[rs1], \text{GPR}[rs2]), (\text{GPR}[rs3], \text{GPR}[rs4]))$</td>
</tr>
<tr>
<td>10</td>
<td>mask.b.slli (rd1,rd2), (rs1,rs2), imm</td>
<td>$\Rightarrow (\text{GPR}[rd1], \text{GPR}[rd2]) \leftarrow \text{Slli}((\text{GPR}[rs1], \text{GPR}[rs2]), \text{imm})$</td>
</tr>
<tr>
<td>11</td>
<td>mask.b.srli (rd1,rd2), (rs1,rs2), imm</td>
<td>$\Rightarrow (\text{GPR}[rd1], \text{GPR}[rd2]) \leftarrow \text{Srli}((\text{GPR}[rs1], \text{GPR}[rs2]), \text{imm})$</td>
</tr>
<tr>
<td>12</td>
<td>mask.b.rori (rd1,rd2), (rs1,rs2), imm</td>
<td>$\Rightarrow (\text{GPR}[rd1], \text{GPR}[rd2]) \leftarrow \text{Rori}((\text{GPR}[rs1], \text{GPR}[rs2]), \text{imm})$</td>
</tr>
<tr>
<td>13</td>
<td>mask.b.add (rd1,rd2), (rs1,rs2), (rs3,rs4)</td>
<td>$\Rightarrow (\text{GPR}[rd1], \text{GPR}[rd2]) \leftarrow \text{Add}((\text{GPR}[rs1], \text{GPR}[rs2]), (\text{GPR}[rs3], \text{GPR}[rs4]))$</td>
</tr>
<tr>
<td>14</td>
<td>mask.b.sub (rd1,rd2), (rs1,rs2), (rs3,rs4)</td>
<td>$\Rightarrow (\text{GPR}[rd1], \text{GPR}[rd2]) \leftarrow \text{Sub}((\text{GPR}[rs1], \text{GPR}[rs2]), (\text{GPR}[rs3], \text{GPR}[rs4]))$</td>
</tr>
<tr>
<td>15</td>
<td>A mask.a.mask (rd1,rd2), rs1</td>
<td>$\Rightarrow (\text{GPR}[rd1], \text{GPR}[rd2]) \leftarrow \text{ArithMask}(\text{GPR}[rs1])$</td>
</tr>
<tr>
<td>16</td>
<td>mask.a.umask rd1, (rs1,rs2)</td>
<td>$\Rightarrow \text{GPR}[rd1] \leftarrow \text{ArithUnmask}((\text{GPR}[rs1], \text{GPR}[rs2]))$</td>
</tr>
<tr>
<td>17</td>
<td>mask.a.remask (rd1,rd2), (rs1,rs2)</td>
<td>$\Rightarrow (\text{GPR}[rd1], \text{GPR}[rd2]) \leftarrow \text{ArithRemask}((\text{GPR}[rs1], \text{GPR}[rs2]))$</td>
</tr>
<tr>
<td>18</td>
<td>mask.a.add (rd1,rd2), (rs1,rs2), (rs3,rs4)</td>
<td>$\Rightarrow (\text{GPR}[rd1], \text{GPR}[rd2]) \leftarrow \text{Add}((\text{GPR}[rs1], \text{GPR}[rs2]), (\text{GPR}[rs3], \text{GPR}[rs4]))$</td>
</tr>
<tr>
<td>19</td>
<td>mask.a.sub (rd1,rd2), (rs1,rs2), (rs3,rs4)</td>
<td>$\Rightarrow (\text{GPR}[rd1], \text{GPR}[rd2]) \leftarrow \text{Sub}((\text{GPR}[rs1], \text{GPR}[rs2]), (\text{GPR}[rs3], \text{GPR}[rs4]))$</td>
</tr>
<tr>
<td>20</td>
<td>F mask.f.sqr (rd1,rd2), (rs1,rs2)</td>
<td>$\Rightarrow (\text{GPR}[rd1], \text{GPR}[rd2]) \leftarrow \text{FieldSqr}((\text{GPR}[rs1], \text{GPR}[rs2]))$</td>
</tr>
<tr>
<td>21</td>
<td>mask.f.mul (rd1,rd2), (rs1,rs2), (rs3,rs4)</td>
<td>$\Rightarrow (\text{GPR}[rd1], \text{GPR}[rd2]) \leftarrow \text{FieldMul}((\text{GPR}[rs1], \text{GPR}[rs2]), (\text{GPR}[rs3], \text{GPR}[rs4]))$</td>
</tr>
<tr>
<td>22</td>
<td>mask.f.aff (rd1,rd2), (rs1,rs2), (rs3,rs4)</td>
<td>$\Rightarrow (\text{GPR}[rd1], \text{GPR}[rd2]) \leftarrow \text{FieldAff}((\text{GPR}[rs1], \text{GPR}[rs2]), (\text{GPR}[rs3], \text{GPR}[rs4]))$</td>
</tr>
</tbody>
</table>

Table 1: A summary of additional instructions that constitute the ISE, and their mapping onto underlying operations.
for example. That is, doing so increases the number of register indexes required, and, therefore, pressure on instruction encoding: an unmasked binary (resp. unary) operation requires 3 (resp. 2) register indexes, whereas a masked equivalent requires 6 (resp. 4). The same scenario is articulated by Lee et al. [LYS04], who describe use the term Multi-word Operand, Multi-word Result (MOMR) to characterise and thereby distinguish cryptographic operations from the general case. There are various ways to satisfy this requirement: we use an implied approach, where two indexes are encoded as one, i.e., 

\[(i, i + 1) \mapsto i\] 

For example, the even-odd index pair (2, 3) is encoded as the first, even index 2; the second, odd index 3 is then implicit rather than explicit. This is an limited instance of the Register File Extension for Multi-word and Long-word Operation (RFEMLO) approach proposed by Lee and Choi [LC08].

The ISE itself constitutes 22 additional instructions, which can be divided into 4 feature classes. Table 1 offers a high-level summary of these instructions, with the underlying operations captured in an algorithmic format by Appendix C to avoid repetition inline; we discuss their design in more detail below.

**Notation.** The RISC-V naming convention [RV:19, Section 27] for ISEs uses a string of single-character identifiers to specify features realised in an implementation. We adopt a variant of this approach, where, for example, ISE[CBA] denotes a concrete implementation of the ISE that supports the C, B, and A feature classes but not the F feature class.

Given the number of options available, we adopt a short-hand notation for software implementations: we use ISA to denote an unmasked implementation using the base ISA, ISA to denote a masked implementation using the base ISA, and ISE to denote a masked implementation using the ISE.

**Conversion (C-class).** The ISE includes a suite of instructions that support conversion of operands under Boolean masking to/from arithmetic masking. For example, the instruction

\[
\text{mask.b2a (rd1,rd2)}, \ (rs1,rs2)
\]

uses the input \(\hat{x} = (x_0, x_1) = (\text{GPR}[rs1], \text{GPR}[rs2])\) so \(x = x_0 \oplus x_1\); it computes the output \(\hat{r} = (r_0, r_1) = (\text{GPR}[rd1], \text{GPR}[rd2]) = \text{Bool2Arith}((x_0, x_1))\) so \(r = r_0 - r_1 \pmod{2^w} = x\).

**Operations under Boolean masking (B-class).** The ISE includes a suite of instructions that support Boolean masking. They allow masking, unmasking, remasking, and application of operations to (masked) operands: these operations include NOT, AND, OR, XOR, left-and right-shift, right-rotate, addition, and subtraction. For example, the instruction

\[
\text{mask.b.add (rd1,rd2)}, \ (rs1,rs2), \ (rs3,rs4)
\]

uses the inputs \(\hat{x} = (x_0, x_1) = (\text{GPR}[rs1], \text{GPR}[rs2])\) and \(\hat{y} = (y_0, y_1) = (\text{GPR}[rs3], \text{GPR}[rs4])\) so \(x = x_0 \oplus x_1\) and \(y = y_0 \oplus y_1\); it computes \(\hat{r} = (r_0, r_1) = (\text{GPR}[rd1], \text{GPR}[rd2]) = \text{BoolAdd}((x_0, x_1), (y_0, y_1))\) so \(r = r_0 \oplus r_1 = x + y\).

**Operations under arithmetic masking (A-class).** The ISE includes a suite of instructions that support arithmetic masking. They allow masking, unmasking, remasking, and application of operations to (masked) operands: these operations include addition and subtraction. For example, the instruction

\[
\text{mask.a.sub (rd1,rd2)}, \ (rs1,rs2), \ (rs3,rs4)
\]

uses the inputs \(\hat{x} = (x_0, x_1) = (\text{GPR}[rs1], \text{GPR}[rs2])\) and \(\hat{y} = (y_0, y_1) = (\text{GPR}[rs3], \text{GPR}[rs4])\) so \(x = x_0 + x_1 \pmod{2^w}\) and \(y = y_0 + y_1 \pmod{2^w}\); it computes \(\hat{r} = (r_0, r_1) = (\text{GPR}[rd1], \text{GPR}[rd2]) = \text{ArithSub}((x_0, x_1), (y_0, y_1))\) so \(r = r_0 - r_1 \pmod{2^w} = x - y\).
Operations for field arithmetic (F-class). Arithmetic operations in the finite field $\mathbb{F}_{2^8}$ play an essential role in many symmetric cryptosystems, most notable the AES [DR02]. For example, the SubBytes transformation of the AES performs an inversion of an element of $\mathbb{F}_{2^8}$, followed by an affine transformation. The MixColumns transformation can be interpreted as multiplications of polynomials whose coefficients are elements of $\mathbb{F}_{2^8}$. Besides the AES, many other symmetric cryptosystems involve arithmetic operations in $\mathbb{F}_{2^8}$; these include the block ciphers SM4 and Camellia, the hash function Grøstl, the authenticated encryption algorithms COMET and Saturnin (which made it into the second round of the current NIST lightweight cryptography standardization project), and many more.

When it comes to masking of the AES (and AES-like or AES-inspired designs), two basic approaches received particular attention in the recent literature. The first approach uses a bit-sliced implementation as starting point and applies masking to the underlying logical operations [SS16]. Such masked bit-sliced implementations are attractive because they can reach relatively high throughput; for example, Schwabe and Stoffelen [SS16] report an encryption time of 7422.6 cycles per block for first-order masked AES on a Cortex-M4 micro-controller when encrypting 256 consecutive blocks. However, the main disadvantage of bit-slicing is that it can only be applied to non-feedback modes of operation like counter mode. In addition, bit-slicing introduces an disproportionally high overhead when the amount of data to be encrypted is small, as is often the case for applications that run on constrained devices. An alternative approach is the well-known masking technique of Rivain and Prouff [RP10], which is provably secure in the probing model and can be straightforwardly extended to higher orders. The Rivain-Prouff masking technique requires performing the inversion in $\mathbb{F}_{2^8}$ through a sequence of multiplication and squarings along with mask refreshings to inject independent randomness. When properly implemented, the Rivain-Prouff masking can meet the strong theoretical security promises in practice, but introduces a massive penalty in execution time. For example, a first-order masked implementation of AES-128 on an ARM Cortex-M3 micro-controller is between 40 and 60 times slower than unprotected reference implementation [GR17].

The B-class instructions described above, most notably mask.b.xor and mask.b.and, can be applied for Boolean masking of bit-sliced implementations of any symmetric cryptosystem, including the AES. However, given the mentioned limitations of bit-slicing (most notably the restriction to non-feedback modes of operation), which carry over to masked bit-slicing, it makes sense to define an ISE to support the masking of non-bit-sliced implementations of the AES. The SubBytes transformation deserves particular attention since it includes inversion in $\mathbb{F}_{2^8}$, which is non-linear and, therefore, extremely costly to mask. Rivain and Prouff [RP10] proposed to mask SubBytes by performing a sequence of masked multiplications and squarings in $\mathbb{F}_{2^8}$, followed by a masked affine transformation. The masked multiplication and squaring are, in turn, composed of “ordinary” multiplications and squarings in $\mathbb{F}_{2^8}$, which are usually implemented using look-up tables (cf. [GR17, Section 3]). However, look-up tables add to the memory footprint and may enable cache attacks when executed on devices with data cache. Said problems can be easily overcome by defining instructions for multiplication, squaring, and affine transformation in $\mathbb{F}_{2^8}$. These operations are ubiquitous in symmetric cryptography (cf. SM4, Camellia, Grøstl, etc.), which means ISE for masked multiplication, masked squaring and masked affine transformation are in line with the general design concept described at the beginning of this section, namely to support operations that are general-purpose and useful for a wide range of cryptographic constructions. For example, the instruction

\[
\text{mask.f.mul } (\text{rd1}, \text{rd2}), (\text{rs1}, \text{rs2}), (\text{rs3}, \text{rs4})
\]

executes a masked 4-way SIMD Within a Register (SWAR) multiplication in $\mathbb{F}_{2^8}$, interpreting the operand in each source register as four elements of $\mathbb{F}_{2^8}$. This instruction is basically a “packed” version of the masked $\mathbb{F}_{2^8}$ multiplication described by Rivain and
Figure 1: A block diagram illustrating submodules of the masked ALU module, and their internal organisation.

Prouff [RP10]; a more formal specification of mask.f.mul can be found in Appendix C. Also the mask.f.sqr and mask.f.aff instruction execute a masked squaring and masked affine transformation in a 4-way SWAR-parallel fashion, which means they operate on four bytes in parallel, whereby each byte is interpreted as an element of $F_{2^8}$. In essence, mask.f.mul and mask.f.aff can be seen as 32-bit versions of the x86 instructions GF2P8MULB [X8618, Pages 3-447–3-448] and GF2P8AFFINEQB [X8618, Pages 3-445–3-446]. Both mask.f.mul and mask.f.sqr use the irreducible polynomial of the AES, namely $p(x) = x^8 + x^4 + x^3 + x + 1$. Nonetheless, these instructions can still be used for e.g., SM4 and other cryptosystems that use a different irreducible polynomial, since the corresponding field-representations are isomorphic (see Appendix D for details).

4 A hardware perspective: ISE realisation

In this Section, we consider the ISE from a hardware perspective, i.e., how the ISE is realised. Section 4.1 outlines the implementation of a masked ALU module, and integration of said module into an existing RISC-V compliant micro-architecture. Section 4.2 then presents synthesis results, including area, plus analysis of per-instruction (vs. whole kernel) execution properties including execution latency, memory footprint, and leakage.

4.1 Implementation

4.1.1 Implementation of a masking-specific ALU

Each operation underlying an ISE instruction is evaluated using a masked ALU module, a block diagram of which is shown in Figure 1: it accepts two 2-share inputs (rs1_s0, rs1_s1) and (rs2_s0, rs2_s1), and produces one 2-share output (rd_s0, rd_s1) (where, for both input and output, only 1 share may be used in specific cases such as masking and unmasking). Internally, the ALU can be viewed as a collection of submodules which cater for three cases: 1) support for generic functionality, 2) support for A-class and F-class
instructions, and 3) support for C-class and B-class instructions.

Case 1) relates to the RNG submodule, which generates masks, e.g., for (re)masking operations: it provides an implicit input to every other submodule, and so is not shown explicitly in the block diagram. In our implementation, the module is instantiated wlog. using a Linear Feedback Shift Register (LFSR), i.e., a Pseudo-Random Number Generator (PRNG); doing so focuses on area minimisation over randomness quality. The LFSR is updated every clock cycle, and the state cannot be read or written to from outside the ALU (e.g., by software). Case 2) is realised by fairly independent submodules: the Arith submodule supports A-class instructions whereas the Field submodule supports F-class instructions: internally, it consists of a (packed) masked field multiplier based on the technique in [RP10, Section 3.1] and optimised circuits from [fSN], and a (packed) masked field transform (or matrix-vector product) which interprets \((rs2_s0, rs2_s1)\) as a 64-bit matrix. Case 3) is realised by fairly dependent submodules which constitute the rest of the ALU. The organisation of said submodules is area-optimised, in that, if and where appropriate, common functionality is reused between operations, and functionality supporting an operation can be iterative (or multi-cycle) vs. combinatorial (or single-cycle).

The \(\text{Bool\{Re,Un\}Mask}\) submodule computes Boolean masking, remasking, and unmasking. The \(\text{Bool\{SLL,SRL,ROR\}}\) submodule computes Boolean masked left-shift, right-shift, and right-rotate. The \(\text{BoolXOR}\) and \(\text{BoolAND}\) submodules compute Boolean masked XOR and AND respectively; Boolean masked OR is computed by the \(\text{BoolIOR}\) submodule by applying De Morgan’s law to the output of the \(\text{BoolAND}\) submodule. Boolean masked addition and subtraction are computed by a collection of submodules that form an iterative Kogge-Stone [KS73] adder; the adder can be viewed as pre-processing, iteration, and post-processing steps. The pre-processing step is shared with the existing \(\text{BoolXOR}\) and \(\text{BoolAND}\) submodules. The iteration and post-processing steps are performed by the \(\text{Bool\{Add,Sub\}-Step}\) submodule; the \(\text{Bool\{Add,Sub\}-Ctrl}\) submodule is used to control intermediate values in the case of masked subtraction. The adder also supports conversion between Boolean and arithmetic masking; the \(\text{A2BPrologue, B2APrologue, and B2AEpilogue}\) submodules pre-process input to and post-process output from the adder to do so.

4.1.2 Integration of the masked ALU into a RISC-V compliant micro-architecture

The masked ALU described in Section 4.1.1 was integrated into the SCARV\(^2\) core, a micro-controller class 5-stage pipelined micro-architecture that implements RV32IMC, i.e., the RV32I base ISA plus the M(ultiply) [RV:19, Chapter 7], C(ompressed) [RV:19, Chapter 16] and (draft) B(it manipulation) [RV:19, Chapter 17]\(^3\) standard extensions. We supplement this baseline core with an instruction

\[
\text{rngsamp \, rd} \rightarrow \text{GPR[rd]} \overset{\mathcal{D}}{\longleftarrow} \{0,1\}^{XLEN},
\]

which samples \(XLEN\) bits from an underlying Random Number Generator (RNG) and stores the result into a general-purpose register. This is important, because it allows efficient generation of masks in software which uses the base ISA, i.e., it does not artificially penalise the base ISA vs. the ISE wrt. generation of randomness.

A block diagram of the core is show in Figure 2, with all components related to the masked ALU, and ISE more generally, highlighted in yellow. Note that the core is interfaced with a 64kB RAM (left) and a 1kB ROM (top); both have single-cycle access latencies.

Support for paired register file access. Per Section 3, the masked ISE design demands a general-purpose register file which can support paired read (resp. write) access: for

An Instruction Set Extension to Support Software-Based Masking

Figure 2: A block diagram illustrating integration of the masked ALU into the SCARV core: all components related to the masked ALU, and ISE more generally, are highlighted in yellow.

masked operands, an index $i$ implies a need to read from (resp. write to) both $\text{GPR}[i]$ (i.e., the 0-th share) and $\text{GPR}[i + 1]$ (i.e., the 1-st share).

Due to the indexing scheme, restructuring the register file to support such access was fairly straightforward. Specifically, the odd and even registers were split into two groups (or sub-files), each one using a dedicated 16-to-1 multiplexer tree to support read access. This approach means a) both elements of a pair can be accessed in parallel, b) there is no interaction between elements within the multiplexer tree, and c) an underlying implementation based on either flip-flops or latches or SRAM is feasible. For single-register reads, an additional 2-to-1 multiplexer in the decode stage selects between either the odd or even group based on the least-significant bit of the index. For base ISA instructions $rs1$ is stored in $s2_{opr\_a}$ and $rs2$ is stored in $s2_{opr\_b}$. For ISE instructions the 0-th (resp. 1-st) share of $rs1$ is stored in $s2_{opr\_a}$ (resp. $s2_{opr\_c}$), whereas the 0-th (resp. 1-st) share of $rs2$ is stored in $s2_{opr\_b}$ (resp. $s2_{opr\_d}$); for the SCARV core at least, this meant adding an additional pipeline register, i.e., $s2_{opr\_d}$.

Mitigating the impact of accidental share combination. We found that although implementation of a leakage-free masked ALU was straightforward in isolation, integration with the core presented some more subtle challenges from a leakage perspective. In particular, given the masked representation of some $x$, i.e., $\tilde{x} = (x_0, x_1)$, potential sources for accidentally share combination, i.e., interaction between $x_0$ and $x_1$, occur throughout the execution pipeline. One way to address this challenge (see, e.g., [KS20, Figure 1]), is by executing base ISA and ISE instructions using separate insecure and secure datapaths. We instead opted to address it in an integrated datapath, employing two mechanisms.

At an intra-component level, note that the interleaved mapping of shares to pipeline
registers is an intentional design decision: the mapping acts to prevent both shares of either \( rs1 \) or \( rs2 \) entering a functional unit other than the masked ALU, and so eliminates the potential for accidental share combination within such components. At an inter-component level, we adopt a representation where the 1-st share is bit-reversed; this is applied to both the register file and pipeline registers. As a result, any accidental share combination, e.g., switching between shares in the forwarding network, will cause non-corresponding bits of those shares to interact: the 0-th bit of the 0-th share will interact with the \((XLEN - 1)\)-th bit of the 1-st share rather than the 0-th bit of the 1-st share, for example. The representation is managed dynamically and automatically in hardware, so is transparent to software. Instructions from the base ISA unreverse their operands before being stored in the \( s2\_opr\_a \) and \( s2\_opr\_b \) pipeline registers. Instructions from the ISE unreverse their operands before entering the execute stage, immediately before the masked ALU; the 1-st share is rereversed immediately after the masked ALU, before entering the multiplexer tree that selects the next value stored in the \( s3\_opr\_b \) pipeline register. This mechanism is inexpensive to realise in hardware: it requires 1) a 1-bit flag per general-purpose and pipeline register to track reversedness, and 2) a 2-to-1 multiplexer per pipeline register to realise (un)reversal where required.

Overall, these mechanisms allow selective sharing of datapath components in a leakage-concise manner: they reduce area vs. the alternative, while also avoiding leakage plus added latency that may stem from forwarding between base ISA and ISE instructions.

Verifying functional correctness. Functional correctness of some Design Under Test (DUT), in our case the baseline core plus ISE, is a property which captures whether execution of instructions complies with their specification. There are essentially two options for checking this property: 1) given known input, check whether a specific output is produced, and 2) given known input, check whether a relationship holds between said input and the output produced.

The first option offers the strongest assurance that instruction execution matches the associated specification, and enables co-simulation of the DUT with a golden reference model (e.g., QEMU or OVPSim) or formal specification. However, the nature of masking and thus ISE support for it demands that input and output shares are randomised: a relationship between the input and output shares is specified, but their concrete values are not. As a result, use of co-simulation based verification requires auxiliary input (or “hints”) to ensure the model and DUT are synchronised: this becomes complex to maintain, even for simple DUTs. We therefore employed the second option, specifically Bounded Model Checking (BMC), to prove relationships between input and output shares are always correct without a need to know their value. For example, for the instruction

\[
\text{mask.b.xor}(rd1,rd2), (rs1,rs2), (rs3,rs4)
\]

we prove the relationship \( r = x \oplus y \) holds for \( x = \text{BoolUnmask}((\text{GPR}[rs1], \text{GPR}[rs2])) \), \( y = \text{BoolUnmask}((\text{GPR}[rs3], \text{GPR}[rs4])) \), and \( r = \text{BoolUnmask}((\text{GPR}[rd1], \text{GPR}[rd2])) \). The baseline core was already formally verified using the riscv-formal\(^4\) framework, which we extended to support paired register file access. This enabled verification of the base ISA instructions, ISE instructions, and interaction between them (having included checks for access consistency).

4.2 Evaluation

Experimental platform. The augmented SCARV core was implemented on a SASEBO-GIII [HKSS12] side-channel analysis platform, which houses two FPGAs: a Xilinx Kintex-7 (model xc7k160tfbg676) target FPGA, and a Xilinx Spartan-6 (model xc6slx45) support

\(^4\)https://github.com/SymbioticEDA/riscv-formal
Table 2: Synthesis results for the ISE, as integrated into the SCARV core: the rows capture 1) the baseline core with no ISE, 2) the baseline core plus ISE with C, B, and A classes, and 3) the baseline core plus ISE with C, B, A, and F classes. Note that timing slack is quoted for a frequency of $f = 50$ MHz.

<table>
<thead>
<tr>
<th>Implementation</th>
<th>LUTs (1.00×)</th>
<th>FFs (1.00×)</th>
<th>Timing slack</th>
<th>NAND2 cells</th>
</tr>
</thead>
<tbody>
<tr>
<td>SCARV core</td>
<td>4136</td>
<td>2107</td>
<td>4.32 ns</td>
<td>-</td>
</tr>
<tr>
<td>SCARV core + ISE[CBA]</td>
<td>6337 (1.53×)</td>
<td>2358 (1.12×)</td>
<td>3.52 ns</td>
<td>48652 (1.23×)</td>
</tr>
<tr>
<td>SCARV core + ISE[CBAF]</td>
<td>7207 (1.74×)</td>
<td>2356 (1.12×)</td>
<td>3.51 ns</td>
<td>55542 (1.41×)</td>
</tr>
</tbody>
</table>

Table 3: Comparison of a) cycle count (i.e., execution latency), b) instruction count (i.e., number of instructions executed), and c) instruction footprint (measured in bytes), for both unmasked (using the base ISA) and masked (using the base ISA and ISE) variants of individual, underlying operations. Blank entries should be read as not applicable.

<table>
<thead>
<tr>
<th>Operation</th>
<th>Cycle count</th>
<th>Instruction count</th>
<th>Instruction footprint</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>ISA</td>
<td>ISA + ISE</td>
<td>ISA + ISA + ISE</td>
</tr>
<tr>
<td>Bool2Arith</td>
<td>249</td>
<td>205</td>
<td>380</td>
</tr>
<tr>
<td>Arith2Bool</td>
<td>268</td>
<td>220</td>
<td>440</td>
</tr>
<tr>
<td>BoolUnmask</td>
<td>1</td>
<td>1</td>
<td>12</td>
</tr>
<tr>
<td>BoolRemask</td>
<td>3</td>
<td>3</td>
<td>12</td>
</tr>
<tr>
<td>BoolAND</td>
<td>1</td>
<td>1</td>
<td>12</td>
</tr>
<tr>
<td>BoolIOR</td>
<td>1</td>
<td>1</td>
<td>42</td>
</tr>
<tr>
<td>BoolXOR</td>
<td>1</td>
<td>1</td>
<td>42</td>
</tr>
<tr>
<td>BoolSLL</td>
<td>1</td>
<td>1</td>
<td>42</td>
</tr>
<tr>
<td>BoolROR</td>
<td>1</td>
<td>1</td>
<td>42</td>
</tr>
<tr>
<td>BoolADD</td>
<td>1</td>
<td>1</td>
<td>43</td>
</tr>
<tr>
<td>BoolSUB</td>
<td>1</td>
<td>1</td>
<td>43</td>
</tr>
<tr>
<td>ArithMask</td>
<td>3</td>
<td>3</td>
<td>12</td>
</tr>
<tr>
<td>ArithUnmask</td>
<td>1</td>
<td>1</td>
<td>12</td>
</tr>
<tr>
<td>ArithRemask</td>
<td>3</td>
<td>3</td>
<td>12</td>
</tr>
<tr>
<td>ArithAdd</td>
<td>1</td>
<td>1</td>
<td>48</td>
</tr>
<tr>
<td>ArithSUB</td>
<td>1</td>
<td>1</td>
<td>48</td>
</tr>
<tr>
<td>FieldSqr</td>
<td>284</td>
<td>229</td>
<td>52</td>
</tr>
<tr>
<td>FieldMUL</td>
<td>261</td>
<td>393</td>
<td>54</td>
</tr>
<tr>
<td>FieldAff</td>
<td>28</td>
<td>704</td>
<td>50</td>
</tr>
</tbody>
</table>

FPGA. We use the former exclusively, synthesising stand-alone designs for it using Xilinx Vivado 2019.2; default synthesis settings are used, with no effort invested in synthesis or post-implementation optimisation. The FPGA uses a 200 MHz differential external clock source, which is transformed into a 25 MHz internal clock signal for use by the core itself.

A standard pipeline of components is attached to the SASEBO-GIII, allowing acquisition of power consumption traces. These components include a MiniCircuits BLK+89 D/C blocker, a MiniCircuits SLP-30+ 32 MHz low-pass filter, an Agilent 8447D amplifier (with a 100 kHz to 1.3 GHz range, and 25 dB gain), and a PicoScope 5000 series oscilloscope; the latter is configured to use a 250 MHz sample frequency, and 12-bit sampling resolution. Coordination of the acquisition process is managed by a workstation connected to each component: it is tasked with 1) configuration of the FPGA with a synthesised bit-stream, 2) upload of software, as generated by a RISC-V capable instance of the GNU tool-chain including GCC 8.2.0, to the core via a simple boot-loader, 3) communication of input and output to and from the core via a UART-based connection, and 4) configuration and download of traces from the oscilloscope.

Initial experiments with the baseline core highlighted some noise inherent in the acquisition pipeline. This noise removed post-acquisition by using a software-based filter. Specifically, a Butterworth [But30] low-pass filter with 5 taps and a 8 MHz cut-off frequency was used; the 8 MHz cut-off was selected so as to maximise detectable leakage during execution of base ISA instructions.

---

5See, e.g., https://github.com/riscv/riscv-gnu-toolchain
Synthesis results. Table 2 summarises hardware implementation cost, with cases for 1) the baseline core with no ISE, 2) the baseline core plus ISE with C, B, and A classes, and 3) the baseline core plus ISE with C, B, A, and F classes. We consider two implementation targets: the first is FPGA-based, specifically the target outlined in Section 4.2, but we also consider a second which is ASIC-based. We model the ASIC target using the Yosys [Wol] Open SYnthesis Suite, reporting area in terms of NAND2 gate equivalent cells per an bundled abstract library: this renders reproducability and comparison easier, because we therefore do not depend on a particular, e.g., proprietary, concrete library.

Without support for F-class instructions, the area overhead of the ISE is fairly modest for the ASIC target (i.e., 1.23× cells) but significantly higher for the FPGA target (e.g., 1.53× LUTs). With support for F-class instructions, this overhead doubles. Although motivated by a sensible use-case and hence trade-off, support for the (packed) finite field operations is clearly costly wrt. area; where support for F-class instructions is required, however, this cost can be reduced by adopting multi- vs. single-cycle implementations of said operations (i.e., via a different time-area trade-off).

Note the area overhead differs significantly between ASIC target and FPGA target. We believe this is due to the granularity of FPGA cells (i.e., m-input LUTs) vs. 2-input CMOS cells, i.e., that not all LUTs are fully utilised in the FPGA case. Also note that the masked ALU does not appear on the critical path, but including it still results in a reduction in timing slack. We believe this is due to signals being routed via longer paths, rather than the ALU implying any additional depth.

Execution results. Primarily, Table 3 captures 1) the execution latency and 2) the memory footprint, of instruction sequences which realise the underlying operations in Table 1 or unmasked analogies thereof. For example, the row for BoolAND includes implementations of that operation using the base ISA (the column ˜ISA) and the ISE (the column ˜ISE); an unmasked Boolean AND operation using the base ISA (the column ISA) is a suitable analogy for this operation, so this is also included.

There are two clear conclusions; one could argue that both are obvious, up to a point, because both stem from a “shift” of functionality from software into hardware. First, for the majority of operations the ˜ISE case is close to the theISA case, e.g., use of masking with support from the ISE does not significantly increase execution latency or memory footprint vs. an unmasked case. Several exceptions exist, of course: for example the multi-cycle Kogge-Stone adder means BoolAdd and BoolSub require 6 cycles vs. 1. Second, for the majority of operations the ISE case is significantly lower than the ISA case, e.g., use of masking with support from the ISE is significantly lower wrt. execution latency and memory footprint vs. a masked case without such support. For certain operations, e.g., BoolAdd and BoolSub, the former is two orders of magnitude lower than the latter.

Leakage results. Using our FPGA-based target, we were able to perform a (preliminary) leakage evaluation. For each underlying operation, we constructed 1) a masked leakage micro-benchmark using the ISE, and 2) an unmasked leakage micro-benchmark using the base ISA, where, in both cases, the implementation was wrapped in an isolating prelude (resp. epilogue) formed from NOPs plus instructions required to write input (resp. read output). Using randomised input, we then generated 100,000 power consumption traces using each micro-benchmark; these traces were used for T-test and Hamming weight based correlation analysis on the unmasked inputs and outputs. Selected results are captured in Table 3. In essence, these result helped us gain confidence that the masked, ISE cases do not leak in isolation, vs. the unmasked, ISA cases which leak strongly.

For our purposes this was enough to begin evaluating larger specific kernels, which is the focus later in Section 5. However it is important to note that a rigorous, general verification exercise (aligned with standard practice for functional correctness vs. leakage) would need
to consider interaction between adjacent and non-adjacent (for pipeline forwarding) ISE and base ISA instructions. The number of such instructions and their resulting interactions is large; this suggests the verification effort may be intractable, and, either way, much higher than for a dedicated cryptography module (e.g., a masked AES accelerator). Note that the problem is not necessarily solved by adopting an implementation with separate insecure and secure datapaths, because the interaction between ISE and base ISA instructions forms the bulk of the verification problem space. The problem could, nonetheless, be interpreted as an implicit argument for such separate datapaths, or against general-purpose support for masking altogether.

5 A software perspective: ISE utilisation

In this Section we consider the ISE from a software perspective, i.e., how the ISE is utilised. Section 5.1 outlines various implementation challenges wrt. use of the ISE, which collectively inform and explain our implementation strategy. Section 5.2 then harnesses that strategy to evaluate the ISE using a range of (symmetric) cryptographic kernels, all implemented using assembly language: due to the importance of AES we focus on it in specifically, but also demonstrate the generality of the ISE using other kernels selected to span different operations, structures, etc.

5.1 Implementation

Inlining and unrolling. In general terms, the use of function inlining (and loop unrolling) involves a trade-off between execution latency (e.g., overhead wrt. the calling convention) and instruction footprint. This issue is particularly acute in the specific case of software-based masking, because a masking scheme is often (necessarily) presented in terms of underlying operations (or “gadgets”) which are naturally realised as functions: aggressive inlining can cause the instruction footprint to exceed the available memory, for example, whereas the opposite imposes increases execution latency.

Table 3 already illustrates that use of the ISE is an effective solution to this problem, in that the footprint of each underlying operation is limited to 1 instruction (having been “inlined into hardware”). Where the ISE is not used, however, we adopted a consistent strategy by 1) favoring execution latency over instruction footprint by allowing use of entire available memory, but 2) focussing any inlining (or unrolling) on the functions with small footprint and/or execution latency (e.g., the implementation of BoolAND but not BoolAdd).

Register pressure. Use of general-purpose registers to store shares inevitably increases register pressure, whether or not the ISE is used. However, the fact that a stricter even-odd indexing requirement is enforced when using the ISE renders the register allocation problem more complex. In theory, this could be viewed as a burden on the register allocator used by a compiler. However, we note that leakage free software must usually be written in assembly language anyway: as such, we did not encounter the problem in practice. Use of RISC-V as the base ISA reduced the problem further in fact, vs. ARM for example, due to the larger general-purpose register: in the majority of kernels, we were able to minimise or even avoid spilling entirely.

Register access scheduling. The transfer of shares between memory and general-purpose registers requires careful scheduling of memory access, e.g., lw and sw, instructions. For example, it was necessary to re-order instruction sequences such that given the masked representation of some \( x \), i.e., \( \tilde{x} = (x_0, x_1) \), a load of \( x_0 \) was not directly followed by a load of \( x_1 \): doing so prevents accidental share combination, and thus Hamming distance leakage,
in the memory interface. Where this was not possible, or not fully effective, we attempted to use fence-like instructions (see, e.g., [SSB+19]) to load or store “dummy” random value to flush residual state; we note that a dedicated mechanism such as FENL [GMPP20] could serve the same purpose with potentially less overhead.

**Implicit register access.** Before implementing the reversed representation detailed in Section 4.1, we found storing loop-counters or pointers in even-indexed registers could remove some instances of leakage. We believe this is because all ISE instruction encodings include even register indexes, so only explicitly accessing even-numbered registers using the base ISA can remove the possibility of glitching between odd and even register. We also found that enabling the use of compressed [RV:19, Chapter 16] instruction encoding could introduce some instances of leakage. We believe this is due to a race condition when decoding mixed sequences of 32- and 16-bit instructions: the register index produced by the decoder can glitch, as the correct location in the instruction buffer is disambiguated. We resolved the problem by disabling compressed instruction encoding, and aligning all functions to a 4-byte boundary.

We note that both observations are related to the neighbour leakage effect detailed by Papagiannopoulos and Veshchikov [PV17], and suggest implicit register access as a more general, descriptive term for this class of micro-architectural leakage.

**Speculative execution.** Despite being micro-controller class, the SCARV core still implements some degree of speculative execution. It has no branch-predictor and resolves re. control-flow late in the pipeline: instructions immediately following a taken branch will be partially executed, therefore, before being squashed. In some cases (e.g., a `mask.b.unmask` instruction immediately following a loop), this resulted in accidental share combination; we resolved the problem by either reordering or padding instruction sequences to avoid speculatively executing the leakage source.

## 5.2 Evaluation

### 5.2.1 AES cases

**ISA.** By design, AES (Rijndael) is suited to be implemented on various hardware/software platforms. As the minimal operation unit is one byte, in an 8-bit processor, one can opt for a byte-oriented implementation [DR02]. If the target platform has a 32-bit processor, Bertoni et al.’s proposal helps to “pack” the 16 bytes of the state into four 32-bit words [BBF+02]. Instead of the standard column-based indexing, they proposed [BBF+02, Section 3.1] to store each word as one row in the state matrix. The benefit of such transposition is that both ShiftRows and MixColumns can be executed in parallel, making the most of the 32-bit datapath. The only drawback of their approach is from the KeySchedule: the KeySchedule is designed to work efficiently with the column-based indexing, so it is not straightforward to see how it works with the row-based indexing. In their original proposal, Bertoni et al. found a workaround that can directly work on the transposed key state matrix: the four S-boxes still have to execute sequentially, but the following linear transformation can be as parallel as possible (although less trivial to interpret). For unmasked implementations, we found that this approach is slightly better than executing the KeySchedule in a column-based matrix then converted to a row-based matrix each round. Further optimisation is possible if the target device provided a larger memory. Specifically, the T-table based approach [DR02] performs the whole AES encryption round as a few table look-ups and XORs, which leads to much more efficient implementations (Table 4a). AES encryption can be even more efficient with dedicated ISe’s support. The x86, ARM, MIPS, POWER

---

6One could describe this as an “assume always not-taken” branch prediction strategy.
Table 4: Comparison of a) cycle count (i.e., execution latency), b) instruction count (i.e., number of instructions executed), c) instruction footprint (measured in bytes), and d) data footprint (measured in bytes), for both unmasked (using the base ISA) and masked (using the base ISA and ISE) implementations of various cryptographic kernels. Instruction and data footprint entries with only one value represent the total size for Encrypt, Decrypt, and KeySchedule.

<table>
<thead>
<tr>
<th>Implementation</th>
<th>Kernel</th>
<th>Instruction count</th>
<th>Cycle count</th>
<th>Instruction footprint</th>
<th>Data footprint</th>
</tr>
</thead>
<tbody>
<tr>
<td>ISA: Bertoni et al. [BBF02]</td>
<td>KeySchedule</td>
<td>608</td>
<td>842</td>
<td>2148</td>
<td>524</td>
</tr>
<tr>
<td></td>
<td>Encrypt</td>
<td>1932</td>
<td>2427</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>Decrypt</td>
<td>2265</td>
<td>2761</td>
<td></td>
<td></td>
</tr>
<tr>
<td>ISA: T-tables [DR02, Section 4.2]</td>
<td>KeySchedule</td>
<td>430</td>
<td>515</td>
<td>1936</td>
<td>5120</td>
</tr>
<tr>
<td></td>
<td>Encrypt</td>
<td>938</td>
<td>1016</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>Decrypt</td>
<td>938</td>
<td>1037</td>
<td></td>
<td></td>
</tr>
<tr>
<td>ISA: V3 ISE [MNSW]</td>
<td>KeySchedule</td>
<td>219</td>
<td>312</td>
<td>730</td>
<td>10</td>
</tr>
<tr>
<td></td>
<td>Encrypt</td>
<td>238</td>
<td>291</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>Decrypt</td>
<td>239</td>
<td>280</td>
<td></td>
<td></td>
</tr>
<tr>
<td>ISA: Rivain-Prouff [RP10]</td>
<td>KeySchedule</td>
<td>18119</td>
<td>22609</td>
<td>14416</td>
<td>1356</td>
</tr>
<tr>
<td></td>
<td>Encrypt</td>
<td>59823</td>
<td>64200</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>Decrypt</td>
<td>61459</td>
<td>65811</td>
<td></td>
<td></td>
</tr>
<tr>
<td>~ISE</td>
<td>KeySchedule</td>
<td>1389</td>
<td>1528</td>
<td>756</td>
<td>84</td>
</tr>
<tr>
<td></td>
<td>Encrypt</td>
<td>1012</td>
<td>1113</td>
<td>968</td>
<td></td>
</tr>
<tr>
<td></td>
<td>Decrypt</td>
<td>1229</td>
<td>1690</td>
<td>1109</td>
<td></td>
</tr>
</tbody>
</table>

(a) AES cases.

(b) Other, non-AES cases.

and SPARC architectures all have dedicated instructions for accelerating AES, most of which re-use SIMD or Vector register files to accommodate the entire 128-block size. The standardisation process for RISC-V AES-specific acceleration instructions is ongoing at the time of writing, with the current proposals outlined in [MNSW].

ISA. Meanwhile, not all implementations above are easy to mask: for instance, each T-table usually takes 1 kB memory. Masking it trivially takes $4 \cdot 2^{18} = 1$ MB of memory. Depending on the specific platform, such memory cost might be infeasible. On the other hand, the T-tables are specific to AES: considering the potential usage in other cryptographic applications, masking the finite field operations seems to be a better option. For easier comparison with the ISE version, we implement the ISA-based masked Sbox in a byte-wise manner, rather than a masked T-table. Considering the register pressure (Section 5.1), the word-oriented approach [BBF02] is preferred, as we can store the entire AES state in registers. The AES function takes 15 additional stack load/store operations; afterwards, only the S-box takes 2 inner-loop stack load/store operations. Avoiding further stack access not only improves the efficiency of our implementation but also alleviates the pain of “leakage debugging”, which often turns out to be quite demanding, even for professional side-channel researchers.

We have faithfully implemented the Rivain-Prouff scheme [RP10] using the base ISA. Since the base ISA does not provide any instruction for finite field multiplication or inversion, we compute the field inversion as the 254-th power, which takes 4 masked
multiplications, 3 squares, 1 power-by-16, and 2 mask refreshing gadgets. Each (byte-wise) multiplication can be efficiently implemented as 2 table look-ups on the logarithm table and one table look-up on the extended exponential table ([GR17, Section 3.2]). As Rivain and Prouff stated, since we are operating on a field of characteristic 2, both square and power-by-16 are linear operations that can be carried out on each share separately [RP10]. In our ISA-based implementation, we simply use 2 additional 256B tables for square and power-by-16. The affine transformation in the S-box, on the other hand, can be implemented as \( x \oplus (x \ll 1) \oplus (x \ll 2) \oplus (x \ll 3) \oplus (x \ll 4) \oplus 63_{(16)} \). Although this is suboptimal since there is not an instruction designed for “rotation shift within one byte”, we can implement this specific transformation using \( \sim 30 \) ALU instructions without any additional memory cost. In summary, the ISA-based implementation uses five 256 B look-up tables, while the randomness cost is the same as the original Rivain-Prouff scheme [RP10] (6 bytes per S-box). As a comparison, Goudarzi and Rivain implemented the same scheme on the ARM architecture: for a 2-share masking scheme, their AES encryption takes around 62 kilo-cycles (vs. 64 kilo-cycles in Table 4). Considering their implementation relies heavily on the “free shift” (i.e., the “flexible second operand”) in ARM instructions [GR17], we believe our ISA-based implementation is efficient enough as a comparison reference.

ISE. With our extended instructions (i.e., \texttt{mask.f.mul} and \texttt{mask.f.sqr}), one can easily replicate the Rivain-Prouff inversion. Since the \texttt{mask.f.sqr} already combines a mask refreshing procedure, there is no need to do additional refreshing. As a consequence, our implementation does use more randomness than the original Rivain-Prouff scheme: however, considering the randomness is generated each cycle on the hardware anyway, one can argue such cost does not count as “extra”. Besides, for engineers who do not necessarily have a deep understanding in masking proofs, our \texttt{mask.f.sqr} reduces the risk caused by ignoring the independent assumption, which might be worthwhile when implementing anything other than Rivain-Prouff AES. Alternatively, one can also implement all linear transformations with \texttt{mask.f.aff}. Although this is a general approach, using \texttt{mask.f.aff} takes extra memory accesses to load the affine matrix, which becomes suboptimal compared with \texttt{mask.f.sqr} in the AES context. For usages other than AES, \texttt{mask.f.aff} might be the only option. Note that in this case, it is essential to perform mask refreshing (with \texttt{mask.b.remask}) as the security proof required, since \texttt{mask.f.aff} does not perform refreshing by itself.

As expected, the extended instructions bring significant performance boost: as we can see in Table 4, the ISE version is even faster than the unprotected ISA version. This is because all table accesses in the ISA version have to be executed in sequence, whereas with the extension, those operations can be executed in parallel. Decryption takes slightly longer, as we compute the inverse \texttt{MixColumns} based on the existing \texttt{MixColumns} [DR02]. Again, we are aware of various existing implementation trade-off options and the fact that our choice is unlikely to be the best in every aspect. Nonetheless, for this paper, our purpose is simply demonstrating the potential of such instruction extensions, instead of setting up a speed record for masked implementations on RISC-V.

5.2.2 Other, non-AES cases

To demonstrate the generality and effectiveness of the ISE beyond AES, we implemented the Speck [BSS+13] and Sparx [DPU+16] block ciphers, plus the ChaCha20 [Ber08] stream cipher. These were chosen because their ARX nature made them good candidates to evaluate the complete set of proposed masking ISE instructions. For Speck and Sparx, the \texttt{KeySchedule}, \texttt{Encrypt}, and \texttt{Decrypt} functions were all implemented in un-masked (ISA), base ISA masked (ISA) and ISE masked (ISE) representations. For ChaCha20, we focus on only the round function.
As can be seen in Table 4b, the base ISA masked implementations suffer enormous increased latency. We found that masked add/subtract operations cause the most dominant part of the performance overhead. Moreover, the masked add/subtract operations require a number of intermediate variables that amplifies the register pressure discussed in Section 5.1. With Boolean masked instruction extension including \texttt{mask.b.add} and \texttt{mask.b.sub}, the ISE masked implementations, as expected, gain more than one order of magnitude overhead reduction in terms of instruction count and cycle count in comparison to the base ISA alternatives.

6 Conclusion

Summary. In this paper, we presented the design, implementation, and evaluation of an ISE to support software-based masking; we pitch it as an option positioned, and so, to some extent, a compromise between software-only and hardware-only alternatives. Accepting the inherent overhead in hardware, our evaluation suggests the ISE can support efficient, secure first-order masked implementations of various kernels. Our ISE-supported first-order masked implementation of AES, for example, is an order of magnitude more efficient than a software-only alternative wrt. both execution latency and memory footprint.

That said, however, implementing the ISE highlighted various challenges which demand care. For example, vs. whole-core masking [GJM+16, MGH19] or segregated secure and unsecure datapaths [KS20], it is challenging to ensure non-interaction between shares: even if the masked ALU is modular and so physically separate, the unmasked datapath may require some changes to ensure this property. It is challenging to both a) identify when and where change is required, and b) implement such change, because the ISE becomes more invasive as a result. Likewise, utilising the ISE is not as “drop in” a process as one may expect intuitively. For example, one must contend with increased register pressure while simultaneously considering the security properties of loading/storing shares from/to memory; doing so is not trivial.

Future work. Beyond any more incremental improvements, such as increasing the number and diversity of kernels we evaluate, various higher-level directions represent either important and/or interesting future work:

- Our ISE directly supports masked operations for \( n = 2 \) shares. The question remains, however, how such support generalises: for example, 1) how to bootstrap higher-order masking schemes using an ISE, such as ours, that includes first-order oriented building blocks, and/or 2) how such an ISE might be altered to enhance generality wrt. said order.

- It is plausible to consider extending the ISE to assist with various challenges wrt. utilisation. For example, one can imagine enforcing secure register access scheduling via ISE-supported, and so “share aware” memory access instructions.

- Belaïd et al. [BDM+20] present Tornado, a tool capable of deriving secure masked implementations from a high-level description. [BDM+20, Section 3.1] states the model of computation assumed, which includes a number of masked “gadgets” whose form implies that Tornado could be retargeted to an ISE-enabled platform.

- Instructions such as \texttt{mask.b.add} and \texttt{mask.a.add} perform arithmetic with an assumed word size of XLEN. However, consider a case where XLEN = 32 but we implement a kernel which demands masked arithmetic modulo \( 2^w \) for some \( w \neq 32 \), e.g., \( w = 16 \) or \( w = 64 \). Doing so may be problematic, because the masked operation lacks any direct support for, e.g., management of carries.
• Ge et al. [GYH18] pitch their aISA as “a new hardware-software contract”, and, as a result, weaken the abstraction afforded by an ISA of some associated micro-architecture. To meet the required security properties, our ISE demands care wrt. implementation (in hardware) and utilisation (in software). Per the aISA, one could therefore question how much the ISE can (or if it even should) dictate details of an associated micro-architectural implementation. Exploration of this question, e.g., suitable requirements and how to specify them, seems an interesting challenge.

Acknowledgements

This work has been supported in part by EPSRC via grant EP/R012288/1, under the RISE (http://www.ukrise.org) programme.

References


A Leakage evaluation plots

Figure 3: Comparison of leakage for selected instructions as executed on the SCARV core, for masked, ISA (left) and unmasked, ISE (right) cases: each case relates to TVLA-based leakage detection, plotting the (absolute) t-statistic stemming from 100,000 power consumption traces.
Figure 4: Comparison of leakage for masked and unmasked implementations of (the Encrypt kernel of) AES as executed on the SCARV core: each case relates to TVLA-based leakage detection, plotting the (absolute) t-statistic stemming from 100,000 power consumption traces.
### Instruction encodings

#### Table 5: Instruction encodings (C-class: conversion).

| 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9  | 8  | 7  | 6  | 5  | 4  | 3  | 2  | 1  | 0  |
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| 0000000 | 0000 | 0 | rsm1 | 0 | 000 | rdm | 0 | 10110 | 11 | mask.b2a |
| 0000000 | 0000 | 0 | rsm1 | 1 | 000 | rdm | 0 | 10110 | 11 | mask.a2b |

#### Table 6: Instruction encodings (B-class: operations under Boolean masking).

| 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9  | 8  | 7  | 6  | 5  | 4  | 3  | 2  | 1  | 0  |
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| 0000000 | 0001 | 0 | rs1 | 000 | rdm | 0 | 10110 | 11 | mask.b.mask |
| 0000000 | 0001 | 1 | rsm1 | 0 | 000 | rd | 0 | 10110 | 11 | mask.b.unmask |
| 0000000 | 0001 | 1 | rsm1 | 1 | 000 | rdm | 0 | 10110 | 11 | mask.b.remask |
| 1000000 | rsm2 | 0 | rsm1 | 0 | 010 | rdm | 0 | 10110 | 11 | mask.b.not |
| 1000000 | rsm2 | 0 | rsm1 | 0 | 111 | rdm | 0 | 10110 | 11 | mask.b.and |
| 1000000 | rsm2 | 0 | rsm1 | 1 | 110 | rdm | 0 | 10110 | 11 | mask.b.or |
| 1000000 | rsm2 | 0 | rsm1 | 0 | 100 | rdm | 0 | 10110 | 11 | mask.b.xor |
| 1000000 | rsm2 | 0 | rsm1 | 0 | 000 | rdm | 0 | 10110 | 11 | mask.b.add |
| 1000000 | rsm2 | 0 | rsm1 | 0 | 001 | rdm | 0 | 10110 | 11 | mask.b.sub |
| 1100000 | shamt | rsm1 | 0 | 000 | rdm | 0 | 10110 | 11 | mask.b.slli |
| 1100000 | shamt | rsm1 | 0 | 001 | rdm | 0 | 10110 | 11 | mask.b.srli |
| 1100000 | shamt | rsm1 | 0 | 010 | rdm | 0 | 10110 | 11 | mask.b.roli |

#### Table 7: Instruction encodings (A-class: operations under arithmetic masking).

| 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9  | 8  | 7  | 6  | 5  | 4  | 3  | 2  | 1  | 0  |
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| 0000000 | 0010 | 0 | rs1 | 000 | rdm | 0 | 10110 | 11 | mask.a.mask |
| 0000000 | 0010 | 1 | rsm1 | 0 | 000 | rd | 0 | 10110 | 11 | mask.a.unmask |
| 0000000 | 0010 | 1 | rsm1 | 1 | 000 | rdm | 0 | 10110 | 11 | mask.a.remask |
| 0100000 | rsm2 | 0 | rsm1 | 0 | 000 | rdm | 0 | 10110 | 11 | mask.a.add |
| 0100000 | rsm2 | 0 | rsm1 | 0 | 001 | rdm | 0 | 10110 | 11 | mask.a.sub |

#### Table 8: Instruction encodings (F-class: operations for field arithmetic).

| 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9  | 8  | 7  | 6  | 5  | 4  | 3  | 2  | 1  | 0  |
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| 1000000 | 0001 | 0 | rs1 | 0 | 010 | rdm | 0 | 10110 | 11 | mask.f.sqr |
| 1100000 | rsm2 | 0 | rsm1 | 0 | 000 | rdm | 0 | 10110 | 11 | mask.f.mul |
| 1100000 | rsm2 | 0 | rsm1 | 0 | 010 | rdm | 0 | 10110 | 11 | mask.f.aff |
C Additional algorithms

Data: The masked value \( \bar{x} = (x_0, x_1) \).

Result: The masked value \( \bar{r} = (r_0, r_1) \) such that \( r = r_0 - r_1 \pmod{2^w} = x \).

function \textsc{Bool2Arith}(x_0, x_1) begin
    \( t \leftarrow \{0, 1\}^w \)
    \( (s_0, s_1) \leftarrow \textsc{BoolAdd}(x_0, x_1, (t, 0)) \)
    \( r_1 \leftarrow t \)
    \( r_0 \leftarrow s_0 \oplus s_1 \)
    return \((r_0, r_1)\)
end

Algorithm 1: \textsc{Bool2Arith}: convert from Boolean to arithmetic masking.

Data: The masked value \( \bar{x} = (x_0, x_1) \).

Result: The masked value \( \hat{r} = (r_0, r_1) \) such that \( r = r_0 \oplus r_1 = x \).

function \textsc{Arith2Bool}(x_0, x_1) begin
    \( t \leftarrow \{0, 1\}^w \)
    \( x_1 \leftarrow x_1 \oplus t \)
    \( (r_0, r_1) \leftarrow \textsc{BoolSub}(x_0, 0, (t, x_1)) \)
    return \((r_0, r_1)\)
end

Algorithm 2: \textsc{Arith2Bool}: convert from arithmetic to Boolean masking.

Data: The value \( x \).

Result: The masked value \( \hat{r} = (r_0, r_1) \) such that \( r = r_0 \oplus r_1 = x \).

function \textsc{BoolMask}(x_0, x_1) begin
    \( t \leftarrow \{0, 1\}^w \)
    \( r_1 \leftarrow t \)
    \( r_0 \leftarrow x \oplus t \)
    return \((r_0, r_1)\)
end

Algorithm 3: \textsc{BoolMask}: apply mask operation (under Boolean masking).
Data: The masked value $\hat{x} = (x_0, x_1)$.
Result: The value $r = r_0 \oplus r_1 = x$.

function \textsc{BoolUnmask}((x_0, x_1)) begin
  $r \leftarrow x_0 \oplus x_1$
  return $r$
end

Algorithm 4: \textsc{BoolUnmask}: apply unmask operation (under Boolean masking).

Data: The masked value $\hat{x} = (x_0, x_1)$.
Result: The masked value $\hat{r} = (r_0, r_1)$ such that $r = r_0 \oplus r_1 = x$.

function \textsc{BoolRemask}((x_0, x_1)) begin
  $t \leftarrow \{0, 1\}^w$
  $r_1 \leftarrow x_1 \oplus t$
  $r_0 \leftarrow x_0 \oplus t$
  return $(r_0, r_1)$
end

Algorithm 5: \textsc{BoolRemask}: apply remask operation (under Boolean masking).

Data: The masked values $\hat{x} = (x_0, x_1)$ and $\hat{y} = (y_0, y_1)$.
Result: The masked value $\hat{r} = (r_0, r_1)$ such that $r = r_0 \oplus r_1 = \overline{x} \land y$.

function \textsc{BoolAND}((x_0, x_1), (y_0, y_1)) begin
  $t \leftarrow \{0, 1\}^w$
  $r_1 \leftarrow t \oplus (x_1 \land \overline{y_1}) \oplus (x_1 \lor \overline{y_0})$
  $r_0 \leftarrow t \oplus (x_0 \land y_1) \oplus (x_0 \lor \overline{y_0})$
  return $(r_0, r_1)$
end

Algorithm 7: \textsc{BoolAND}: apply AND operation (under Boolean masking).

Data: The masked values $\hat{x} = (x_0, x_1)$ and $\hat{y} = (y_0, y_1)$.
Result: The masked value $\hat{r} = (r_0, r_1)$ such that $r = r_0 \oplus r_1 = x \lor y$.

function \textsc{BoolIOR}((x_0, x_1), (y_0, y_1)) begin
  $(s_0, s_1) \leftarrow \textsc{BoolAND}((x_0, \overline{x_1}), (y_0, \overline{y_1}))$
  $r_1 \leftarrow \overline{s_1}$
  $r_0 \leftarrow s_0$
  return $(r_0, r_1)$
end

Algorithm 8: \textsc{BoolIOR}: apply OR operation (under Boolean masking).
Data: The masked values \( \hat{x} = (x_0, x_1) \) and \( \hat{y} = (y_0, y_1) \).
Result: The masked value \( \hat{r} = (r_0, r_1) \) such that \( r = r_0 \oplus r_1 = x \oplus y \).

function \texttt{BOOLXOR}\((x_0, x_1), (y_0, y_1)\) begin
\[
t \leftarrow \{0, 1\}^w \\
r_0 \leftarrow t \oplus x_0 \oplus y_0 \\
r_1 \leftarrow t \oplus x_1 \oplus y_1 \\
\text{return}(r_0, r_1)
\]
end

Algorithm 9: \texttt{BOOLXOR}: apply XOR operation (under Boolean masking).

Data: The masked values \( \hat{x} = (x_0, x_1) \) and \( \hat{y} = (y_0, y_1) \)
Result: The masked value \( \hat{r} = (r_0, r_1) \) such that \( r = r_0 \oplus r_1 = x + y \).

function \texttt{BOOLADD}\((x_0, x_1), (y_0, y_1)\) begin
\[
(a_0, a_1) \leftarrow \texttt{BOOLXOR}\((x_0, x_1), (y_0, y_1)\) \\
(p_0, p_1) \leftarrow (a_0, a_1) \\
(g_0, g_1) \leftarrow \texttt{BOOLAND}\((x_0, x_1), (y_0, y_1)\) \\
\text{for } i = 1 \text{ upto } \log_2 w \text{ do} \\
(\hat{h}_0, \hat{h}_1) \leftarrow \texttt{BOOLSLL}(g_0, g_1, 2^{i-1}) \\
(u_0, u_1) \leftarrow \texttt{BOOLSLL}(p_0, p_1, 2^{i-1}) \\
(\hat{h}_0, \hat{h}_1) \leftarrow \texttt{BOOLAND}(p_0, p_1, (\hat{h}_0, \hat{h}_1)) \\
(g_0, g_1) \leftarrow \texttt{BOOLXOR}(g_0, g_1, (\hat{h}_0, \hat{h}_1)) \\
(p_0, p_1) \leftarrow \texttt{BOOLAND}(p_0, p_1, (u_0, u_1)) \\
\text{end} \\
(\hat{h}_0, \hat{h}_1) \leftarrow \texttt{BOOLSLL}(g_0, g_1, 1) \\
(r_0, r_1) \leftarrow \texttt{BOOLXOR}(a_0, a_1, (\hat{h}_0, \hat{h}_1)) \\
\text{return}(r_0, r_1)
\]
end

Algorithm 10: \texttt{BOOLADD}: apply addition operation (under Boolean masking).

Data: The masked values \( \hat{x} = (x_0, x_1) \) and \( \hat{y} = (y_0, y_1) \)
Result: The masked value \( \hat{r} = (r_0, r_1) \) such that \( r = r_0 \oplus r_1 = x - y \)

function \texttt{BOOLSUB}\((x_0, x_1), (y_0, y_1)\) begin
\[
(a_0, a_1) \leftarrow \texttt{BOOLXOR}\((x_0, x_1), (y_0, y_1)\) \\
(p_0, p_1) \leftarrow (a_0, a_1) \\
(g_0, g_1) \leftarrow \texttt{BOOLAND}\((x_0, x_1), (y_0, y_1)\) \\
(u_0, u_1) \leftarrow \texttt{BOOLAND}(p_0, p_1, (0, 1)) \\
(g_0, g_1) \leftarrow \texttt{BOOLXOR}(g_0, g_1, (u_0, u_1)) \\
\text{for } i = 1 \text{ upto } \log_2 w \text{ do} \\
(\hat{h}_0, \hat{h}_1) \leftarrow \texttt{BOOLSLL}(g_0, g_1, 2^{i-1}) \\
(u_0, u_1) \leftarrow \texttt{BOOLSLL}(p_0, p_1, 2^{i-1}) \\
(\hat{h}_0, \hat{h}_1) \leftarrow \texttt{BOOLAND}(p_0, p_1, (\hat{h}_0, \hat{h}_1)) \\
(g_0, g_1) \leftarrow \texttt{BOOLXOR}(g_0, g_1, (\hat{h}_0, \hat{h}_1)) \\
(p_0, p_1) \leftarrow \texttt{BOOLAND}(p_0, p_1, (u_0, u_1)) \\
\text{end} \\
(\hat{h}_0, \hat{h}_1) \leftarrow (g_0 \ll 1 \ | \ 0, g_1 \ll 1 \ | \ 1) \\
(r_0, r_1) \leftarrow \texttt{BOOLXOR}(a_0, a_1, (\hat{h}_0, \hat{h}_1)) \\
\text{return}(r_0, r_1)
\]
end

Algorithm 11: \texttt{BOOLSUB}: apply subtraction operation (under Boolean masking).
Data: The masked value $\hat{x} = (x_0, x_1)$, and an integer $0 \leq i < w$.
Result: The masked value $\hat{r} = (r_0, r_1)$ such that $r = r_0 \oplus r_1 = x \ll i$.

function $\text{BOOLSLL}((x_0, x_1), i)$ begin
  \[ t \leftarrow \{0, 1\}^i \]
  \[ r_1 \leftarrow (x_1 \ll i) \parallel t \]
  \[ r_0 \leftarrow (x_0 \ll i) \parallel t \]
  \[ \text{return}(r_0, r_1) \]
end

Algorithm 12: $\text{BOOLSLL}$: apply left-shift operation (under Boolean masking).

Data: The masked value $\hat{x} = (x_0, x_1)$, and an integer $0 \leq i < w$.
Result: The masked value $\hat{r} = (r_0, r_1)$ such that $r = r_0 \oplus r_1 = x \gg i$.

function $\text{BOOLSRL}((x_0, x_1), i)$ begin
  \[ t \leftarrow \{0, 1\}^i \]
  \[ r_1 \leftarrow t \parallel (x_1 \gg i) \]
  \[ r_0 \leftarrow t \parallel (x_0 \gg i) \]
  \[ \text{return}(r_0, r_1) \]
end

Algorithm 13: $\text{BOOLSRL}$: apply right-shift operation (under Boolean masking).

Data: The masked value $\hat{x} = (x_0, x_1)$, and an integer $0 \leq i < w$.
Result: The masked value $\hat{r} = (r_0, r_1)$ such that $r = r_0 \oplus r_1 = x \ggg i$.

function $\text{BOOLROR}((x_0, x_1), i)$ begin
  \[ r_1 \leftarrow x_1 \ggg i \]
  \[ r_0 \leftarrow x_0 \ggg i \]
  \[ \text{return}(r_0, r_1) \]
end

Algorithm 14: $\text{BOOLROR}$: apply right-rotate operation (under Boolean masking).

Data: The value $x$.
Result: The masked value $\hat{r} = (r_0, r_1)$ such that $r = r_0 - r_1 \pmod{2^w} = x$.

function $\text{ARTHIMASK}((x_0, x_1))$ begin
  \[ t \leftarrow \{0, 1\}^w \]
  \[ r_1 \leftarrow t \]
  \[ r_0 \leftarrow r_0 + t \pmod{2^w} \]
  \[ \text{return}(r_0, r_1) \]
end

Algorithm 15: $\text{ARTHIMASK}$: apply mask operation (under arithmetic masking).

Data: The masked value $\hat{x} = (x_0, x_1)$.
Result: The value $r = r_0 - r_1 \pmod{2^w} = x$.

function $\text{ARITHUNMASK}((x_0, x_1))$ begin
  \[ r \leftarrow x_0 - x_1 \pmod{2^w} \]
  \[ \text{return} \]
end

Algorithm 16: $\text{ARITHUNMASK}$: apply unmask operation (under arithmetic masking).
Data: The masked value \( \bar{x} = (x_0, x_1) \).

Result: The masked value \( \bar{r} = (r_0, r_1) \) such that \( r = r_0 - r_1 \pmod{2^w} = x \).

function \textsc{ArithRemask}((x_0, x_1)) \begin{align*} t & \leftarrow \{0, 1\}^w \\ r_1 & \leftarrow x_1 + t \pmod{2^w} \\ r_0 & \leftarrow x_0 + t \pmod{2^w} \\ \text{return}(r_0, r_1) \end{align*}
end

Algorithm 17: \textsc{ArithRemask}: apply remask operation (under arithmetic masking).

Data: The masked values \( \bar{x} = (x_0, x_1) \) and \( \bar{y} = (y_0, y_1) \).

Result: The masked value \( \bar{r} = (r_0, r_1) \) such that \( r = r_0 - r_1 \pmod{2^w} = x + y \).

function \textsc{ArithAdd}((x_0, x_1), (y_0, y_1)) \begin{align*} r_1 & \leftarrow x_1 + y_1 \pmod{2^w} \\ r_0 & \leftarrow x_0 + y_0 \pmod{2^w} \\ \text{return}(r_0, r_1) \end{align*}
end

Algorithm 18: \textsc{ArithAdd}: apply addition operation (under arithmetic masking).

Data: The masked values \( \bar{x} = (x_0, x_1) \) and \( \bar{y} = (y_0, y_1) \).

Result: The masked value \( \bar{r} = (r_0, r_1) \) such that \( r = r_0 - r_1 \pmod{2^w} = x - y \).

function \textsc{ArithSub}((x_0, x_1), (y_0, y_1)) \begin{align*} r_1 & \leftarrow x_1 - y_1 \pmod{2^w} \\ r_0 & \leftarrow x_0 - y_0 \pmod{2^w} \\ \text{return}(r_0, r_1) \end{align*}
end

Algorithm 19: \textsc{ArithSub}: apply subtraction operation (under arithmetic masking).

Data: The masked value \( \hat{x} = (x_0, x_1) \).

Result: The masked value \( \hat{r} = (r_0, r_1) \) such that \( r = r_0 \oplus r_1 = x^2 \).

function \textsc{FieldSqr} ((x_0, x_1)) \begin{align*} r_0 & \leftarrow 0 \\ r_1 & \leftarrow 0 \\ \text{for} \ i \leftarrow 0 \ \text{to} \ \frac{w}{8} - 1 \ \text{do} \\ \xi_{i0} & \leftarrow (x_0 \gg 8 \cdot i) \wedge FF_{16} \\ \xi_{i1} & \leftarrow (x_1 \gg 8 \cdot i) \wedge FF_{16} \\ t_i & \leftarrow \{0, 1\}^8 \\ r_{i1} & \leftarrow t \oplus (\xi_{i1} \times \xi_{i1}) \\ r_{i0} & \leftarrow t \oplus (\xi_{i0} \times \xi_{i0}) \\ r_0 & \leftarrow r_0 \oplus (r_{i0} \ll 8 \cdot i) \\ r_1 & \leftarrow r_1 \oplus (r_{i1} \ll 8 \cdot i) \end{align*}
end

Algorithm 20: \textsc{FieldSqr}: apply packed \( F_{2^8} \) squaring operation (under Boolean masking).
Data: The masked values $\hat{x} = (x_0, x_1)$ and $\hat{y} = (y_0, y_1)$.

Result: The masked value $\hat{r} = (r_0, r_1)$ such that $r = r_0 \oplus r_1 = x \otimes y$.

function FieldMul $((x_0, x_1), (y_0, y_1))$ begin
\[ r_0 \leftarrow 0 \]
\[ r_1 \leftarrow 0 \]
for $i \leftarrow 0$ to \( \frac{w}{8} - 1 \) do
\[ x_{i0} \leftarrow (x_0 \gg 8 \cdot i) \land FF_{16} \]
\[ x_{i1} \leftarrow (x_1 \gg 8 \cdot i) \land FF_{16} \]
\[ y_{i0} \leftarrow (y_0 \gg 8 \cdot i) \land FF_{16} \]
\[ y_{i1} \leftarrow (y_1 \gg 8 \cdot i) \land FF_{16} \]
\[ t_i \leftarrow \{0, 1\}^8 \]
\[ r_{i1} \leftarrow t_i \oplus (x_{i1} \otimes y_{i1}) \oplus (x_{i0} \otimes y_{i0}) \]
\[ r_{i0} \leftarrow t_i \oplus (x_{i0} \otimes y_{i0}) \]
\[ r_0 \leftarrow r_0 \oplus (r_{i0} \ll 8 \cdot i) \]
\[ r_1 \leftarrow r_1 \oplus (r_{i1} \ll 8 \cdot i) \]
end
return $(r_0, r_1)$
end

Algorithm 21: FieldMul : apply packed $F_{2^8}$ multiplication operation (under Boolean masking).

Data: The masked values $\hat{x} = (x_0, x_1)$ and a pair $M = (M_0, M_1)$ which combine to specify the $(8 \times 8)$-element transformation matrix.

Result: The masked value $\hat{r} = (r_0, r_1)$ such that $r = r_0 \oplus r_1 = Mx$.

function FieldAff $((x_0, x_1), (M_0, M_1))$ begin
\[ M \leftarrow M_1 || M_0 \]
\[ r_0 \leftarrow \text{FieldAff'}(x_0, M) \]
\[ r_1 \leftarrow \text{FieldAff'}(x_1, M) \]
return $(r_0, r_1)$
end

Algorithm 22: FieldAff : apply packed $F_{2^8}$ affine transform operation (under Boolean masking).

Data: A 32-bit value $x$ and a 64-bit matrix $M$.

Result: A 32-bit value $r$.

function FieldAff'$(x, M)$ begin
\[ r \leftarrow 0 \]
for $i = 0$ upto 7 do
\[ c \leftarrow (M \gg 8 \cdot i) \land FF_{16} \]
\[ c \leftarrow c \cdot ((x \gg i) \land 01010101_{16}) \]
\[ r \leftarrow r \oplus c \]
end
return $r$
end

Algorithm 23: FieldAff': apply packed $F_{2^8}$ affine transform step.
D Using F-class Instructions for SM4

SM4 is a block cipher designed by Shu-Wang Lu that is mandated in the Chinese national standard for wireless LAN WAPI (Wired Authentication and Privacy Infrastructure) [LJH+07]. It has also been standardized by the International Organization for Standardization (ISO) in 2017. SM4 is a 32-round unbalanced Feistel network with a block size and key size of 128 bits. The structure of SM4 shows certain similarities with the AES; most notably, it uses an 8-bit Sbox that can be algebraically expressed through an affine transformation, an inversion in $\mathbb{F}_{2^8}$, followed by another affine transformation. However, the irreducible polynomial of SM4 is $p(x) = x^8 + x^7 + x^6 + x^5 + x^3 + x^2 + 1$, which differs from the irreducible polynomial of the AES.

The two F-class instructions `mask.f.mul` and `mask.f.sqr` described in Section 3 perform a 4-way multiplication and squaring of elements of $\mathbb{F}_{2^8}$ based on the irreducible polynomial of the AES, which is $p(x) = x^8 + x^4 + x^3 + x + 1$. Nonetheless, it is possible to use `mask.f.mul` and `mask.f.sqr` for a masked implementation of SM4 (and any other cipher with an alternative irreducible polynomial) since polynomial basis representations of $\mathbb{F}_{2^8}$ with different irreducible polynomials are isomorphic to each other. Consequently, there exists a one-to-one mapping of field elements between these field-representations, and converting an element from one representation of $\mathbb{F}_{2^8}$ to another one requires just the computation of a vector-matrix product $vM$, i.e. an $(8 \times 8)$-bit change-of-basis matrix $M$ is left-multiplied by an 8-bit vector $v$ that represents an element of $\mathbb{F}_{2^8}$. Section A.7 of [IEE00] describes in detail how this change-of-basis matrix $M$ can be computed given two irreducible polynomials $p_1(x)$ and $p_2(x)$ as input. In short, the first step is to find a root $r(x)$ of $p_1(x)$ with respect to $p_2(x)$; i.e. this root $r(x)$ is an element of $\mathbb{F}_{2^8}$ with the property $p_1(r) \equiv 0 \mod p_2$. Then, the rows of the change-of-basis matrix $M$ are simply the coefficients of the polynomials $r(x)^i \mod p_2(x)$ for $0 \leq i \leq 7$. For example, the change-of-basis matrix for converting elements of the SM4-field to the corresponding elements of the AES-field looks as follows.

\[
\begin{pmatrix}
1 & 1 & 1 & 1 & 1 & 1 & 0 & 1 \\
0 & 1 & 1 & 0 & 0 & 1 & 1 & 1 \\
1 & 1 & 1 & 1 & 1 & 0 & 1 & 0 \\
1 & 0 & 0 & 0 & 0 & 1 & 1 & 0 \\
0 & 0 & 1 & 1 & 0 & 1 & 0 & 0 \\
0 & 1 & 1 & 0 & 1 & 0 & 0 & 1 \\
0 & 0 & 1 & 0 & 0 & 0 & 1 & 1 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\
\end{pmatrix}
\]

The change-of-basis matrix for conversion in the opposite direction (i.e. from AES to SM4 representation) is the inverse of the above matrix. It is not necessary to perform these conversions before and after each execution of `mask.f.mul` and `mask.f.sqr`; instead, it is more efficient to convert the plaintext/ciphertext block at the beginning of an encryption/decryption, then perform all arithmetic operations in $\mathbb{F}_{2^8}$ with the AES representation (using `mask.f.mul` and `mask.f.sqr`), and finally convert the resulting ciphertext/plaintext back to the SM4 representation.