Abstract. We propose a new approach for building efficient, provably secure, and practically hardened implementations of masked algorithms. Our approach is based on a Domain Specific Language in which users can write efficient assembly implementations and fine-grained leakage models. The latter are then used as a basis for formal verification, allowing for the first time formal guarantees for a broad range of device-specific leakage effects not addressed by prior work. The practical benefits of our approach are demonstrated through a case study of the PRESENT S-Box: we develop a highly optimized and provably secure masked implementation, and show through practical evaluation based on TVLA that our implementation is practically resilient. Our approach significantly narrows the gap between formal verification of masking and practical security.

Keywords: Side-channel resilience · Higher-order masking · Probing security · Verification · Domain-Specific-Languages

1 Introduction

Physical measurements reveal information beyond the inputs and outputs of programs as execution on physical devices emits information on intermediate computations steps. This information, encoded in the noise, time, power or electromagnetic radiations, is known as side-channel leakage and can be used to mount effective side-channel attacks.

The masking countermeasure splits secret data \( a \) into \( d \) shares \((a_0, \ldots, a_{d-1})\) such that it is easy to compute \( a \) from all shares but impossible from less than \( d \) shares [CJRR99, ISW03]. This requires attacks to recover \( d \) shares instead of a single secret value. An active line of research considers the construction of masked algorithms, denoted “gadgets”, which compute some functionality on masked inputs while enforcing that secrets cannot be recovered from less than \( d \) intermediate values. Construction of gadgets is particularly difficult when considering side-channel leakage which allows to observe more than just the intermediate computation steps [GMPO19]. Extended leakage models have been devised to consider additional side-channel information in systematic manner [FGP+18, PR13, DDF14, BGI+18].

Naturally, the question arises whether the masking countermeasure has been applied correctly to a gadget and whether it actually improves security. There exist two main, and fairly distinct, approaches to evaluate the effectiveness of the applied countermeasures: (I) Physical validation performing specific attacks or statistical tests on physical...
measurements [DSM17, DSV14, SM15, PV17, MOW17] and (II) Provable resilience based on attacker and leakage models [CJRR99, ISW03, FGP+18, PR13, DDF19] and automated verification [BBD+15, BBD+16, Cor18, EWS14]. We review the strengths and weaknesses of both approaches.

The main benefit of reproducing attacks is the close correspondence to security; a successful attack implies a real threat, an unsuccessful attack rules out a vulnerability from exactly this attack under the specific evaluation parameters. The drawback is the inherently limited attacker scope to only those attacks which have been performed and the fact that exhaustive evaluation of all attacks remains untractable in most cases. Statistical evaluation allows to bound the retrievable side-channel information, the success rate of retrieval, or to detect side-channel information leakage without considering actual attacks [SM15, DSM17, DSV14]. Nonetheless, the evaluation remains specific to the input data and measurement environment used during assessment. In both cases it is difficult to decide at which point to stop the evaluation and to declare an implementation to be secure. In addition, these methods have large computational requirements which imply an increased wait time for the evaluation results. This prevents fast iterative development cycles with repeated proposal of implementations and evaluation thereof. Vice versa; the implementer has to carefully produce good implementations to avoid too frequent evaluation, limiting creative freedom.

Provable resilience provides a rigorous approach for proving the resilience of masked algorithms. The main benefit of this approach is that guarantees hold in all environments which comply with the assumptions of the proof and that assessment ends when such a proof is found. Inherent to all formal security notions for side-channel is (I) a formal leakage model which defines the side-channel characteristics considered in the proof and (II) an attacker model. The leakage model defines which side-channel information leakages (observations) are accessible to the attacker during execution of a masked program whereas the formal attacker model defines the capabilities of the attacker exploiting this information, e.g. how many side-channel measurements an attacker can perform.

Threshold probing security is arguably the most established approach for provable resilience. In this approach, execution leaks the value of intermediate computations, and the attacker can observe at most $t$ side-channel leakages during an execution of a program masked with $d > t$ shares. The notion of threshold probing security proves perfect resilience against adversaries observing at most $t$ leakages but cannot provide assurance for attackers which potentially observe more. Programs enjoy security against practical attackers w.r.t. the chosen notion if the side-channel model accurately captures the device’s leakage characteristics. The main benefit of probing security is that it can be used to rule out classes of attacks entirely, in difference to physical evaluation such as Test Vector Leakage Assessment (TVLA) [SM16]. Variations of threshold probing security such as the $t$–Non-interference ($t$–NI) and $t$–Strong-Non-interference ($t$–SNI) refinements exist which are easier to evaluate (check) or guarantee additional properties [BBD+16].

A further benefit of provable resilience, and in particular of threshold probing security, is that it is amenable to automated verification. The main benefit of automated verification is that it delegates the formal analysis to a computer program and manages the combinatorial explosion that arises when analyzing complex gadgets at high orders.

The main critique of formal security notions for side-channel security is related to the large gap between formal model and behavior in practice, resulting in security assurance that are sometimes hard to interpret as recently shown by Gao et al. [GMPO19]. In particular, implementations of verified threshold probing secure algorithms frequently enjoy much less practical side-channel resilience as precisely analyzed by Balasch et al. [BGG+14] and [GMPO19]. The advantage of physical evaluation is preeminent in that the increasing diversity of discovered side-channel leakage effects is not entirely considered by existing verification frameworks. One of the reasons being that the considered leakage effects are
inherently integrated into the tool and therefore prevent flexible and fine-grained modeling. In the current setting, to consider new leakage with distinct behavior it is required to modify the tool’s implementation. But the diversity of power side-channel leakage encountered in practice is expected to grow as long as new execution platforms are developed [PV17, BGG14, CGD18, MOW17, SSB19, Ves14].

1.1 Our Work

In this paper, we illustrate that automated verification can deliver provably resilient and practically hardened masked implementations with low overhead.

**Fine-Grained Modeling of Leakage** We define a Domain Specific Language (DSL), denoted IL, for modeling assembly implementations and specifying fine-grained leakage models. The dual nature of IL has significant benefits. First, it empowers implementers to capture real leakage behavior in the form of device-specific leakage models, which ultimately ensure that the purported formal resilience guarantees are in close correspondence with practical behavior. Second, it supports efficient assembly level implementations of masked algorithms, and bypasses thorny issues with secure compilation. Third, it forms the basis of a generic automated verification framework in which assembly implementations can be analyzed generically, without the need to commit to a fixed or pre-existing leakage model. Specifically, we present a tool that takes as input an implementation and checks whether the implementation is secure w.r.t., the security notion associated with the leakage models given with the implementation. This stands in sharp contrast with prior work on automated verification, which commits to one or a fixed set of leakage models.

**Optimized Hardening of Masking** The combination of fine-grained leakage models and reliable verification enables the construction of masked implementations which exhibit no detectable leakage in physical assessment, known as “hardened masking” or “hardening” of masked implementations. We demonstrate several improvements in constructing hardened gadgets and a hardened PRESENT S-Box at 1st and 2nd order which exhibit no detectable leakage beyond one million measurements in TVLA. We provide generic optimization strategies which reduce the overhead from hardening by executing the code of a secure composition of gadgets in an altered order instead of introducing overhead by inserting additional instructions as countermeasure. The resulting overhead reduction of almost 73\% for the first order implementation and of 63\% for the second order shows a need to consider composition strategies in addition to established secure composition results. Our contributions outperform the “lazy strategy” [BGG14] of doubling the number of shares in masking instead of performing hardening; the security order can be increased without detrimental impact on performance as our optimized 2nd order hardened PRESENT S-Box is as fast as a non-optimized 1st order hardened PRESENT S-Box, effectively increasing the security order “for free”.

1.2 Related Work

For the sake of clarity, we organize related work by areas:

**Provable Resilience** Provable resilience of masked implementations was initiated by Chari et al. [CJRR99], and later continued by Ishai, Sahai and Wagner (ISW) [ISW03] and many others. As of today, provable resilience remains a thriving area of research, partially summarized in [KR19], with multiple very active sub-areas. One such relevant area is the study of leakage models, involving the definition and comparison of new models, including the noisy leakage model, the random probing model, the threshold probing model with...
glitches [PR13, DDF14, BGI\textsuperscript{+}18]. Leakage effects were for the first time summarized in a general model by the Robust Probing model [FGP\textsuperscript{+}18]. Later, De Meyer et al. in [DBR19], introduce their concept of glitch immunity and unify security concepts such as (Strong) Non-Interference in an information theoretic manner. In comparison to these works, our DSL offers a much higher flexibility in terms of leakages, since it allows to take into account a broader class of leakages, and consequently more realistic scenarios. Another relevant area tackles the problem of composing secure gadgets; a prominent new development is the introduction of strong non-interference, which achieves desirable composition properties that cannot be obtained under the standard notion of threshold probing security [BBD\textsuperscript{+}16]. Belaid, Goudarzi et Rivain present an elegant alternative approach to solve the problem of composition; however their approach is based on the assumption that only ISW gadgets are used [BGR18]. The formal analysis of composability in extended leakage models started to receive more attention with the analysis of Faust et al. in [FGP\textsuperscript{+}18], which formalized the physical leakages of glitches, transitions and couplings with the concept of extended-probes and proved the ISW multiplication scheme to be probing secure against glitches in two cycles. Later, Cassiers et al. in [CGLS20] proposed the concept of Hardware Private Circuits, which formalizes compositional probing security against glitches, and presented gadgets securely composable at arbitrary orders against glitches. Our work augments the $t$–NI and $t$–SNI notions to capture resilience and composition in any fine-grained model which can be expressed using our DSL and in the presence of stateful execution, as required for provably secure compilers such as MaskComp and Tornado [BBD\textsuperscript{+}16, BDM\textsuperscript{+}20]. The research area of optimization of hardened masking did not receive much attention in the literature, for the best of our knowledge.

**Automated Verification** Proving resilience of masked implementations at high orders incurs a significant combinatorial cost, making the task error-prone, even for relatively simple gadgets. Moss et al. [MOPT12] were the first to show how this issue can be managed using program analysis. Although their work is focused on first-order implementations, it has triggered a spate of works, many of which accomodate high orders [BRNI13, EWS14, BBD\textsuperscript{+}15, Cor18, ZGSW18, BGI\textsuperscript{+}18]. MaskVerif [BBD\textsuperscript{+}15, BBC\textsuperscript{+}19], which we use in our work, is arguably one of the most advanced tools, and is able to verify different notions of security, including $t$–NI and $t$–SNI at higher orders, for different models, including ISW, ISW with transitions, and ISW with glitches. Furthermore, the latest version of MaskVerif captures multiple side-channel effects for hardware platforms, which are configurable by the user. However, the input language of MaskVerif lacks the expressiveness of IL, making it difficult to capture the rich class of potential leakage in software implementations.

**Modeling Side-Channel Behavior** Side-channel behavior is also expressed for analysis purposes other than provable resilience. Papagiannopoulos and Veshchikov construct models of platform specific side-channel effects they discover in practice [PV17]. Their tool ASCOLD prevents combinations of shares in the considered leakage effects, which are hard-coded into the tool. Most importantly, they show that implementations enjoy improved practical security when no shares are combined in their leakage model, which is reminiscent of first order probing security in extended leakage models. Our contributions allow users to provide fine-grained leakage specifications in IL to verify widely established formal security notions at higher orders.

ELMO [MOW17], MAPS [CGD18] and SILK [Ves14] intend to simulate physical measurements based on detailed models. The tools assume fixed leakage effects but allow customization by the user in form of valuation functions. This degree of detail is relevant for simulating good physical measurements but not necessary for our information theoretic notions of security. The authors of MAPS distinguish effects which are beyond what is captured in ELMO’s fixed set of combinations and show the need to remain unbiased.
towards leakage specifications when developing tools for side-channel resilience evaluation. Most notably, ELMO is able to accurately simulate measurements from models inferred in an almost automated manner and is now being used in works attempting to automate the construction of hardened implementations [SSB+19].

2 Expressing Side-Channel Leakage

Verification of side-channel resilience requires suitable representation of the implementation under assessment. This representation must express a program’s functional semantic and information observable per side-channel. It is well known that the leakage behavior of execution platforms differs and this diversity must be expressible to gain meaningful security assurance from verification.

2.1 A Domain Specific Language with Explicit Leakage

Already at CHES 2013 Bayrak et al. [BRNI13] point out the difficulty of expressing arbitrary side-channel leakage behavior yet providing a “good interface” to users willing to specify device-specific side-channel characteristics. The reason can be related to the fundamental approach of implicitly augmenting the underlying language’s operators with side-channel. In such setting, the addition of two variables \( c \leftarrow a + b \); implicitly models information observable by an adversary, but what is leaked (e.g. \( a, b, \) or \( a + b \)) must be encoded in the language semantics (i.e., the meaning of \( \leftarrow \) and \( + \)) and thus prevents flexible adoption of leakage characteristics.

The concept of “explicit leakage” is an alternative as it requires to explicitly state what side-channel information is emitted. We present a Domain Specific Language (DSL) adopting the concept; the language’s constructs do not capture side-channel behavior (i.e., their execution provides no observable side-channel information), except for a dedicated statement “\( \text{leak} \)“ which can be understood as providing specific information to an adversary. The given example can now be stated as \( c \leftarrow a + b; \text{leak}\{a + b\} \);, which has two important benefits: First, verification and representation of programs can be decoupled to become two independent tasks. Second, specification of side-channel behavior becomes more flexible in that a diverse set of complex side-channels can be expressed and altered without effort.

Our DSL, named “IL” for “intermediate language” has specific features to support representation of low-level software. A Backus Normal Form representation is given in Figure 1. Its building blocks are states \( \chi \), expressions \( e \), commands \( c \) of multiple statements \( i \) and global declarations \( g \) of variables and macros with local variables \( x_1, \ldots, x_k \).

\[
\begin{align*}
\chi & ::= x \mid x[e] \mid (e) \\
e & ::= \chi \mid n \in \mathbb{Z} \mid t \mid o(e_1, \ldots, e_j) \\
i & ::= \chi \leftarrow e \mid \text{leak}\{e_1, \ldots, e_j\} \mid m(e_1, \ldots, e_j) \\
& \quad \mid \text{label } l \mid \text{goto } e \\
c & ::= i* \\
g & ::= \text{var } x \mid \text{macro } m(x_1, \ldots, x_j) \ x_1, \ldots, x_k \{e\}
\end{align*}
\]

Figure 1: Simplified syntax of the intermediate language where \( n \) ranges on integers, \( x \) on variables, \( m \) on macro identifiers, \( o \) on operations and \( l \) on label identifiers.

A state element \( \chi \) is either a variable \( x \), an array \( x \) with an indexing expression \( e \), or a location in memory \( (e) \). Memory is distinguished to allow specifications of disjoint memory regions which eases formal verification. Expressions are built from state elements
χ, constant integers n, unique labels l, and operators o applied to expressions. Infix abbreviations for logical “and” ⊗, “exclusive-or” ⊕, addition + and right shift ≫ are used in the following. Allowed statements i are assignments χ ← e, explicit leaks leak {e₁, ..., e_j} of one or more expressions and call to a previously defined macro m(e₁, ..., e_j) where m is the name of the macro. Statements for if conditionals and while loops are supported as well. Labels l are needed to represent the execution of microcontrollers (MCUs) which is based on the address of an instruction. They are defined by a dedicated statement, enabling execution to proceed at the instruction subsequent to this label. Static jumps to unique labels and indirect jumps based on expressions of labels are supported to represent control-flow.

In a nutshell, executable implementations consist of an unstructured list of hardware instructions where each instruction is located at a specific address and execution steps over addresses. We represent implementations as a list of label definitions and macro calls: every instruction is represented by an IL label corresponding to the address of this instruction and a macro call representing the hardware instruction and its operands. A line of Assembly code “0x16E: ADDS R0 R1” becomes almost identical IL code: label 0x16E; ADDS(R0, R1); where adds is a call to the model of the “ADDS” instruction.

The DSL allows to express fine-grained leakage models specifying the semantic and side-channel behavior of assembly instructions. In this light, verifying side-channel resilience of implementations involves three steps: (I) modeling behavior of instructions, (II) representing an implementation using such a model and (III) analyzing or verifying the representation (Section 3).

We stress the significant benefit: verification and representation become separate concerns, i.e., automated verification is now defined over the semantic of our DSL and the separate leakage model of step (I) can be freely modified or exchanged without altering the work-flow in stages (II) and (III). In particular, our tool, named “scVerif” allows the user to provide such a leakage specification in conjunction with an implementation for verification of side-channel resilience.

### 2.2 Modeling Instruction Semantics

The DSL allows to construct models which are specific to the device executing an implementation by attaching device specific side-channel behavior. This is especially important for the Arm and RISC-V Instruction Set Architectures (ISAs) since these are implemented in various MCUs which execute instructions differently, potentially giving rise to distinct side-channel information. The instruction semantic must be modeled since some leakage effects depend not only on intermediate state but also on the order of execution (e.g, control flow). In the following, we show construction of models for Arm Cortex M0+ (CM0+) instructions which are augmented with leakage in Section 2.3. The DSL enables construction of leakage models for other architectures or programming languages as well.

IL enables to express architecture flags, carry bits, unsigned/signed operations, cast between data types, bit operations, control flow, etc. in close correspondence to ISA specifications. The instructions of the CM0+ ISA operate on a set of globally accessible registers and flags, denoted architecture state. They can be modeled as global variables in IL: var R0; var R1; ... var PC; var apsrc; (carry flag) var apsrb; (overflow flag) var apsz; (zero flag) var apsrn; (negative flag).

Addition is used in the ADDS instruction and instructions operating on pointers such as LDR (load) and STR (store). Expressing the semantic of addition with carry requires casting 32 bit values to unsigned, respective signed values and comparing the results of addition to assign the carry and overflow flags correctly. The IL model of ADDS is expressed in Algorithm 1, closely following the Arm ISA specification [ARM18] with six
Algorithm 1 Low-level model of addition with carry and instruction for addition.

1:  macro ADDWithCarry (x, y, carry, result, carryOut, overflow)
2:  var unsignedSum, var signedSum {
3:    signedSum ← (uint) x + (uint) y + (uint) carry;
4:    unsignedSum ← (int) x + (int) y + (int) carry;
5:    result ← (w32) unsignedSum;
6:    carryOut ← ¬((uint) result = unsignedSum);
7:    overflow ← ¬((int) result = signedSum);
8:  }
9:  macro ADDS (rd, rn) { ▷ model of rd ← rd + rn
10:    ADDWithCarry(rd, rn, 0, rd, apsrc, apsrv);
11:    apsrz ← rd = 0;
12:    apsrn ← (rd ≃ 31) = 1;
13:    if rd ≃n pc then
14:      goto rd;
15:  end if
16: }

parameters for inputs, output, carry and overflow flags\textsuperscript{1}. unsignedSum and signedSum are local values. The ADDS instruction is modeled by calling the macro and expressing the side-effect on global flags. A special case of addition to pc requires to issue a branch to the resulting address (represented as label). The operator \(\triangleq_n\) is used to compare whether the parameter \(rd\) is equal to the register with name \(pc\) and conditionally issue a branch.

Sampling randomness, e.g. in the form of queries to random number generators, can be expressed by reading from a tape of pre-sampled randomness in global state.

\textbf{2.3 Modeling Leakage}

We augment the instruction models with a representation of power side-channels specific to threshold probing security. For this security notion it is sufficient to model the dependencies of leakages, which is much simpler and more portable than modeling the constituting function defining the actual value observable by an adversary. Specifying multiple expressions within a single \(\text{leak}\{e_{1}, e_{2}, \ldots\}\) statement allows the threshold probing attacker to observe multiple values (expressions) at the cost of a single probe. On hardware this is known from the “glitch” leakage effect which allows to observe multiple values at once [FGP\textsuperscript{+}18]. The \(\text{leak}\) statement allows generic specification of such \textit{multi-variate leakage} both for side-channel leakage effects but also as worst-case specifications of observations. In particular, a program which is resilient w.r.t. \(\text{leak}\{e_{1}, e_{2}\}\) is necessarily resilient w.r.t. any function \(f(a, b)\) in \(\text{leak}\{f(e_{1}, e_{2})\}\) but not vice versa.

The ADDS instruction is augmented with leakage, which is representative for \texttt{ANDS} (logical conjunction) and \texttt{EORS} (exclusive disjunction) as they behave similar in our model. Observable leakage arises from computing the sum and can be modeled by the statement \(\text{leak}\{rd + rn\}\). Transition leakage as in the robust probing model of [FGP\textsuperscript{+}18] is modeled in a worst case manner: instead of the Hamming-Distance there are two values leaked at the cost of a single probe: \(\text{leak}\{rd, rd + rn\}\), covering any exotic combination as e.g. observed in [GMPO19, MOW17]. The order of execution matters, thus this leakage must be added at the top of the function, before assigning \(rd\textsuperscript{2}\). For better clarity we expose these two leakage effects as macros. The resulting specification of ADDS can be found in Algorithm 2.

\textsuperscript{1}Called macros are substituted in-place and modify input parameters instead of returning values.

\textsuperscript{2}The order in which \(\text{leak}\) statements are placed does not matter since leaks have no semantic side-effect.
Definition 1 (Computation Leakage Effect). The computation leakage effect produces an observation on the value resulting from the evaluation of expression $e$.

1: macro EmitComputationLeak ($e$) {
2:     leak {$e$};
3: }

Definition 2 (Transition Leakage Effect). The transition leakage effect provides an observation on state $x$ and the value $e$ which is to be assigned.

1: macro EmitTransitionLeak ($x$, $e$) {
2:     leak {$x$, $e$};
3: }

Algorithm 2 Leakage model of ADDS instruction.

1: macro LEAKYADDS ($rd$, $rn$) {
2:     EmitComputationLeak($rd + rn$);
3:     EmitTransitionLeak($rd$, $rd + rn$);
4:     EmitRevenantLeak($opA$, $rd$);
5:     EmitRevenantLeak($opB$, $rn$);
6:     ADDS($rd$, $rn$);
7: }

Power side-channels encountered in practice sometimes depend on previously executed instructions. Corre et al. describe a leakage effect, named “operand leakage”, which leaks a combination of current and previous operands of two instructions (e.g. parameters to ADDS) [CGD18]. A similar effect on memory accesses was observed by Papagianopoulos and Veshchikov, denoted as “memory remnant” in [PV17]. The explicit leak statement enables modeling of such cross-instruction leakage effects by introducing additional state elements $\chi$, denoted as “leakage state”. In general, leakage effects which depend on one value $p$ from past execution and one value $c$ from current instruction can be modeled by placing $p$ in global state $\text{opA}$ during the first instruction and emitting a leak of global state and current value in $\text{leak} \{\text{opA}, p\}$ in the latter instruction. The operand and memory remnant leakage effects always emit leakage and update leakage state jointly. We put forward a systematic under the name “revenant leakage”, leaning its name to the (unexpected) comeback of sensitive data from past execution steps and, in the figurative sense, haunting the living cryptographer during construction of secure masking. The leakage effect is modeled in Definition 3 and applied to the ADDS instruction in Algorithm 2. The definition can easily be modified such that the state change is conditional to a user-defined predicate or the leakage is extended to a history of more than one instruction.

Definition 3 (Revenant Leakage Effect). The “revenant” leakage effect releases a transition leakage prior to updating some leakage state $x \leftarrow p$.

1: macro EmitRevenantLeak ($x$, $p$) {
2:     leak {$x$, $p$};
3:     $x \leftarrow p$;
4: }

The leakage effects are applied in instruction models by calling EmitRevenantLeak with the distinct leakage state used for caching the value (e.g. $\text{opA}$) and the value leaking in combination, e.g. the first operand to an addition.

The overall leakage model for a simplified ISA is depicted in Algorithm 3, it corresponds to the model used for CM0+ Assembly. In our model the leakage state elements are

\footnote{The full model is provided in combination with our tool scVerif [too20].}
Algorithm 3 Simplified power side-channel leakage model for CM0+ instructions.

```plaintext
1: var R0; var R1; ... var R12; var PC;         ▷ Global registers
2: var opA; var opB; var opR; var opW;         ▷ Global leakage state
3: macro XOR (rd, rn) {
4:   leak {opA, rd, opB, rn};                ▷ combination of revenants
5:   EmitTransitionLeak(rd, rd ⊕ rn);
6:   EmitRevenantLeak(opA, rd);
7:   EmitRevenantLeak(opB, rn);
8:   rd ← rd ⊕ rn;
9: }
10: macro AND (rd, rn) {
11:   leak {opA, rd, opB, rn};                 ▷ combination of revenants
12:   EmitTransitionLeak(rd, rd ⊙ rn);
13:   EmitRevenantLeak(opA, rd);
14:   EmitRevenantLeak(opB, rn);
15:   rd ← rd ⊙ rn;
16: }
17: macro LOAD (rd, rn, i) {
18:   leak {opA, rn, opB, i};                  ▷ Manual multivariate leakage
19:   EmitRevenantLeak(opA, rn);
20:   EmitRevenantLeak(opB, rd);             ▷ mixed mapping
21:   EmitRevenantLeak(opB, ⟨rn, i⟩);
22:   EmitTransitionLeak(rd, ⟨rn, i⟩);
23:   rd ← ⟨rn, i⟩;
24: }
25: macro STORE (rd, rn, i) {
26:   leak {opA, rn, opB, i};                  ▷ Manual multivariate leakage
27:   EmitRevenantLeak(opA, rn);
28:   EmitRevenantLeak(opB, rd);             ▷ mixed mapping
29:   EmitRevenantLeak(opW, rd);             ▷ note: individual state
30:   ⟨rn, i⟩ ← rd;
31: }
```

denoted by \( opA, opB, opR, opW \) to model four distinct revenant effects for the 1\textsuperscript{st} and 2\textsuperscript{nd} operand of computation as well as for \texttt{LOAD} and \texttt{STORE} separately. Some effects have been refined to match the behavior encountered in practice, which diverges in the mapping of operands and an unexpected propagation of the destination register in \texttt{LOAD} instructions.

In \cite{PV17} the “neighboring” leakage is reported, but we did not observe it on CM0+ MCUs during our case-study. The effect represents a coupling between registers, probably related to the special architecture of the “ATMega163”, highlighting the need of device-specific leakage models. Neighboring leakage can be modeled by using the \( \approx_n \) operator as shown in Definition 4.

**Definition 4 (Neighboring Leakage Effect).** The neighboring leakage effect causes a leak of an unrelated register \( RN \) when register \( RM \) is accessed.

```plaintext
1: macro EmitNeighborLeak (e) {
2:   if e \( \approx_n \) RM then
3:     leak {RN, RM};
4:   end if
5: }
```

The DSL in combination with the concept of explicit leakage enables to model all
leakage effects known to us such that verification of threshold probing security becomes aware of these additional leakages. Our effect definitions can serve as building block to construct models such as our model in Algorithm 3 but can be freely modified to model behavior not yet publicly known. In particular, the expressiveness of modeling appears not to be limited except in that further computation operations $o$ might need to be added to our small DSL.

3 Stateful (S)NI and Automated Verification

In this section, we lay the foundations for proving security of IL implementations. We first define security notions for IL gadgets: following a recent trend [BBD+16], we consider two notions: non-interference (NI) and strong non-interference (SNI), which achieve different composability properties. Then, we present an effective method for verifying whether an IL gadget satisfies one of these notions.

3.1 Security Definitions

We first start with a brief explanation of the need for a new security definition. At a high level, security of stateful computations requires dealing with residual effects on state. Indeed, when a gadget is executed on the processor, it does not only return the computed output but it additionally leaves “residue” in registers, memory, or leakage state. Code subsequently executed might produce leakages combining these residues with output shares, breaking secure composability. As an example, let us consider the composition of a stateful refreshing gadget with a stateful multiplication scheme: $\text{Refr} (\text{Mult}(x, y))$. In the case of non-stateful gadgets, if $\text{Mult}$ is $t$–NI and $\text{Refr}$ is $t$–SNI, such a composition is $t$–SNI. However, if the gadgets are stateful this is not necessarily anymore the case. We give a concrete example: Consider a modified ISW multiplication such that it is $t$–SNI even with the leakages defined in the previous chapter, the output state $s_{\text{out}}$ of the multiplication, in combination with the revenant leakage effect in the load of Algorithm 3 can be used to retrieve information about the secret as follows: After the multiplication one register could contain the last output share of the multiplication gadget and the gadget is still secure. If the refreshing first loads the first output share of the multiplication in the same register, the revenant effect emits an observation containing both values (the first and last output share of the multiplication) in a single probe. Thus the last probes can be used to get the remaining output shares of the multiplication, and the composition is clearly vulnerable.

We first introduce the notion of gadget, on which our security definitions are based. Informally, gadgets are IL macros with security annotations.

Definition 5 (Gadget). A gadget is an IL macro with security annotations:

- a security environment, mapping inputs and outputs to a security level: secret ($\mathcal{H}$) or public ($\mathcal{L}$),
- a memory typing, mapping memory locations to a security level: secret ($\mathcal{H}$), public ($\mathcal{L}$), random ($\mathcal{R}$),
- share declarations, consisting of tuples of inputs and outputs. We adopt the convention that all tuples are of the same size, and disjoint, and that all inputs and outputs must belong to a share declaration.

We now state two main notions of security. The first notion is an elaboration of the usual notion of non-interference, and is stated relative to a public input state $s_{\text{in}}$ and public output state $s_{\text{out}}$. The definition is split in two parts: the first part captures
that the gadget does not leak, and the second part captures that the gadget respects the security annotations.

**Definition 6 (Stateful t–NI).** A gadget with input state \( s_{\text{in}} \) and output state \( s_{\text{out}} \) is stateful \( t \)-Non-Interfering (\( t \)-NI) if every set of \( t \) observations can be simulated by using at most \( t \) shares of each input and any number of values from the input state \( s_{\text{in}} \). Moreover, any number of observations on the output state \( s_{\text{out}} \) can be simulated without using any input share, but using any number of values from the input state \( s_{\text{in}} \).

The second notion is an elaboration of strong non-interference. Following standard practice, we distinguish between internal observations (i.e., observations that differ from outputs) and output observations.

**Definition 7 (Stateful t–SNI).** A gadget with input state \( s_{\text{in}} \) and output state \( s_{\text{out}} \) is stateful \( t \)-Strong-Non-Interfering (\( t \)-SNI), if every set of \( t_1 \) observations on the internal observations, \( t_2 \) observations on the output values such that \( t_1 + t_2 \leq t \), combined with any number of observations on the output state \( s_{\text{out}} \), can be simulated by using at most \( t_1 \) shares of each input and any number of values from the input state \( s_{\text{in}} \).

Both definitions require that gadgets have a fixed number of shares. This assumption is made here for the simplicity of presentation but is not required by our tool.

Finally, we note that there exists other notions of security. One such notion is called probing security. We do not define this notion formally here, but note that for stateful gadgets \( t \)-SNI implies probing security, provided the masked inputs are mutually independent families of shares, and the input state is probabilistic independent of masked inputs and internal randomness.

We validate our notions of security through a proof that they are composable — Section 4 introduces new and optimized composition theorems. The general composition results hold for stateful \( t \)-NI, respective stateful \( t \)-SNI, because the notions ensure similar properties as their non-stateful counterparts.

**Proposition 1.** Let \( G_1(\cdot, \cdot) \) and \( G_2(\cdot) \) be two stateful gadgets as in Figure 2. Assuming \( G_2 \) is stateful \( t \)-SNI and \( G_1 \) is stateful \( t \)-NI, then the composition \( G_2(G_1(\cdot, \cdot)) \) is stateful \( t \)-SNI.

*Proof.* Let \( s_{\text{in}}^1 \) and \( s_{\text{out}}^1 \) be respectively the state input and state output of \( G_1 \) and \( s_{\text{in}}^2 \) and \( s_{\text{out}}^2 \) respectively the state input and state output of \( G_2 \). We prove in the following that the composition \( G_2(G_1(\cdot, \cdot)) \) is stateful \( t \)-SNI.

Let \( \Omega = (I, \mathcal{O}) \) be the set of observations on the whole composition, where \( I_i \) are the observations on the internal computation of \( G_i \), \( I = I_1 \cup I_2 \) with \( |I| = |I_1 \cup I_2| \leq t_1 \) and \( |I| + |\mathcal{O}| \leq t \).

Since \( G_2 \) is stateful \( t \)-SNI and \( |I_2 \cup \mathcal{O}| \leq t \), then there exist observation sets \( S_2^1 \) and \( S_2^2 \) such that \( |S_2^1| \leq |I_2| \), \( |S_2^2| \leq |I_2| \) and all the observations on internal and output values combined with any number of observations on the output state \( s_{\text{out}}^2 \) can be simulated by using any number of values from the input state \( s_{\text{in}}^2 \) and the shares of each input with index respectively in \( S_2^1 \) and \( S_2^2 \).

Since \( G_1 \) is stateful \( t \)-NI, \( |I_1 \cup S_2^2| \leq |I_1 \cup I_2| \leq t \) and \( s_{\text{out}}^1 = s_{\text{in}}^2 \), then there exists an observation set \( S^3 \) such that \( |S^3| \leq |I_1| + |S_2^2| \) and all the observations on internal and output values combined with any number of observations on the output state \( s_{\text{out}}^{1} \) can be simulated by using any number of values from the input state \( s_{\text{in}}^{1} \) and the shares of the input with index in \( S^3 \).

Now, composing the simulators that we have for the two gadgets \( G_1 \) and \( G_2 \), all the observations on internal and output values of the circuit combined with any number of observations on the output state can be simulated from \( |S^3| \leq |I_1| + |S_2^2| \leq |I_1| + |I_2| \leq t_1 \) shares of the first input and \( |S_2^2| \leq |I_2| \) shares of the second input and any number of values from the input state \( s_{\text{in}}^{1} \). Therefore we conclude that the circuit is stateful \( t \)-SNI. \( \square \)
3.2 Automated Verification

In this section, we consider the problem of formally verifying that an IL program is secure at order $t$, for $t \geq 1$. The obvious angle for attacking this problem is to extend existing formal verification approaches to IL. However, there are two important caveats. First, some verification approaches make specific assumptions on the programs—e.g. [BGR18] assumes that gadgets are built from ISW core gadgets. Such assumptions are reasonable for more theoretical models, but are difficult to transpose to a more practical model; besides they defeat the purpose of our approach, which is to provide programmers with a flexible environment to build verified implementations. Second, adapting formal verification algorithms to IL is a very significant engineering endeavour. Therefore we follow an alternative method: we define a transformation $T$ that maps IL programs into a fragment that coincides with the core language of MaskVerif, and reuse the verification algorithm of MaskVerif for checking the transformed program. The transformation is explained below, and satisfies correctness and precision. Specifically, the transformation $T$ is correct: if $T(P)$ is secure at order $t$ then $P$ is secure at order $t$ (where security is either $t$–NI or $t$–SNI of $t$). The transformation $T$ is also precise: if $P$ is secure at order $t$ and $T(P)$ is defined then $T(P)$ is secure at order $t$. Thus, the sole concern with the approach is the partial nature of the transformation $T$. While our approach rejects legitimate programs, it works well on a broad range of examples.

**Target language and high-level algorithm** The core language of MaskVerif is a subset of IL:

$$
\begin{align*}
\chi & ::= x | x[n] \\
i & ::= s \leftarrow e | \text{leak}\{e_1, \ldots, e_j\} \\
e & ::= \chi | n | o(e_1, \ldots, e_j) \\
c & ::= i * 
\end{align*}
$$

The main differences between IL and MaskVerif is that the latter does not have memory accesses, macros and control-flow instructions and limits array accesses to constant indices. Our program transformation proceeds in two steps: first, all macros are inlined; then the expanded program is partially evaluated.

**Partial evaluation** The partial evaluator takes as input an IL program and a public initial state and returns another IL program. The output program is equivalent to the original program w.r.t. functionality and leakage, under some mild assumptions about initial memory layout, explained below.

Our partial evaluator manipulates abstract values and tuples of abstract values, and abstract memories. An abstract value $\vartheta$ can be either a base value corresponding to concrete base values like Boolean $b$ or integer $n$, a label $l$ that represent abstract code pointers and are used for indirect jumps, and abstract pointers $\langle x, n \rangle$. The latter are an abstract representation of a real pointer. Formally, the syntax of values is defined by:

$$
\begin{align*}
\vartheta & ::= b | n | l | \langle x, n \rangle | \bot \\
v & ::= \vartheta | [\vartheta; \ldots; \vartheta]
\end{align*}
$$

Initially the abstract memory is split into different (disjoint) regions modeled by fresh arrays with maximal offset that do not exist in the original program. Those regions
what we call the memory layout. A base value \( \langle x, n \rangle \) represents a pointer to the memory region \( x \) with the offset \( n \) (an integer). This encoding is helpful to deal with pointer arithmetic. The following code gives an example of region declarations:

```plaintext
region mem w32 a[0:1]
region mem w32 b[0:1]
region mem w32 c[0:1]
region mem w32 rnd[0:0]
region mem w32 stack[-4:-1]
```

It means that the initial memory is split into 5 distinct region \( a, b, c, \text{ rnd}, \text{ stack} \), where \( a \) is an array of size 2 with index 0 and 1. Remark that the initial assumption is not checked (and cannot be checked by the tool). Then another part of the memory layout provides some initialisation for registers (IL variables):

```plaintext
init r0 <rnd, 0>
init r1 <c, 0>
init r2 <a, 0>
init r3 <b, 0>
init sp <stack, 0>
```

In particular, this specifies that initially the register \( r0 \) is a pointer to the region \( \text{ rnd} \). Some extra information is also provided to indicate which regions initially contain random values, or correspond to input/output shares.

The partial evaluator is parameterized by a state \( \langle p, c, \mu, \rho, ec \rangle \), where \( p \) is the original IL program, \( c \) is the current command, \( \mu \) a mapping from \( p \)'s variables to their abstract value, \( \rho \) a mapping from variable corresponding to memory region to their abstract value, and \( ec \) is the sequence of commands that have been partially executed. The partial evaluator iteratively propagates values, removes branching instructions, and replaces memory accesses by variable accesses (or constant array accesses). Figure 3 provides some selected rules for the partial evaluator. A complete description of the partial evaluator will appear in the full version.

For expressions, the partial evaluator computes the value \( \vartheta \) of \( e \) in \( \mu \) and \( \rho \) (which can be \( \perp \)) and an expression \( e' \) where memory/array accesses are replaced by variables/-constant array accesses, i.e. \( [e]_\mu^\vartheta = (\vartheta, e') \). If the expression is of the form \( o(e_1, \ldots, e_n) \) where all the arguments are partially evaluated, the resulting expression is the operator applied to the resulting expressions \( e'_i \) of the \( e_i \) and the resulting value is the partial evaluation of \( \check{o}(\vartheta_1, \ldots, \vartheta_n) \) where \( \check{o} \) check is the \( \vartheta_i \) are concrete values in that case it computes the concrete value else it return \( \perp \) (the partial evaluator sometime uses more powerful simplification rules like \( 0 \check{\ast} \vartheta \mapsto \vartheta \)).

If the expression is a variable, the partial evaluator simply returns the value stored in \( \mu \) and the variable itself. The case is similar for array accesses, first the index expression is evaluated and the resulting value should be an integer \( n \), the resulting expression is simple \( x[n] \) and the resulting value is the value stored in \( \mu(x) \) at position \( n \) (the partial evaluator checks that \( n \) is in the bound of the array). For memory access \( \langle e \rangle \) the partial evaluation of \( e \) should lead to an abstract pointer \( \langle x, ofs \rangle \), in this case the resulting expression is \( x[ofs] \) and the value is \( \rho(x)[ofs] \).

For assignment, the partial evaluator evaluates the left side of the assignment \( \chi \) as an expression, leading to a refined “left side” expression \( \chi' \), the right part of the assignment \( e \) is also partially evaluated leading to \( (\vartheta, e') \) the partially evaluated assignment is \( \chi' \leftarrow e' \) and the mapping \( \mu \) and \( \rho \) are updated accordly with the value \( \vartheta \). For leak instructions, the partial evaluator simply propagates known information into the command. For control-flow instructions, the partial evaluator tries to resolve the control-flow and eliminates the instruction. For goto statements, the partial evaluator tries to resolve the next instruction to be executed, and eliminates the instruction.
Proposition 2 (Informal). Let $P$ and $P'$ be an IL gadget and the corresponding MaskVerif gadget output by the partial evaluator. For every initial state $s$ satisfying the memory layout assumptions, the global leakage of $P$ w.r.t. $s$ and a set of inputs is equal to the global leakage of $P'$ w.r.t. the same inputs.

We briefly comment on proving Proposition 2. In order to provide a formal proof, a formal semantics of gadgets is needed. Our treatment so far has intentionally been left informal. However, the behavior of gadgets can be made precise using programming language semantics. We briefly explain how. Specifically, the execution of gadgets can be modelled by a small-step semantics that captures one-step execution between states. This semantics is mostly standard, except for the leak statements which generate observations. Using the small-step semantics, one can model global leakage as a function that takes as input initial values for the inputs and an initial state and produces a sequence of observations, a list of outputs and a final state. Last, we transform global leakage into a probabilistic function by sampling all inputs tagged with the security type $R$ independently and uniformly from their underlying set. This yields a function that takes as input initial values for the inputs and an initial partial state (restricted to the non-random values), a list of observations selected by the adversary and returns a joint distribution over tuples of values, where each tuple corresponds to an observation selected by the adversary.

3.3 Implementation

We have implemented the partial evaluator as a front-end to MaskVerif, named “scVerif”. Users can write leakage models, annotations and programs in IL or provide programs in Assembly code. If the output program lies in the MaskVerif fragment, then verification starts with user specified parameters such as security order or which property to verify. Else, the program is rejected. The tool also applies to bit- and n-sliced implementations and provides additional automation for temporal accumulation of probes to represent capacitance in physical measurement. Shareslicing is not yet supported as the scheme is questioned in [GMP019] and fundamentally insecure in our CM0+ models. However, few additional transformations allow to extend our work to these implementations.
4 Representative Proofs of Efficient Masking

We describe the construction and optimization of gadgets that do not exhibit vulnerable leakage at any order \( t \leq d - 1 \), where \( d \) is the number of shares. That is, we harden masked implementations to be secure at the optimal order \( t = d - 1 \) in fine-grained leakage models, opposed to the “lazy” strategy of masking in a basic model at higher orders with the intention to achieve practical security at lower orders \( t < d - 1 \) [BGG+14].

Creating a secure gadget is an iterative process which involves three tasks: (a) understanding and modeling the actual leakage behavior (b) constructing an (efficient) implementation which is secure in the fine-grained model (c) optionally performing physical evaluation of side-channel resilience to assess the quality of the model for the specific target platform. Protecting an implementation against side-channel effects mandates insertion of instructions to circumvent vulnerable combination of masked secrets.

4.1 Hardened Masking

In this section, we discuss the development of gadgets which enjoy security in any fine-grained leakage model. We design gadgets first in the simplified IL model depicted in Algorithm 3. Designing in IL is more flexible than assembly since shortcuts such as leakage free operations and abstract countermeasures are available. Once the gadget is hardened the gadget is implemented in assembly and verified again, which is to a large degree trivial but requires to substitute abstract countermeasures by concrete instructions.

Each gadget takes as input one or two values \( a \) and \( b \), respectively encoded in \((a_0, \ldots, a_{d-1})\) and \((b_0, \ldots, b_{d-1})\), and gives as output the shares \((c_0, \ldots, c_{d-1})\), encoding a value \( c \). By convention, inputs and outputs are stored in memory to allow construction of implementations at higher orders. Our gadgets, provided in the Supplementary material, use the registers \( R0, R1, R2, \) and \( R3 \) as memory addresses pointing to inputs, outputs and random values stored in memory. The registers \( R4, R5, R6, \) and \( R7 \) are used to perform the elementary operations. Registers beyond \( R7 \) are used rarely.

A gadget which is correctly masked in the basic leakage model, i.e., secure against computation leakage (Definition 1), can be secured by purging the architecture and leakage state at selected locations within the code\(^4\). The reason is simple: every leak must be defined over elements of the state and removing sensitive data from these elements prior the instruction causing such leak mitigates the ability to observe the sensitive data.

We distinguish “scrubbing” countermeasures, which purge architecture state, and “clearing” countermeasures, which remove values residing in leakage state. Two macros serve as abstract countermeasures, \( \text{SCRUB}(R0) \) and \( \text{CLEAR}(\text{opA}) \) assign some value which is independent of secrets to \( R0 \), respectively \( \text{opA} \). On assembly level these need to be substituted by available instructions. Clearing \( \text{opA} \) or \( \text{opB} \) is mostly done by ANDS(\( R0, R0 \)) since \( R0 \) is a public memory address. Purging \( \text{opR} \) (respectively \( \text{opW} \)) requires to execute LOAD (respectively STORE) instruction reading (writing) a public value from memory, but the side-effects of both instructions require additional care. Sometimes multiple countermeasures can be combined in assembly.

Moreover we approach the problem of securing a composition against the leakage effects introduced in Section 2.1 by ensuring that all the registers involved in the computation of a gadget are completely cleaned before the composition with the next gadget. This, indeed, easily guarantees the requirements of stateful \( t \)-SNI in Definition 7. We use \( \text{FCLEAR} \) as abstract placeholder for the macro run after each gadget to clear the state \( s_{out} \). Additional clearings are needed between intermediate computations in the gadgets; these macros are represented as \( \text{CLEAR}_i \), where the index distinguishes between the different macros in the gadget since each variety of leakage needs a different countermeasure.

\(^4\)All \( t \)-NI and \( t \)-SNI algorithms enjoy this property since computation leakage is inherent to masking.
Finally, randomness is employed in order to randomize part of the computation, especially in the case of non-linear gadgets, where otherwise with one probe the attacker could get the knowledge of several shares of the inputs. We indicate with RND a value picked uniformly at random from $F_{2^8}^2$, prior to execution.

For giving an intuition of our strategy, we depict in Algorithm 4 and Algorithm 5 respectively an addition and a multiplication scheme at 1st order of security. Some other examples of stateful $t \cdot SNI$ addition, multiplication and refreshing schemes for different orders can be found in section A of the Supplementary material. They have all been verified to be stateful $t \cdot SNI$ with the use of our new tool. Some algorithms are clearly inspired by schemes already existing in the literature, as the ISW multiplication [ISW03] and the schemes in [BBP +16]. We analyze the S-Box of Present and provide a stateful $t \cdot SNI$ secure Algorithm for first and second order in Appendix C. Stateful $t \cdot SNI$ Security can be achieved by refreshing the output with a secure stateful $t \cdot SNI$ Refresh gadget. For reasons of simplicity, we divided the S-Box into three functions and designed stateful $t \cdot SNI$ secure Gadgets accordingly. Considering that all three Gadgets only have one fan-in and fan-out, the composition is also stateful $t \cdot SNI$ secure.

**Algorithm 4** Addition scheme, 1st order stateful $t \cdot SNI$

**Input:** $a = (a_0, a_1), b = (b_0, b_1)$

**Output:** $c = (c_0, c_1)$, such that $(c_0 = a_0 + b_0), (c_1 = a_1 + b_1)$

1: LOAD(R4, R1, 0); ▷ Load $a_0$ into register $r_4$
2: LOAD(R5, R2, 0); ▷ Load $b_0$ into register $r_5$
3: XOR(R4, R5); ▷ after XOR $r_4$ contains $a_0 + b_0$
4: STORE(R4, R0, 0); ▷ Store the value of $r_4$ as output share $c_0$
5: CLEAR(opA);
6: LOAD(R5, R1, 1); ▷ Load $a_1$ into register $r_5$
7: LOAD(R6, R2, 1); ▷ Load $b_1$ into register $r_6$
8: XOR(R5, R6); ▷ after XOR $r_5$ contains $a_1 + b_1$
9: STORE(R5, R0, 1); ▷ Store the value of $r_5$ as output share $c_1$
10: SCRUB(R4); SCRUB(R5); SCRUB(R6);
11: CLEAR(opA); CLEAR(opB); CLEAR(opF); CLEAR(opk);

The methodology just described, despite being easy to apply, can be expensive, as it requires an extensive use of clearings, especially for guaranteeing secure composition. However, a couple of strategies can be adopted in order to overcome this drawback and optimize the use of clearings. We describe such optimization strategies in the following.

### 4.2 Optimized Composition of Linear Gadgets

The first scenario of optimization is the case when linear gadgets are composed to each other. We refer to this situation as *linear composition*. We exploit the fact that linear gadgets never combine multiple shares of an input and thus independent computation are performed on each input share, denoted “share-wise”. We modify the order in which the operation are usually performed, in such a way that initially all the operations of the first shares are applied, then all the ones on the second shares, and so on.

More formally, let $a, b, c$ be $d$-shared encodings $(a_i)_{i \in[d]}, (b_i)_{i \in[d]}, (c_i)_{i \in[d]}$ and let $\mathcal{F}(a, b) := (\mathcal{F}_0(a_0, b_0), \text{CLEAR}_0, \ldots, \mathcal{F}_{d-1}(a_{d-1}, b_{d-1}), \text{CLEAR}_{d-1}, \text{FCLEAR})$ be a share-wise simulatable linear gadget with, e.g. $\mathcal{F}_i(a_i, b_i)$ outputs $a_i \oplus b_i$ as described in Figure 4 (left) and CLEAR are the leakage countermeasures between each share-wise computation as explained in Section 4.1. In the following we consider a composition $\mathcal{F}(\mathcal{F}(a, b), c)$ and present a technique to optimize the efficiency of both gadgets. Instead of performing first
Algorithm 5 Multiplication scheme, 1st order stateful $t$-SNI

Input: $a = (a_0, a_1)$, $b = (b_0, b_1)$
Output: $c = (c_0, c_1)$, such that $(c_0 = a_0b_0 + \text{RND}_0 + a_0b_1)$, $(c_1 = a_1b_1 + \text{RND}_0 + a_1b_0)$

1. LOAD($R4, R2, 0$);
2. LOAD($R5, R1, 0$);
3. AND($R4, R5$);
4. LOAD($R6, R3, 0$);
5. XOR($R6, R4$);
6. LOAD($R7, R2, 1$);
7. AND($R5, R7$);
8. XOR($R5, R6$);
9. STORE($R5, R0, 0$);
10. CLEAR($\text{opA}$); SCRUB($R4$); SCRUB($R6$);
11. LOAD($R4, R1, 1$);
12. AND($R7, R4$);
13. LOAD($R6, R3, 0$);
14. XOR($R6, R7$);
15. LOAD($R5, R2, 0$);
16. AND($R5, R4$);
17. XOR($R6, R5$);
18. STORE($R6, R0, 1$);
19. SCRUB($R4$); SCRUB($R5$); SCRUB($R6$); SCRUB($R7$);
20. CLEAR($\text{opA}$); CLEAR($\text{opB}$); CLEAR($\text{opA}$); CLEAR($\text{opB}$);

the inner function $F(a, b) =: m$ and then the outer function $F(m, c) =: a$, we perform

$$\hat{F}(a, b, c) = \left( (\hat{F}_i(a_i, b_i, c_i), \text{CLEAR}_i) \right)_{i \in [d]}^\text{FCLEAR}$$

with $\hat{F}_i(a_i, b_i, c_i) = F_i(F_i(a_i, b_i), c_i)$. In other words, we change the order of computation to $m_0, a_0, \ldots, m_{d-1}, a_{d-1}$, rather than $m_0, \ldots, m_{d-1}, a_0, \ldots, a_{d-1}$.

This method allows us to save on the number of CLEAR, LOAD, and STORE operations. In a normal execution, the output $m$ of the first gadget needs to be stored in memory, just to be loaded during the execution of the second gadget. With the optimized execution, instead, we do not need to have such LOADs and STOREs, since the two gadgets are performed at the same time. Additionally, by considering the composition as a unique gadget, we can save on the clearings that would be otherwise needed after the first gadget to ensure the stateful $t$-SNI. We provide a security proof for $\hat{F}(a, b, c)$ in Proposition 3 and a concrete application of Proposition 3 to Algorithm 4 in the Supplementary material.

**Proposition 3.** The optimized gadget $\hat{F}(a, b, c)$ as described above, is stateful-$t$-NI.

**Proof.** We show that all observations in the gadget depend on at most one share of each input. Since the attacker can perform at most $n - 1$ observations, this implies that any combination of its observations is independent of at least one share of each input. More precisely, the computation of the $i$th output of $\hat{F}(a, b, c)$ only depends on the $i$th shares of $a$, $b$ or $c$. Hence the observations in each iteration only leak information about the $i$th share since we clear the state after the computation of each output share. Therefore any combination of $t \leq d - 1$ observations is dependent on at most $t$ shares of each input, and any set $t$ observations is simulatable with at most $t$ shares of each input bundle. 

16
4.3 Optimized Composition of Gadgets with Independent Inputs

The second scenario that we take into account is the one described in Figure 4 (right), where two non-linear gadgets, e.g. two multiplication algorithms, sharing one of the inputs are performed. We refer in the following to this situation as non-linear composition. In this case, it is possible to reduce the number of loadings and clearings, by re-using the shares in common, once loaded into the registers and replacing the intermediate clearings of a gadget by independent computations of another gadget.

The optimization technique described to save clearings also holds for two gadgets with independent inputs. The intermediate clearings in a gadget ensure that two computations on two different shares of the same secret do not leak together. Since this clearing is only a computation independent of the secret, the clearing can be replaced by a useful computation of another gadget.

With our tool, we have proven that the merge of stateful $t$–SNI multiplications, given in Appendix A of the Supplementary material, is also stateful $t$–SNI. Since we only need the more efficient special optimization for the PRESENT S-Box, we focus on two multiplications with shared input. In total, we save 59% cycles for second order. Overhead from clearings and scrubs reduces by 75%, the amount of loads and stores by 47%.

4.4 Case study: Masking the PRESENT S-Box

The impact of our methodology is estimated by masking a large circuit, the PRESENT block cipher, at $1^{st}$ and $2^{nd}$ order with the basic rules for composability (Section 3) and the introduced optimizations (Section 4.2 and 4.3). The structure of the S-Box of PRESENT allows the adoption of the optimization techniques, both in the linear and in the non-linear composition. Based on [CFE16], the S-Box consists of two share-wise functions and one non-linear function. The non-linear part is depicted in Figure 5. A complete description of the S-Box is provided in the Supplementary material.

Our masked implementation of the PRESENT S-Box, using the trivial solution for composability, is provided in the Supplementary material. Algorithm 12 in Appendix C depicts the masked S-Box, where the subroutines CALCA in Algorithm 14, CALCB in
Algorithm 15 and calcG in Algorithm 16 are first order NI gadgets. The optimized version of it, instead, employs our optimization techniques which are given in the subroutines calcA_opt, calcB_opt and calcG_opt, respectively in Algorithms 17, 18 and 19. Our focus is the optimization of computational overhead arising from hardening masked implementations. The optimizations reduce the use of randomness in case of probabilistic clearings and scrubs. Furthermore, the tool can verify that manual choices of randomness reuse in large implementations are secure [FPS17].

Our 1st and 2nd order PRESENT S-Box require 7, respective 26 words of entropy, the implementation is given in Algorithm 13. With the help of the tool the requirements can be reduced to 3, respective 18 words of entropy.

As metric to measure the improvements of our optimization techniques, we take the amount of basic operations used in the implementations, as shown in Figure 1. From this comparison, we can see that both implementations use almost the same amount of core operations (xor and and), since the two versions implement the same algorithm. More precisely, the non-optimized version requires two xor operations less, thanks to the parallel calculation of all output values in calcG_opt, where $b \cdot d$ needs to be added to $a'$ and $d'$. On the other hand, since in the non-optimized version more intermediate values need to be stored and loaded inside the functions, while in the optimized version it is only needed to store intermediate values between the functions, the number of stores and loads employed is lower, producing an improvement in terms of operation count. Additionally, the amount of loads is reduced further in the optimized version by loading every input share once per output share. This holds with the exception of the limited amount of registers, requiring to load $a_1$ and $d_1$ twice for the second output share and $b_0$ and $b_1$ only are needed to load once in the whole gadget.

In Table 1 the efficiency of our approach is depicted as the ratio between the operation needed for the calculation and the overhead caused by clearings in both the normally composed and the optimized versions of the PRESENT S-Box. The comparison shows an efficiency improvement of almost 73% for 1st order and of 63% for 2nd order.

In these regards, we underline how the aforementioned optimization is possible thanks to the use of our new tool. The latter, indeed, allows us to first prove the security of combination of stateful gadgets, i.e., the optimized compositions discussed above, and then to verify their security in the biggest context of the S-Box, which would otherwise be too exhaustive to prove by pen and paper.

| Table 1: Operation and cycle count of normally composed and optimized PRESENT S-Box |
|---------------------------------|-----------------|-----------------|-----------------|-----------------|-----------------|
|                                | 1st order       |                 | 2nd order       |                 |                 |
|                                | composition     | optimized       | opt comp        | composition     | optimized       | opt comp        |
| LOAD                            | 115             | 60              | 0.52            | 251             | 136             | 0.54            |
| AND                             | 24              | 24              | 1               | 54              | 54              | 1               |
| XOR                             | 57              | 59              | 1.054           | 133             | 142             | 1.07            |
| STORE                           | 72              | 48              | 0.67            | 93              | 48              | 0.52            |
| SCRUB( )                        | 95              | 16              | 0.17            | 211             | 53              | 0.25            |
| CLEAR( opA)                     | 35              | 12              | 0.34            | 67              | 20              | 0.3             |
| CLEAR( opB)                     | 130             | 43              | 0.33            | 314             | 80              | 0.25            |
| CLEAR( opR)                     | 130             | 26              | 0.2             | 260             | 73              | 0.28            |
| CLEAR( opW)                     | 56              | 4               | 0.07            | 93              | 10              | 0.11            |
| cycles                          | 1097            | 440             | 0.4             | 2173            | 883             | 0.41            |
Table 2: Density of PRESENT S-Box, i.e., the ration between clearings and operation count

<table>
<thead>
<tr>
<th></th>
<th>1st order</th>
<th></th>
<th>2nd order</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Normal composition</td>
<td>Optimization</td>
<td>Normal composition</td>
<td>Optimization</td>
</tr>
<tr>
<td>#clearings</td>
<td>1.66</td>
<td>0.45</td>
<td>1.78</td>
<td>0.62</td>
</tr>
<tr>
<td>#operations</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

4.5 Resilience in Practice

The question whether proofs in fine-grained leakage models connect to resilience in practice was left open so far.

The connection between threshold probing security and resilience in practice is straightforward: the formal property that no combination of \( t \) (modeled) observations provides benefit to an attacker can directly be transposed to the physical setting where no combination of \( t \) measurement samples should provide valuable information on secrets. A threshold probing proof is thus representative whenever the specified leakage model contains all information derivable from measurement samples, i.e., the model is sufficiently complete. Our work enables verification in leakage models with the mandated precision.

The assurance of representative proofs is important in that it provides a lower bound on the attack complexity since at least \( t \) pieces of information have to be recovered and this difficulty is exponential in \( t \) when sufficient noise is present [PR13]. Our systematic approach allows to get the most out of masking by achieving the optimal resilience at security order \( t = d - 1 \) in practice, which is important for efficient implementations.

The question of evaluating the quality of a model is still unanswered for this new kind of specification which expresses data dependency only. Leakage certification is an established approach to systematically validate the quality of leakage models but requires more detail than needed for probing security since the constituting function for each measurement sample must be modeled [DSM17, DSV14]. Leakage detection is a good candidate due to direct connection to probing security and the way models are shared across implementations. Representative proofs of threshold probing security correspond to the hypothesis that the distribution of every combination of \( t \) leakage observations is independent of secrets. Leakage detection methods such as TVLA assess exactly this hypothesis in comparing the distribution of measurement samples taken during execution on a fixed secret with execution on random secrets [SM15]. Informally, TVLA evaluates whether the observable leakage of computation on secrets can be distinguished from leakage generated by random inputs, which should be indistinguishable for secure implementations.

The quality of our model is evaluated by constructing multiple implementations in this shared model and applying physical leakage detection independently on each implementation. We stress that in this assessment strategy the model becomes stronger the more verified implementations are evaluated using leakage detection, which is a significant benefit of systematic hardening in general. Our model is (empirically) qualitative since all implementations are leakage free at their optimal order in physical leakage detection assessment at a minimum of one million traces\(^5\).

The power consumption of two CM0+ MCUs (“FRDM-KL82Z”, “STM32L073RZ”) is each measured with an oscilloscope sampling the current consumption via an inductive current probe at 2.5 GS/s, a bandwidth of 500 MHz and 8bit quantification. The MCUs are clocked at 4 MHz and every 125 samples are averaged resulting in 5 samples per cycle. Each execution is averaged over four repeated executions to further reduce the noise, resulting in an assessment with very little noise. Sets of one million measurements

\(^5\)The IL and Assembly code for stateful-\( t \)-\texttt{SNI} and and \texttt{refresh}, the stateful-\( t \)-\texttt{NI xor}, \texttt{copy} and \texttt{negation} gadgets, stateful-\( t \)-\texttt{NI} compositions thereof, the optimized stateful-\( t \)-\texttt{NI} PRESENT S-Box, all masked at 1st and 2nd order, as well as the full leakage model are provided at https://github.com/scverif/gadgets
each are compared in random vs. fixed Welch \( t \)-test, alpha certainty of 0.0001. Significant leakage is detected when the \( t \)-statistics are larger than the non-adopted threshold of 4.5.

Our 1\textsuperscript{st} order PRESENT S-Box is free of significant leakage on both MCUs, as seen in Figure 6. The need for device-specific models is easily depicted by the fact that our practically resilient and formally secure code emits detectable leakage when executed on the distinct Arm Cortex M4F (CM4F) architecture, namely the “STM32F407”, as seen in Appendix D, Figure 7.

![Figure 6: Physical leakage detection \( t \)-statistics of optimized 1\textsuperscript{st} order PRESENT S-Box assessment, \( x \) axis represents sample points.](image)

To show the applicability of our model at higher order we evaluate our 2\textsuperscript{nd} order PRESENT S-Box in 2\textsuperscript{nd} order multivariate TVLA on the KL82Z by processing the measurements such that every pair of sample points is combined and evaluated, the results are shown in Appendix D, Figure 7. The combinatorial blow-up requires hundreds of CPU hours to evaluate the S-Box, compared to few seconds when using seVerif.

The model sufficiently represents power side-channel leakage for uni-variate (first order) and multi-variate (higher-order) attacks up to one million traces and as such threshold probing security proofs in this particular model appear representative. In general, the combination of probing security and TVLA evaluation is beneficial as strict verification of many implementations depends on a single, shared specification of leakage behavior while physical evaluation strengthens the shared specification by assessing in different contexts. Re-using models in the form of shared libraries allows to reduce the risk of specification errors as well, thus we provide our model as open-source \[too20\]. Moreover, our approach allows to verify concrete implementations at higher orders of security with predictable resilience in practice, scaling beyond the computational bound of multivariate TVLA.

5 Conclusion

In this paper, we show how automated verification can deliver provably resilient and practically hardened masked implementations with low overhead.

Our DSL allows to construct fine-grained models of side-channel behavior which can be adopted flexibly to specific contexts. For the first time, this approach allows to verify formal notions of side-channel resilience in user-provided models at higher orders. The combination of representative leakage models and formal verification enables to rule out entire classes of practical side-channel attacks backed by provable security statements.

New generic optimization strategies are introduced to reduce the overhead mandated by additional countermeasures for security in fine-grained leakage models. The optimizations are applied to a masked PRESENT S-Box and validated to be leak free up to a
high number of traces in physical leakage assessment despite the high efficiency of the constructions. Moreover, the optimized and hardened constructions show that practical resilience and efficiency can go hand in hand, motivating further research.

Our tool scVerif serves as front-end to MaskVerif but the presented concept to model side-channel behavior explicitly is likely adoptable to verification of other security notions such as noisy or random probing security, given that sufficient information such as signal-to-noise ratio or occurrence probabilities are encoded in the model. This could allow to bound the success rate of attacks at order $t > d$ in combination with the powerful but bounded assurance from probing security for $t \leq d$.

Acknowledgements

Clara Paglialonga and Maximilian Orlt are partially funded by the VeriSec project 16KIS0634 from the Federal Ministry of Education and Research (BMBF) and the Hessen State Ministry for Higher Education, Research and the Arts within their joint support of the National Research Center for Applied Cybersecurity ATHENE, and by the Emmy Noether Program FA 1320/1-1. Marc Gourjon is partially funded by the VeriSec project 16KIS0601K from BMBF.

References


Supplementary material

A Basic algorithms

A.1 Addition gadgets

Algorithm 6 SECXOR: Addition scheme at 2\textsuperscript{nd} order of security

\textbf{Input:} \(a = (a_0, a_1, a_2); \ b = (b_0, b_1, b_2)\)

\textbf{Output:} \(c = (c_0, c_1, c_2), \) such that

\[
\begin{align*}
c_0 &= a_0 + b_0 \\
c_1 &= a_1 + b_1 \\
c_2 &= a_2 + b_2
\end{align*}
\]

1: \textsc{load}(R4, R1, 0);
2: \textsc{load}(R5, R2, 0);
3: \textsc{xor}(R4, R5);
4: \textsc{store}(R4, R0, 0);
5: \textsc{clear}(\textsc{opE})
6: \textsc{clear}(\textsc{opW})
7: \textsc{load}(R6, R1, 1);
8: \textsc{load}(R7, R2, 1);
9: \textsc{xor}(R6, R7);
10: \textsc{store}(R7, R0, 1);
11: \textsc{clear}(\textsc{opE})
12: \textsc{clear}(\textsc{opW});
13: \textsc{scrub}(R6);
14: \textsc{load}(R5, R1, 2);
15: \textsc{load}(R6, R2, 2);
16: \textsc{xor}(R5, R6);
17: \textsc{store}(R5, R0, 2);
18: \textsc{scrub}(R4);
19: \textsc{scrub}(R5);
20: \textsc{scrub}(R6);
21: \textsc{clear}(\textsc{opA})
22: \textsc{clear}(\textsc{opB})
23: \textsc{clear}(\textsc{opR})
24: \textsc{clear}(\textsc{opW})

Algorithm 7 Addition scheme at \(n\textsuperscript{th}\) order of security

\textbf{Input:} \(a = (a_0, ..., a_n); \ b = (b_0, ..., b_n)\)

\textbf{Output:} \(c = (c_0, ..., c_n), \) such that

\[
c_i = a_i + b_i, \ 0 \leq i \leq n
\]

1: \textbf{for} \(i = 0\ \text{to} \ n\ \textbf{do}
2: \ \textsc{load}(R4, R1, i);
3: \ \textsc{load}(R5, R2, i);
4: \ \textsc{xor}(R4, R5);
5: \ \textsc{store}(R4, R0, i);
6: \ \textsc{clear}(\textsc{opW})
7: \ \textsc{scrub}(R4);
8: \ \textsc{scrub}(R5);
9: \ \textbf{end for}
10: \ \textsc{clear}(\textsc{opA})
11: \ \textsc{clear}(\textsc{opB})
12: \ \textsc{clear}(\textsc{opR})
### A.2 Multiplication gadgets

**Algorithm 8 SECMULT: Multiplication scheme at 2\textsuperscript{nd} order of security**

**Input:** \( a = (a_0, a_1, a_2), b = (b_0, b_1, b_2) \)

**Output:** \( c = (c_0, c_1, c_2) \), such that

\[
\begin{align*}
c_0 &= a_0 b_0 + \text{rnd}_0 + a_0 b_1 + \text{rnd}_1 + a_1 b_0 \\
c_1 &= a_1 b_1 + \text{rnd}_1 + a_1 b_2 + \text{rnd}_2 + a_2 b_1 \\
c_2 &= a_2 b_2 + \text{rnd}_2 + a_2 b_0 + \text{rnd}_0 + a_0 b_2
\end{align*}
\]

<table>
<thead>
<tr>
<th>Line</th>
<th>Instruction</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>LOAD(R5, R1, 0);</td>
</tr>
<tr>
<td>2</td>
<td>LOAD(R4, R2, 0);</td>
</tr>
<tr>
<td>3</td>
<td>AND(R4, R5);</td>
</tr>
<tr>
<td>4</td>
<td>LOAD(R6, R3, 0);</td>
</tr>
<tr>
<td>5</td>
<td>XOR(R6, R4);</td>
</tr>
<tr>
<td>6</td>
<td>CLEAR(opB);</td>
</tr>
<tr>
<td>7</td>
<td>LOAD(R7, R2, 1);</td>
</tr>
<tr>
<td>8</td>
<td>AND(R7, R5);</td>
</tr>
<tr>
<td>9</td>
<td>XOR(R6, R7);</td>
</tr>
<tr>
<td>10</td>
<td>SCRUB(R4);</td>
</tr>
<tr>
<td>11</td>
<td>LOAD(R4, R1, 1);</td>
</tr>
<tr>
<td>12</td>
<td>LOAD(R5, R2, 0);</td>
</tr>
<tr>
<td>13</td>
<td>CLEAR(opB);</td>
</tr>
<tr>
<td>14</td>
<td>AND(R4, R5);</td>
</tr>
<tr>
<td>15</td>
<td>XOR(R6, R4);</td>
</tr>
<tr>
<td>16</td>
<td>CLEAR(opA);</td>
</tr>
<tr>
<td>17</td>
<td>SCRUB(R5);</td>
</tr>
<tr>
<td>18</td>
<td>LOAD(R5, R3, 1);</td>
</tr>
<tr>
<td>19</td>
<td>XOR(R6, R5);</td>
</tr>
<tr>
<td>20</td>
<td>STORE(R6, R0, 0);</td>
</tr>
<tr>
<td>21</td>
<td>CLEAR(op\textsuperscript{b});</td>
</tr>
<tr>
<td>22</td>
<td>SCRUB(R4);</td>
</tr>
<tr>
<td>23</td>
<td>SCRUB(R5);</td>
</tr>
<tr>
<td>24</td>
<td>SCRUB(R7);</td>
</tr>
<tr>
<td>25</td>
<td>LOAD(R5, R1, 1);</td>
</tr>
<tr>
<td>26</td>
<td>LOAD(R4, R2, 1);</td>
</tr>
</tbody>
</table>
A.3 Refreshing gadgets

**Algorithm 9** FIRSTREF: Refreshing scheme at 1st order of security

**Input:** \( a = (a_0, a_1) \)

**Output:** \( c = (c_0, c_1) \), such that

\[
\begin{align*}
c_0 &= a_0 + \text{rnd}_0 \\
c_1 &= a_1 + \text{rnd}_0
\end{align*}
\]

1: \text{LOAD}(R4, R1, 0); \quad \triangleright \text{Load } a_0 \text{ into register } r_4
2: \text{LOAD}(R5, R3, 0); \quad \triangleright \text{Load } \text{rnd}_0 \text{ into register } r_5
3: \text{XOR}(R4, R5); \quad \triangleright \text{after XOR } r_4 \text{ contains } a_0 + \text{rnd}_0
4: \text{STORE}(R4, R0, 0); \quad \triangleright \text{Store the value of } r_4 \text{ as output share } c_0
5: \text{CLEAR}(\text{opW}); \quad \triangleright \text{Load } a_1 \text{ into register } r_6
6: \text{LOAD}(R6, R1, 1); \quad \triangleright \text{after XOR } r_4 \text{ contains } a_1 + \text{rnd}_0
7: \text{XOR}(R6, R4); \quad \triangleright \text{Store the value of } r_4 \text{ as output share } c_1
8: \text{STORE}(R4, R0, 1);
9: \text{SCRUB}(R4);
10: \text{SCRUB}(R5);
11: \text{SCRUB}(R6);
12: \text{CLEAR}(\text{opA});
13: \text{CLEAR}(\text{opB});
14: \text{CLEAR}(\text{opR});
15: \text{CLEAR}(\text{opW});

**Algorithm 10** SECREF: Refreshing scheme at 2nd order of security

**Input:** \( a = (a_0, a_1, a_2) \)

**Output:** \( c = (c_0, c_1, c_2) \), such that

\[
\begin{align*}
c_0 &= a_0 + \text{rnd}_0 \\
c_1 &= a_1 + \text{rnd}_1 \\
c_2 &= a_2 + \text{rnd}_0 + \text{rnd}_1
\end{align*}
\]

1: \text{LOAD}(R4, R3, 0);
2: \text{LOAD}(R6, R1, 0);
3: \text{CLEAR}(\text{opA});
4: \text{LOAD}(R5, R3, 1);
5: \text{XOR}(R6, R4);
6: \text{STORE}(R6, R0, 0);
7: \text{CLEAR}(\text{opW});
8: \text{SCRUB}(R6);
9: \text{LOAD}(R7, R1, 1);
10: \text{CLEAR}(\text{opA});
11: \text{XOR}(R7, R5);
12: \text{STORE}(R7, R0, 1);
13: \text{CLEAR}(\text{opB});
14: \text{CLEAR}(\text{opE});
15: \text{XOR}(R4, R5);
16: \text{SCRUB}(R5);
17: \text{CLEAR}(\text{opB});
18: \text{LOAD}(R5, R1, 2);
19: \text{XOR}(R5, R4);
20: \text{STORE}(R5, R0, 2);
21: \text{SCRUB}(R4);
22: \text{SCRUB}(R5);
23: \text{SCRUB}(R6);
24: \text{SCRUB}(R7);
25: \text{CLEAR}(\text{opA});
26: \text{CLEAR}(\text{opE});
27: \text{CLEAR}(\text{opR});
28: \text{CLEAR}(\text{opW});
B  Optimization with Proposition 3

In Algorithm 11, we give the concrete construction of how Proposition 3 is applied to the standard XOR given in Algorithm 7. We point out that we analyzed the worst-case scenario in Proposition 2, and in Algorithm 11, a complete clear is not needed between the computation of each output share. Table 3 illustrates that all observations never depend on two different shares of the same input and $t$-NI security holds with the same arguments as in the proof.

Algorithm 11 Optimized addition scheme at $n^{th}$ order of security

Input: $a = (a_0, ..., a_n)$, $b = (b_0, ..., b_n)$ and $c = (c_0, ..., c_n)$

Output: $d = (d_0, ..., d_n)$, such that

$$d_i = a_i + b_i + c_i, 0 \leq i \leq n$$

1: for $(i = 0$ to $n)$ do
2: LOAD($R_5$, $R_2$, $i$);
3: LOAD($R_4$, $R_1$, $i$);
4: XOR($R_4$, $R_5$);
5: LOAD($R_5$, $R_3$, $i$);
6: XOR($R_4$, $R_5$);
7: STORE($R_4$, $R_0$, $i$);
8: CLEAR(opW);
9: SCRUB($R_4$);
10: end for
11: SCRUB($R_5$);
12: CLEAR(opA);
13: CLEAR(opB);
14: CLEAR(opR);

Table 3: Observations captured in the $i^{th}$ loop iteration of Algorithm 11

<table>
<thead>
<tr>
<th>Leakage effect</th>
<th>line 2</th>
<th>line 3</th>
<th>line 4</th>
<th>line 5</th>
<th>line 6</th>
<th>line 7</th>
</tr>
</thead>
<tbody>
<tr>
<td>Computation</td>
<td>-</td>
<td>-</td>
<td>$(a_i + b_i)$</td>
<td>$(a_i + b_i + c_i)$</td>
<td>$(a_i + b_i + c_i)$</td>
<td>$(a_i + b_i + q_i)$, $(c_i, a_i + b_i + c_i)$</td>
</tr>
<tr>
<td>Transition</td>
<td>$(c_{i-1}, b_i)$</td>
<td>$(pub, a_i)$</td>
<td>$(a_i, b_i, a_i + b_i)$</td>
<td>$(b_i, c_i)$</td>
<td>$(a_i + b_i, c_i, a_i + b_i + c_i)$</td>
<td></td>
</tr>
<tr>
<td>Revenant</td>
<td>$(b_i, c_{i-1})$</td>
<td>$(b_i, a_i)$</td>
<td>$(pub, a_i)$</td>
<td>$(b_i, b_i)$, $(b_i, c_i)$</td>
<td>$(b_i, c_i)$</td>
<td>$(pub, a_i + b_i + q_i c_i), (c_i, a_i + b_i + c_i)$</td>
</tr>
</tbody>
</table>
C PRESENT Sbox

The PRESENT S-box $S$ of the first order implementation, based on [CFE16], is expressed in the following way:

$$S(x) = A(G(B(x))))$$

with the affine functions $A$ and $B$:

$$A(x) = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 1 & 1 \end{bmatrix} x \oplus \begin{bmatrix} 0 \\ 1 \\ 0 \\ 1 \end{bmatrix} \quad B(x) = \begin{bmatrix} 1 & 1 & 0 & 0 \\ 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \end{bmatrix} x \oplus \begin{bmatrix} 0 \\ 0 \\ 0 \\ 1 \end{bmatrix}$$

and the function $G : \{0, 1\}^4 \mapsto \{0, 1\}^4$

$$G(a, b, c, d) = (a', b', c', d')$$

$$a' = a + bc + bd$$

$$b' = d + ab$$

$$c' = b$$

$$d' = c + bd$$

C.1 first order

Algorithm 12 PRESENT s-Box at 1st order of security

Input: $x = (x_0, x_1)$
Output: $y = (y_0, y_1)$, such that

$$y_0 \oplus y_1 = A(G(G(B(x_0 \oplus x_1))))$$

1: calcB();
2: calcG();
3: calcG();
4: calcA();

Algorithm 13 optimized PRESENT s-Box at 1st order of security

Input: $x = (x_0, x_1)$
Output: $y = (y_0, y_1)$, such that

$$y_0 \oplus y_1 = A(G(G(B(x_0 \oplus x_1))))$$

1: calcB_opt();
2: calcG_opt();
3: calcG_opt();
4: calcA_opt();

Algorithm 14 calcA: function A of the PRESENT s-Box at 1st order of security

Input: $x = (x_0, x_1)$
Output: $y = (y_0, y_1)$, such that

$$y_0 \oplus y_1 = A(x_0 \oplus x_1)$$

1: FIRSTXOR(x0, x2, y0);
2: FIRSTXORONE(x1, y1);
3: FIRSTSTORE(x0, y2);
4: FIRSTXOR(x0, x2, y3);
5: FIRSTXOR(y3, x3, y4);
6: FIRSTXORONE(y3, y3);
Algorithm 15 calCB: function B of the PRESENT s-Box at 1st order of security

Input: $x = (x_0, (x_1)$
Output: $y = (y_0, y_1)$, such that

$$y_0 \oplus y_1 = B(x_0 \oplus x_1)$$

1: FIRSTXOR($x_1, x_3, y_1$);
2: FIRSTXORONE($y_3, y_3$);
3: FIRSTSTORE($x_2, y_2$);
4: FIRSTXOR($x_1, x_2, y_1$);
5: FIRSTREF($y_1, R3, 0, y_1$);
6: FIRSTXOR($x_0, x_1, y_0$);

Algorithm 16 calCG: function G of the PRESENT s-Box at 1st order of security

Input: $x = (x_0, x_1)$
Output: $y = (y_0, y_1)$, such that

$$y_0 \oplus y_1 = G((x_0 \oplus x_1))$$

1: FIRSTMULT($b_{in}, d_{in}, R3, d_{out}$);
2: FIRSTMULT($b_{in}, c_{in}, R3, a_{out}$);
3: FIRSTXOR($a_{out}, a_{in}, a_{out}$);
4: FIRSTXOR($a_{out}, d_{out}, a_{out}$);
5: FIRSTXOR($c_{in}, d_{out}, d_{out}$);
6: FIRSTMULT($a_{in}, b_{in}, R3, b_{out}$);
7: FIRSTXOR($b_{out}, d_{in}, b_{out}$);
8: FIRSTSTORE($b_{in}, c_{out}$);
Algorithm 17: \texttt{CALCA\_OPT}: optimized function A of the PRESENT s-Box at 1\textsuperscript{st} order of security

Input: $x = (x_0, x_1)$

Output: $y = (y_0, y_1)$, such that

$$y_0 \oplus y_1 = A(x_0 \oplus x_1)$$

1: \texttt{LOAD}(R4, R1, 0);
2: \texttt{STORE}(R4, R0, 4);
3: \texttt{LOAD}(R5, R1, 4);
4: \texttt{XOR}(R5, R4);
5: \texttt{STORE}(R5, R0, 0);
6: \texttt{XOR}(R5, 0xFFFFFFFF);
7: \texttt{LOAD}(R6, R1, 6);
8: \texttt{XOR}(R5, R6);
9: \texttt{STORE}(R5, R0, 6);
10: \texttt{LOAD}(R5, R1, 2);
11: \texttt{XOR}(R5, 0xFFFFFFFF);
12: \texttt{STORE}(R5, R0, 2);
13: \texttt{LOAD}(R5, R1, 1);
14: \texttt{STORE}(R5, R0, 5);
15: \texttt{LOAD}(R4, R1, 7);
16: \texttt{XOR}(R4, R4);
17: \texttt{STORE}(R4, R0, 1);
18: \texttt{LOAD}(R5, R1, 7);
19: \texttt{XOR}(R4, R5);
20: \texttt{STORE}(R4, R0, 7);
21: \texttt{LOAD}(R5, R1, 3);
22: \texttt{STORE}(R5, R0, 3);
23: \texttt{SCRUB}(R4);
24: \texttt{SCRUB}(R5);
25: \texttt{SCRUB}(R6);
26: \texttt{CLEAR}(\texttt{opA})
27: \texttt{CLEAR}(\texttt{opB})
28: \texttt{CLEAR}(\texttt{opR})
29: \texttt{CLEAR}(\texttt{opW})

\begin{itemize}
\item \texttt{Load} $a_0$ into register $r_4$
\item \texttt{Store} the value of $r_4$ as output share $c'_0$
\item \texttt{after XOR} $r_5$ contains $c_0 + a_0$
\item \texttt{Store} the value of $r_5$ as output share $a'_0$
\item \texttt{after XOR} $r_5$ contains $c_0 + a_0 + 1$
\item \texttt{after XOR} $r_5$ contains $c_0 + a_0 + 1 + d_0$
\item \texttt{Store} the value of $r_5$ as output share $d'_0$
\item \texttt{after XOR} $r_5$ contains $b_0 + 1$
\item \texttt{Store} the value of $r_5$ as output share $b'_0$
\item \texttt{Load} $a_1$ into register $r_5$
\item \texttt{Store} the value of $r_5$ as output share $c'_1$
\item \texttt{after XOR} $r_5$ contains $c_1 + a_1$
\item \texttt{Store} the value of $r_4$ as output share $a'_1$
\item \texttt{after XOR} $r_4$ contains $c_1 + a_1 + d_1$
\item \texttt{Store} the value of $r_4$ as output share $d'_1$
\item \texttt{Load} $b_1$ into register $r_5$
\item \texttt{Store} the value of $r_5$ as output share $b'_1$
\end{itemize}
Algorithm 18 calcB_opt: optimized function B of the PRESENT s-Box at 1st order of security

Input: $x = (x_0, x_1)$
Output: $y = (y_0, y_1)$, such that

$$y_0 \oplus y_1 = B(x_0 \oplus x_1)$$

1: LOAD($R4, R1, 2$);
2: LOAD($R5, R1, 6$);
3: XOR($R5, R4$);
4: XOR($R5, 0xFFFFFFFF$);
5: STORE($R5, R0, 6$);
6: LOAD($R6, R1, 0$);
7: XOR($R6, R4$);
8: STORE($R6, R0, 0$);
9: LOAD($R5, R1, 4$);
10: XOR($R4, R5$);
11: LOAD($R6, R3, 2$);
12: XOR($R4, R6$);
13: STORE($R4, R0, 2$);
14: STORE($R5, R0, 4$);
15: LOAD($R4, R1, 3$);
16: LOAD($R5, R1, 7$);
17: XOR($R5, R4$);
18: STORE($R5, R0, 7$);
19: LOAD($R6, R1, 1$);
20: XOR($R6, R4$);
21: STORE($R6, R0, 1$);
22: LOAD($R5, R1, 7$);
23: XOR($R4, R5$);
24: LOAD($R6, R3, 2$);
25: XOR($R4, R6$);
26: STORE($R4, R0, 3$);
27: STORE($R5, R0, 5$);
28: SCRUB($R4$);
29: SCRUB($R5$);
30: SCRUB($R6$);
31: CLEAR($opA$);
32: CLEAR($opB$);
33: CLEAR($opC$);
34: CLEAR($opD$);
Algorithm 19 calcG_opt: optimized function G of the PRESENT s-Box at 1st order of security

Input: $x = (x_0, x_1)$

Output: $y = (y_0, y_1)$, such that

$$y_0 \oplus y_1 = G(x_0 \oplus x_1)$$

1: LOAD(R4, R1, 2);
2: LOAD(R6, R1, 5);
3: AND(R6, R4);
4: LOAD(R5, R1, 6);
5: AND(R5, R4);
6: XOR(R6, R5);
7: LOAD(R7, R3, 0);
8: XOR(R6, R7);
9: LOAD(R7, R3, 1);
10: XOR(R5, R7);
11: LOAD(R7, R1, 4);
12: XOR(R5, R7);
13: AND(R7, R4);
14: XOR(R6, R7);
15: LOAD(R7, R1, 7);
16: AND(R7, R4);
17: XOR(R5, R7);
18: STORE(R5, R0, 6);
19: CLEAR(opR);
20: LOAD(R5, a_{in}, 0);
21: XOR(R6, R5);
22: AND(R5, R4);
23: CLEAR(opA);
24: XOR(R6, R7);
25: STORE(R6, R0, 0);
26: LOAD(R6, R1, 6);
27: CLEAR(opB);
28: XOR(R5, R6);
29: STORE(R4, R0, 4);
30: CLEAR(opB);
31: LOAD(R7, R1, 1);
32: AND(R4, R7);
33: LOAD(R7, R3, 2);
34: CLEAR(opE);
35: XOR(R5, R7);
36: CLEAR(opE);
37: CLEAR(opA);
38: XOR(R5, R4);
39: STORE(R5, R0, 2);
40: LOAD(R7, R1, 3);
41: LOAD(R5, R1, 4);
42: AND(R5, R7);
43: AND(R6, R7);
44: XOR(R5, R6);
45: LOAD(R4, R3, 0);
46: CLEAR(opB);
47: XOR(R5, R4);
48: LOAD(R4, R3, 1);
49: XOR(R6, R4);
50: LOAD(R4, R1, 5);
51: XOR(R6, R4);
52: AND(R4, R7);
53: XOR(R5, R4);
54: LOAD(R4, R1, 7);
55: AND(R4, R7);
56: XOR(R6, R4);
57: STORE(R6, R0, 7);
58: LOAD(R6, R1, 1);
59: XOR(R5, R6);
60: AND(R6, R7);
61: CLEAR(opA);
62: XOR(R5, R4);
63: STORE(R5, R0, 1);
64: LOAD(R5, R1, 7);
65: CLEAR(opB);
66: XOR(R6, R5);
67: STORE(R7, R0, 5);
68: SCRUB(R4);
69: CLEAR(opR);
70: LOAD(R4, a_{in}, 0);
71: CLEAR(opB);
72: AND(R7, R4);
73: LOAD(R4, R3, 2);
74: CLEAR(opB);
75: XOR(R6, R4);
76: CLEAR(opB);
77: CLEAR(opA);
78: XOR(R5, R4);
79: STORE(R5, R0, 3);
80: SCRUB(R5);
81: SCRUB(R6);
82: SCRUB(R4);
83: SCRUB(R7);
84: CLEAR(opR);
85: CLEAR(opW);
86: CLEAR(opA);
87: CLEAR(opB);
C.2 second order

Algorithm 20 calcA\_opt: optimized function A of the PRESENT s-Box at 2nd order of security

Input: $x = (x_0, x_1, x_2)$

Output: $y = (y_0, y_1, y_2)$, such that

$$y_0 \oplus y_1 \oplus y_2 = A(x_0 \oplus x_1 \oplus x_2)$$

1: load $(R_4, R_1, 0)$;
2: store $(R_4, R_0, 6)$;
3: load $(R_5, R_1, 6)$;
4: xor $(R_5, R_4)$;
5: store $(R_5, R_0, 0)$;
6: xor $(R_5, 0xFFFFFFFF)$;
7: load $(R_6, R_1, 9)$;
8: xor $(R_5, R_6)$;
9: store $(R_5, R_0, 9)$;
10: load $(R_5, R_1, 3)$;
11: xor $(R_5, 0xFFFFFFFF)$;
12: store $(R_5, R_0, 3)$;
13: load $(R_5, R_1, 1)$;
14: store $(R_5, R_0, 7)$;
15: clear $(opB)$;
16: load $(R_4, R_1, 7)$;
17: clear $(opB)$;
18: xor $(R_4, R_5)$;
19: store $(R_4, R_0, 1)$;
20: load $(R_5, R_1, 10)$;
21: xor $(R_4, R_5)$;
22: store $(R_4, R_0, 10)$;
23: load $(R_5, R_1, 4)$;
24: store $(R_5, R_0, 4)$;
25: load $(R_5, R_1, 2)$;
26: store $(R_5, R_0, 8)$;
27: clear $(opB)$;
28: load $(R_4, R_1, 8)$;
29: clear $(opB)$;
30: xor $(R_4, R_5)$;
31: store $(R_4, R_0, 2)$;
32: load $(R_5, R_1, 11)$;
33: xor $(R_4, R_5)$;
34: store $(R_4, R_0, 11)$;
35: load $(R_5, R_1, 5)$;
36: store $(R_5, R_0, 5)$;
37: scrub $(R_4)$;
38: scrub $(R_5)$;
39: scrub $(R_6)$;
40: clear $(opA)$;
41: clear $(opB)$;
42: clear $(opR)$;
43: clear $(op\bar{R})$;
Algorithm 21\textsc{calcb\_opt}: optimized function $B$ of the Present s-Box at 2\textsuperscript{nd} order of security

Input: $x = (x_0, x_1, x_2)$

Output: $y = (y_0, y_1, y_2)$, such that

$$y_0 \oplus y_1 \oplus y_2 = B(x_0 \oplus x_1 \oplus x_2)$$

1: \texttt{LOAD}(R4, R1, 3);
2: \texttt{LOAD}(R5, R1, 9);
3: \texttt{XOR}(R5, R4);
4: \texttt{XOR}(R5, 0xFFFFFFFF);
5: \texttt{LOAD}(R6, R3, 0);
6: \texttt{XOR}(R5, R6);
7: \texttt{STORE}(R5, R0, 9);
8: \texttt{LOAD}(R6, R1, 0);
9: \texttt{XOR}(R6, R4);
10: \texttt{LOAD}(R5, R3, 2);
11: \texttt{XOR}(R6, R5);
12: \texttt{STORE}(R6, R0, 0);
13: \texttt{LOAD}(R5, R1, 6);
14: \texttt{XOR}(R4, R5);
15: \texttt{LOAD}(R6, R3, 4);
16: \texttt{XOR}(R4, R6);
17: \texttt{STORE}(R4, R0, 3);
18: \texttt{LOAD}(R6, R3, 6);
19: \texttt{XOR}(R5, R6);
20: \texttt{STORE}(R5, R0, 6);
21: \texttt{SCRUB}(R4);
22: \texttt{SCRUB}(R5);
23: \texttt{SCRUB}(R6);
24: \texttt{CLEAR}(opA);
25: \texttt{CLEAR}(opB);
26: \texttt{CLEAR}(opR);
27: \texttt{CLEAR}(op\ddag);
28: \texttt{LOAD}(R4, R1, 4);
29: \texttt{LOAD}(R5, R1, 10);
30: \texttt{XOR}(R5, R4);
31: \texttt{LOAD}(R6, R3, 1);
32: \texttt{XOR}(R5, R6);
33: \texttt{STORE}(R5, R0, 10);
34: \texttt{LOAD}(R6, R1, 1);
35: \texttt{XOR}(R6, R4);
36: \texttt{LOAD}(R5, R3, 3);
37: \texttt{XOR}(R6, R5);
38: \texttt{STORE}(R6, R0, 1);
39: \texttt{LOAD}(R5, R1, 7);
40: \texttt{XOR}(R4, R5);
41: \texttt{LOAD}(R6, R3, 5);
42: \texttt{XOR}(R4, R6);
43: \texttt{STORE}(R4, R0, 4);
44: \texttt{LOAD}(R6, R3, 7);
45: \texttt{XOR}(R5, R6);
46: \texttt{STORE}(R5, R0, 7);
47: \texttt{SCRUB}(R4);
48: \texttt{SCRUB}(R5);
49: \texttt{SCRUB}(R6);
50: \texttt{CLEAR}(op\ddag);
51: \texttt{CLEAR}(opB);
52: \texttt{CLEAR}(op\ddag);
53: \texttt{CLEAR}(op\ddag);
54: \texttt{LOAD}(R4, R1, 5);
55: \texttt{LOAD}(R5, R1, 11);
56: \texttt{XOR}(R5, R4);
57: \texttt{SCRUB}(R6);
58: \texttt{LOAD}(R6, R3, 0);
59: \texttt{CLEAR}(opR);
60: \texttt{LOAD}(R7, R3, 1);
61: \texttt{CLEAR}(op\ddag);
62: \texttt{XOR}(R6, R7);
63: \texttt{CLEAR}(op\ddag);
64: \texttt{CLEAR}(op\ddag);
65: \texttt{XOR}(R5, R6);
66: \texttt{STORE}(R5, R0, 11);
67: \texttt{LOAD}(R6, R1, 2);
68: \texttt{XOR}(R6, R4);
69: \texttt{SCRUB}(R5);
70: \texttt{LOAD}(R5, R3, 2);
71: \texttt{CLEAR}(op\ddag);
72: \texttt{LOAD}(R7, R3, 3);
73: \texttt{CLEAR}(opB);
74: \texttt{XOR}(R5, R7);
75: \texttt{CLEAR}(op\ddag);
76: \texttt{CLEAR}(op\ddag);
77: \texttt{XOR}(R6, R5);
78: \texttt{STORE}(R6, R0, 2);
79: \texttt{LOAD}(R5, R1, 8);
80: \texttt{CLEAR}(op\ddag);
81: \texttt{XOR}(R4, R5);
82: \texttt{SCRUB}(R6);
83: \texttt{LOAD}(R6, R3, 4);
84: \texttt{CLEAR}(op\ddag);
85: \texttt{LOAD}(R7, R3, 5);
86: \texttt{CLEAR}(op\ddag);
87: \texttt{XOR}(R6, R7);
88: \texttt{CLEAR}(op\ddag);
89: \texttt{CLEAR}(op\ddag);
90: \texttt{XOR}(R4, R6);
91: \texttt{STORE}(R4, R0, 5);
92: \texttt{SCRUB}(R6);
93: \texttt{LOAD}(R6, R3, 6);
94: \texttt{CLEAR}(op\ddag);
95: \texttt{LOAD}(R7, R3, 7);
96: \texttt{CLEAR}(op\ddag);
97: \texttt{XOR}(R6, R7);
98: \texttt{CLEAR}(op\ddag);
99: \texttt{CLEAR}(op\ddag);
100: \texttt{XOR}(R5, R6);
101: \texttt{STORE}(R5, R0, 8);
102: \texttt{SCRUB}(R4);
103: \texttt{SCRUB}(R5);
104: \texttt{SCRUB}(R6);
105: \texttt{SCRUB}(R7);
106: \texttt{CLEAR}(op\ddag);
107: \texttt{CLEAR}(op\ddag);
108: \texttt{CLEAR}(op\ddag);
109: \texttt{CLEAR}(op\ddag);
Algorithm 22 CALCG\_OPT: optimized function $G$ of PRESENT s-Box at 2\textsuperscript{nd} order of security

Input: $x = \{x_0, x_1, x_2\}$

Output: $y = \{y_0, y_1, y_2\}$, such that

$$y_0 \oplus y_1 \oplus y_2 = G(x_0 \oplus x_1 \oplus x_2)$$

1. LOAD($R_4, R_1, 3$);
2. LOAD($R_5, R_1, 10$);
3. LOAD($R_6, R_1, 5$);
4. LOAD($R_7, R_1, 1$);
5. AND($R_5, R_4$);
6. AND($R_6, R_4$);
7. AND($R_7, R_4$);
8. XOR($R_6, R_5$);
9. LOAD($R_0, R_3, 0$);
10. XOR($R_5, R_0$);
11. LOAD($R_0, R_3, 3$);
12. XOR($R_6, R_0$);
13. LOAD($R_0, R_3, 6$);
14. XOR($R_7, R_0$);
15. LOAD($R_0, R_1, 0$);
16. XOR($R_6, R_0$);
17. AND($R_0, R_4$);
18. XOR($R_7, R_0$);
19. LOAD($R_0, R_1, 6$);
20. XOR($R_5, R_0$);
21. AND($R_0, R_4$);
22. XOR($R_6, R_0$);
23. LOAD($R_0, R_1, 9$);
24. XOR($R_7, R_0$);
25. AND($R_0, R_4$);
26. XOR($R_5, R_0$);
27. XOR($R_6, R_0$);
28. SCRUB($R_0$);
29. CLEAR($opR$);
30. LOAD($R_0, R_3, 1$);
31. CLEAR($opA$);
32. CLEAR($opB$);
33. XOR($R_5, R_0$);
34. LOAD($R_0, R_3, 4$);
35. XOR($R_6, R_0$);
36. LOAD($R_0, R_3, 7$);
37. XOR($R_7, R_0$);
38. LOAD($R_0, R_1, 2$);
39. AND($R_0, R_4$);
40. XOR($R_7, R_0$);
41. LOAD($R_0, R_1, 8$);
42. AND($R_0, R_4$);
43. XOR($R_6, R_0$);
44. LOAD($R_0, R_1, 11$);
45. AND($R_0, R_4$);
46. XOR($R_5, R_0$);
47. XOR($R_6, R_0$);
48. STORE($R_6, R_0, 0$);
49. STORE($R_7, R_0, 3$);
50. STORE($R_4, R_0, 6$);
51. STORE($R_5, R_0, 9$);
52. SCRUB($R_0$);
53. SCRUB($R_5$);
54. SCRUB($R_6$);
55. SCRUB($R_4$);
56. SCRUB($R_7$);
57. CLEAR($opA$);
58. CLEAR($opB$);
59. CLEAR($opR$);
60. CLEAR($opW$);
61. LOAD($R_4, R_1, 4$);
62. LOAD($R_5, R_1, 11$);
63. LOAD($R_6, R_1, 8$);
64. LOAD($R_7, R_1, 2$);
65. AND($R_5, R_4$);
66. AND($R_6, R_4$);
67. AND($R_7, R_4$);
68. XOR($R_6, R_5$);
69. LOAD($R_0, R_3, 1$);
70. XOR($R_5, R_0$);
71. LOAD($R_0, R_3, 4$);
72. XOR($R_6, R_0$);
73. LOAD($R_0, R_3, 7$);
74. XOR($R_7, R_0$);
75. LOAD($R_0, R_1, 1$);
76. XOR($R_6, R_0$);
77. AND($R_0, R_4$);
78. XOR($R_7, R_0$);
79. LOAD($R_0, R_1, 5$);
80. XOR($R_5, R_0$);
81. AND($R_0, R_4$);
82. XOR($R_6, R_0$);
83. LOAD($R_0, R_1, 10$);
84. XOR($R_7, R_0$);
85. AND($R_0, R_4$);
86. XOR($R_5, R_0$);
87. XOR($R_6, R_0$);
88. SCRUB($R_0$);
89. CLEAR($opB$);
90. LOAD($R_0, R_3, 2$);
D TVLA report

Figure 7: Bivariate TVLA of the optimized 2\textsuperscript{nd} order PRESENT S-Box executed on the KL82Z microcontroller. No significant leakage is detected for every pair of sample points on the $x$ and $y$ axis as the values are far below the threshold of 4.5.

Figure 8: Physical leakage is detected when the optimized 1\textsuperscript{st} order PRESENT S-Box is executed on an Arm Cortex M4F microcontroller despite being secure in our model for CM0+ and leakage free on two CM0+ microcontrollers. The $x$ axis represents sample points. A distinct fine-grained leakage model is needed for this processor as there are clear signs of leakage at low number of traces. The three-stage pipeline and three-address arithmetic logic unit of the CM4F are likely causing distinct leakage behavior, amenable to future, fine-grained leakage models.