# Optimizing S-box Implementations Using SAT Solvers: Revisited 

Fuxin Zhang ${ }^{1,2}$ and Zhenyu Huang ${ }^{1,2}$<br>${ }^{1}$ State Key Laboratory of Information Security, Institute of Information Engineering,<br>Chinese Academy of Sciences, Beijing, China<br>${ }^{2}$ School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China<br>\{zhangfuxin, huangzhenyu\}@iie.ac.cn


#### Abstract

We propose a new method to encode the problems of optimizing S-box implementations into SAT problems. By considering the inputs and outputs of gates as Boolean functions, the fundamental idea of our method is representing the relationships between these inputs and outputs according to their algebraic normal forms. Based on this method, we present several encoding schemes for optimizing S-box implementations according to various criteria, such as multiplicative complexity, bitslice gate complexity, gate complexity, and circuit depth complexity. The experimental results of these optimization problems show that, compared to the encoding method proposed in FSE 2016, which represents these relationships between Boolean functions by their truth tables, our new encoding method can significantly accelerate the subsequent solving process by 2-100 times for the majority of instances. To further improve the solving efficiency, we propose several strategies to eliminate the redundancy of the derived equation system and break the symmetry of the solution space. We apply our method in the optimizations of the S-boxes used in Ascon, ICEPOLE, PRIMATEs, Keccak/Ketje/Keyak, Joltik/Piccolo, LAC, Minalpher, Prøst, and RECTANGLE. We achieve some new improved implementations and narrow the range of the optimal values for different optimization criteria of these S-boxes.


Keywords: S-box • SAT solver • Circuit optimization • Algebraic normal form.

## 1 Introduction

As one of the main primitives in symmetric-key cryptography, S-boxes play a pivotal role in providing nonlinearity and confusion. Efficient implementations of S-boxes are of paramount importance in enhancing the overall performance of symmetric-key algorithms. Depending on the requirements of the specific platform, the optimization of S-box implementations involves one or multiple criteria, such as throughput, latency, power consumption, memory usage, and resistance against side-channel attacks.

Courtois et al. were the first to introduce SAT solvers for optimizing S-box implementations [CMH13]. The main idea of their method is encoding the decision problems related to S-box implementations into SAT problems, which are subsequently solved by off-the-shelf SAT solvers. At FSE 2016 [Sto16], Stoffelen generalized this method and proposed specific schemes to encode optimization problems based on various criteria, such as multiplicative complexity, bitslice gate complexity, gate complexity, and circuit depth complexity, into SAT problems. Moreover, some optimized implementations for lightweight S-boxes used in Ascon [DEMS16], ICEPOLE [1MGH ${ }^{+}$15], PRIMATEs [ABB ${ }^{+}$], Keccak $\left[\mathrm{BPVA}^{+} 11\right] /$ Ketje $\left[\mathrm{BDP}^{+}\right] / K e y a k\left[\mathrm{BDP}^{+} 16\right]$, Joltik [JNP15]/Piccolo $\left[\mathrm{SIH}^{+} 11\right]$, LAC $\left[\mathrm{ZWW}^{+} 14\right]$, Minalpher $\left[\mathrm{STA}^{+} 14\right]$, Prøst $\left[\mathrm{KLL}^{+} 14\right]$ and RECTANGLE [ZBL ${ }^{+} 14$ ], were presented in [Sto16]. Stoffelen's method was further extended to search for S-box implementations with the smallest area $\left[\mathrm{LWH}^{+} 21, \mathrm{FWL}^{+} 21\right]$, as well as the lowest AND-depth
and AND-count jointly [ $\left.\mathrm{BMD}^{+} 20\right]$.
In addition to the SAT-based method, some heuristic search-based methods have been developed for optimizing S-box implementations [Osv00, BMP13]. In [JPST17], based on a graph-based meet-in-the-middle search algorithm, Jean et al. proposed an automated tool named LIGHTER, which can generate an efficient implementation of a lightweight S-box given a certain set of available instructions and their corresponding costs. Utilizing the implementation model of LIGHTER, a more efficient and versatile platform named PEIGEN was presented in [BGLS19]. In [Ras22], Rasoolzadeh investigated the latency complexity of S-boxes, and proposed algorithms for determining the latency complexity and finding the circuits with the minimum latency complexity of a given S-box.

Compared to the heuristic search-based methods, an advantage of the SAT-based methods is the theoretical guarantee of obtaining optimal results. Specifically, for the problem of implementing an S-box based on an optimization criterion with a value of $k$, if the SAT solver returns UNSAT(unsatisfiable) for $k=k_{1}$ and returns SAT(satisfiable) for $k=k_{2}$ with $k_{2}=k_{1}+1$, then $k_{2}$ is the optimal value for this criterion. However, the SAT solver cannot always return the results within a reasonable time (especially for the UNSAT instances), hence there is often a gap between $k_{1}$ and $k_{2}$ for many instances. Our motivation is to accelerate the solving process of the SAT-based method, thus bridging the gap between $k_{1}$ and $k_{2}$. For this purpose, we focus on modifying the way of encoding the problems of optimizing S-box implementations into SAT problems.

Our contribution. We propose a new method to encode the decision problems associated with the implementations of S-boxes based on different optimization criteria, such as multiplicative complexity, bitslice gate complexity, gate complexity, and circuit depth complexity, into SAT problems. Compared to the encoding method introduced in FSE 2016 [Sto16], our new encoding method can significantly reduce the running time of the subsequent solving process. For the majority of instances originated from the implementations of various 4 -bit and 5-bit S-boxes used in Ascon, ICEPOLE, Joltik/Piccolo, Keccak/Ketje/Keyak, LAC, Minalpher, Prøst, and RECTANGLE, our method can accelerate the solving processes by approximately 2-100 times.

To further improve our encoding method, we introduce several strategies aimed at eliminating the redundancy within the derived equation system and breaking the symmetry of the solution space. Experimental results show that these improvements lead to additional acceleration in solving all UNSAT instances and the majority of SAT instances.

We achieve some new optimized implementations for the S-boxes used in Ascon, ICEPOLE, Joltik/Piccolo, Keccak/Ketje/Keyak, LAC, Minalpher, Prøst, and RECTANGLE. Moreover, we narrow the range of the possible optimal values for different optimization criteria of these S-boxes. Especially, we determine the multiplicative complexity of the S-box used in PRIMATEs, the gate complexities of the S-box and S-box ${ }^{-1}$ used in RECTANGLE, as well as the optimal widths of the low-depth implementations of the S-boxes used in Keccak/Ketje/Keyak, LAC, and RECTANGLE.

## 2 Preliminaries

### 2.1 Boolean Functions and Boolean Equations

Let $\mathbb{F}_{2}$ be the finite field with only two elements $\{0,1\}$, and $\mathbb{F}_{2}^{n}$ denotes the $n$-dimensional vector space over $\mathbb{F}_{2}$. A $n$-variable Boolean function $f\left(x_{1}, x_{2}, \ldots, x_{n}\right)$ is a function mapping $\mathbb{F}_{2}^{n}$ to $\mathbb{F}_{2}$. Let $B_{n}$ be the set of $n$-variable Boolean functions. A Boolean function $f \in B_{n}$ can be represented by a truth table, which is a list of all $2^{n}$ vectors together with their function values.

A Boolean function can also be represented as an element in the Boolean polynomial ring $\mathbf{R}_{2}=\mathbb{F}_{2}\left[x_{1}, x_{2}, \ldots, x_{n}\right] /\left\langle x_{1}^{2}+x_{1}, x_{2}^{2}+x_{2}, \ldots, x_{n}^{2}+x_{n}\right\rangle$, and has a unique polynomial
form:

$$
f\left(x_{1}, x_{2}, \ldots, x_{n}\right)=\sum_{u=\left\{u_{1}, u_{2}, \ldots, u_{n}\right\} \in \mathbb{F}_{2}^{n}} a_{u} \prod_{i=1}^{n} x_{i}^{u_{i}}
$$

This form is called the Algebraic Normal Form (ANF) of a Boolean function. A Boolean function in form is called a Boolean polynomial. In this paper, when we say a general Boolean polynomial we mean a Boolean polynomial whose $2^{n}$ coefficients are all variables, and when we say a specific Boolean polynomial we mean a Boolean polynomial whose $2^{n}$ coefficients take specific Boolean values.

A vectorial Boolean function is a map from $\mathbb{F}_{2}^{n}$ to $\mathbb{F}_{2}^{m}$ :

$$
\left(x_{1}, x_{2} \ldots, x_{n}\right) \rightarrow\left(f_{1}\left(x_{1}, x_{2}, \ldots, x_{n}\right), f_{2}\left(x_{1}, x_{2}, \ldots, x_{n}\right), \ldots, f_{m}\left(x_{1}, x_{2}, \ldots, x_{n}\right)\right),
$$

where each $f_{i}\left(x_{1}, x_{2}, \ldots, x_{n}\right)$ is a Boolean function. An S-box with $n$-bit input and $m$-bit output can be represented by a vectorial Boolean function from $\mathbb{F}_{2}^{n}$ to $\mathbb{F}_{2}^{m}$.

A (Boolean) circuit is a directed acyclic graph where the inputs and the gates are the nodes, and the edges correspond to the Boolean-valued wires. The most common gates used in a circuit are XOR, AND, NOT, and OR. Each of them corresponds to a specific vectorial Boolean function over $\mathbb{F}_{2}$.

If a circuit implements a given S-box, then by setting the input nodes as variables $\left\{x_{1}, x_{2}, \ldots, x_{n}\right\}$, it outputs the vectorial Boolean function corresponding to the S-box. Note that, in this circuit implementation, the input and output of each gate can be described as Boolean functions with variables $\left\{x_{1}, x_{2}, \ldots, x_{n}\right\}$, and the coefficients of their ANFs will have some relations. For example, if the gate is from the set $\{\mathrm{XOR}, \mathrm{AND}, \mathrm{NOT}\}$, then we have the following relations.

- XOR: Suppose $f=\sum_{u_{i} \in \mathbb{F}_{2}} a_{u_{1}, \ldots, u_{n}} \prod_{i=1}^{n} x_{i}^{u_{i}}, g=\sum_{u_{i} \in \mathbb{F}_{2}} b_{u_{1}, \ldots, u_{n}} \prod_{i=1}^{n} x_{i}^{u_{i}}$ are the inputs of an XOR gate, and $h=\sum_{u_{i} \in \mathbb{F}_{2}} c_{u_{1}, \ldots, u_{n}} \prod_{i=1}^{n} x_{i}^{u_{i}}$ is the output, then $h=f+g$ and we have:

$$
\begin{equation*}
c_{u_{1}, \ldots, u_{n}}=a_{u_{1}, \ldots, u_{n}}+b_{u_{1}, \ldots, u_{n}}, \forall\left(u_{1}, \ldots, u_{n}\right) \in \mathbb{F}_{2}^{n} \tag{1}
\end{equation*}
$$

- AND: Suppose $f=\sum_{u_{i} \in \mathbb{F}_{2}} a_{u_{1}, \ldots, u_{n}} \prod_{i=1}^{n} x_{i}^{u_{i}}, g=\sum_{v_{i} \in \mathbb{F}_{2}} b_{v_{1}, \ldots, v_{n}} \prod_{i=1}^{n} x_{i}^{v_{i}}$ are the inputs of an AND gate, and $h=\sum_{w_{i} \in \mathbb{F}_{2}} c_{w_{1}, \ldots, w_{n}} \prod_{i=1}^{n} x_{i}^{w_{i}}$ is the output, then $h=f \cdot g$ and we have:

$$
\begin{equation*}
c_{w_{1}, \ldots, w_{n}}=\sum_{\max \left(u_{i}, v_{i}\right)=w_{i}, 1 \leq i \leq n} a_{u_{1}, \ldots, u_{n}} \cdot b_{v_{1}, \ldots, v_{n}}, \forall\left(w_{1}, \ldots, w_{n}\right) \in \mathbb{F}_{2}^{n} . \tag{2}
\end{equation*}
$$

- NOT: If the input of a NOT gate is $f=\sum_{u_{i} \in \mathbb{F}_{2}} a_{u_{1}, \ldots, u_{n}} \prod_{i=1}^{n} x_{i}^{u_{i}}$, and the output of the gate is $g=\sum_{u_{i} \in \mathbb{F}_{2}} b_{u_{1}, \ldots, u_{n}} \prod_{i=1}^{n} x_{i}^{u_{i}}$, then $g=f+1$ and we have:

$$
\begin{equation*}
b_{0, \ldots, 0}=a_{0, \ldots, 0}+1, \text { and } b_{u_{1}, \ldots, u_{n}}=a_{u_{1}, \ldots, u_{n}}, \text { if }\left(u_{1}, \ldots, u_{n}\right) \neq(0, \ldots, 0) . \tag{3}
\end{equation*}
$$

More generally, suppose $f_{1}, f_{2}, \ldots, f_{k}$ are some general Boolean polynomials, $g_{1}, g_{2}, \ldots, g_{s}$ some specific Boolean polynomials, and $H$ is either a general or a specific Boolean polynomial. It is easy to see that given an equation

$$
H\left(f_{1}, f_{2}, \ldots, f_{k}, g_{1}, g_{2}, \ldots, g_{s}\right)=0
$$

similarly to generating Equation (1), (2) and (3), by comparing the coefficients of different monomials, we can derive some equations with respect to the coefficients of these general Boolean polynomials. In the following paragraphs of this paper, we use

$$
H\left(f_{1}, f_{2}, \ldots, f_{k}, g_{1}, g_{2}, \ldots, g_{s}\right) \stackrel{c}{=} 0
$$

to denote these coefficient equations generated from the original polynomial equation.
Equations for the at-most-one constraint: Suppose $a_{1}, a_{2}, \ldots, a_{k}$ are some Boolean variables, then the constraint that there is at most one nonzero element in $\left\{a_{1}, a_{2}, \ldots, a_{k}\right\}$ can be depicted by the following Boolean polynomial equations:

$$
a_{i} a_{j}=0, \text { for all } 1 \leq i<j \leq k
$$

In the followings, we use the notation $\operatorname{AtMost}_{1}\left(a_{1}, a_{2}, \ldots, a_{k}\right)$ to denote the above Boolean polynomial equations for the at-most-one constraint.

Equations for the exactly-one constraint: Suppose $a_{1}, a_{2}, \ldots, a_{k}$ are some Boolean variables, then the constraint that there is exactly one nonzero element in $\left\{a_{1}, a_{2}, \ldots, a_{k}\right\}$ can be depicted by the following Boolean polynomial equations:

$$
\left\{\begin{array}{l}
a_{1}+a_{2}+\cdots+a_{k}=1 \\
a_{i} a_{j}=0, \text { for all } 1 \leq i<j \leq k
\end{array}\right.
$$

In the followings, we use the notation Exactly $\mathrm{y}_{1}\left(a_{1}, a_{2}, \ldots, a_{k}\right)$ to denote the above Boolean polynomial equations for the exactly-one constraint.

### 2.2 Optimization Criteria

Since an S-box is equivalent to a vectorial Boolean function, the S-box implementation is sometimes mentioned as the Boolean function implementation. The S-box implementation in software and hardware can be optimized according to various criteria. In this paper, we consider the S-box implementations using gates with fan-in at most two, and focus on the optimization criteria addressed in [Sto16]. Here we reference their definitions from [Sto16]:

Multiplicative Complexity (MC). The multiplicative complexity of an S-box is defined as the minimum number of AND gates required to implement the S-box using the gates from the set $\{\mathrm{XOR}, \mathrm{AND}, \mathrm{NOT}\}$. We shall also call the multiplicative complexity of a given circuit the actual number of AND gates involved in the circuit. Reducing the MC is useful in protecting against side-channel attacks, since the cost of the masking for side-channel analysis countermeasure depends on the MC. Moreover, it is important in multi-party computation and fully homomorphic encryption, where linear operations almost come for free $\left[\mathrm{ARS}^{+} 15\right]$.

Bitslice Gate Complexity (BGC). The bitslice gate complexity of an S-box is defined as the minimum number of gates in $\{\mathrm{XOR}, \mathrm{OR}$, AND, NOT $\}$ required to implement the S-box. Reducing the BGC leads to efficient bitsliced software implementations. Note that, NAND, NOR, and XNOR are not allowed under the BGC measure, since on the most common CPU architectures, no instruction can immediately compute these gates.

Gate Complexity (GC). The gate complexity of an S-box is defined as the minimum number of logic gates required to implement the S-box. In addition to bitslice gates $\{$ XOR, OR, AND, NOT\}, \{NAND, NOR, XNOR\} are now permitted. From a low GC implementation, one can obtain an efficient hardware implementation, although the different area costs of different gates still need to be considered.

Circuit Depth Complexity. For a circuit implementing an S-box, its depth is defined as the length of the longest path from the input gate to the output gate. In hardware implementations, circuit depth is highly relevant to the latency of the circuit. Here, we only consider the circuits consisting of gates from the set \{XOR, OR, AND, NOT, NAND, NOR, XNOR $\}$.

The procedures of using SAT solvers to solve these different optimization problems are similar. Here, we take the problem of finding an implementation with the optimal MC as an example.

- For any fixed $k$, the problem of determining whether there is a circuit that implements an S-box $\mathcal{S}$ with at most $k$ AND gates, which is called the multiplicative complexity decision problem, can be converted to a SAT problem. Then this SAT problem can be solved by an off-the-shelf SAT solver. If the solver returns SAT (satisfiable) for some $k$ and returns UNSAT (unsatisfiable) for $k-1$, then it is proven that $k$ is the minimum number of and gates required to implement $\mathcal{S}$. Moreover, in this case, from a SAT result, we can construct a circuit that implements $S$ with $k$ AND gates.

Similarly, circuit optimizations based on different criteria correspond to different decision problems. For encoding such a decision problem into a SAT problem, we can first encode it into a system of Boolean polynomial equations, then convert this system into a CNF (Conjunctive Normal Form) expression, which serves as the input of SAT solvers. The conversion process can be accomplished by an ANF-to-CNF converter, such as Bosphorus [CSCM19]. Therefore, we focus on the problem of encoding the decision problems into Boolean equation systems.

## 3 Two Encoding Methods

We use the following toy example to illustrate our encoding method and its difference with the one proposed in [Sto16].

Example 1. Implementing an S-box : $\left(x_{1}, x_{2}, x_{3}\right) \rightarrow\left(x_{1} x_{2}+x_{3}, x_{2} x_{3}+x_{1}, x_{1} x_{3}+x_{2}\right)$ with gates in \{AND, XOR, NOT\}, and using at most two AND gates.

First, we present a scheme for encoding the above problem into a Boolean equation system based on our encoding method. We use general Boolean polynomials $X_{1}, X_{2}$, and $X_{3}$ to denote the circuit inputs, and use general Boolean polynomials $Y_{1}, Y_{2}$, and $Y_{3}$ to denote the circuit outputs. In this case, if a circuit implements this S-box, the following coefficient equations must hold.

$$
X_{1} \stackrel{c}{=} x_{1}, X_{2} \stackrel{c}{=} x_{2}, X_{3} \stackrel{c}{=} x_{3}, Y_{1} \stackrel{c}{=} x_{1} x_{2}+x_{3}, Y_{2} \stackrel{c}{=} x_{2} x_{3}+x_{1}, Y_{3} \stackrel{c}{=} x_{1} x_{3}+x_{2}
$$

Suppose the two inputs and one output of the first AND gate are denoted by general Boolean polynomials $Q_{1}, Q_{2}$, and $T_{1}$ respectively. Then $Q_{1}$ and $Q_{2}$ should be constructed from the circuit inputs and previous gates outputs by applying some XOR and NOT gates. Since no AND gate was applied before, we can express $Q_{1}$ and $Q_{2}$ by some affine combinations of the circuit inputs. It means there exist some $a_{i} \in \mathbb{F}_{2}, i \in\{1,2, \ldots, 8\}$ such that

$$
\begin{align*}
& Q_{1}=a_{1}+a_{2} X_{1}+a_{3} X_{2}+a_{4} X_{3}, \\
& Q_{2}=a_{5}+a_{6} X_{1}+a_{7} X_{2}+a_{8} X_{3},  \tag{4}\\
& T_{1}=Q_{1} \cdot Q_{2} .
\end{align*}
$$

Now we denote the two inputs and one output of the second AND gate by general Boolean polynomial $Q_{3}, Q_{4}$, and $T_{2}$ respectively. Similarly to $Q_{1}$ and $Q_{2}, Q_{3}$ and $Q_{4}$ can be expressed as an affine combination of $X_{1}, X_{2}, X_{3}$, and $T_{1}$. Hence, there are some $a_{i} \in \mathbb{F}_{2}, i \in\{9,10, \ldots, 18\}$ such that

$$
\begin{align*}
& Q_{3}=a_{9}+a_{10} X_{1}+a_{11} X_{2}+a_{12} X_{3}+a_{13} T_{1}, \\
& Q_{4}=a_{14}+a_{15} X_{1}+a_{16} X_{2}+a_{17} X_{3}+a_{18} T_{1},  \tag{5}\\
& T_{2}=Q_{3} \cdot Q_{4} .
\end{align*}
$$

Similarly, the circuit outputs can be expressed as some affine combinations of $X_{1}, X_{2}, X_{3}$, $T_{1}$, and $T_{2}$, thus there are some $a_{i} \in \mathbb{F}_{2}, i \in\{19,20, \ldots, 36\}$ such that

$$
\begin{align*}
& Y_{1}=a_{19}+a_{20} X_{1}+a_{21} X_{2}+a_{22} X_{3}+a_{23} T_{1}+a_{24} T_{2}, \\
& Y_{2}=a_{25}+a_{26} X_{1}+a_{27} X_{2}+a_{28} X_{3}+a_{29} T_{1}+a_{30} T_{2},  \tag{6}\\
& Y_{3}=a_{31}+a_{32} X_{1}+a_{33} X_{2}+a_{34} X_{3}+a_{35} T_{1}+a_{36} T_{2} .
\end{align*}
$$

Then by setting these $a_{i}$ 's and the coefficients of these general Boolean polynomials as unknowns, we can generate the following Boolean equation system:

$$
\mathcal{G}=\left\{\begin{array}{l}
X_{1} \stackrel{c}{=} x_{1}, X_{2} \stackrel{c}{=} x_{2}, X_{3} \stackrel{c}{=} x_{3}  \tag{7}\\
Q_{1} \stackrel{c}{=} a_{1}+a_{2} X_{1}+a_{3} X_{2}+a_{4} X_{3} \\
Q_{2} \stackrel{c}{=} a_{5}+a_{6} X_{1}+a_{7} X_{2}+a_{8} X_{3} \\
Q_{3} \stackrel{c}{=} a_{9}+a_{10} X_{1}+a_{11} X_{2}+a_{12} X_{3}+a_{13} T_{1} \\
Q_{4} \stackrel{c}{=} a_{14}+a_{15} X_{1}+a_{16} X_{2}+a_{17} X_{3}+a_{18} T_{1} \\
T_{1} \stackrel{c}{=} Q_{1} \cdot Q_{2}, T_{2} \stackrel{c}{=} Q_{3} \cdot Q_{4} \\
Y_{1} \stackrel{c}{=} a_{19}+a_{20} X_{1}+a_{21} X_{2}+a_{22} X_{3}+a_{23} T_{1}+a_{24} T_{2} \\
Y_{2} \stackrel{c}{=} a_{25}+a_{26} X_{1}+a_{27} X_{2}+a_{28} X_{3}+a_{29} T_{1}+a_{30} T_{2} \\
Y_{3} \stackrel{c}{=} a_{31}+a_{32} X_{1}+a_{33} X_{2}+a_{34} X_{3}+a_{35} T_{1}+a_{36} T_{2} \\
Y_{1} \stackrel{c}{=} x_{1} x_{2}+x_{3}, Y_{2} \stackrel{c}{=} x_{2} x_{3}+x_{1}, Y_{3} \stackrel{c}{=} x_{1} x_{3}+x_{2}
\end{array}\right.
$$

Obviously, a solution of this system corresponds to a circuit that implements the given S-box with at most two AND gates. ${ }^{1}$ In this solution, the values of the coefficient variables determine the ANFs of different wires in the circuit, and from the values of $a_{1}, a_{2}, \ldots, a_{36}$, one can easily construct the sub-circuits that can generate these AND gate inputs by XOR and NOT gates. The number of XOR and NOT gates used to generate these AND gate inputs can be optimized by solving the shortest linear straight-line program problem as mentioned in [Sto16].

Now, based on the method proposed in [Sto16], we present another scheme to encode the same problem into Boolean equation systems. First, suppose we have a circuit that implements the given S-box with 2 AND gates. If the inputs of the circuit take some specific Boolean values $\left\{x_{1}^{0}, x_{2}^{0}, x_{3}^{0}\right\}$, then the outputs of the circuit are some fixed Boolean values $\left\{y_{1}^{0}, y_{2}^{0}, y_{3}^{0}\right\}$, where $\left\{x_{1}^{0}, x_{2}^{0}, x_{3}^{0}\right\}$ and $\left\{y_{1}^{0}, y_{2}^{0}, y_{3}^{0}\right\}$ correspond to a row in the truth table of the S-box. In this case, the input and output wires of all gates will also take some specific Boolean values. It is easy to see that if we substitute the variables $X_{i}, Q_{i}, T_{i}$, and $Y_{i}$ with these specific values, Equations (4), (5), and (6) hold for some assignments of these $a_{i}$ 's. Note that, if the inputs of the circuit take any other values, then the wires of this circuit will also take some other fixed values. If we substitute the variables $X_{i}, Q_{i}, T_{i}$, and $Y_{i}$ with these new values, Equations (4), (5), and (6) still hold for the same assignments of $a_{i}$ 's.

Actually, we can construct a Boolean equation system $\mathcal{F}$ which consists of $2^{3}$ subsystems $\mathcal{F}_{i}\left(0 \leq i \leq 2^{3}-1\right)$ with the same structure, and $\mathcal{F}_{i}$ is generated by setting the input of the S-box as $\left(i_{0}, i_{1}, i_{2}\right)$, where $i_{0} i_{1} i_{2}$ is the binary expression of $i$. Moreover, we have

[^0]\[

\mathcal{F}_{i}=\left\{$$
\begin{array}{l}
x_{i, 1}=i_{0}, x_{i, 2}=i_{1}, x_{i, 3}=i_{2}  \tag{8}\\
q_{i, 1}=a_{1}+a_{2} x_{i, 1}+a_{3} x_{i, 2}+a_{4} x_{i, 3} \\
q_{i, 2}=a_{5}+a_{6} x_{i, 1}+a_{7} x_{i, 2}+a_{8} x_{i, 3} \\
q_{i, 3}=a_{9}+a_{10} x_{i, 1}+a_{11} x_{i, 2}+a_{12} x_{i, 3}+a_{13} t_{i, 1} \\
q_{i, 4}=a_{14}+a_{15} x_{i, 1}+a_{16} x_{i, 2}+a_{17} x_{i, 3}+a_{18} t_{i, 1} \\
t_{i, 1}=q_{i, 1} \cdot q_{i, 2}, t_{i, 2}=q_{i, 3} \cdot q_{i, 4} \\
y_{i, 1}=a_{19}+a_{20} x_{i, 1}+a_{21} x_{i, 2}+a_{22} x_{i, 3}+a_{23} t_{i, 1}+a_{24} t_{i, 2} \\
y_{i, 2}=a_{25}+a_{26} x_{i, 1}+a_{27} x_{i, 2}+a_{28} x_{i, 3}+a_{29} t_{i, 1}+a_{30} t_{i, 2} \\
y_{i, 3}=a_{31}+a_{32} x_{i, 1}+a_{33} x_{i, 2}+a_{34} x_{i, 3}+a_{35} t_{i, 1}+a_{36} t_{i, 2} \\
y_{i, 1}=i_{0} i_{1}+i_{2}, y_{i, 2}=i_{1} i_{2}+i_{0}, y_{i, 3}=i_{0} i_{2}+i_{1}
\end{array}
$$\right.
\]

Apparently, a solution of $\mathcal{F}$ corresponds to a circuit that implements the given S-box with at most two AND gates. In this solution, the values of $x_{i, j}, y_{i, j}, q_{i, j}, t_{i, j}$ are equal to the values taken by the corresponding wires when the circuit input is $\left(i_{2}, i_{1}, i_{0}\right)$.

It is easy to see that, for these two encoding schemes, their basic encoding frameworks, which define the relations among gate inputs, outputs, and $a_{i}$ 's, are the same. The difference is how we generate Boolean equations from these relations. In our method, the equations are generated based on the algebraic expressions of the vectorial Boolean functions corresponding to the gate inputs and outputs. In comparison, in the method proposed in [Sto16], the equations are generated based on the truth tables of these vectorial Boolean functions. Therefore, we call our encoding method the algebraic expression method, and call the method proposed in [Sto16] the truth table method.

For two encoding schemes, which are obtained by the algebraic expression method and the truth table method respectively, if their encoding frameworks are the same, then the two equation systems generated by them have the same number of variables and the same number of equations. For example, for this toy problem, in the first scheme, each " $\stackrel{c}{=}$ " equation corresponds to $2^{3}$ Boolean equations, since there are $2^{3}$ coefficients for a Boolean polynomial with 3 variables. Moreover, each general Boolean polynomial corresponds to $2^{3}$ Boolean variables. Therefore, the system $\mathcal{G}$ has $12 \cdot 2^{3}+36$ variables and $15 \cdot 2^{3}$ equations. For the second scheme, there are $2^{3}$ subsystems and each $\mathcal{F}_{i}$ contains 36 common variables, 12 special variables, and 15 equations. Therefore, $\mathcal{F}$ also has $12 \cdot 2^{3}+36$ variables and $15 \cdot 2^{3}$ Boolean equations.

Although the numbers of variables and equations for the systems generated by these two schemes are the same, the corresponding reduced equations, which are obtained by applying simple substitutions to eliminate intermediate variables, have different sizes. We observed that, in most cases, the reduced equation system generated by the algebraic expression method is simpler than the one generated by the truth table method. This can be illustrated by the following toy example.

Example 2. Implementing the 2-to-1 Boolean function: $f:\left(x_{1}, x_{2}\right) \rightarrow x_{1} x_{2}+x_{1}+x_{2}$ with \{AND, XOR , NOT $\}$ gates, and using 1 AND gate.

Based on the algebraic expression method and the truth table method, we can generate two Boolean equation systems, and both of them have 34 variables and 28 equations. For the algebraic expression method, we have to solve the system:

$$
\left\{\begin{array}{l}
c_{1}=0, c_{2}=1, c_{3}=0, c_{4}=0, c_{5}=0, c_{6}=0, c_{7}=1, c_{8}=0 \\
c_{9}=a_{1}+a_{2} c_{1}+a_{3} c_{5}, c_{10}=a_{2} c_{2}+a_{3} c_{6}, c_{11}=a_{2} c_{3}+a_{3} c_{7}, c_{12}=a_{2} c_{4}+a_{3} c_{8} \\
c_{13}=a_{4}+a_{5} c_{1}+a_{6} c_{5}, c_{14}=a_{5} c_{2}+a_{6} c_{6}, c_{15}=a_{5} c_{3}+a_{6} c_{7}, c_{16}=a_{5} c_{4}+a_{6} c_{8} \\
c_{17}=c_{9} c_{13}, c_{18}=c_{10} c_{14}+c_{10} c_{13}+c_{9} c_{14}, c_{19}=c_{11} c_{15}+c_{11} c_{13}+c_{9} c_{15} \\
c_{20}=c_{9} c_{16}+c_{10} c_{15}+c_{10} c_{16}+c_{11} c_{14}+c_{11} c_{16}+c_{12} c_{13}+c_{12} c_{14}+c_{12} c_{15}+c_{12} c_{16} \\
c_{21}=a_{7}+a_{8} c_{1}+a_{9} c_{5}+a_{10} c_{17}, c_{22}=a_{8} c_{2}+a_{9} c_{6}+a_{10} c_{18} \\
c_{23}=a_{8} c_{3}+a_{9} c_{7}+a_{10} c_{19}, c_{24}=a_{8} c_{4}+a_{9} c_{8}+a_{10} c_{20} \\
c_{21}=0, c_{22}=1, c_{23}=1, c_{24}=1 .
\end{array}\right.
$$

For the truth table method, we have to solve the system:

$$
\left\{\begin{array}{l}
x_{1,1}=0, x_{1,2}=0, \\
q_{1,1}=a_{1}+a_{2} x_{1,1}+a_{3} x_{1,2}, q_{1,2}=a_{4}+a_{5} x_{1,1}+a_{6} x_{1,2} \\
t_{1,1}=q_{1,1} q_{1,2} \\
y_{1,1}=a_{7}+a_{8} x_{1,1}+a_{9} x_{1,2}+a_{10} t_{1,1}, \\
y_{1,1}=0 \\
x_{2,1}=0, x_{2,2}=1, \\
q_{2,1}=a_{1}+a_{2} x_{2,1}+a_{3} x_{2,2}, q_{2,2}=a_{4}+a_{5} x_{2,1}+a_{6} x_{2,2} \\
t_{2,1}=q_{2,1} q_{2,2} \\
y_{2,1}=a_{7}+a_{8} x_{2,1}+a_{9} x_{2,2}+a_{10} t_{2,1}, \\
y_{2,1}=1 \\
x_{3,1}=1, x_{3,2}=0, \\
q_{3,1}=a_{1}+a_{2} x_{3,1}+a_{3} x_{3,2}, q_{3,2}=a_{4}+a_{5} x_{3,1}+a_{6} x_{3,2} \\
t_{3,1}=q_{3,1} q_{3,2} \\
y_{3,1}=a_{7}+a_{8} x_{3,1}+a_{9} x_{3,2}+a_{10} t_{3,1}, \\
y_{3,1}=1 \\
x_{4,1}=1, x_{4,2}=1, \\
q_{4,1}=a_{1}+a_{2} x_{4,1}+a_{3} x_{4,2}, q_{4,2}=a_{4}+a_{5} x_{4,1}+a_{6} x_{4,2} \\
t_{4,1}=q_{4,1} q_{4,2} \\
y_{4,1}=a_{7}+a_{8} x_{4,1}+a_{9} x_{4,2}+a_{10} t_{4,1}, \\
y_{4,1}=1 .
\end{array}\right.
$$

After simple substitution, the reduced system for the algebraic expression method is:

$$
\left\{\begin{array}{l}
a_{7}+a_{10} a_{1} a_{4}=0, a_{10} a_{2} a_{6}+a_{10} a_{3} a_{5}=1, \\
a_{8}+a_{10} a_{2} a_{5}+a_{10} a_{2} a_{4}+a_{10} a_{1} a_{5}=1, \\
a_{9}+a_{10} a_{3} a_{6}+a_{10} a_{3} a_{4}+a_{10} a_{1} a_{6}=1
\end{array}\right.
$$

In comparison, the reduced system for the truth table method is:

$$
\left\{\begin{array}{l}
a_{7}+a_{10} a_{1} a_{4}=0, \\
a_{7}+a_{9}+a_{10} a_{1} a_{4}+a_{10} a_{1} a_{6}+a_{10} a_{3} a_{4}+a_{10} a_{3} a_{6}=1, \\
a_{7}+a_{8}+a_{10} a_{1} a_{4}+a_{10} a_{1} a_{5}+a_{10} a_{2} a_{4}+a_{10} a_{2} a_{5}=1, \\
a_{7}+a_{8}+a_{9}+a_{10} a_{1} a_{4}+a_{10} a_{1} a_{5}+a_{10} a_{1} a_{6}+a_{10} a_{2} a_{4} \\
\quad+a_{10} a_{2} a_{5}+a_{10} a_{2} a_{6}+a_{10} a_{3} a_{4}+a_{10} a_{3} a_{5}+a_{10} a_{3} a_{6}=1 .
\end{array}\right.
$$

Apparently, the reduced systems generated by the algebraic expression method are simpler. Since the complexity of the reduced systems often determines the difficulty of solving the original system, we conjecture that the system generated by the algebraic expression method can be probably solved faster, and our experimental results shown in Section 4.5 confirm this conjecture.

## 4 Algebraic Encoding Schemes for Circuit Optimizations

For the sake of simplicity, in the following paragraphs, we call an encoding scheme obtained by the algebraic expression method, an algebraic expression scheme, and call an encoding scheme obtained by the truth table method, a truth table scheme. According to Section 3, from a truth table scheme, one can easily construct an algebraic expression scheme by using the same encoding framework. Therefore, we can convert the truth table schemes proposed in [Sto16] to algebraic expression schemes. However, we found that there are some minor issues with the basic encoding frameworks proposed in [Sto16], which will lead to incorrect results in certain cases. For this reason, we have slightly modified these flawed encoding frameworks to ensure that the algebraic expression schemes proposed in this section can work well in any case. In the following parts, we will present these
algebraic expression schemes, and show their advantages by comparing the time to solve the equations generated by these schemes and those from [Sto16] for specific problems.

### 4.1 An Encoding Scheme for MC Optimizations

In this subsection, we present an algebraic expression scheme for MC optimizations. This algebraic expression scheme uses the same encoding framework proposed in [Sto16] with slightly modification, since we observed that there is a flaw in this encoding framework.

Given an S-box $\mathcal{S}:\left(x_{0}, \ldots, x_{n-1}\right) \rightarrow\left(S_{0}\left(x_{0}, \ldots, x_{n-1}\right), \ldots, S_{m-1}\left(x_{0}, \ldots, x_{n-1}\right)\right)$, our decision problem is determining whether $\mathcal{S}$ can be implemented with at most $k$ AND gates. We follow the notations used in Section 3:

- $X_{i}$ is the general Boolean polynomial corresponding to the circuit input;
- $Y_{i}$ is the general Boolean polynomial corresponding to the circuit output;
- $Q_{i}$ is the general Boolean polynomial corresponding to the gate input;
- $T_{i}$ is the general Boolean polynomial corresponding to the gate output;
- $a_{i}$ is the variable that describes the expressions of $Q_{i}$ and $Y_{i}$ w.r.t. $X_{i}$ and $T_{i}$.

Then the Boolean equation system can be generated as follows.

1) $\forall i \in\{0, \ldots, n-1\}, \forall j \in\{0, \ldots, m-1\}, X_{i} \stackrel{c}{=} x_{i}, Y_{j} \stackrel{c}{=} S_{j}\left(x_{0}, \ldots, x_{n-1}\right)$. These equations encode the circuit inputs and outputs.
2) $\forall i \in\{0, \ldots, 2 k-1\}$,

$$
Q_{i} \stackrel{c}{=} a_{l_{i}}+\left(\sum_{j=0}^{n-1} a_{l_{i}+j+1} \cdot X_{j}\right)+\left(\sum_{j=0}^{\left\lfloor\frac{i}{2}\right\rfloor-1} a_{l_{i}+n+j+1} \cdot T_{j}\right),
$$

where $l_{i}=i(n+1)+\left\lfloor\frac{i^{2}-2 i+1}{4}\right\rfloor$. These coefficient equations encode that $Q_{i}$, the inputs of the AND gates, can be expressed as an affine combination of the circuit inputs and previous AND gate outputs. Here $a_{l}$ corresponds to a possible NOT gate.
3) $\forall i \in\{0, \ldots, k-1\}, T_{i} \stackrel{c}{=} Q_{2 i} \cdot Q_{2 i+1}$. These equations encode the $k$ AND gates.
4) $\forall i \in\{0, \ldots, m-1\}$,

$$
\begin{equation*}
Y_{i} \stackrel{c}{=} a_{s}+\left(\sum_{j=0}^{n-1} a_{s+j+1} \cdot X_{j}\right)+\left(\sum_{j=0}^{k-1} a_{s+n+j+1} \cdot T_{j}\right) \tag{9}
\end{equation*}
$$

where $s_{i}=2 k(n+1)+k(k-1)+i(n+k+1)$. These equations encode that the S-box outputs can be expressed as an affine combination of the S-box inputs and all AND gate outputs.

Remark 1. If we directly use the encoding framework proposed in [Sto16], Equation (9) should be rewritten as

$$
\begin{equation*}
Y_{i} \stackrel{c}{=}\left(\sum_{j=0}^{n-1} a_{s+j+1} \cdot X_{j}\right)+\left(\sum_{j=0}^{k-1} a_{s+n+j+1} \cdot T_{j}\right) . \tag{10}
\end{equation*}
$$

Compared to Equation (9), this equation does not contain $a_{s}$, which encodes a possible NOT gate. However, we observed that Equation (10) may lead to an incorrect result when the ANF of the S-box contains the constant term 1. For example, the S-box $\left(x_{1}, x_{2}, x_{3}\right) \rightarrow\left(x_{1} x_{2}+x_{2} x_{3}+1, x_{2} x_{3}+x_{1} x_{3}, x_{1} x_{2}+x_{1} x_{3}\right)$ has multiplicative complexity 2 , since it can be implemented by the following steps:

- $q_{1}=x_{1}+x_{3} ; t_{1}=q_{1} \cdot x_{2}=x_{1} x_{2}+x_{2} x_{3} ; q_{2}=x_{2}+x_{3} ; t_{2}=x_{1} \cdot q_{2}=x_{1} x_{2}+x_{1} x_{3} ;$ $y_{1}=t_{1}+1 ; y_{2}=t_{1}+t_{2} ; y_{3}=t_{2}$.

However, if we utilize the algebraic expression scheme based on Equation (10) or the truth table scheme proposed in [Sto16] with setting $k=2$, the SAT solver will output UNSAT. Therefore, to avoid potentially incorrect results, in our proposed scheme we modify the equations for $Y_{i}$ as Equation (9).

### 4.2 An Encoding Scheme for BGC optimizaitons

Given an S-box $\mathcal{S}:\left(x_{0}, \ldots, x_{n-1}\right) \rightarrow\left(S_{0}\left(x_{0}, \ldots, x_{n-1}\right), \ldots, S_{m-1}\left(x_{0}, \ldots, x_{n-1}\right)\right)$, the BGC decision problem is determining whether $\mathcal{S}$ can be implemented with at most $k$ bitslice gates. Let $X_{i}, Y_{i}, Q_{i}$, and $a_{i}$ be defined as in the MC decision problem, then we can encode this decision problem into a Boolean equation system as follows.

1) $\forall i \in\{0, \ldots, n-1\}, \forall j \in\{0, \ldots, m-1\}, X_{i} \stackrel{c}{=} x_{i}, Y_{j} \stackrel{c}{=} S_{j}\left(x_{0}, \ldots, x_{n-1}\right)$. These equations encode the circuit inputs and outputs.
2) $\forall i \in\{0, \ldots, k-1\}$,

$$
T_{i} \stackrel{c}{=} b_{3 i} \cdot Q_{2 i} \cdot Q_{2 i+1}+b_{3 i+1} \cdot Q_{2 i}+b_{3 i+1} \cdot Q_{2 i+1}+b_{3 i+2} \cdot Q_{2 i}+b_{3 i+2}
$$

These equations encode the $k$ gates from the set \{AND, OR, XOR, NOT\}, where Boolean variables $b_{3 i}, b_{3 i+1}, b_{3 i+2}$ determine the type of the gate. Table 1 shows the correspondence between the values of $\left(b_{3 i}, b_{3 i+1}, b_{3 i+2}\right)$ and the types of gates.
3) $\forall i \in\{0, \ldots, k-1\}, b_{3 i} \cdot b_{3 i+2}=0, b_{3 i+1} \cdot b_{3 i+2}=0$. These equations prevent $\left(b_{3 i}, b_{3 i+1}, b_{3 i+2}\right)$ from taking the values $(0,1,1),(1,0,1),(1,1,1)$, ensuring that only AND, OR, XOR, and NOT are allowed in this circuit.
4) $\forall i \in\{0, \ldots, 2 k-1\}$,

$$
Q_{i} \stackrel{c}{=}\left(\sum_{j=0}^{n-1} a_{l_{i}+j} \cdot X_{j}\right)+\left(\sum_{j=0}^{\left\lfloor\frac{i}{2}\right\rfloor-1} a_{l_{i}+n+j} \cdot T_{j}\right), \operatorname{AtMost}_{1}\left(a_{l_{i}}, \ldots, a_{l_{i}+n+\left\lfloor\frac{i}{2}\right\rfloor-1}\right)
$$

where $l_{i}=n i+\left\lfloor\frac{i^{2}-2 i+1}{4}\right\rfloor$. These equations encode that $Q_{i}$, the gate input, can be the constant 0 , an S-box input, or a previous gate output.
5) $\forall i \in\{0, \ldots, m-1\}$ :

$$
Y_{i} \stackrel{c}{=}\left(\sum_{j=0}^{n-1} a_{s_{i}+j} \cdot X_{j}\right)+\left(\sum_{j=0}^{k-1} a_{s_{i}+n+j} \cdot T_{j}\right), \operatorname{AtMost}_{1}\left(a_{s_{i}}, \ldots, a_{s_{i}+n+k-1}\right)
$$

where $s_{i}=2 k n+k(k-1)+i(n+k)$. These equations encode that $Y_{i}$, the S-box output, can be the constant 0 , an S-box inputs, or a gate output.

Table 1: Gate types for different $\left(b_{3 i}, b_{3 i+1}, b_{3 i+2}\right)$

| $\left(b_{3 i}, b_{3 i+1}, b_{3 i+2}\right)$ | Expression for $T_{i}$ | Gate represented by $T_{i}$ |
| :--- | :--- | :--- |
| $(0,0,0)$ | 0 | 0 |
| $(0,0,1)$ | $Q_{2 i}+1$ | $\operatorname{NOT}\left(Q_{2 i}\right)$ |
| $(0,1,0)$ | $Q_{2 i}+Q_{2 i+1}$ | $\operatorname{XOR}\left(Q_{2 i}, Q_{2 i+1}\right)$ |
| $(0,1,1)$ | $Q_{2 i+1}+1$ | $\operatorname{Prevented}$ |
| $(1,0,0)$ | $Q_{2 i} \cdot Q_{2 i+1}$ | $\operatorname{AND}\left(Q_{2 i}, Q_{2 i+1}\right)$ |
| $(1,0,1)$ | $Q_{2 i} \cdot Q_{2 i+1}+Q_{2 i}+1$ | $\operatorname{Prevented}$ |
| $(1,1,0)$ | $Q_{2 i} \cdot Q_{2 i+1}+Q_{2 i}+Q_{2 i+1}$ | $\operatorname{OR}\left(Q_{2 i}, Q_{2 i+1}\right)$ |
| $(1,1,1)$ | $Q_{2 i} \cdot Q_{2 i+1}+Q_{2 i+1}+1$ | Prevented |

### 4.3 An Encoding Scheme for GC Optimizations

The encoding scheme for GC optimizations is similar to that for BGC optimizations. There are only two differences:

1) The gate outputting $T_{i}$ is encoded by a different expression:

$$
\forall i \in\{0, \ldots, k-1\}, T_{i} \stackrel{c}{=} b_{3 i} \cdot Q_{2 i} \cdot Q_{2 i+1}+b_{3 i+1} \cdot Q_{2 i}+b_{3 i+1} \cdot Q_{2 i+1}+b_{3 i+2}
$$

These equations encode $k$ gates in $\{\mathrm{XOR}, \mathrm{XNOR}$, AND, OR, NAND, NOR $\}$, where Boolean variables $b_{3 i}, b_{3 i+1}, b_{3 i+2}$ determine the type of the gate. The correspondence between the values of $\left(b_{3 i}, b_{3 i+1}, b_{3 i+2}\right)$ and the types of gates is presented in Table 2. Note that, the NOT gate is not in this table, but it can be implemented by other gates with some specific inputs. For example, if $Q_{2 i}=Q_{2 i+1}, \operatorname{NAND}\left(Q_{2 i}, Q_{2 i+1}\right)=\operatorname{NOT}\left(Q_{2 i}\right)$, and if $Q_{2 i+1}=0, \operatorname{XNOR}\left(Q_{2 i}, Q_{2 i+1}\right)=\operatorname{NOT}\left(Q_{2 i}\right)$. An explanation for the necessity of the NOT gate for GC optimizations can be found in Remark 2.

Table 2: Gate types for different $\left(b_{3 i}, b_{3 i+1}, b_{3 i+2}\right)$

| $\left(b_{3 i}, b_{3 i+1}, b_{3 i+2}\right)$ | Expression for $T_{i}$ | Gate represented by $T_{i}$ |
| :--- | :--- | :--- |
| $(0,0,0)$ | 0 | 0 |
| $(0,0,1)$ | 1 | 1 |
| $(0,1,0)$ | $Q_{2 i}+Q_{2 i+1}$ | $\operatorname{XOR}\left(Q_{2 i}, Q_{2 i+1}\right)$ |
| $(0,1,1)$ | $Q_{2 i}+Q_{2 i+1}+1$ | $\operatorname{XNOR}\left(Q_{2 i}, Q_{2 i+1}\right)$ |
| $(1,0,0)$ | $Q_{2 i} \cdot Q_{2 i+1}$ | $\operatorname{AND}\left(Q_{2 i}, Q_{2 i+1}\right)$ |
| $(1,0,1)$ | $Q_{2 i} \cdot Q_{2 i+1}+1$ | $\operatorname{NAND}\left(Q_{2 i}, Q_{2 i+1}\right)$ |
| $(1,1,0)$ | $Q_{2 i} \cdot Q_{2 i+1}+Q_{2 i}+Q_{2 i+1}$ | $\operatorname{OR}\left(Q_{2 i}, Q_{2 i+1}\right)$ |
| $(1,1,1)$ | $Q_{2 i} \cdot Q_{2 i+1}+Q_{2 i}+Q_{2 i+1}+1$ | $\operatorname{NOR}\left(Q_{2 i}, Q_{2 i+1}\right)$ |

2) There is no constraint for $b_{3 i}, b_{3 i+1}, b_{3 i+2}$, since all types of gates are allowed.

Remark 2. For some problems, by using NOT gates, one can achieve implementations with lower GC. For example, consider the problem of implementing the following 3 -bit S-box: $\left(x_{1}, x_{2}, x_{3}\right) \rightarrow\left(x_{1} x_{2}+x_{2}, x_{1} x_{3}+x_{3}, x_{2} x_{3}\right)$. It is easy to check that without NOT gates, the optimal implementation has GC 5 . In comparison, with NOT gate, the optimal implementation has GC 4: $t_{1}=x_{1}+1, t_{2}=t_{1} \cdot x_{2}, t_{3}=t_{1} \cdot x_{3}, t_{4}=x_{2} \cdot x_{3}, y_{1}=t_{2}$, $y_{2}=t_{3}, y_{3}=t_{4}$. Theoretically, we only need to consider the NOT gates in the first depth layer of a circuit for GC optimization. Actually, if the input of a NOT gate is the output of a previous gate $G$, we can remove this NOT gate by modifying $G$. For example if $G$ is an AND gate, then we can change $G$ to be a NAND gate. By this way, without changing the GC of a circuit, one can remove all NOT gates with depth $\geq 2$ in this circuit. However, in our scheme for GC optimizations, determining a gate in the first depth layer is costly, hence we suppose any gate in the circuit could be a NOT gate.

### 4.4 An Encoding Scheme for Depth Optimizations

The decision problem for circuit depth complexity differs slightly from the former ones. We can divide a circuit into different layers based on the depth of each gate. Note that, if only the depth of a circuit is bounded, the number of gates in each layer can be arbitrarily large, and it is not possible to encode a problem with an infinite number of operations into a fixed Boolean equation system. For this reason, the following decision problem was considered in [Sto16]. Given an S-box $\mathcal{S}:\left(x_{0}, \ldots, x_{n-1}\right) \rightarrow\left(S_{0}\left(x_{0}, \ldots, x_{n-1}\right), \ldots, S_{m-1}\left(x_{0}, \ldots, x_{n-1}\right)\right)$, determine whether there is a circuit implementing $\mathcal{S}$ with depth $\leq k$, and using at most $w$ gates in each layer.

In the encoding framework proposed in [Sto16], the following property was deemed to be valid.

- For a fan-in 2 gate in the $i$-th layer, any of its inputs should be the circuit input or the output of a gate in some layer with depth less than $i$.

However, based on this property, one may not always obtain a gate with depth $i$. For example, if the inputs of this gate are the outputs of a gate with depth $i-2$, then this gate has depth $i-1$. Therefore, when encoding under this property, a gate that was originally intended to be in the $i$-th layer may actually be located in some previous layer. In this case, the number of gates in this previous layer may exceed the width bound $w$. More precisely, the solver may output a solution that corresponds to a circuit containing a layer with more than $w$ gates. For this reason, we modify this property as follows.

- For a fan-in 2 gate in the $i$-th layer, its inputs should be the circuit input or the output of a gate in some layer with depth less than $i$. Moreover, at least one bit of its inputs should be the output of a gate in the $(i-1)$-th layer.

Under this property, the Boolean equation system can be generated as follows.

1) $\forall i \in\{0, \ldots, n-1\}, \forall j \in\{0, \ldots, m-1\}, X_{i} \stackrel{c}{=} x_{i}, Y_{j} \stackrel{c}{=} S_{j}\left(x_{0}, \ldots, x_{n-1}\right)$. These equations encode the circuit inputs and outputs.
2) $\forall i \in\{0, \ldots, k w-1\}$,

$$
T_{i} \stackrel{c}{=} b_{3 i} \cdot Q_{2 i} \cdot Q_{2 i+1}+b_{3 i+1} \cdot Q_{2 i}+b_{3 i+1} \cdot Q_{2 i+1}+b_{3 i+2}
$$

These equations encode the $k w$ gates, with the type of gates being encoded by $\left(b_{3 i}, b_{3 i+1}, b_{3 i+2}\right)$ as shown in Table 2. Similarly, the NOT gate can be implemented by other fan-in 2 gates with some specific inputs.
3) $\forall i \in\{0, \ldots, 2 w-1\}, Q_{i} \stackrel{c}{=} \sum_{j=0}^{n-1} a_{n i+j} \cdot X_{j}$, AtMost $_{1}\left(a_{n i}, \ldots, a_{n i+n-1}\right)$. These equations encode that the input of a gate in first layer can be the constant 0 or an S-box input.
4) $\forall i \in\{w, \ldots, k w-1\}$,

$$
\begin{aligned}
Q_{2 i} & \stackrel{c}{=}\left(\sum_{j=0}^{n-1} a_{l_{i}+j} \cdot X_{j}\right)+\left(\sum_{j=0}^{v w-1} a_{l_{i}+n+j} \cdot T_{j}\right), \operatorname{AtMost}_{1}\left(a_{l_{i}}, \ldots, a_{l_{i}+n+v w-1}\right) \\
Q_{2 i+1} & \stackrel{c}{=}\left(\sum_{j=0}^{w-1} a_{l_{i}+n+v w+j} \cdot T_{(v-1) w+j}\right), \operatorname{Exactly}_{1}\left(a_{l_{i}+n+v w}, \ldots, a_{l_{i}+n+(v+1) w-1}\right),
\end{aligned}
$$

where $v=\left\lfloor\frac{i}{w}\right\rfloor, l_{i}=i(n+v w+w)+n w-\frac{v^{2}+v+2}{2} w^{2}$. Here $Q_{2 i}$ and $Q_{2 i+1}$ are two inputs of a gate in the $(v+1)$-th layer of the circuit. These equations encode that $Q_{2 i}$ can be the constant 0 , or an input of the S-box, while $Q_{2 i+1}$ can only be the output of a gate in the $v$-th layer. Note that, if $Q_{2 i+1}$ equals zero, this gate may be a NOT gate with depth less than $v+1$. Thus, in order to avoid this case, $\left\{a_{l_{i}+n+v w}, a_{l_{i}+n+v w+1}, \ldots, a_{l_{i}+n+(v+1) w-1}\right\}$ should satisfy the exactly-one constraint.
5) $\forall i \in\{0, \ldots, m-1\}$,

$$
Y_{i} \stackrel{c}{=}\left(\sum_{j=0}^{n-1} a_{s_{i}+j} \cdot X_{j}\right)+\left(\sum_{j=0}^{k w-1} a_{s_{i}+n+j} \cdot T_{j}\right), \operatorname{AtMost}_{1}\left(a_{s_{i}}, \ldots, a_{s_{i}+n+k w-1}\right)
$$

where $s_{i}=k w\left(\frac{k w}{2}+n+\frac{w}{2}\right)+w(n-w)+i(n+k w)$. These equations encode that $Y_{i}$, the S-box outputs, can be the constant 0 , an S -box input, or a gate output.

### 4.5 Algebraic Expression Method v.s. Truth Table Method

In this subsection, we present the experimental comparison between the algebraic expression method and the truth table method by contrasting the time of the solving process. Our experiments focus on optimizing the implementations of the S-boxes considered in [Sto16] for different criteria. These S-boxes are, the S-boxes used in Ascon [DEMS16], ICEPOLE $\left[\mathrm{MGH}^{+} 15\right]$, PRIMATEs $\left[\mathrm{ABB}^{+}\right]$, Keccak $\left[\mathrm{BPVA}^{+} 11\right] / \operatorname{Ketje}\left[\mathrm{BDP}^{+}\right] / \operatorname{Keyak}\left[\mathrm{BDP}^{+} 16\right]$, Joltik [JNP15]/Piccolo [SIH $\left.{ }^{+} 11\right]$, LAC [ZWW $\left.{ }^{+} 14\right]$, Minalpher [STA $\left.{ }^{+} 14\right]$, Prøst [KLL+ 14$]$, RECTANGLE [ZBL $\left.{ }^{+} 14\right]$, and their inverses, if they exist and are used in decryption.

Table 3: The time of the solving process for MC and BGC optimizations

| Result | S-box | Size | MC |  |  | BGC |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  | $k$ | Truth | Alge. | $k$ | Truth | Alge. |
| SAT | Ascon | $5 \times 5$ | 5 | 7.4 s | 1.4 s | 16 | 8477 s | 1813 s |
|  | ICEPOLE | $5 \times 5$ | 6 | 31.2 s | 4.8 s | 26 | 25262 s | 5357 s |
|  | Keccak | $5 \times 5$ | 5 | 3.7 s | 1.2 s | 13 | 19.7 s | 75.5 s |
|  | PRIMATEs | $5 \times 5$ | 7 | 85.4 s | 3.6 s | 27 | \# | 5840 s |
|  | PRIMATEs ${ }^{-1}$ | $5 \times 5$ | 9 | 9234 s | 1750 s | - | - | - |
|  | Joltik | $4 \times 4$ | 4 | 0.91 s | 0.28 s | 10 | 41.8 s | 27.9 s |
|  | Joltik ${ }^{-1}$ | $4 \times 4$ | 4 | 0.72 s | 0.27 s | 10 | 157 s | 77.9 s |
|  | LAC | $4 \times 4$ | 4 | 1.0 s | 0.25 s | 11 | 91.3 s | 66.2 s |
|  | Minalpher | $4 \times 4$ | 5 | 2.0 s | 0.85 s | 18 | \# | 5022 s |
|  | Prøst | $4 \times 4$ | 4 | 1.4 s | 0.25 s | 8 | 3.9 s | 4.4 s |
|  | RECTANGLE | $4 \times 4$ | 4 | 0.53 s | 0.24 s | 12 | 606 s | 98.7 s |
|  | RECTANGLE ${ }^{-1}$ | $4 \times 4$ | 4 | 1.8 s | 0.22 s | 12 | 34.7 s | 81.1 s |
| UNSAT | Ascon | $5 \times 5$ | 4 | 104 s | 0.42 s | 10 | 647 s | 215 s |
|  | ICEPOLE | $5 \times 5$ | 5 | \# | 51.4 s | 10 | 229 s | 132 s |
|  | Keccak | $5 \times 5$ | 4 | 215 s | 0.45 s | 9 | 231 s | 72.8 s |
|  | PRIMATEs | $5 \times 5$ | 5 | 120936 s | 142 s | 10 | 385 s | 182 s |
|  | PRIMATEs ${ }^{-1}$ | $5 \times 5$ | 6 | \# | 658 s | 10 | 436 s | 101 s |
|  | Joltik | $4 \times 4$ | 3 | 2.3 s | 0.08 s | 9 | 1076 s | 193 s |
|  | Joltik ${ }^{-1}$ | $4 \times 4$ | 3 | 2.3 s | 0.07 s | 9 | 897 s | 205 s |
|  | LAC | $4 \times 4$ | 3 | 2.4 s | 0.07 s | 9 | 504 s | 120 s |
|  | Minalpher | $4 \times 4$ | 4 | 99.8 s | 0.17 s | 9 | 240 s | 62.6 s |
|  | Prøst | $4 \times 4$ | 3 | 2.2 s | 0.07 s | 7 | 4.5 s | 3.9 s |
|  | RECTANGLE | $4 \times 4$ | 3 | 2.6 s | 0.04 s | 9 | 500 s | 128 s |
|  | RECTANGLE ${ }^{-1}$ | $4 \times 4$ | 3 | 2.4 s | 0.06 s | 9 | 514 s | 148 s |

In our experiments, we chose Bosphorus [CSCM19] as the ANF-to-CNF converter, and Kissat [BF22] as the SAT solver. Since Bosphorus contains a process of simplifying the equations and the entire converting process may take more time than the SAT problemsolving process, the timings presented in this paper are the sum of the running times of Bosphorus and Kissat. Our experimental platforms are a PC with a 3.4 GHz Intel i7-6700 CPU and a workshop with a 2.9 GHz AMD 3990X CPU. For each optimization problem, timings for different methods were obtained on the same platform, using only one thread.

In Tables 3 and 4, the timings for solving different instances are presented. The instances for the algebraic expression method were generated based on the schemes presented in Section 4. For a fair comparison, the instances for the truth table method should be generated based on the same basic encoding frameworks, hence we modify the code given in [Sto16] accordingly and generated the instances for the truth table method by the modified code.

In these tables, the columns labeled as "Alge." present the timings for solving the equation systems generated by the algebraic expression method, while the columns labeled

Table 4: The time of the solving process for GC and depth optimizations

| Result | S-box | GC |  |  | Depth |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | $k$ | Truth | Alge. | $(k, w)$ | Truth | Alge. |
| SAT | Ascon | 15 | 7303 s | 398 s | $(3,6)$ | 1024 s | 37.1 s |
|  | ICEPOLE | 21 | \# | 27303 s | $(4,7)$ | \# | 2057 s |
|  | Keccak | 13 | 230 s | 21.8 s | $(2,10)$ | 86.0 s | 5.1 s |
|  | PRIMATEs | 26 | 116710 s | 34554 s | $(4,8)$ | 11190 s | 542 s |
|  | Joltik | 8 | 3.3 s | 3.1 s | $(4,2)$ | 2.0 s | 1.4 s |
|  | Joltik ${ }^{-1}$ | 8 | 18.1 s | 8.8 s | $(4,3)$ | 57.1 s | 24.5 s |
|  | LAC | 10 | 45.4 s | 20.4 s | $(3,6)$ | 140 s | 22.5 s |
|  | Minalpher | 16 | \# | 916 s | $(4,5)$ | 27031 s | 396 s |
|  | Prøst | 8 | 4.6 s | 1.8 s | $(4,3)$ | 22.9 s | 10.9 s |
|  | RECTANGLE | 11 | 626 s | 76.5 s | $(3,6)$ | 241 s | 86.9 s |
|  | RECTANGLE ${ }^{-1}$ | 11 | 283 s | 86.5 s | $(3,6)$ | 252 s | 61.9 s |
| UNSAT | Ascon | 8 | 3496 s | 34.1 s | $(2,8)$ | 358 s | 6.3 s |
|  | ICEPOLE | 8 | 57605 s | 221 s | $(3,10)$ | 1298 s | 505 s |
|  | Keccak | 8 | 1991 s | 93.9 s | $(2,7)$ | 52934 s | 1621 s |
|  | PRIMATEs | 8 | 118807 s | 41.9 s | $(2,15)$ | 865 s | 43.4 s |
|  | PRIMATEs ${ }^{-1}$ | 8 | 1215 s | 13.4 s | $(3,10)$ | 1571 s | 377 s |
|  | Joltik | 7 | 27.0 s | 13.8 s | $(3,10)$ | 436 s | 101 s |
|  | Joltik ${ }^{-1}$ | 7 | 37.8 s | 13.1 s | $(3,10)$ | 566 s | 136 s |
|  | LAC | 7 | 52.0 s | 10.9 s | $(2,10)$ | 127 s | 9.6 s |
|  | Minalpher | 7 | 797 s | 8.6 s | $(3,10)$ | 988 s | 233 s |
|  | Prøst | 7 | 19.1 s | 16.1 s | $(3,10)$ | 749 s | 132 s |
|  | RECTANGLE | 8 | 9096 s | 357 s | $(2,10)$ | 517 s | 8.0 s |
|  | RECTANGLE ${ }^{-1}$ | 8 | 4658 s | 323 s | $(2,10)$ | 145 s | 7.1 s |

as "Truth" present the timings for solving the equation systems generated by the truth table method. Here "\#" means running over 2 days without output. For MC, BGC, and GC, the columns labeled as " $k$ " list the maximum number of gates allowed in the decision problems we introduced before. For depth, in the column labeled as " $(k, w)$ ", the first number is the depth bound, and the second one is the maximum number of gates allowed in each layer.

Based on these experimental results, it is evident that the algebraic expression method accelerates the solving process for all instances except three cases for BGC. Moreover, for the majority of instances, the acceleration rate is approximately 2 to 100 times.

## 5 Eliminating Redundancy and Breaking Symmetry

In this section, we further improve the efficiency of the solving process by modifying the basic encoding frameworks. We consider two types of techniques: eliminating redundancy and breaking symmetry.

- For eliminating redundancy, we aim to eliminate redundant variables in the derived equations and prevent the occurrence of redundant gates. Here, redundant variables are variables that can be set to some fixed Boolean values (usually 0 ) without affecting the satisfiability of the equations. In a circuit, redundant gates refer to gates that can be removed directly from the circuit without affecting the outputs of the circuit.
- For breaking symmetry, we aim to break the symmetry of the solution space, which arises from the commutativity of the binary operations corresponding to the fan-in 2 gates.

In the following, we show how to apply such techniques to different optimization problems.

We should notice that for our implementation problems, solving UNSAT instances is much harder than solving SAT instances, since for a SAT instance, there are a lot of solutions, and the SAT solver only needs to return 1 solution. Moreover, the solving time for SAT instances is not very stable, since slightly modifying the equations may vary the solving time. For these reasons, our primary goal is to reduce the solving time for UNSAT instances.

### 5.1 Improvements for MC Optimizations

Eliminating Redundant Variables. In the scheme proposed in Section 4.1, $Q_{i}$ and $Y_{i}$ are encoded as follows:

$$
\begin{gathered}
Q_{i} \stackrel{c}{=} a_{l_{i}}+\left(\sum_{j=0}^{n-1} a_{l_{i}+j+1} \cdot X_{j}\right)+\left(\sum_{j=0}^{\left\lfloor\frac{i}{2}\right\rfloor-1} a_{l_{i}+n+j+1} \cdot T_{j}\right), \\
Y_{i} \stackrel{c}{=} a_{s_{i}}+\left(\sum_{j=0}^{n-1} a_{s_{i}+j+1} \cdot X_{j}\right)+\left(\sum_{j=0}^{k-1} a_{s_{i}+n+j+1} \cdot T_{j}\right) .
\end{gathered}
$$

In these equations, $a_{l_{i}}$ and $a_{s_{i}}$ encode possible NOT gates. The following proposition demonstrates that if the ANF of the S-box has no constant term, then we can remove $a_{l}$ and $a_{s}$ in these equations without affecting the satisfiability of the system. The proof of this proposition can be found in Appendix A.

Proposition 1. Let $\mathcal{S}^{\prime}:\left(x_{0}, \ldots, x_{n-1}\right) \rightarrow\left(S_{0}^{\prime}\left(x_{0}, \ldots, x_{n-1}\right), \ldots, S_{m-1}^{\prime}\left(x_{0}, \ldots, x_{n-1}\right)\right)$ be an $S$-box with $M C k$, and the $A N F$ of each $S_{i}^{\prime}\left(x_{0}, \ldots, x_{n-1}\right)$ has no constant term. Then there exists a circuit implementing $\mathcal{S}^{\prime}$ has $M C k$ and only contains AND and XOR gates.

For any S-box $\mathcal{S}:\left(x_{0}, \ldots, x_{n-1}\right) \rightarrow\left(S_{0}\left(x_{0}, \ldots, x_{n-1}\right), \ldots, S_{m-1}\left(x_{0}, \ldots, x_{n-1}\right)\right)$. Suppose $\mathcal{S}^{\prime}$ is derived from $\mathcal{S}$ by removing the constant term of each $S_{i}$. It is easy to see that $\mathcal{S}$ and $\mathcal{S}^{\prime}$ have the same MC, and based on an implementation of $S^{\prime}$ one can easily obtain an implementation of $S$ by adding some NOT gates to generate the constant term 1. Therefore, we can equivalently consider the MC decision problem for $\mathcal{S}^{\prime}$. Then, based on Proposition 1, we can modify the encoding scheme by rewriting the equations for $Q_{i}$ and $Y_{i}$ as

$$
\begin{aligned}
Q_{i} & \stackrel{c}{=}\left(\sum_{j=0}^{n-1} a_{l_{i}+j} \cdot X_{j}\right)+\left(\sum_{j=0}^{\left\lfloor\frac{i}{2}\right\rfloor-1} a_{l_{i}+n+j} \cdot T_{j}\right), \\
Y_{i} & \stackrel{c}{=}\left(\sum_{j=0}^{n-1} a_{s_{i}+j} \cdot X_{j}\right)+\left(\sum_{j=0}^{k-1} a_{s_{i}+n+j} \cdot T_{j}\right)
\end{aligned}
$$

By this modification, we can eliminate $2 k+m$ redundant variables, when the decision problem is for testing $k$.

Avoiding Redundant Gates. Here we consider the occurrences of redundant AND gates. For our encoding scheme introduced before, two types of redundant AND gates may appear.

1) The first type is the gates having trivial outputs, and we call this type of gates trivial gates. If the input of a circuit are variables, the output of a gate should be a specific Boolean polynomial. A trivial output means this specific Boolean polynomial is equal to the constant 0 , constant 1 , or one of the gate inputs.
2) The second type is the gates having useless outputs, and we call this type of gates useless gates. If $T_{i}$ is the output of an AND gate, we say $T_{i}$ is useless if its value does not affect the final outputs of the circuit. Specifically, it means $T_{i}$ is neither an input of any subsequent gate nor an output of the circuit.

By adding some equations corresponding to the constraints that at least one of some $a_{i}$ 's should be nonzero, we can avoid these two kinds of redundant gates ${ }^{2}$. However, our

[^1]experimental results show that these at-least-one constraints will slightly decrease the efficiency of the subsequent solving process. Therefore, for MC optimizations, we choose to keep these two types of redundant gates. But for optimization problems based on other criteria, we observed that avoiding the trivial gates, which could be achieved by adding some much simpler equations, can accelerate the subsequent solving processes, and we will introduce this in the following subsections.

Breaking Symmetry. We consider the symmetry derived from the commutativity of the multiplication operation. In our modified scheme, the two inputs of the AND gate that outputs $T_{i}$ are encoded as

$$
\begin{gathered}
Q_{2 i}=\left(\sum_{j=0}^{n-1} a_{l_{2 i}+j} \cdot X_{j}\right)+\left(\sum_{j=0}^{i-1} a_{l_{2 i}+n+j} \cdot T_{j}\right) \\
Q_{2 i+1}=\left(\sum_{j=0}^{n-1} a_{l_{2 i}+n+i+j} \cdot X_{j}\right)+\left(\sum_{j=0}^{i-1} a_{l_{2 i}+2 n+i+j} \cdot T_{j}\right)
\end{gathered}
$$

Let $N=n+i$ and $l=l_{2 i}$. Suppose $\mathcal{A}=\left\{\left(\alpha_{0}, \ldots, \alpha_{N-1}\right),\left(\beta_{0}, \ldots, \beta_{N-1}\right)\right\}$ is an assignment for variables $a_{l}, \ldots, a_{l+N-1}$, and $a_{l+N}, \ldots, a_{l+2 N-1}$, with $\left(\alpha_{0}, \ldots, \alpha_{N-1}\right) \neq\left(\beta_{0}, \ldots, \beta_{N-1}\right)$. Let $\mathcal{A}^{\prime}=\left\{\left(\beta_{0}, \ldots, \beta_{N-1}\right),\left(\alpha_{0}, \ldots, \alpha_{N-1}\right)\right\}$ be another assignment, then we have $\mathcal{A} \neq \mathcal{A}^{\prime}$. Obviously, $\mathcal{A}$ and $\mathcal{A}^{\prime}$ will result in the same $T_{i}$, since $Q_{2 i} \cdot Q_{2 i+1}=Q_{2 i+1} \cdot Q_{2 i}$. In this case, we say $\mathcal{A}$ and $\mathcal{A}^{\prime}$ are a pair of symmetric assignments. It is easy to check that there are $2^{N-1}\left(2^{N}-1\right)$ pairs of symmetric assignments, and symmetry breaking aims to exclude one assignment from each of these $2^{N-1}\left(2^{N}-1\right)$ pairs of symmetric assignments.

We can completely break this symmetry by imposing the constraint $\left(a_{l}, \ldots, a_{l+N-1}\right) \succeq$ $\left(a_{l+N}, \ldots, a_{l+2 N-1}\right)$, where " $\succeq$ " is a total order, such as the lexicography order " $\succeq_{l e x}$ ", for $N$-dimension Boolean vectors. Then, an assignment $\mathcal{A}^{\prime}=\left\{\left(\beta_{0}, \ldots, \beta_{N-1}\right),\left(\alpha_{0}, \ldots, \alpha_{N-1}\right)\right\}$ with $\left(\beta_{0}, \ldots, \beta_{N-1}\right) \preceq\left(\alpha_{0}, \ldots, \alpha_{N-1}\right)$ will be an invalid assignment. Hence, in the subsequent solving process, the searching branches containing this partial assignment are pruned early. Theoretically, this early pruning may reduce the running time of the SAT solver, if its computational cost is lower than that of determining the invalidity of assignments containing $\mathcal{A}^{\prime}$ in the original system.

If we use the lexicography order, then we can encode the constraint $\left(a_{l}, \ldots, a_{l+N-1}\right) \succeq_{l e x}$ $\left(a_{l+N}, \ldots, a_{l+2 N-1}\right)$ to the following equations.

- $\left(a_{l}+1\right) a_{l+N}=0$, which encodes $a_{l} \geq a_{l+N}$.
- For all $0 \leq j \leq N-2, \prod_{i=0}^{j}\left(a_{l+i}+a_{l+N+i}+1\right)\left(a_{l+j+1}+1\right) a_{l+N+j+1}=0$, which encodes: if $a_{l}=a_{l+N}, \ldots, a_{l+j}=a_{l+N+j}$, then $a_{l+j+1} \geq a_{l+N+j+1}$.
By adding all these equations into the equation system, we can exclude one assignment from each of the $2^{N-1}\left(2^{N}-1\right)$ pairs of symmetric assignments. However, since there are some complicated ones in these equations, if all these equations are added, the size of the system will significantly increase, which will undesirably slow down the subsequent solving process. To address this issue, we consider eliminating a part of these symmetric assignments by just adding some simple equations.

In these equations, the first equation $\left(a_{l}+1\right) a_{l+N}=0$ can exclude assignments with the form $\{(0, *, \ldots, *),(1, *, \ldots, *)\}$. There are $2^{2(N-1)}$ assignments having this form, and each of them contains in a pair of symmetric assignments. These pairs account for $1 /\left(2-\frac{1}{2^{N-1}}\right)>1 / 2$ of all pairs of symmetric assignments. Hence, by adding this equation for each pair of $Q_{2 i}$ and $Q_{2 i+1}$, we can eliminate more than $1 / 2$ of the symmetric assignments we aim to eliminate. Based on our experiments, we observed that compared to the strategy of adding more equations to exclude more symmetric assignments, this strategy provides better acceleration.

Experimental Results. By applying the above strategies for eliminating redundant variables and breaking partial symmetry, we achieve an improved encoding scheme. In

Table 5, we compare this improved scheme and the basic scheme based on the time to solve the UNSAT instances in Table 3. Experimental results show that the improved encoding scheme can accelerate the solving process for all these UNSAT instances. For instances that are not considered "too easy" (with solving time exceeding 0.5 seconds), the acceleration rate is approximately 2 .

Table 5: Comparison of the basic scheme and the improved scheme for MC optimizations

| UNSAT instances |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| S-box | $k$ | Basic | Improved | S-box | $k$ | Basic | Improved |
| Ascon | 4 | 0.42 s | 0.40 s | Joltik ${ }^{-1}$ | 3 | 0.07 s | 0.04 s |
| ICEPOLE | 5 | 51.4 s | 22.7 s | LAC | 3 | 0.07 s | 0.04 s |
| Keccak | 4 | 0.45 s | 0.39 s | Minalpher | 4 | 0.17 s | 0.16 s |
| PRIMATEs | 5 | 142 s | 96.1 s | Prøst | 3 | 0.07 s | 0.03 s |
| PRIMATEs ${ }^{-1}$ | 6 | 658 s | 239 s | RECTANGLE | 3 | 0.04 s | 0.03 s |
| Joltik | 3 | 0.08 s | 0.04 s | RECTANGLE ${ }^{-1}$ | 3 | 0.06 s | 0.04 s |

### 5.2 Improvements for BGC Optimizations

Eliminating Redundant Variables. In the encoding scheme proposed in Section 4.2, the following equations encode that the output $Y_{i}$ is equal to either a circuit input or a gate output.

$$
Y_{i} \stackrel{c}{=}\left(\sum_{j=0}^{n-1} a_{s_{i}+j} \cdot X_{j}\right)+\left(\sum_{j=0}^{k-1} a_{s_{i}+n+j} \cdot T_{j}\right), \operatorname{AtMost}_{1}\left(a_{s_{i}}, \ldots, a_{s_{i}+n+k-1}\right)
$$

Note that, in a practical problem, the ANF of an output of a S-box is always not equal to the ANF of any input of the S-box. Therefore, we can modify these equations as follows,

$$
Y_{i} \stackrel{c}{=} \sum_{j=0}^{k-1} a_{s_{i}+j} \cdot T_{j}, \operatorname{AtMost}_{1}\left(a_{s_{i}}, \ldots, a_{s_{i}+k-1}\right)
$$

By this modification, we can eliminate $n \cdot m$ variables.
Avoiding Redundant Gates. Similarly to MC optimizations, for BGC optimizations, the achieved circuits may contain two types of redundant gates: trivial gates and useless gates. To avoid useless gates, as in MC optimizations, we should add at-least-one constraints for certain $a_{j}$ 's, but this will decrease the efficiency of the subsequent solving process as observed in our experiments. Therefore, we focus on avoiding the trivial gates.

In our encoding scheme, the inputs of a fan-in 2 gate that outputs $T_{i}$ are $Q_{2 i}$ and $Q_{2 i+1}$. This gate is trivial, when $Q_{2 i}=Q_{2 i+1}, Q_{2 i}=0$, or $Q_{2 i+1}=0$. The following equations encode $Q_{2 i}$ and $Q_{2 i+1}$ in our scheme.

$$
\begin{aligned}
Q_{2 i} & \stackrel{c}{=}\left(\sum_{j=0}^{n-1} a_{l_{2 i}+j} \cdot X_{j}\right)+\left(\sum_{j=0}^{i-1} a_{l_{2 i}+n+j} \cdot T_{j}\right), \operatorname{AtMost}_{1}\left(a_{l_{2 i}}, \ldots, a_{l_{2 i}+n+i-1}\right) \\
Q_{2 i+1} & \stackrel{c}{=}\left(\sum_{j=0}^{n-1} a_{l_{2 i+1}+j} \cdot X_{j}\right)+\left(\sum_{j=0}^{i-1} a_{l_{2 i+1}+n+j} \cdot T_{j}\right), \operatorname{AtMost}_{1}\left(a_{l_{2 i+1}}, \ldots, a_{l_{2 i+1}+n+i-1}\right)
\end{aligned}
$$

Thus, to ensure $Q_{2 i} \neq 0$ and $Q_{2 i+1} \neq 0$, we only need to replace AtMost ${ }_{1}$ with Exactly ${ }_{1}$ in the above equations. Moreover, when $Q_{2 i} \neq 0$, to ensure $Q_{2 i} \neq Q_{2 i+1}$, we can impose the constrain that for any $0 \leq j \leq n+i-1, a_{l_{2 i}+j}$ and $a_{l_{2 i+1}+j}$ cannot be 1 simultaneously. For this purpose, we can add the following equations into the system.

$$
a_{l_{2 i}+j} a_{l_{2 i+1}+j}=0, \text { for all } 0 \leq j \leq n+i-1
$$

Additionally, if the gate outputting $T_{i}$ is a NOT gate, $Q_{2 i+1}$ is a free input. In this case, to guarantee this gate being nontrivial, we only need $Q_{2 i} \neq 0$.

Breaking Symmetry. Similarly to MC optimizations, to break the symmetry derived from the commutativity of the binary operation with inputs $Q_{2 i}$ and $Q_{2 i+1}$, one can impose the constraint $\left(a_{l_{2 i}}, \ldots, a_{l_{2 i}+n+i-1}\right) \succeq_{l e x}\left(a_{l_{2 i+1}}, \ldots, a_{l_{2 i+1}+n+i-1}\right)$. We denote this constraint by $Q_{2 i} \succeq_{l e x} Q_{2 i+1}$ for simplicity. Note that, we can combine the constraint $Q_{2 i} \succeq_{\text {lex }} Q_{2 i+1}$ with the constraint $Q_{2 i} \neq Q_{2 i+1}$ (this prevents the trivial gates with two same inputs) to the constraint $Q_{2 i} \succ_{l e x} Q_{2 i+1}$. When $Q_{2 i} \neq 0$, this constraint can be encoded as the following equations.

$$
a_{l_{2 i}+j} a_{l_{2 i+1}}=0, a_{l_{2 i}+j} a_{l_{2 i+1}+1}=0, \ldots, a_{l_{2 i}+j} a_{l_{2 i+1}+j}=0,0 \leq \forall j \leq n+i-1
$$

Note that, if $\left(a_{l_{2 i}}, \ldots, a_{l_{2 i}+n+i-2}, a_{l_{2 i}+n+i-1}\right)=(0, \ldots, 0,1)$, the only valid assignment for $\left(a_{l_{2 i+1}}, \ldots, a_{l_{2 i+1}+n+i-1}\right)$ in the above equations is $(0, \ldots, 0)$. This contradicts the constraint $Q_{2 i+1} \neq 0$. From our experiments, we observed that compared to modifying the constraint $Q_{2 i} \succ Q_{2 i+1}$, removing the constraint $Q_{2 i+1} \neq 0$ is a better choice. Therefore, in our final improved encoding scheme, we impose the constrains $Q_{2 i} \succ Q_{2 i+1}$ and $Q_{2 i} \neq 0$, to avoid the majority of trivial gates and completely break the symmetry derived from commutativity.

Experimental Results. In Table 6, we present the comparison of the performances of our basic encoding scheme and the improved encoding scheme. We can observe that for all UNSAT instances, the solving process can be accelerated by approximately 2 to 6 times Moreover, for most SAT instances, the solving process also has a moderate acceleration.

Table 6: Comparison of the basic scheme and the improved scheme for BGC optimizations

| S-box | SAT instances |  |  | UNSAT instances |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | $k$ | Basic | Improved | $k$ | Basic | Improved |
| Ascon | 16 | 1813 s | 299 s | 10 | 215 s | 118 s |
| ICEPOLE | 26 | 5357 s | 2046 s | 10 | 132 s | 19.2 s |
| Keccak | 13 | 75.5 s | 17.8 s | 9 | 72.8 s | 34.1 s |
| PRIMATEs | 27 | 5840 s | 1900 s | 10 | 182 s | 65.4 s |
| PRIMATEs ${ }^{-1}$ | - | - | - | 10 | 101 s | 50.6 s |
| Joltik | 10 | 27.9 s | 17.4 s | 9 | 193 s | 117 s |
| Joltik ${ }^{-1}$ | 10 | 77.9 s | 84.6 s | 9 | 205 s | 86.8 s |
| LAC | 11 | 66.2 s | 65.3 s | 9 | 120 s | 52.4 s |
| Minalpher | 18 | 5022 s | 3397 s | 9 | 62.6 s | 15.7 s |
| Prøst | 8 | 4.4 s | 3.6 s | 7 | 3.9 s | 2.4 s |
| RECTANGLE | 12 | 98.7 s | 62.9 s | 9 | 128 s | 54.4 s |
| RECTANGLE ${ }^{-1}$ | 12 | 81.1 s | 104 s | 9 | 148 s | 52.0 s |

### 5.3 Improvements for GC Optimizations

Since the only difference of the encoding schemes for BGC optimizations and GC optimizations is the way of encoding different gates, we use the strategies introduced in last subsection. Specifically, we apply the following three strategies:

1) Assuming $Y_{i} \neq X_{j}$ for any $0 \leq i \leq m-1,0 \leq j \leq n-1$; 2) Adding the constraint $\left.Q_{2 i} \neq 0 ; 3\right)$ Adding the constraint $Q_{2 i} \succ_{\text {lex }} Q_{2 i+1}$.
Note that, after this modification, $Q_{2 i+1}$ can still be equal to 0 , thus a NOT gate can be applied as $\operatorname{XNOR}\left(Q_{2 i}, Q_{2 i+1}\right)$ with $Q_{2 i+1}=0$.

Table 7 presents the performances of our basic and improved schemes. It is evident that the improved scheme accelerates the solving process for all UNSAT and SAT instances.

It is surprising that the acceleration rate reaches approximately 40 to 200 for some UNSAT instances.

Table 7: Comparison of the basic scheme and the improved scheme for GC optimizations

| S-box | SAT instances |  |  | UNSAT instances |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | $k$ | Basic | Improved | $k$ | Basic | Improved |
| Ascon | 15 | 398 s | 91.5 s | 8 | 34.1 s | 11.6 s |
| ICEPOLE | 21 | 27303 s | 18631 s | 8 | 221 s | 1.4 s |
| Keccak | 13 | 21.8 s | 19.9 s | 8 | 93.9 s | 6.5 s |
| PRIMATEs | 26 | 34554 s | 4570 s | 8 | 41.9 s | 1.4 s |
| PRIMATEs ${ }^{-1}$ | - | - | - | 8 | 13.4 s | 1.5 s |
| Joltik | 8 | 3.1 s | 2.4 s | 7 | 13.8 s | 3.0 s |
| Joltik ${ }^{-1}$ | 8 | 8.8 s | 4.5 s | 7 | 13.1 s | 2.1 s |
| LAC | 10 | 20.4 s | 12.6 s | 7 | 10.9 s | 1.7 s |
| Minalpher | 16 | 916 s | 800 s | 7 | 8.6 s | 0.42 s |
| Prøst | 8 | 1.8 s | 1.7 s | 7 | 16.1 s | 3.1 s |
| RECTANGLE | 11 | 76.5 s | 9.6 s | 8 | 357 s | 5.5 s |
| RECTANGLE ${ }^{-1}$ | 11 | 86.5 s | 10.6 s | 8 | 323 s | 7.4 s |

### 5.4 Improvements for Depth Optimizations

Similarly, we apply the following three improving strategies:

1) Assuming $Y_{i} \neq X_{j}$ for any $\left.0 \leq i \leq m-1,0 \leq j \leq n-1 ; 2\right)$ For the first layer, adding the constraint $Q_{2 i+1} \neq 0 ; 3$ ) Adding the constraint $Q_{2 i} \prec_{l e x} Q_{2 i+1}{ }^{3}$.
Note that, after applying these strategies, a NOT can only be implemented by an XNOR gate with inputs 0 and $Q_{2 i+1}$, therefore we cannot impose the constraint $Q_{2 i} \neq 0$. Moreover, to be consistent with the property that $Q_{2 i}$ can be equal to 0 and $Q_{2 i+1} \neq 0$, we use the constraint $Q_{2 i} \prec_{\text {lex }} Q_{2 i+1}$ instead of $Q_{2 i} \succ_{\text {lex }} Q_{2 i+1}$.

The experimental results in Table 8 show that for all UNSAT instances and the majority of SAT instances, the solving process can be accelerated moderately.

Table 8: Comparison of the basic scheme and the improved scheme for depth optimizations

| S-box | SAT instances |  |  | UNSAT instances |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | $(k, w)$ | Basic | Improved | ( $k, w$ ) | Basic | Improved |
| Ascon | $(3,6)$ | 37.1 s | 82.1 s | $(2,8)$ | 6.3 s | 3.9 s |
| ICEPOLE | $(4,7)$ | 2057 s | 1360 s | $(3,10)$ | 505 s | 203 s |
| Keccak | $(2,10)$ | 5.1 s | 3.6 s | $(2,7)$ | 1621 s | 42.1 s |
| PRIMATEs | $(4,8)$ | 542 s | 274 s | $(2,15)$ | 43.4 s | 26.2 s |
| PRIMATEs ${ }^{-1}$ | - | - | - | $(3,10)$ | 377 s | 341 s |
| Joltik | $(4,2)$ | 1.4 s | 1.4 s | $(3,10)$ | 101 s | 55.8 s |
| Joltik ${ }^{-1}$ | $(4,3)$ | 24.5 s | 17.8 s | $(3,10)$ | 136 s | 92.0 s |
| LAC | $(3,6)$ | 22.5 s | 7.9 s | $(2,10)$ | 9.6 s | 7.6 s |
| Minalpher | $(4,5)$ | 396 s | 388 s | $(3,10)$ | 233 s | 128 s |
| Prøst | $(4,3)$ | 10.9 s | 3.9 s | $(3,10)$ | 132 s | 76.9 s |
| RECTANGLE | $(3,6)$ | 86.9 s | 15.9 s | $(2,10)$ | 8.0 s | 4.3 s |
| RECTANGLE ${ }^{-1}$ | $(3,6)$ | 61.9 s | 10.1 s | $(2,10)$ | 7.1 s | 4.0 s |

[^2]
## 6 New Results for Optimized S-box Implementations

Based on our improved encoding schemes, we achieved some new results for optimizing the implementations of the S-boxes introduced in Section 4.5. We present these results in this section, and the corresponding detailed implementations are given in Appendix C ${ }^{4}$

Here we denote the range of the optimal value for a criterion (MC, BGC, GC) as [ $k_{1}, k_{2}$ ]. For example, let $k$ be the value we test in the MC decision problem. When we say the MC of an S-box is in $\left[k_{1}, k_{2}\right]$, it means the SAT-solver returns UNSAT when $k=k_{1}-1$, and returns SAT when $k=k_{2}$.

Multiplicative Complexity. The MCs of the S-boxes for Ascon, ICEPOLE, Keccak/Ketje/Keyak, Joltik/Piccolo, Joltik ${ }^{-1}$ Piccolo $^{-1}$, LAC, Minalpher, Prøst, RECTANGLE, RECTANGLE ${ }^{-1}$ have already been determined in [Sto16]. Considering the issue mentioned in Remark 1, which may cause potentially incorrect results, we verified the correctness of these results by our method. Moreover, as shown in Table 9, we determined the MC of the PRIMATEs S-box and narrowed the value range of the MC of the PRIMATEs ${ }^{-1}$ S-box.

Table 9: Multiplicative complexities of PRIMATEs and PRIMATEs ${ }^{-1}$

| S-box | $[$ Sto16 $]$ | This paper |
| :--- | :---: | :---: |
| PRIMATEs | $\in[6,7]$ | 7 |
| PRIMATEs $^{-1}$ | $\in[6,10]$ | $\in[7,8]$ |

Bitslice Gate Complexity. For the S-boxes of Joltik/Piccolo, Joltik ${ }^{-1} /$ Piccolo $^{-1}$, LAC, and Prøst, their BGCs have been determined in [Sto16]. We verified these results, and obtained some new BGC results for other S-boxes. These new results are summarized in Table 10.

Table 10: Bitslice gate complexities of S-boxes

| S-box | [Sto16] | This paper | S-box | [Sto16] | This paper |
| :--- | :---: | :---: | :--- | :---: | :---: |
| Ascon | - | $\in[12,16]$ | Keccak | $\leq 13$ | $\in[12,13]$ |
| ICEPOLE | - | $\in[12,26]$ | Minalpher | $\in[11, \infty]$ | $\in[12,17]$ |
| PRIMATEs | - | $\in[12,27]$ | RECTANGLE | $\in[10,12]$ | $\in[11,12]$ |
| PRIMATEs $^{-1}$ | - | $\in[12,45]$ |  |  |  |

Gate Complexity. The GCs of the S-boxes for Joltik/Piccolo, Joltik ${ }^{-1} /$ Piccolo $^{-1}$, LAC, and Prøst have been determined in [Sto16]. Here, we determined the GCs of RECTANGLE and RECTANGLE ${ }^{-1}$, and obtained some new results for other S-boxes. Table 11 presents these new results.

Table 11: Gate complexities of S-boxes

| S-box | $[$ Sto16 $]$ | This paper | S-box | [Sto16] | This paper |
| :--- | :---: | :---: | :--- | :---: | :---: |
| Ascon | - | $\in[12,15]$ | Keccak | - | $\in[11,13]$ |
| ICEPOLE | - | $\in[12,21]$ | Minalpher | - | $\in[11,14]$ |
| PRIMATEs | - | $\in[12,23]$ | RECTANGLE | $\in[10,11]$ | 11 |
| PRIMATEs $^{-1}$ | - | $\in[12,43]$ | RECTANGLE $^{-1}$ | $\in[10,11]$ | 11 |

[^3]Depth complexity. In comparison to the results presented in [Sto16], we achieved new low-depth implementations and UNSAT boundaries for the 4 -bit Minalpher S-box and all 5 -bit S-boxes. We also refined the UNSAT boundaries for some 4-bit S-boxes (from LAC, RECTANGLE, RECTANGLE ${ }^{-1}$ ). Especially, we can determine the optimal widths for implementing some S-boxes (from Ascon, Keccak, LAC, RECTANGLE, RECTANGLE ${ }^{-1}$ ) with low depth.

Table 12: Circuit depth complexities of S-boxes. $k$ is the depth of the circuit and $w$ is the width of the circuit.

| S-box | $(k, w)$ for SAT results |  | $(k, w)$ for UNSAT bounds |  | Optimal width |
| :--- | :---: | :---: | :---: | :---: | :---: |
|  | $[$ Sto16] | This paper | $[$ Sto16] | This paper |  |
| Ascon | - | $(3,6)$ | - | $(3,5)$ | $\checkmark$ |
| ICEPOLE | - | $(4,7)$ | - | $(4,4)$ | $\times$ |
| PRIMATEs | - | $(4,7)$ | - | $(4,4)$ | $\times$ |
| Keccak | - | $(2,10)$ | - | $(2,9)$ | $\checkmark$ |
| Minalpher | - | $(4,5)$ | - | $(4,3)$ | $\times$ |
| LAC | $(3,6)$ | $(3,6)$ | $(3,4)$ | $(\mathbf{3 , 5 )}$ | $\checkmark$ |
| RECTANGLE | $(3,6)$ | $(3,6)$ | $(3,4)$ | $\mathbf{( 3 , 5 )}$ | $\checkmark$ |
| RECTANGLE $^{-1}$ | $(3,6)$ | $(3,6)$ | $(3,4)$ | $\mathbf{( 3 , 5 )}$ | $\checkmark$ |

## 7 Conclusion

We revisit the problem of optimizing S-box implementations with SAT solvers. For accelerating the SAT problem-solving process, we propose the algebraic expression method for encoding the circuit optimization problems into SAT problems. Experimental results show that, in most cases, compared to the truth table method introduced in FSE 2016, the algebraic expression method can accelerate the subsequent solving process by approximately 2 to 100 times. To further improve the solving efficiency, we propose several strategies to remove redundant variables, avoid redundant gates, and break the symmetry of the solution space. Based on these further improved encoding schemes, we obtain some new optimized implementations for the S-boxes used in Ascon, ICEPOLE, PRIMATEs, Keccak, Minalpher, and RECTANGLE. We also narrow the value ranges of their MC, BGC, GC, and their optimal circuit width for low-depth implementations. It is easy to see that our method can be easily combined with the SAT-based algorithms for optimizing circuit area and AND-depth proposed in previous works.

## References

$\left[\mathrm{ABB}^{+}\right] \quad$ Elena Andreeva, Begül Bilgin, Andrey Bogdanov, Atul Luykx, Florian Mendel, Bart Mennink, Nicky Mouha, Qingju Wang, and Kan Yasuda. Primates v1. 02. submission to caesar. 2016.
[ARS ${ }^{+}$15] Martin R Albrecht, Christian Rechberger, Thomas Schneider, Tyge Tiessen, and Michael Zohner. Ciphers for mpc and fhe. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 430-454. Springer, 2015.
$\left[\mathrm{BDP}^{+}\right]$Guido Bertoni, Joan Daemen, Michaël Peeters, Gilles Van Assche, and Ronny Van Keer. Caesar submission: Ketje v1, march 2014. URL http://competitions.cr.yp.to/round1/ketjev11.pdf.
[ $\left.\mathrm{BDP}^{+} 16\right]$ Guido Bertoni, Joan Daemen, Michaël Peeters, GV Assche, and RV Keer. Keyak v2. Submission to the CAESAR Competition, 2016.
[BF22] Armin Biere and Mathias Fleury. Gimsatul, isasat and kissat entering the sat competition 2022. Proc. of SAT Competition, pages 10-11, 2022.
[BGLS19] Zhenzhen Bao, Jian Guo, San Ling, and Yu Sasaki. PEIGEN - a platform for evaluation, implementation, and generation of s-boxes. IACR Trans. Symmetric Cryptol., 2019(1):330-394, 2019.
$\left[\mathrm{BMD}^{+} 20\right]$ Begül Bilgin, Lauren De Meyer, Sébastien Duval, Itamar Levi, and FrançoisXavier Standaert. Low AND depth and efficient inverses: a guide on s-boxes for low-latency masking. IACR Trans. Symmetric Cryptol., 2020(1):144-184, 2020.
[BMP13] Joan Boyar, Philip Matthews, and René Peralta. Logic minimization techniques with applications to cryptology. J. Cryptol., 26(2):280-312, 2013.
$\left[\mathrm{BPVA}^{+} 11\right]$ Guido Bertoni, Michaël Peeters, Gilles Van Assche, et al. The keccak reference. 2011.
[CMH13] Nicolas Courtois, Theodosis Mourouzis, and Daniel Hulme. Exact logic minimization and multiplicative complexity of concrete algebraic and cryptographic circuits. Int. J. Adv. Intell. Syst, 6(3):165-176, 2013.
[CSCM19] Davin Choo, Mate Soos, Kian Ming A Chai, and Kuldeep S Meel. Bosphorus: Bridging anf and cnf solvers. In 2019 Design, Automation \& Test in Europe Conference \& Exhibition (DATE), pages 468-473. IEEE, 2019.
[DEMS16] Christoph Dobraunig, Maria Eichlseder, Florian Mendel, and Martin Schläffer. Ascon v1. 2. Submission to the CAESAR Competition, 5(6):7, 2016.
[FWL ${ }^{+}$21] Yanhong Fan, Weijia Wang, Zhihu Li, Zhenyu Lu, Siu-Ming Yiu, and Meiqin Wang. Forced independent optimized implementation of 4-bit s-box. In Joonsang Baek and Sushmita Ruj, editors, Information Security and Privacy - 26th Australasian Conference, ACISP 2021, Virtual Event, December 1-3, 2021, Proceedings, volume 13083 of Lecture Notes in Computer Science, pages 151-170. Springer, 2021.
[JNP15] Jérémy Jean, Ivica Nikolić, and Thomas Peyrin. Joltik v1. 3. CAESAR Round, 2, 2015.
[JPST17] Jérémy Jean, Thomas Peyrin, Siang Meng Sim, and Jade Tourteaux. Optimizing implementations of lightweight building blocks. IACR Trans. Symmetric Cryptol., 2017(4):130-168, 2017.
$\left[K_{L L}{ }^{+} 14\right]$ Elif Bilge Kavun, Martin M Lauridsen, Gregor Leander, Christian Rechberger, Peter Schwabe, and Tolga Yalçın. Prøst v1. CAESAR Round, 1, 2014.
$\left[\mathrm{MGH}^{+} 15\right]$ Pawe l Morawiecki, Kris Gaj, Ekawat Homsirikamol, Krystian Matusiewicz, Josef Pieprzyk, Marcin Rogawski, Marian Srebrny, and Marcin Wójcik. Icepole v2. CAESAR submission: http://competitions.cr.yp.to/round2/icepolev2.pdf, 2015.
$\left[\mathrm{LWH}^{+} 21\right]$ Zhenyu Lu, Weijia Wang, Kai Hu, Yanhong Fan, Lixuan Wu, and Meiqin Wang. Pushing the limits: Searching for implementations with the smallest area for lightweight s-boxes. In Avishek Adhikari, Ralf Küsters, and Bart Preneel, editors, Progress in Cryptology - INDOCRYPT 2021 - 22nd International Conference on Cryptology in India, Jaipur, India, December 12-15, 2021, Proceedings, volume 13143 of Lecture Notes in Computer Science, pages 159-178. Springer, 2021.
[Osv00] Dag Arne Osvik. Speeding up serpent. In The Third Advanced Encryption Standard Candidate Conference, April 13-14, 2000, New York, New York, USA, pages 317-329. National Institute of Standards and Technology,, 2000.
[Ras22] Shahram Rasoolzadeh. Low-latency boolean functions and bijective s-boxes. IACR Trans. Symmetric Cryptol., 2022(3):403-447, 2022.
$\left[\right.$ SIH $\left.^{+} 11\right] \quad$ Kyoji Shibutani, Takanori Isobe, Harunaga Hiwatari, Atsushi Mitsuda, Toru Akishita, and Taizo Shirai. Piccolo: an ultra-lightweight blockcipher. In International workshop on cryptographic hardware and embedded systems, pages 342-357. Springer, 2011.
[STA $\left.{ }^{+} 14\right]$ Yu Sasaki, Yosuke Todo, Kazumaro Aoki, Yusuke Naito, Takeshi Sugawara, Yumiko Murakami, Mitsuru Matsui, and Shoichi Hirose. Minalpher v1. CAESAR Round, 1, 2014.
[Sto16] Ko Stoffelen. Optimizing s-box implementations for several criteria using sat solvers. In International Conference on Fast Software Encryption, pages 140-160. Springer, 2016.
$\left[Z_{B L}{ }^{+} 14\right]$ Wentao Zhang, Zhenzhen Bao, Dongdai Lin, Vincent Rijmen, Bohan Yang, and Ingrid Verbauwhede. Rectangle: a bit-slice ultra-lightweight block cipher. Suitable for Multiple Platforms, 2014.
$\left[\right.$ ZWW ${ }^{+}$14] Lei Zhang, Wenling Wu, Yanfeng Wang, Shengbao Wu, and Jian Zhang. Lac: A lightweight authenticated encryption cipher. Submitted to the CAESAR competition, 2014.

## A Proof of Proposition 1

First, we need the following lemma.
Lemma 1. Let $\mathcal{S}:\left(x_{1}, \ldots, x_{n}\right) \rightarrow\left(S_{1}, \ldots, S_{m}\right)$ be an $S$-box with $M C$. There exists a circuit, which implements $\mathcal{S}$ with MC $k$, satisfying that the NOT gates in this circuit all locate after the last AND gate.
Proof. Let $\mathcal{C}_{0}$ be a circuit that implements $\mathcal{S}$ with $k$ and gates. Denote its outputs by $y_{1}, y_{2}, \ldots, y_{m}$. Suppose $\operatorname{AND}_{i}$ is the $i$-th AND gate in $\mathcal{C}_{0}$. We denote its inputs as $p_{i}$ and $q_{i}$, and its output as $t_{i}$. Then we have

$$
p_{i}=a_{i}+L_{i}\left(x_{1}, \ldots, x_{n}, t_{1}, \ldots, t_{i-1}\right)
$$

for some $a_{i}$ and linear function $L_{i}$. If $a_{i}=0$, this is a linear combination of these $x_{j}$ 's and $t_{j}$ 's, and for simplicity, we say $p_{i}$ is a linear combination. In this case, $p_{i}$ can be generated from $x_{j}$ 's and $t_{j}$ 's by only using XOR gates, hence we say NOT gate is not used for generating $p_{i}$. If $a_{1}=1$, we say $p_{i}$ is an affine combination. In this case, one should use at least one NOT gate to generate $p_{i}$ from $x_{j}$ 's and $t_{j}$ 's, hence we say NOT gate is used for generating $p_{i}$ The same terminology applies to $q_{i}$ and $y_{i}$.

Now we prove the following assertion: one can remove the NOT gates used for generating $p_{i}$ and $q_{i}$ by modifying the XOR and NOT gates following $\operatorname{AND}_{i}$, while preserving the circuit outputs.

If NOT gates are used for generating $p_{i}$ or $q_{i}$, we have the following two cases.

1) Both $p_{i}$ and $q_{i}$ are affine combinations. We have

$$
p_{i}=p_{i}^{\prime}+1, q_{i}=q_{i}^{\prime}+1, t_{i}=p_{i} q_{i}=p_{i}^{\prime} q_{i}^{\prime}+p_{i}^{\prime}+q_{i}^{\prime}+1
$$

where $p_{i}^{\prime}=U_{i}\left(x_{1}, \ldots, x_{n}, t_{1}, \ldots, t_{i-1}\right), q_{i}^{\prime}=V_{i}\left(x_{1}, \ldots, x_{n}, t_{1}, \ldots, t_{i-1}\right)$, and $U_{i}, V_{i}$ are two linear functions. We can change the inputs of $\mathrm{AND}_{i}$ to $p_{i}^{\prime}$ and $q_{i}^{\prime}$ by removing the NOT gates used to generate $p_{i}$ and $q_{i}$. Then, the output of AND $_{i}$ will be $t_{i}^{\prime}=p_{i}^{\prime} q_{i}^{\prime}$. Note that in $\mathcal{C}_{0}, t_{i}$ is used to generate the inputs of some $\operatorname{AND}_{j}$ with $j>i$, or generate the circuit outputs. For any case, $t_{i}$ is used to generate some affine combination $L\left(x_{1}, \ldots, x_{n}, t_{1}, \ldots, t_{i}, \ldots, t_{s}\right)+a$. Note that
$t_{i}=t_{i}^{\prime}+p_{i}^{\prime}+q_{i}^{\prime}+1=t_{i}^{\prime}+U_{i}\left(x_{1}, \ldots, x_{n}, t_{1}, \ldots, t_{i-1}\right)+V_{i}\left(x_{1}, \ldots, x_{n}, t_{1}, \ldots, t_{i-1}\right)+1$.
By substituting $t_{i}$ with this expression in $L$, we have

$$
\begin{equation*}
L\left(x_{1}, \ldots, x_{n}, t_{1}, \ldots, t_{i}, \ldots, t_{s}\right)+a=L^{\prime}\left(x_{1}, \ldots, x_{n}, t_{1}, \ldots, t_{i}^{\prime}, \ldots, t_{s}\right)+a^{\prime} \tag{11}
\end{equation*}
$$

for some $a^{\prime}$ and linear function $L^{\prime}$.
Equation (11) means that we can modify the XOR and NOT gates following $\mathrm{AND}_{i}$ to preserve the circuit outputs.
2) One of $p_{i}$ and $q_{i}$ is an affine combination. Without loss of generality, suppose $p_{i}$ is an affine combination. Then, we have

$$
p_{i}=p_{i}^{\prime}+1, t_{i}=p_{i} q_{i}=p_{i}^{\prime} q_{i}+q_{i}
$$

where $p_{i}^{\prime}=U_{i}\left(x_{1}, \ldots, x_{n}, t_{1}, \ldots, t_{i-1}\right), q_{i}=V_{i}\left(x_{1}, \ldots, x_{n}, t_{1}, \ldots, t_{i-1}\right)$, and $U_{i}, V_{i}$ are two linear functions.
By removing the NOT gate used to generate $p_{i}$, we can change the inputs of $\mathrm{AND}_{i}$ to $p_{i}^{\prime}$ and $q_{i}$. Then the output of $\operatorname{AND}_{i}$ will be $t_{i}^{\prime}=p_{i}^{\prime} q_{i}$, and we have

$$
t_{i}=t_{i}^{\prime}+V_{i}\left(x_{1}, \ldots, x_{n}, t_{1}, \ldots, t_{i-1}\right)
$$

Similarly, for any affine combination $L\left(x_{1}, \ldots, x_{n}, t_{1}, \ldots, t_{i}, \ldots, t_{s}\right)+a$, we can substitute $t_{i}$ with the above expression, and obtain an equivalent affine combination $L^{\prime}\left(x_{1}, \ldots, x_{n}, t_{1}, \ldots, t_{i}^{\prime}, \ldots, t_{s}\right)+a^{\prime}$. This proves our assertion.

Based on the above assertion, we can recursively reconstruct the circuit as follows. For $i$ from 1 to $k$, remove the NOT gates used to generate the inputs of $\mathrm{AND}_{i}$ and modify the following XOR and NOT gates accordingly. After the above process, the NOT gates only appear after $\mathrm{AND}_{k}$. Note that, we only modify the XOR and NOT gates, hence the new circuit still has $k$ AND gates. This proves the lemma.

Proof of Proposition 1. According to Lemma 1, we can construct a circuit $\mathcal{C}$ that implements $\mathcal{S}$ with multiplicative complexity $k$, and all NOT gates in $\mathcal{C}$ appear after the last AND gate. Suppose $t_{i}$ is the output of the $i$-th AND gate, then the ANF of $t_{i}$ has no constant term, for any $1 \leq i \leq k$. Since $y_{i}$, the $i$-th bit of the circuit output, can be written as $y_{i}=L\left(x_{1}, \ldots, x_{n}, t_{1}, \ldots, t_{k}\right)+a_{i}$, where $a_{i} \in \mathbb{F}_{2}$ and $L$ is a linear function. By substituting $t_{i}$ with its ANF in $L$, we have $y_{i}=T\left(x_{1}, \ldots, x_{n}\right)+a_{i}$, where $T$ is a Boolean polynomial without a constant term. This induces that $a_{i}=0$. Therefore, we can construct $y_{i}$ from $\left(x_{1}, \ldots, x_{n}, t_{1}, \ldots, t_{k}\right)$ only by XOR gates. This proves the proposition.

## B Attempts in Eliminating Redundant Gates for MC Optimizations

1) Trivial gates. In our modified scheme (without NOT gates), the two inputs of an AND gate are encoded as

$$
\begin{aligned}
Q_{2 i} & =\left(\sum_{j=0}^{n-1} a_{l_{2 i}+j} \cdot X_{j}\right)+\left(\sum_{j=0}^{i-1} a_{l_{2 i}+n+j} \cdot T_{j}\right), \\
Q_{2 i+1} & =\left(\sum_{j=0}^{n-1} a_{l_{2 i+1}+j} \cdot X_{j}\right)+\left(\sum_{j=0}^{i-1} a_{l_{2 i+1}+n+j} \cdot T_{j}\right) .
\end{aligned}
$$

Obviously, if $a_{l_{2 i}}=0, \ldots, a_{l_{2 i}+n+i-1}=0$ or $a_{l_{2 i+1}}=0, \ldots, a_{l_{2 i+1}+n+i-1}=0$, which means $Q_{2 i}=0$ or $Q_{2 i+1}=0$, this AND gate will be a trivial gate. Moreover, if $a_{l_{2 i}}=a_{l_{2 i+1}}, \ldots, a_{l_{2 i}+n+i-1}=a_{l_{2 i+1}+n+i-1}$, which means $Q_{2 i}=Q_{2 i+1}$, this AND gate will also be a trivial gate.
2) Useless gates. For our encoding scheme, if $T_{i}$ is useless, then it is irrelevant to any $Q_{r}(r>2 i+1)$ and any $Y_{t}(0 \leq t \leq m-1)$. In our scheme, $Q_{r}$ and $Y_{t}$ are encoded as follows,

$$
\begin{aligned}
Q_{r} & =\left(\sum_{j=0}^{n-1} a_{l_{r}+j} \cdot X_{j}\right)+\left(\sum_{j=0}^{\left\lfloor\frac{r}{2}\right\rfloor-1} a_{l_{r}+n+j} \cdot T_{j}\right), \\
Y_{t} & =\left(\sum_{j=0}^{n-1} a_{s_{t}+j} \cdot X_{j}\right)+\left(\sum_{j=0}^{k-1} a_{s_{t}+n+j} \cdot T_{j}\right) .
\end{aligned}
$$

If $T_{i}$ is irrelevant to $Q_{r}$ and $Y_{t}$, we have $a_{l_{r}+n+i}=a_{s_{t}+n+i}=0$.
We use AtLeast ${ }_{1}(S)$ to denote the equations corresponding to the constraint that there is at least one nonzero element in a variable set $S$. Then, to avoid trivial gates, we can impose the constraints $Q_{2 i} \neq 0, Q_{2 i+1} \neq 0$, and $Q_{2 i} \neq Q_{2 i+1}$. This can be done by adding AtLeast ${ }_{1}\left(a_{l_{2 i}}, \ldots, a_{l_{2 i}+n+i-1}\right)$, AtLeast $_{1}\left(a_{l_{2 i+1}}, \ldots, a_{l_{2 i+1}+n+i-1}\right)$, AtLeast $_{1}\left(a_{l}+\right.$ $\left.a_{l+n+i}, \ldots, a_{l+n+i-1}+a_{l+2 n+2 i-1}\right)$, for all $0 \leq i \leq k-1$, into the equation system. Moreover, we can add, AtLeast $_{1}\left(a_{l_{2 i+2}+n+i}, \ldots, a_{l_{2 k-1}+n+i}, a_{s_{0}+n+i}, \ldots, a_{s_{k-1}+n+i}\right)$ for all $0 \leq i \leq k-1$, into the equation system, to avoid the useless gates.

However, our experiments show that adding either of these two groups of equations will slightly decrease the efficiency of the subsequent solving process ${ }^{5}$. Therefore, for MC optimizations, we choose to keep these two types of redundant gates.

[^4]
## C New Optimized S-box Implementations

Here we present some new optimized S-box implementations achieved by our method. In these implementations, logical connectives are used to denote different gates, i.e., let $\wedge, \vee$, $\oplus, \neg$ denote AND, OR, XOR, NOT, respectively. Moreover, we use $x_{0}$ and $y_{0}$ to denote the most significant bit of the S-box input $x$ and the S-box output $y$, respectively.

## C. 1 MC Optimized Implementation

| PRIMATEs ${ }^{-1}$ | $q_{6}=u_{5} \oplus u_{2}$ | $u_{11}=u_{8} \oplus u_{10}$ |
| :---: | :---: | :---: |
| $k=8$ | $q_{7}=x_{1} \oplus q_{5}$ | $q_{12}=\neg u_{11}$ |
| $q_{1}=x_{0} \oplus x_{1}$ | $t_{3}=q_{6} \wedge q_{7}$ | $u_{12}=q_{15} \oplus q_{14}$ |
| $t_{0}=x_{1} \wedge q_{1}$ | $u_{6}=x_{2} \oplus q_{7}$ | $q_{13}=u_{4} \oplus u_{12}$ |
| $u_{0}=x_{4} \oplus t_{0}$ | $q_{8}=t_{3} \oplus u_{6}$ | $t_{6}=q_{12} \wedge q_{13}$ |
| $u_{1}=x_{2} \oplus t_{0}$ | $u_{7}=q_{6} \oplus q_{8}$ | $t_{7}=q_{14} \wedge q_{15}$ |
| $u_{2}=\neg x_{1}$ | $q_{9}=q_{4} \oplus u_{7}$ | $u_{13}=t_{7} \oplus u_{9}$ |
| $q_{2}=x_{1} \oplus u_{0}$ | $t_{4}=q_{8} \wedge q_{9}$ | $u_{14}=t_{6} \oplus u_{7}$ |
| $q_{3}=u_{1} \oplus u_{2}$ | $u_{8}=x_{3} \oplus t_{1}$ | $u_{15}=u_{3} \oplus q_{8}$ |
| $t_{1}=q_{2} \wedge q_{3}$ | $q_{10}=u_{8} \oplus u_{1}$ | $u_{16}=u_{14} \oplus u_{15}$ |
| $u_{3}=x_{3} \oplus q_{3}$ | $q_{11}=t_{4} \oplus u_{8}$ | $y_{0}=u_{13} \oplus u_{15}$ |
| $u_{4}=x_{2} \oplus q_{1}$ | $t_{5}=q_{10} \wedge q_{11}$ | $y_{1}=u_{8} \oplus u_{13}$ |
| $q_{4}=u_{3} \oplus u_{4}$ | $u_{9}=x_{2} \oplus t_{2}$ | $y_{2}=u_{10} \oplus u_{16}$ |
| $q_{5}=x_{4} \oplus u_{3}$ | $q_{14}=t_{5} \oplus u_{9}$ | $y_{4}=q_{13} \oplus u_{14}$ |
| $t_{2}=q_{4} \wedge q_{5}$ | $q_{15}=t_{4} \oplus u_{5}$ | $y_{3}=y_{4} \oplus u_{11}$ |
| $u_{5}=x_{4} \oplus t_{2}$ | $u_{10}=t_{0} \oplus q_{15}$ |  |

## C. 2 BGC Optimized Implementations

| Ascon | $t_{13}=x_{2} \vee x_{1}$ | $t_{18}=x_{1} \wedge t_{11}$ |
| :---: | :---: | :---: |
| $k=16$ | $t_{14}=t_{0} \oplus t_{6}$ | $y_{2}=t_{12} \oplus t_{10}$ |
| $t_{0}=x_{0} \oplus x_{4}$ | $t_{15}=t_{7} \vee t_{12}$ | $t_{20}=\neg t_{4}$ |
| $t_{1}=\neg x_{4}$ | $t_{16}=t_{2} \oplus t_{13}$ | $t_{21}=t_{12} \oplus t_{18}$ |
| $t_{2}=t_{1} \vee x_{3}$ | $t_{17}=t_{15} \oplus t_{13}$ | $y_{3}=t_{13} \oplus t_{16}$ |
| $t_{3}=x_{1} \oplus x_{2}$ | $t_{18}=t_{17} \wedge t_{10}$ | $t_{23}=x_{1} \vee x_{4}$ |
| $t_{4}=x_{3} \oplus x_{2}$ | $t_{19}=t_{18} \vee t_{9}$ | $y_{4}=t_{20} \oplus t_{17}$ |
| $t_{5}=x_{3} \oplus x_{4}$ | $t_{20}=t_{5} \oplus t_{19}$ | $y_{0}=t_{15} \oplus t_{23}$ |
| $t_{6}=t_{0} \vee x_{1}$ | $y_{2}=t_{14} \oplus t_{19}$ | $y_{1}=t_{6} \oplus t_{21}$ |
| $t_{7}=x_{0} \vee t_{5}$ | $y_{3}=t_{20} \oplus t_{8}$ |  |
| $t_{8}=t_{4} \vee t_{3}$ | $y_{1}=t_{12} \oplus t_{20}$ | PRIMATEs ${ }^{-1}$ |
| $y_{1}=t_{0} \oplus t_{8}$ | $y_{0}=t_{16} \oplus t_{19}$ | $k=45$ |
| $y_{3}=t_{3} \oplus t_{7}$ | $y_{4}=t_{19} \oplus t_{1}$ | See C. 1 |
| $t_{11}=x_{2} \wedge t_{3}$ |  | Minalpher |
| $t_{12}=t_{6} \oplus t_{5}$ | PRIMATEs | $k=17$ |
| $y_{2}=t_{3} \oplus t_{2}$ | $k=27$ |  |
| $y_{0}=t_{12} \oplus t_{11}$ | $t_{0}=x_{2} \wedge x_{3}$ | $t_{0}=\neg x_{3}$ |
| $y_{4}=t_{0} \oplus t_{12}$ | $t_{1}=x_{0} \wedge x_{2}$ | $t_{1}=x_{0} \vee t_{0}$ |
|  | $t_{2}=x_{0} \wedge x_{3}$ | $t_{2}=t_{1} \oplus x_{2}$ |
| ICEPOLE | $t_{3}=t_{0} \oplus x_{4}$ | $t_{3}=t_{2} \oplus x_{0}$ |
| $k=26$ | $t_{4}=x_{1} \oplus t_{1}$ | $t_{4}=t_{2} \oplus x_{3}$ |
| $t_{0}=x_{4} \wedge x_{3}$ | $t_{5}=x_{0} \wedge x_{4}$ | $t_{5}=t_{4} \oplus x_{1}$ |
| $t_{1}=x_{1} \vee x_{0}$ | $t_{6}=x_{0} \oplus x_{3}$ | $t_{6}=t_{5} \oplus t_{1}$ |
| $t_{2}=x_{4} \oplus x_{1}$ | $t_{7}=t_{6} \wedge x_{4}$ | $t_{7}=x_{3} \oplus t_{3}$ |
| $t_{3}=\neg x_{1}$ | $t_{8}=x_{1} \vee x_{0}$ | $t_{8}=t_{7} \vee x_{1}$ |
| $t_{4}=x_{4} \wedge x_{0}$ | $t_{9}=t_{1} \oplus t_{2}$ | $t_{9}=t_{6} \wedge t_{4}$ |
| $t_{5}=x_{3} \oplus t_{2}$ | $t_{10}=t_{5} \oplus t_{8}$ | $t_{10}=t_{3} \wedge t_{8}$ |
| $t_{6}=x_{2} \oplus x_{0}$ | $t_{11}=x_{2} \oplus x_{3}$ | $y_{3}=t_{9} \oplus t_{2}$ |
| $t_{7}=\neg x_{3}$ | $t_{12}=x_{2} \wedge t_{3}$ | $y_{2}=t_{10} \vee t_{9}$ |
| $t_{8}=x_{1} \oplus t_{4}$ | $t_{13}=t_{4} \wedge x_{2}$ | $t_{13}=t_{6} \wedge x_{0}$ |
| $t_{9}=x_{4} \oplus x_{0}$ | $t_{14}=t_{9} \oplus t_{3}$ | $t_{14}=t_{0} \oplus t_{10}$ |
| $t_{10}=x_{0} \oplus t_{3}$ | $t_{15}=t_{11} \oplus t_{14}$ | $y_{0}=t_{9} \oplus t_{8}$ |
| $t_{11}=x_{3} \wedge x_{2}$ | $t_{16}=t_{7} \oplus x_{0}$ | $y_{1}=t_{14} \vee t_{13}$ |
| $t_{12}=t_{11} \oplus x_{0}$ | $t_{17}=t_{12} \oplus t_{14}$ |  |

## C. 3 GC Optimized Implementations

| Ascon | $y_{2}=t_{14} \oplus t_{13}$ | $y_{2}=t_{7} \oplus t_{4}$ |
| :---: | :---: | :---: |
| $k=15$ | $t_{16}=t_{14} \oplus x_{3}$ | $t_{11}=t_{5} \wedge x_{1}$ |
| $t_{0}=x_{4} \oplus x_{3}$ | $y_{0}=\neg\left(t_{9} \oplus t_{16}\right)$ | $t_{12}=x_{0} \wedge t_{0}$ |
| $t_{1}=x_{2} \oplus x_{3}$ | $y_{4}=t_{16} \oplus t_{1}$ | $t_{13}=t_{5} \oplus x_{0}$ |
| $t_{2}=x_{1} \oplus x_{2}$ | $y_{1}=t_{12} \oplus y_{4}$ | $t_{14}=t_{12} \oplus t_{11}$ |
| $t_{3}=\neg\left(x_{0} \oplus x_{4}\right)$ | $y_{3}=t_{14} \oplus t_{8}$ | $t_{15}=t_{1} \wedge t_{13}$ |
| $t_{4}=\neg\left(t_{0} \vee x_{0}\right)$ |  | $t_{16}=t_{2} \oplus t_{3}$ |
| $t_{5}=\neg\left(t_{3} \oplus t_{2}\right)$ | Keccak/Ketje/Keyak | $t_{17}=\neg\left(t_{8} \oplus t_{12}\right)$ |
| $t_{6}=t_{5} \vee x_{1}$ | $k=13$ | $t_{18}=t_{15} \oplus t_{14}$ |
| $t_{7}=x_{4} \wedge t_{0}$ | $t_{0}=x_{3} \wedge x_{4}$ | $y_{0}=t_{9} \oplus t_{14}$ |
| $t_{8}=\neg\left(x_{1} \wedge t_{3}\right)$ | $t_{1}=\neg x_{2}$ | $y_{1}=t_{6} \oplus t_{16}$ |
| $t_{9}=t_{2} \vee t_{1}$ | $t_{2}=\neg x_{0}$ | $y_{3}=t_{18} \oplus t_{3}$ |
| $y_{3}=\neg\left(t_{4} \oplus t_{2}\right)$ | $t_{3}=\neg\left(x_{1} \wedge t_{2}\right)$ | $y_{4}=t_{17} \oplus t_{5}$ |
| $y_{4}=\neg\left(t_{8} \oplus t_{0}\right)$ | $t_{4}=\neg\left(t_{2} \vee x_{4}\right)$ |  |
| $y_{0}=t_{6} \oplus t_{0}$ | $t_{5}=\neg\left(t_{1} \oplus t_{0}\right)$ | PRIMATEs ${ }^{\mathbf{1}}$ |
| $y_{1}=\neg\left(t_{3} \oplus t_{9}\right)$ | $t_{6}=\neg\left(t_{1} \wedge x_{3}\right)$ | $k=43$ |
| $y_{2}=\neg\left(t_{2} \oplus t_{7}\right)$ | $y_{1}=\neg\left(x_{1} \oplus t_{6}\right)$ | See C. 1 |
|  | $t_{8}=\neg\left(t_{1} \vee x_{1}\right)$ |  |
| ICEPOLE | $y_{2}=x_{4} \oplus t_{5}$ | Minalpher |
| $k=21$ | $y_{4}=\neg\left(x_{4} \oplus t_{3}\right)$ | $k=14$ |
| $t_{0}=\neg\left(x_{4} \oplus x_{2}\right)$ | $y_{0}=t_{8} \oplus x_{0}$ | $t_{0}=\neg\left(x_{2} \wedge x_{3}\right)$ |
| $t_{1}=\neg\left(x_{4} \oplus x_{1}\right)$ | $y_{3}=t_{4} \oplus x_{3}$ | $t_{1}=x_{1} \oplus t_{0}$ |
| $t_{2}=t_{0} \vee x_{3}$ |  | $t_{2}=t_{1} \vee x_{0}$ |
| $t_{3}=x_{0} \wedge x_{1}$ | PRIMATEs | $t_{3}=\neg\left(x_{2} \oplus t_{2}\right)$ |
| $t_{4}=\neg\left(x_{3} \oplus t_{3}\right)$ | $k=23$ | $t_{4}=\neg\left(t_{3} \oplus x_{3}\right)$ |
| $t_{5}=\neg\left(x_{2} \wedge x_{3}\right)$ | $t_{0}=x_{3} \oplus x_{2}$ | $t_{5}=\neg\left(x_{3} \oplus t_{1}\right)$ |
| $t_{6}=\neg\left(x_{0} \oplus x_{2}\right)$ | $t_{1}=\neg\left(x_{4} \oplus x_{3}\right)$ | $t_{6}=t_{4} \wedge t_{5}$ |
| $t_{7}=\neg\left(t_{4} \oplus t_{5}\right)$ | $t_{2}=\neg\left(x_{2} \vee t_{1}\right)$ | $y_{3}=\neg\left(t_{3} \oplus t_{6}\right)$ |
| $t_{8}=\neg\left(x_{0} \wedge t_{1}\right)$ | $t_{3}=x_{1} \wedge t_{0}$ | $t_{8}=t_{6} \oplus t_{1}$ |
| $t_{9}=\neg\left(x_{1} \vee t_{6}\right)$ | $t_{4}=\neg\left(x_{3} \oplus t_{2}\right)$ | $y_{0}=x_{0} \oplus t_{4}$ |
| $t_{10}=t_{0} \wedge t_{6}$ | $t_{5}=x_{1} \oplus x_{4}$ | $t_{10}=t_{8} \wedge y_{0}$ |
| $t_{11}=\neg\left(t_{1} \wedge t_{10}\right)$ | $t_{6}=x_{0} \oplus x_{4}$ | $y_{2}=t_{5} \oplus t_{10}$ |
| $t_{12}=\neg\left(t_{7} \oplus x_{4}\right)$ | $t_{7}=\neg\left(t_{5} \vee x_{0}\right)$ | $t_{12}=\neg\left(y_{2} \wedge y_{3}\right)$ |
| $t_{13}=\neg\left(t_{2} \oplus t_{7}\right)$ | $t_{8}=x_{2} \wedge t_{6}$ | $y_{1}=\neg\left(t_{12} \oplus t_{8}\right)$ |
| $t_{14}=t_{11} \wedge t_{4}$ | $t_{9}=t_{0} \vee x_{3}$ |  |

## C. 4 Depth Optimized Implementations

## Ascon

$k=3, w=6$
Layer 1
$t_{0}=x_{2} \oplus x_{1}$
$t_{1}=x_{1} \oplus x_{3}$
$t_{2}=\neg\left(x_{3} \oplus x_{4}\right)$
$t_{3}=\neg\left(x_{0} \oplus x_{4}\right)$
$t_{4}=x_{2} \vee x_{1}$
$t_{5}=x_{4} \wedge x_{3}$
Layer 2
$t_{6}=t_{0} \oplus t_{5}$
$t_{7}=\neg\left(t_{2} \wedge t_{3}\right)$
$t_{8}=\neg\left(t_{4} \oplus t_{2}\right)$
$t_{9}=t_{1} \vee t_{0}$
$t_{10}=\neg\left(x_{1} \vee t_{3}\right)$
$t_{11}=x_{1} \wedge t_{3}$
Layer 3
$y_{0}=t_{10} \oplus t_{8}$
$y_{1}=\neg\left(t_{3} \oplus t_{9}\right)$
$y_{2}=\neg\left(x_{4} \oplus t_{6}\right)$
$y_{3}=t_{6} \oplus t_{7}$
$y_{4}=\neg\left(t_{2} \oplus t_{11}\right)$

## ICEPOLE

$k=4, w=7$
Layer 1
$t_{0}=\neg\left(x_{4} \oplus x_{1}\right)$
$t_{1}=\neg\left(x_{4} \wedge x_{3}\right)$
$t_{2}=x_{1} \oplus x_{3}$
$t_{3}=\neg\left(x_{3} \oplus x_{0}\right)$
$t_{4}=\neg x_{1}$
$t_{5}=\neg\left(x_{2} \oplus x_{0}\right)$
$t_{6}=x_{4} \oplus x_{2}$
Layer 2
$t_{7}=\neg\left(x_{2} \vee t_{2}\right)$
$t_{8}=t_{1} \oplus t_{6}$
$t_{9}=\neg\left(x_{0} \vee t_{4}\right)$
$t_{10}=t_{2} \vee t_{6}$
$t_{11}=\neg\left(t_{5} \wedge t_{0}\right)$
$t_{12}=x_{4} \vee t_{3}$
$t_{13}=x_{2} \wedge t_{4}$

Layer 3
$t_{15}=\neg\left(t_{10} \vee t_{11}\right)$
$t_{16}=x_{4} \oplus t_{9}$
$t_{17}=t_{12} \oplus t_{11}$
$t_{18}=\neg\left(t_{1} \oplus t_{12}\right)$
$t_{19}=\neg\left(x_{0} \oplus t_{13}\right)$
$t_{20}=\neg\left(t_{7} \oplus t_{13}\right)$
Layer 4
$y_{0}=\neg\left(t_{15} \oplus t_{19}\right)$
$y_{1}=t_{20} \oplus t_{15}$
$y_{2}=\neg\left(t_{8} \oplus t_{15}\right)$
$y_{3}=\neg\left(t_{18} \oplus t_{15}\right)$
$y_{4}=t_{16} \oplus t_{15}$

## Keccak/Ketje/Keyak

$k=2, w=10$
Layer 1
$t_{0}=\neg\left(x_{1} \vee x_{2}\right)$
$t_{1}=x_{2} \wedge x_{3}$
$t_{2}=x_{3} \wedge x_{4}$
$t_{3}=\neg\left(x_{4} \wedge x_{0}\right)$
$t_{4}=\neg\left(x_{3} \oplus x_{1}\right)$
$t_{5}=\neg\left(x_{0} \oplus x_{1}\right)$
$t_{6}=x_{2} \oplus x_{4}$
$t_{7}=x_{0} \vee x_{1}$
$t_{8}=x_{3} \oplus x_{0}$
$t_{9}=\neg\left(x_{4} \oplus x_{0}\right)$
Layer 2
$y_{0}=t_{5} \oplus t_{0}$
$y_{1}=\neg\left(t_{4} \oplus t_{1}\right)$
$y_{2}=t_{6} \oplus t_{2}$
$y_{3}=\neg\left(t_{8} \oplus t_{3}\right)$
$y_{4}=\neg\left(t_{7} \oplus t_{9}\right)$

## PRIMATEs

$k=4, w=7$
Layer 1
$t_{0}=\neg\left(x_{1} \vee x_{4}\right)$
$t_{1}=x_{0} \oplus x_{2}$
$t_{2}=\neg\left(x_{1} \oplus x_{4}\right)$
$t_{3}=x_{1} \wedge x_{2}$
$t_{4}=x_{3} \oplus x_{0}$
$t_{5}=x_{2} \oplus x_{3}$
$t_{6}=x_{4} \oplus x_{3}$
Layer 2
$t_{7}=\neg\left(x_{0} \wedge t_{2}\right)$
$t_{8}=x_{2} \wedge t_{1}$
$t_{9}=\neg\left(x_{3} \wedge t_{1}\right)$
$t_{10}=x_{2} \wedge t_{6}$
$t_{11}=\neg\left(t_{6} \oplus t_{0}\right)$
$t_{12}=\neg\left(x_{4} \wedge t_{4}\right)$
$t_{13}=\neg\left(x_{1} \wedge t_{5}\right)$
Layer 3
$t_{14}=x_{1} \oplus t_{10}$
$t_{15}=\neg\left(t_{7} \vee t_{8}\right)$
$t_{16}=\neg\left(x_{4} \oplus t_{9}\right)$
$t_{17}=t_{10} \oplus t_{13}$
$t_{18}=t_{1} \oplus t_{8}$
$t_{19}=t_{9} \oplus t_{8}$
$t_{20}=\neg\left(t_{3} \oplus t_{12}\right)$
Layer 4

$$
\begin{aligned}
& y_{0}=\neg\left(t_{11} \oplus t_{19}\right) \\
& y_{1}=\neg\left(t_{4} \oplus t_{17}\right) \\
& y_{2}=t_{14} \oplus t_{15} \\
& y_{3}=t_{20} \oplus t_{18} \\
& y_{4}=\neg\left(t_{14} \oplus t_{16}\right)
\end{aligned}
$$

## Joltik/Piccolo

$k=4, w=2$
Layer 1
$t_{0}=\neg\left(x_{1} \vee x_{2}\right)$
$t_{1}=\neg\left(x_{1} \vee x_{0}\right)$
Layer 2
$y_{0}=x_{3} \oplus t_{1}$
$y_{1}=x_{0} \oplus t_{0}$
Layer 3
$t_{4}=x_{2} \vee y_{0}$
$t_{5}=y_{0} \vee y_{1}$
Layer 4
$y_{2}=x_{1} \oplus t_{4}$
$y_{3}=\neg\left(x_{2} \oplus t_{5}\right)$

Joltik ${ }^{-1} /$ Piccolo $^{-1}$
$k=4, w=3$
Layer 1
$t_{0}=x_{0} \vee x_{1}$
$t_{1}=\neg\left(x_{2} \wedge x_{1}\right)$
$t_{2}=x_{0} \oplus x_{2}$
Layer 2
$t_{3}=x_{3} \oplus t_{2}$
$t_{4}=t_{0} \wedge t_{1}$
$y_{2}=\neg\left(x_{3} \oplus t_{0}\right)$
Layer 3
$t_{6}=t_{2} \vee y_{2}$
$t_{7}=t_{3} \wedge y_{2}$
$t_{8}=x_{0} \vee y_{2}$
Layer 4
$y_{0}=\neg\left(x_{1} \oplus t_{6}\right)$
$y_{1}=x_{2} \oplus t_{8}$
$y_{3}=t_{4} \oplus t_{7}$

## LAC

$k=3, w=6$
Layer 1
$t_{0}=x_{0} \oplus x_{3}$
$t_{1}=\neg\left(x_{3} \vee x_{1}\right)$
$t_{2}=\neg\left(x_{3} \oplus x_{2}\right)$
$t_{3}=\neg\left(x_{1} \oplus x_{0}\right)$
$t_{4}=x_{2} \wedge x_{1}$
$t_{5}=x_{1} \vee x_{0}$
Layer 2
$t_{6}=\neg\left(x_{0} \vee t_{1}\right)$
$t_{7}=x_{1} \wedge t_{2}$
$t_{8}=\neg\left(t_{0} \wedge t_{2}\right)$
$t_{9}=t_{3} \vee t_{4}$
$t_{10}=\neg\left(x_{3} \wedge t_{5}\right)$
$t_{11}=t_{2} \vee t_{2}$
Layer 3
$y_{0}=\neg\left(t_{8} \oplus t_{10}\right)$
$y_{1}=t_{9} \oplus t_{6}$
$y_{2}=\neg\left(t_{0} \oplus t_{7}\right)$
$y_{3}=\neg\left(t_{5} \oplus t_{11}\right)$

## Minalpher

$k=4, w=5$
Layer 1
$t_{0}=\neg\left(x_{2} \wedge x_{1}\right)$
$t_{1}=\neg\left(x_{1} \vee x_{2}\right)$
$t_{2}=\neg\left(x_{2} \oplus x_{3}\right)$
$t_{3}=x_{1} \wedge x_{3}$
$t_{4}=x_{3} \wedge x_{2}$
Layer 2
$t_{5}=x_{0} \oplus t_{2}$
$t_{6}=\neg\left(x_{0} \wedge t_{0}\right)$
$t_{7}=\neg\left(x_{3} \vee t_{1}\right)$
$t_{8}=x_{1} \oplus t_{4}$
$t_{9}=\neg\left(x_{2} \oplus t_{3}\right)$
Layer 3
$t_{10}=t_{6} \oplus t_{9}$
$t_{11}=\neg\left(t_{5} \oplus t_{8}\right)$
$t_{12}=\neg\left(x_{0} \wedge t_{8}\right)$
$t_{13}=\neg\left(t_{5} \vee t_{9}\right)$
$t_{14}=t_{3} \oplus t_{6}$
Layer 4
$y_{0}=t_{12} \oplus t_{11}$
$y_{1}=\neg\left(t_{13} \oplus t_{14}\right)$
$y_{2}=t_{10} \oplus t_{12}$
$y_{3}=\neg\left(t_{7} \vee t_{13}\right)$

## Prøst

$k=4, w=3$
Layer 1
$t_{0}=x_{0} \wedge x_{1}$
$t_{1}=x_{2} \wedge x_{1}$
$t_{2}=\neg\left(x_{3} \wedge x_{0}\right)$
Layer 2
$t_{3}=x_{0} \vee t_{1}$
$y_{0}=x_{2} \oplus t_{0}$
$t_{5}=\neg\left(t_{1} \oplus t_{2}\right)$
Layer 3
$t_{6}=\neg\left(x_{3} \oplus t_{5}\right)$
$t_{7}=x_{1} \oplus t_{5}$
$t_{8}=\neg\left(x_{3} \wedge y_{0}\right)$
Layer 4
$y_{3}=\neg\left(t_{7} \oplus t_{8}\right)$
$y_{1}=t_{2} \oplus t_{6}$
$y_{2}=\neg\left(t_{3} \oplus t_{8}\right)$
$k=3, w=6$
Layer 1
$t_{0}=\neg\left(x_{3} \oplus x_{2}\right)$
$t_{1}=x_{2} \oplus x_{1}$
$t_{2}=x_{2} \vee x_{0}$
$t_{3}=\neg\left(x_{1} \oplus x_{0}\right)$
$t_{4}=x_{3} \vee x_{1}$
$t_{5}=\neg\left(x_{0} \oplus x_{3}\right)$
Layer 2
$t_{6}=\neg\left(t_{2} \oplus t_{5}\right)$
$t_{7}=\neg\left(x_{1} \vee t_{3}\right)$
$t_{8}=t_{4} \wedge t_{0}$
$t_{9}=\neg\left(x_{2} \vee t_{0}\right)$
$t_{10}=t_{3} \vee t_{5}$
$t_{11}=t_{1} \wedge t_{2}$
Layer 3
$y_{0}=\neg\left(t_{10} \oplus t_{11}\right)$
$y_{1}=\neg\left(t_{8} \oplus t_{7}\right)$
$y_{2}=\neg\left(x_{1} \oplus t_{6}\right)$
$y_{3}=\neg\left(t_{3} \oplus t_{9}\right)$

## RECTANGLE ${ }^{-1}$

$k=3, w=6$
Layer 1
$t_{0}=\neg\left(x_{2} \oplus x_{1}\right)$
$t_{1}=x_{0} \oplus x_{3}$
$t_{2}=\neg\left(x_{2} \wedge x_{1}\right)$
$t_{3}=\neg\left(x_{2} \wedge x_{3}\right)$
$t_{4}=x_{1} \oplus x_{0}$
$t_{5}=x_{2} \vee x_{1}$
Layer 2
$t_{6}=t_{5} \wedge t_{4}$
$t_{7}=t_{0} \wedge t_{4}$
$t_{8}=\neg\left(x_{0} \vee t_{1}\right)$
$t_{9}=\neg\left(t_{2} \wedge t_{1}\right)$
$t_{10}=\neg\left(x_{3} \wedge t_{4}\right)$
$t_{11}=t_{3} \wedge t_{1}$
Layer 3
$y_{0}=\neg\left(t_{7} \oplus t_{11}\right)$
$y_{1}=t_{0} \oplus t_{8}$
$y_{2}=t_{0} \oplus t_{10}$
$y_{3}=t_{6} \oplus t_{9}$

## RECTANGLE


[^0]:    ${ }^{1}$ If $a_{24}, a_{30}, a_{36}=0$, then $T_{2}$ is not used in the following gates. This means the AND gate generating $T_{2}$ is redundant, which induces a circuit implementing the S -box with one AND gate.

[^1]:    ${ }^{2} \mathrm{~A}$ specific introduction of how to avoid these redundant gates can be found in Appendix B

[^2]:    ${ }^{3}$ Here we compare the last $w a_{i}$ 's in $Q_{2 i}$ with the $w a_{i}$ 's in $Q_{2 i+1}$

[^3]:    ${ }^{4}$ For the BGC and GC optimizations of PRIMATEs ${ }^{-1}$, the solver cannot return a SAT result in a reasonable time. The BGC-45 and GC-43 implementations are constructed from the MC-8 implementation.

[^4]:    ${ }^{5} \mathrm{We}$ have tested two ways of encoding $\operatorname{AtLeastOne}\left(a_{1}, a_{2}, \ldots, a_{s}\right)$ : 1) directly adding $a_{1} \vee a_{2} \vee \cdots \vee a_{s}$ into the input of the SAT solver; 2) adding $\left(a_{1}+1\right)\left(a_{2}+1\right) \cdots\left(a_{s}+1\right)=0$ into the equation system.

