Fast evaluation of polynomials over binary finite fields and application to side-channel countermeasures

We describe a new technique for evaluating polynomials over binary finite fields. This is useful in the context of anti-DPA countermeasures when an S-box is expressed as a polynomial over a binary finite field. For n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n$$\end{document}-bit S-boxes, our new technique has heuristic complexity O(2n/2/n)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\fancyscript{O}}(2^{n/2}/\sqrt{n})$$\end{document} instead of O(2n/2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\fancyscript{O}}(2^{n/2})$$\end{document} proven complexity for the Parity-Split method. We also prove a lower bound of Ω(2n/2/n)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\varOmega }}(2^{n/2}/\sqrt{n})$$\end{document} on the complexity of any method to evaluate n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n$$\end{document}-bit S-boxes; this shows that our method is asymptotically optimal. Here, complexity refers to the number of non-linear multiplications required to evaluate the polynomial corresponding to an S-box. In practice, we can evaluate any 8-bit S-box in 10 non-linear multiplications instead of 16 in the Roy–Vivek paper from CHES 2013, and the DES S-boxes in 4 non-linear multiplications instead of 7. We also evaluate any 4-bit S-box in 2 non-linear multiplications instead of 3. Hence our method achieves optimal complexity for the PRESENT S-box.


Introduction
The implementations of cryptographic algorithms on devices like PCs, microcontrollers, smart cards, etc. leak secret information to an adversary.Typical examples of such leakages are electro-magnetic emissions, power consumption and even acoustic emanations.An adversary can use this information to recover the secret key by applying different statistical techniques.Differential Power Analysis (DPA) -the most widely known and powerful technique -is based on statistical analysis of the power consumption of a device [KJJ99].Other techniques including Template Attacks, Correlation Power Analysis Attacks (CPA), etc. were proposed in the past [CRR02,BCO04].More recently, a side-channel attack on RSA was proposed using the acoustic emanations from a device [GST13].
Masking.A well known technique to protect implementations against power analysis based side-channel attacks is to mask internal secret variables.This is done by XORing any internal variable with a random variable r, for e.g., x = x⊕ r.However, this will make the implementation secure against first-order attacks only.Second-order attacks against such counter-measures is proposed in [Mes00].In this type of attack the adversary combines the information obtained from two internal variables.This will require more data (power consumption traces) in practice, which could make the attack infeasible in certain cases.In general the above masking technique can be extended to secure an implementation against higher-order attacks.This can be achieved by splitting an internal variable x into d shares, say, x = d i=1 x i .Using this idea it is easy to compute any linear/affine function in a secured way, since it is enough to compute y i = (x i ) for 1 ≤ i ≤ d.However, it is not obvious how to do this for non-linear functions.In practice, nearly every cryptographic primitive includes some non-linear function, e.g., Sbox, modular addition, etc. Generic Higher-Order Masking.The Rivain-Prouff masking scheme is the first provably secure higher-order masking technique for AES [RP10].The main idea of this method is to perform secure monomial evaluation with d shares of a secret variable using the previously known ISW scheme [ISW03].Namely the (non-linear part of) AES S-box can be represented by the monomial x 254 over F 2 8 .Prouff and Rivain showed that this monomial can be evaluated securely using 4 non-linear multiplications and a few linear squarings.By using this scheme the AES S-box can be masked for any order d.
This method was extended to a generic technique for higher-order masking, in [CGP + 12], by Carlet, Goubin, Prouff, Quisquater and Rivain (CGPQR).Any given n-bit S-box can be represented by a polynomial 2 n −1 i=0 a i x i over F 2 n using Lagrange's interpolation theorem.Hence, any S-box can be masked by secure evaluation of this polynomial with d shares of a secret variable.This is the first generic technique to mask any S-box for any order d.In this technique a polynomial evaluation in F 2 n is split into simple operations over F 2 n : addition, multiplication by constant, and regular multiplication of two elements.Note that multiplication of two same elements (i.e.squaring) and multiplication by a constant -both are linear operations over F 2 n , hence easy to mask.For performing a secure multiplication of two distinct elements, i.e. a non-linear multiplication, the CGPQR masking scheme uses the ISW method as in [RP10].
Asymptotically, the running time of the Rivain-Prouff and CGPQR masking schemes is dominated by the number of non-linear multiplications required to evaluate a polynomial over F 2 n .Namely with d shares, using the ISW method an affine function can be masked with only O(d) operations over F 2 n , whereas a non-linear multiplication requires O(d 2 ) operations.Note that for achieving d-th order security the Rivain-Prouff scheme requires at least 2d + 1 shares. 1Efficient Polynomial Evaluation for Masking.The CGPQR masking scheme can be made efficient by optimizing the number of multiplications required for the polynomial evaluation in F 2 n .In [CGP + 12] two techniques -Parity-Split and Cyclotomic Class, are used for optimizing the number of such non-linear multiplications.For arbitrary n-bit S-box, or equivalently for evaluating any polynomial over F 2 n , the Parity-Split method has proven complexity O(2 n/2 ).Here complexity refers to the number of non-linear multiplications required to evaluate the polynomial corresponding to an S-box.For the particular case of monomials (e.g.AES S-box) the Cyclotomic Class method gives the optimal number of multiplications in F 2 n .
At CHES 2013, Roy and Vivek [RV13] adapted a generic method for improving the efficiency of polynomial evaluation in F 2 n .They demonstrated the technique for the polynomials corresponding to several well known S-boxes including DES, PRESENT and CLEFIA.In particular, the Roy-Vivek method reduces the number of non-linear multiplications for DES to 7 (from 10), for CLEFIA to 16 (from 22) and for CAMELLIA to 15 (from 22).This technique also achieves the optimal number of 4 multiplications for the monomial corresponding to the AES S-box.
Our Results.In this article we propose an improved generic technique for fast polynomial evaluation in F 2 n .For arbitrary n-bit S-box our method has heuristic complexity O(2 n/2 / √ n), compared to the O(2 n/2 ) proven complexity for the Parity-Split method from [CGP + 12].
Our method is as follows.We first generate a set L of monomials x α , including all the monomials from a cyclotomic class.We then randomly generate a fixed set of "basis" polynomials q i (x), whose monomials are all in the precomputed set L. Then given a polynomial P (x) over F 2 n we try to write P (x) as: where p i (x) are polynomials with monomials also in the set L, and t is some parameter.Since the q i (x) polynomials are fixed, the coefficients of the p i (x) polynomials can be obtained by solving a system of linear equations in F 2 n .Then to evaluate P (x) one first evaluates all the monomials in the set L; the polynomials p i (x) and q i (x) can then be evaluated without any further nonlinear multiplication.The polynomials P (x) is then evaluated from (1) with t − 1 additional non-linear multiplications.
The number of monomials in the set L must be carefully chosen.Namely the larger the basis set L of monomials, the more degrees of freedom we have in solving (1), with fewer polynomials p i (x) and therefore fewer additional nonlinear multiplications; however the number of non-linear multiplications to build L will increase.Therefore the number of monomials in the basis set L must be optimized to minimize the total number of non-linear multiplications, namely the non-linear multiplications for building the set L, and the additional t − 1 non-linear multiplications for evaluating P (x).
As a concrete application of our new method above, we show that for the generic higher-order masking of several well known S-boxes, e.g.DES, CLEFIA, PRESENT, etc., our method reduces the number of multiplications compared to the previously known methods [CGP + 12,RV13].In particular, using our method PRESENT can be masked with 2 multiplications (instead of 3), and DES with 4 multiplications (instead of 7), see Table 1.Our method achieves optimal complexity for the PRESENT S-box since it was proved in [RV13] that 2 non-linear multiplications are necessary to evaluate it.In Table 5, we report the timing results for DES masked using our technique.We also prove a lower bound of Ω(2 n/2 / √ n) for the complexity of any method to evaluate n-bit S-boxes, a.k.a.masking complexity; this shows that our method is asymptotically optimal.Our new lower bound significantly improves upon the previously known bound of Ω(log 2 n) from [RV13].

Generic Polynomial Evaluation Technique
Before we describe our improved method to evaluate polynomials over F 2 n , let us first recollect in Section 2.1 the method proposed by Roy and Vivek [RV13, Section 4] to evaluate the polynomials (over F 2 6 ) corresponding to the DES S-boxes.Their method requires 7 non-linear multiplications.The method in [RV13] is based on the Divide-and-Conquer strategy, which is an adaptation of a polynomial evaluation technique by Paterson and Stockmeyer [PS73].The technique consists in decomposing the polynomial to be evaluated in terms of polynomials having their monomials from a precomputed set.Our method is partly based on this idea.

The Roy-Vivek Method for DES S-boxes
Let P DES (x) ∈ F 2 6 [x] be the Lagrange interpolation polynomial corresponding to a DES S-box.Here the 4-bit output of a DES S-box is identified as a 6-bit output with two leading zeroes, and hence these bit strings are naturally identified with the elements of F 2 6 .Note that for all the DES S-boxes, deg (P DES (x)) = 62.Write P DES (x) = q(x) where deg(R) ≤ 35 and deg(q) = 26.Then divide the polynomial R(x) − x 27 by q(x): R(x) − x 27 = c(x) • q(x) + s(x), where deg(c) ≤ 9 and deg(s) ≤ 25, which gives Next decompose the polynomials q(x) and x 27 + s(x) in a similar way but, instead, dividing first by x 18 , and then using x 9 as the "correction term".One gets q(x) = (x 18 + c 1 (x)) • q 1 (x) + x 9 + s 1 (x), where deg(q 1 ) = 8, deg(c 1 ) ≤ 9, deg(s 1 ) ≤ 7, deg(q 2 ) = 9, deg(c 2 ) ≤ 8, and deg(s 2 ) ≤ 8. Finally, In [RV13], the monomials x, x 2 , x 3 , x 4 , x 5 , x 6 , x 7 , x 8 , x 9 , x 18 , x 36 are first evaluated using 4 non-linear multiplications.Namely a non-linear multiplication is required for each of the monomials x 3 , x 5 , x 7 and x 9 ; the rest of the monomials can be evaluated using linear squarings only.Each of the individual polynomials in the above expression such as x 36 + c(x), x 18 + c 1 (x), q 1 (x), and so on, can then be evaluated for free, that is without further non-linear multiplications.To evaluate P DES (x) from (2), 3 more non-linear multiplications are needed, and hence totally 7 non-linear multiplications are sufficient to evaluate a DES S-box.
To sum up, the basic idea behind the above technique is to precompute a set of monomials, and then obtain a decomposition of the required polynomial in terms of polynomials having their monomials only from the precomputed set.Note that the said decomposition is obtained in a "fixed" way that depends only on the degree of the polynomial, which is required to be of the form k (2 p − 1)±c, for some parameters k, p and c; we refer to [RV13] for more details.
In the new method we propose next, we also precompute a set of monomials as above, but we also include every other monomial that can be computed for free by the squaring operation; that is we always generate the full cyclotomic class for any computed monomial.Then we try to decompose the polynomial as a sum of product of two polynomials having their monomials from the precomputed set.One of the two polynomials in every summand is randomly chosen, and we try to determine the other polynomial by solving (for unknown coefficients) the system of linear equations obtained by evaluating the polynomial at every point of the domain F 2 n .This approach of determining the unknown coefficients of the polynomials is similar to the Lagrange interpolation technique.

Our New Generic Method
Let us first recollect the notion of cyclotomic class over F 2 n and introduce some notations.The cyclotomic class of α w.r.t.n (n ≥ 1 , 0 ≤ α ≤ 2 n − 2), denoted by C α , is defined as the set of integers Intuitively, C α corresponds to the exponents of all the monomials that can be computed from x α ∈ F 2 n [x] using only the squaring operations (modulo x 2 n +x).Since our goal is only to evaluate polynomials over F 2 n , we will be actually working in the ring In other words, we treat any polynomial P (x) ∈ F 2 n [x] to be the same as P (x) modulo x 2 n + x; hence P (x) has degree at most 2 n − 1.

By d
$ ← D we denote an element d chosen uniformly at random from a set D. For any subset Λ ⊆ {0, 1, . . ., 2 n − 2}, x Λ denotes the set of monomials Finally we denote by P(x Λ ) the set of all polynomials in F 2 n [x] whose monomials are only from the set x Λ .
Description.Consider an n-bit to n-bit S-Box represented by a polynomial We consider a collection S of cyclotomic classes w.r.t.n: Also, define L as the set of all integers in the cyclotomic classes of S: We choose the set S of cyclotomic classes in (3) so that the set of corresponding monomials x L from S can be computed using only − 2 non-linear multiplications.We require that every monomial x 0 , x 1 , . . ., x 2 n −1 , can be written as product of some two monomials in P(x L ).Moreover, we try to choose only those cyclotomic classes with the maximum number of n elements (except C 0 which has only a single element).This gives Next, we generate t − 1 random polynomials q i (x) $ ← P(x L ) that have their monomials only in x L .Suitable values for the parameters t and |L| will be determined later.Then, we try to find t polynomials p i (x) ∈ P(x L ) such that It is easy to see that the coefficients of the p i (x) polynomials can be obtained by solving a system of linear equations in F 2 n , as in the Lagrange interpolation theorem.More precisely, to find the polynomials p i (x), we solve the following system of linear equations over F 2 n : where the matrix A is obtained by evaluating the R.H.S. of (6) at every element of F 2 n , and by treating the unknown coefficients of p i (x) as variables.This matrix has 2 n rows and t • |L| columns, since each of the t polynomials p i (x) has |L| unknown coefficients.The matrix A can also be written as a block concatenation of smaller matrices: where A i is a 2 n × |L| matrix corresponding to the product p i (x) • q i (x).Let a j ∈ F 2 n (j = 0, 1, . . ., 2 n − 1) be all the field elements and p i (x) consists of the monomials x k1 , x k2 , . . ., x k |L| ∈ x L .Then, the matrix A i has the following structure: The unknown vector c in (7) corresponds to the unknown coefficients of the polynomials p i (x).The vector b is formed by evaluating P (x) at every element of F 2 n .Note that since P (x) corresponds to an S-box, the vector b can be directly obtained from the corresponding S-box lookup table.
If the matrix A has rank 2 n , then we are able to guarantee that the decomposition in (6) exists for every polynomial P (x).To be of full rank 2 n the matrix must have a number of columns ≥ 2 n .This gives us the necessary condition We stress that (10) is only a necessary condition.Namely we don't know how to prove that the matrix A will be full rank when the previous condition is satisfied; this makes our algorithm heuristic.In practice for random polynomials q i (x) we almost always obtain a full rank matrix under condition (10).
From (5), we get the condition where t is the number of polynomials p i (x) and the number of cyclotomic classes in the set S, to evaluate a polynomial P (x) over F 2 n .We summarize the above method in Algorithm 1 below.The number of nonlinear multiplications required in the combining step (6) is t − 1.As mentioned earlier, we need − 2 non-linear multiplications to precompute the set x L .Hence the total number of non-linear multiplications required is then where t is the number of polynomials p i (x) and the number of cyclotomic classes in the set S.
Algorithm 1 New generic polynomial decomposition algorithm Cα i , and the basis set x L can be computed using − 2 non-linear multiplications.
, where each Ai is the 2 n × |L| matrix given by (9).5: Solve the linear system A•c = b, where b is the evaluation of P (x) at every element of F2n .6: Construct the polynomials pi(x) from the solution vector c.
Remark 1.If A has rank 2 n , then the same set of basis polynomials q i (x) will yield a decomposition as in (6) for any polynomial P (x).That is, the matrix A is independent from the polynomial P (x) to be evaluated.
Remark 2. Our decomposition method is heuristic because for a given n in F 2 n we do not know how to guarantee that the matrix A has full rank 2 n .However for typical values of n, say n = 4, 6, 8, we can definitely check that the matrix A has full rank, for a particular choice of random polynomials q i (x).Then any polynomial P (x) can be decomposed using these polynomials q i (x).In other words for a given n we can once and for all generate the random polynomials q i (x) and check that the matrix A has full rank 2 n , which will prove that any polynomial P (x) ∈ F 2 n [x] can then be decomposed as above.In summary our method is heuristic for large values of n, but can be proven for small values of n.Such proof requires to compute the rank of a matrix with 2 n rows and a slightly larger number of columns, which takes O(2 3n ) time using Gaussian elimination.
Asymptotic Analysis.Substituting (12) in (11) to eliminate the parameter , we get t The R.H.S. of the above expression is minimized when t ≈ 2 n n , and hence we obtain Hence, our heuristic method requires O( 2 n /n) non-linear multiplications, which is asymptotically slightly better than the Parity-Split method [CGP + 12], which has proven complexity O( √ 2 n ).If one has to rigorously establish the above bound for our method, then we may have to prove the following statements, which we leave as open problems: • We can sample the collection S of cyclotomic classes in (3), each having maximal length n (other than C 0 ), using at most − 2 non-linear multiplications.• The condition t • |L| ≥ 2 n suffices to ensure that the matrix A has full rank 2 n .
Table 2 lists the expected minimum number of non-linear multiplications, as determined by ( 14), for binary fields F 2 n of practical interest.It also lists the actual number of non-linear multiplications that suffices to evaluate any polynomial, for which we have verified that the matrix A has full rank 2 n , for a particular random choice of the q i (x) polynomials.We also provide a performance comparison of our method with that of the Cyclotomic Class and the Parity-Split methods from [CGP + 12].Here we do not compare with the results from [RV13] since that work is mainly concerned with the optimization of specific S-boxes and polynomials of specific degrees; however such comparison will be made for specific S-boxes in Section 4. In Appendix B, we list the specific choice of parameters t and L that we used in this experiment.Counting the Linear Operations.From (5) and (6), we get (2t − 1) • (|L| − 1) + (t − 1) as an upper-bound on the number of addition operations required to evaluate P (x).This is because each of the 2t − 1 polynomials p i (x) and q i (x) in ( 6) have (at most) |L| terms, and there are t summands in (6).From (10), we get: Similarly, we get (2t − 1) • |L| ≈ 2 • 2 n as an estimate for the number of scalar multiplications.Since the squaring operations are used only to compute the list L, we need |L| − ≤ |L| ≈ √ n • 2 n many of them (cf.( 13)).
Remark 3. The above count of the linear operations can be significantly reduced if the linear operations are replaced by table lookups as much as possible.Such an approach is particularly well suited for application in the higher-order masking scheme of [CGP + 12], where we need to evaluate a given polynomial with many shares and that the processing of linear polynomials with shares is particularly straightforward.More specifically, we can write each p i (x) as a sum of F 2 -linear polynomials p i,j , one for each cyclotomic class in the pre-computed set S (cf. (3), (4)): The polynomials p i,j are F 2 -linear and hence are of the form ilarly, the polynomials q i (x) can also be expressed in the above form.If we tabulate the values of each of the linear polynomials p i,j and q i,j , then it suffices to evaluate x αj for each cyclotomic class C αj ∈ S using only NLMs.Then the polynomials p i,j and q i,j can be evaluated by just table lookups, and then each of the 2t − 1 polynomials p i and q i can be eventually evaluated with − 1 additions each.Finally, we need t − 1 more additions in the step (6).Hence, we need no scalar multiplications nor squarings using this table lookup technique.The total number of additions we need is Note that this technique is not very effective for the evaluation method of [RV13] since nearly every linear polynomial that appears has at most two non-zero terms.

New Lower Bound for Polynomial Evaluation
In this section, we show that our method from the previous section is asymptotically optimal.More precisely, we show that to evaluate any polynomial over F 2 n , any algorithm must use at least O( 2 n /n) non-linear multiplications.This improves the previously known bound of Ω (log 2 n) from [RV13].
To establish our lower bound we first need a formal model that describes polynomial evaluation over F 2 n .Such a model, the F 2 n -polynomial chain, has been described in [RV13, Section 3].For the sake of completeness, we briefly recollect the definition in Appendix A.
Previous Result.Let us recollect in slightly more details the previous lower bound of Ω (log 2 n).The following proposition gives a lower bound on the number of non-linear multiplications necessary to evaluate a polynomial P (x), a.k.a.non-linear complexity of P (x), as the maximum of the quantity necessary to evaluate its monomials.Let M(P (x)) denote the non-linear complexity of P (x).If P (x) corresponds to an n-bit S-box S, then M(P (x)) is also called the masking complexity of S.
where m n (i) is the length of the shortest cyclotomic-class (CC) addition chain of i w.r.t.n.
The following result gives a lower bound on the value of m n (i) in terms of the Hamming weight of i.
, where ν(i) is the Hamming weight of the binary representation of i (0 ≤ i ≤ 2 n − 2).
Since ν (2 n − 2) = n−1, hence polynomials having the monomial x 2 n −2 will have non-linear complexity at least log 2 (n − 1).Hence Ω (log 2 n) is a lower bound on the number of necessary non-linear multiplications required to evaluate polynomials over F 2 n .
New Lower Bound.Our technique to prove the lower bound of Ω( 2 n /n) on the non-linear complexity is similar to the one used in the proof of [PS73, Theorem 2].But we would like to emphasize that their result is not applicable to our setting since they work over the integers and the cost model used there is different from the one used in our case.
Proposition 3.There exists a polynomial P (x) ∈ F 2 n [x] such that M(P (x)) ≥ At a more abstract level, an F 2 n -polynomial chain evaluating P (x) ∈ F 2 n [x] that uses r non-linear multiplications (r ≥ 0) can be equivalently described as a sequence Z of polynomials z −1 , z 0 , . .., z r , where where k = 1, 2, . . ., r, β k,−1 , β k,−1 , β k,i,j , β k,i,j ∈ F 2 n .Lastly, where again β r+1,−1 , β r+1,i,j ∈ F 2 n . .Since the squaring operation is F 2 -linear in F 2 n , and that x 2 n = x for all x ∈ F 2 n , it is easy to see that any polynomial that can be evaluated using at most t non-linear multiplications will be of the form as given in ( 16).
Since there are only distinct polynomials in F 2 n [x] (i.e. up to evaluation), and a given set of values for the parameters enables to evaluate a single polynomial only, we get the following necessary condition to evaluate all polynomials over Hence there exists polynomials over F 2 n that require Ω( 2 n /n) non-linear multiplications to evaluate them.
The above proposition shows that our new method from Section 2.2 is asymptotically optimal.
Concrete Lower Bound.In Table 3  Note that there is still a gap between the lower bound from Table 3 and the achievable value of N mult for our method in Table 2.This is because in our method the decomposition of P (x) as is performed by first generating the polynomials q i (x) randomly and independently of P (x), in order to have a linear system of equations over the coefficients of p i (x).Instead one could try to solve (18) for both the p i (x) and the q i (x) polynomials simultaneously; however this gives a quadratic system of equations, which is much harder to solve.

Application to various S-boxes
In this section, we apply the generic method described in Section 2, to several well known S-boxes.Using our new method, we reduce the number of nonlinear multiplications required in each case, resulting in an improvement over the previously known techniques.
We stress that in our method for an n-bit S-box, the maximum number of non-linear multiplications required is invariant of the choice of the S-box when n is fixed.Hence, the number of non-linear multiplications obtained for a fixed n actually provides an upper bound on the masking complexity of an S-box of size n.

CLEFIA and Other 8-bit S-boxes
The CLEFIA block cipher has two 8-bit S-boxes [SSA + 07].Let us denote the S-box lookup table for either of the S-boxes as S clefia .We choose This implies that after choosing t = 6, and then 5 basis polynomials q i $ ← P(x L ) (1 ≤ i ≤ 5), the following system of equations is constructed in We have checked that for some random choice of the polynomials q i (x) the corresponding matrix A has full rank 256, and therefore we can determine the polynomials p i (x).Given the solution to the above system, the S-box evaluation is then the same as evaluating the polynomial Q(x) + p 6 (x).To evaluate all the monomials in {x, x 3 , x 7 , x 29 , x 87 , x 251 } we need 5 non-linear multiplications, implying that any monomial in x L , any q i (x) (randomly chosen from P(x L )) and any p i (x) can all together be evaluated with 5 non-linear multiplications.Moreover the evaluation of Q(x) requires 5 additional non-linear multiplications.
Therefore the total number of non-linear multiplications required for evaluating the S-box is 10.
Note that it requires at least 4 non-linear multiplications to evaluate the polynomials corresponding to the two S-boxes of CLEFIA by any method.This is because these two polynomials over F 2 8 have degrees 252 (S-box S 0 ) and 254 (S-box S 1 ), and the result follows from Proposition 1.
Invariance.If we choose some other 8-bit S-box, then the matrix corresponding to the resulting system remains the same.Hence, we will still get a solution to the system for the same set of polynomials q i (x).This implies that we can use the same set of basis polynomials to obtain polynomials p i (x) for any other 8-bit S-box.Hence, for any S-box of size 8, the number of non-linear multiplications is at most 10.

PRESENT and Other 4-bit S-boxes
For the 4-bit S-box of PRESENT [BKL + 07], we choose t = 2 and L = C 0 ∪ C 1 ∪ C 3 .By selecting q 1 $ ← P(x L ), we construct the following linear system of equations: The monomials used to construct q 1 (x), q 2 (x) are {x, x 2 , x 4 , x 8 , x 3 , x 6 , x 12 , x 9 }.All of these monomials can be evaluated with a single non-linear multiplication and to evaluate p 1 (x) • q 1 (x) we need only one more non-linear multiplication.Hence, the PRESENT S-box evaluation requires 2 multiplications.As in the case of 8-bit S-boxes, this proves that with the same q 1 (x) any 4-bit S-box can be evaluated with 2 multiplications.Table 4 gives the corresponding polynomials for the PRESENT S-box.The polynomial corresponding to the PRESENT S-box has degree 14 and hence, from Proposition 1, its masking complexity is at least 2 [RV13].This implies that our evaluation method achieves optimal complexity for the PRESENT S-box.

(m, n)-bit S-box: Application to DES
We now consider S-boxes whose output size n is smaller than the input size m, as for the DES S-boxes with m = 6 and n = 4.We can view an (m, n)-bit S-box (m > n) as a mapping from F 2 m to F 2 n .Given any such S-box table S, we want to construct a system of linear equations Table 4. Basis polynomial q1(x) for 4-bit S-boxes, and solutions p1(x), p2(x) to PRESENT S-box.The irreducible polynomial is a 4 + a + 1 over F2.
Note that each S[x j ] is an element of the smaller field F 2 n , but each G(x j ) is an element in the larger field F 2 m .One trivial way to remove this inconsistency is to consider S[x j ] as an element of the larger field F 2 m , by padding the most significant bit of the S-box output with 0's.Then, we determine the polynomials p i (x) by solving the corresponding system A • c = S, as described in Section 2.2.However intuitively this is not optimal, since we are creating an artificial constraint to be satisfied by the coefficients of the polynomials p i (x), namely that the m − n most significant bits of G(x) must be 0, while eventually these most significant bits will simply be discarded after the evaluation of G(x), since to get S(x) we only keep the n least significant bits of G(x).Instead, we consider the representations of the unknown coefficients of the polynomials p i (x) in F 2 instead of F 2 m , and we transform the system of linear equations (22) over F 2 m , into a system of linear equations over F 2 .By doing this, from each constraint G(x j ), we generate m equations over F 2 , instead of one equation over F 2 m .Note that each of these m equations will be an affine combination of the unknown bits of the coefficients of the polynomials p i (x).Only n of these equations are actually necessary, since the output of the S-box is of size n bits.By equating each of these equations to the corresponding output bit of the S-box, we get a transformed system of linear equations B • c = S, where B is an (n • 2 m ) × (t • |L| • m) matrix over F 2 and L is the set of elements from the chosen cyclotomic classes.By solving this transformed system over F 2 we determine the polynomials p i (x).
Example of DES.The DES block cipher has 8 (6, 4)-bit S-boxes [oST93].A DES S-box is a mapping from F 2 6 to F 2 4 .In [RV13], the authors consider the S-boxes as a mapping from F 2 6 to F 2 6 , where the two most significant bits of the output of S-box are fixed to 0, and as recalled in Section 2.1 the evaluation can be done with 7 non-linear multiplications.Also, for the same representation, there is a lower bound of 3 non-linear multiplications necessary to evaluate each DES S-box [RV13].From Table 2, using our generic method over F 2 6 we can perform the evaluation with 5 non-linear multiplications.Below we show that by working over F 2 as explained above, only 4 non-linear multiplications are required.
We choose L = C 0 ∪ C 1 ∪ C 3 ∪ C 7 , t = 3, and q 1 (x), q 2 (x) $ ← P(x L ).Then using our method we transform the following linear system of equations to a system over F 2 .That is, instead of embedding S des into F 2 6 , we write the system of equations over F 2 .This can be done by considering the binary representation of x α evaluated at any given value in F 2 6 .This will give 6 equations over F 2 for each equation Q(x j ) + p 3 (x j ).Out of these 6 equations only 4 will be necessary since the output of DES S-box has 4-bit values.By solving this new system of linear equations over F 2 we can determine p i (x) for each i.
The number of multiplications required to evaluate q 1 (x), q 2 (x) is 2, and Q(x) can be evaluated with 2 additional multiplications.Hence, the total number of non-linear multiplications required is only 4. In Appendix C we give an example of basis polynomials q 1 (x), q 2 (x) for DES and the solution polynomials p i (x) corresponding to the system of linear equations for the first DES S-box S 1 .
As previously, once we obtain a full rank matrix for a set of randomly fixed q 1 (x), q 2 (x), for any other (6, 4)-bit S-box we can use this basis to find the corresponding polynomials p i (x), since the matrix A is independent from the S-box.Hence we can conclude that the masking complexity of any (6, 4)-bit S-box is at most 4.

Implementation Results: DES
We have performed a software implementation of the CGPQR countermeasure [CGP + 12] for DES that incorporates our new polynomial evaluation technique requiring only 4 NLMs.We have implemented this in C on a Dell Latitude 13 notebook running Ubuntu 12.04 Linux.The processor is Intel Core 2 Duo (32-bit architecture) running at 1.3 GHz.Our implementation is based on the source code available from [Cor13].The present implementation is also publicly available at [Cor13].We have used the technique of tabulating linear polynomials from Remark 3 in the implementation of our polynomial evaluation method.Note that these tables corresponding to the linear polynomials need to be stored only in the ROM.
In Table 5, we have compared the above timing results with that of the CG-PQR countermeasure implemented with the Roy-Vivek technique [RV13] that requires 7 NLMs, and also compared with that of the higher-order table recomputation method of Coron [Cor14].In Table 5, the parameter t refers to the order of security and n refers to the number of shares in the full security model of [ISW03].Note the relation n = 2t + 1.The (RAM) memory requirement (in bytes) is provided only for the S-box computations and the overall execution time for a DES encryption is in milliseconds.The penalty factor (PF) gives the ratio of the execution time of a given method to that of an unprotected implementation.The number of calls to the random number generator is 1000 times that of the reported quantity.where Though • and both perform multiplication in F 2 n , the operator " " is reserved for the multiplication by a scalar.A step such as λ j • λ k denotes a non-linear multiplication.Let the number of non-linear multiplications involved in a chain S be denoted as N (S).Then the non-linear complexity of P (x), denoted by M(P (x)), is defined as M(P (x)) = min S N (S).
B Heuristics for choosing parameters t and L

C Evaluation Polynomials for DES S-boxes
In Table 6 we give an example of basis polynomials q 1 (x), q 2 (x) for DES and Table 7 shows the solution polynomials p i (x) corresponding to the system of linear equations for the first DES S-box S 1 .

Table 1 .
Number of non-linear multiplications required for the CGPQR generic higherorder masking scheme.

Table 2 .
Minimum values of N mult

Table 3 .
Lower bound for non-linear complexity in F2n .
we compare, for various values of n, the previously known lower bound for non-linear complexity with the new lower bound as determined by (17).

Table 5 .
Comparison of secure implementations of DES.