Flush, Gauss, and Reload – A Cache Attack on the BLISS Lattice-Based Signature Scheme

. We present the ﬁrst side-channel attack on a lattice-based signature scheme, using the Flush+Reload cache-attack. The attack is targeted at the discrete Gaussian sampler, an important step in the Bimodal Lattice Signature Schemes (BLISS). After observing only 450 signatures with a perfect side-channel, an attacker is able to extract the secret BLISS-key in less than 2 minutes, with a success probability of 0.96. Similar results are achieved in a proof-of-concept implementation using the Flush+Reload technique with less than 3500 signatures. We show how to attack sampling from a discrete Gaussian using CDT or rejection sampling by showing potential information leakage via cache memory. For both sampling methods, a strategy is given to use this additional information, ﬁnalize the attack and extract the secret key. We provide experimental evidence for the idealized perfect side-channel attacks and the Flush+Reload attack on two recent CPUs.


Introduction
The possible advent of general purpose quantum computers will undermine the security of all widely deployed public key cryptography.Ongoing progress towards building such quantum computers recently motivated standardization bodies to set up programs for standardizing post-quantum public key primitives, focusing on schemes for digital signatures, public key encryption, and key exchange [6,18,24].
A particularly interesting area of post-quantum cryptography is lattice-based cryptography; there exist efficient lattice-based proposals for signatures, encryption, and key exchange [8,22,14,27,3,37,1] and several of the proposed schemes have implementations, including implementations in open source libraries [34].While the theoretical and practical security of these schemes is under active research, security of implementations is an open issue.
In this paper we make a first step towards understanding implementation security, presenting the first side-channel attack on a lattice-based signature scheme.More specifically, we present a cache-attack on the Bimodal Lattice Signature Scheme (BLISS) by Ducas, Durmus, Lepoint, and Lyubashevsky from CRYPTO 2013 [8], attacking a research-oriented implementation made available by the BLISS authors at [7].We present attacks on the two implemented methods for sampling from a discrete Gaussian and for both successfully obtain the secret signing key.
Note that most recent lattice-based signature schemes use noise sampled according to a discrete Gaussian distribution to achieve provable security and reduction to standard assumption.Hence, our attack might be applicable to many other implementations.It is possible to avoid our attack by using schemes which avoid discrete Gaussians at the cost of more aggressive assumptions [13].
1.1.The attack target.BLISS is the most recent piece in a line of work on identification-scheme-based lattice signatures, also known as signatures without trapdoors.An important step in the signature scheme is blinding a secret value in a way to make the signature statistically independent of the secret key.For this, a blinding (or noise) value y is sampled according to a discrete Gaussian distribution.In the case of BLISS, y is an integer polynomial of degree less than some system parameter n and each coefficient is sampled separately.Essentially, y is used to hide the secret polynomial s in the signature equation z = y + (−1) b (s • c), where noise polynomial y and bit b are unknown to an attacker and c is the challenge polynomial from the identification scheme which is given as part of the signature (z, c).
If an attacker learns the noise polynomials y for a few signatures, he can compute the secret key using linear algebra and guessing the bit b per signature.Actually, the attacker will only learn the secret key up to the sign, but for BLISS −s is also a valid secret key.
1.2.Our contribution.In this work we present a Flush+Reload attack on BLISS.We implemented the attack for two different algorithms for Gaussian sampling.First we attack the CDT sampler with guide table, as described in [30] and used in the attacked implementation as default sampler [7].CDT is the fastest way of sampling discrete Gaussians, but requires a large table stored in memory.Then we also attack a rejection-sampling-based sampler that was proposed in [8], and also provided in [7].
On a high level, our attacks exploit cache access patterns of the implementations to learn a few coefficients of y per observed signature.We then develop mathematical attacks to use this partial knowledge of different y j s together with the public signature values (z j , c j ) to compute the secret key, given observations from sufficiently many signatures.
In detail, there is an interplay between requirements for the offline attack and restrictions on the sampling.First, restricting to cache access patterns that provide relatively precise information means that the online phase only allows to extract a few coefficients of y j per signature.This means that trying all guesses for the bits b per signature becomes a bottleneck.We circumvent this issue by only collecting coefficients of y j in situations where the respective coefficient of s • c j is zero as in these cases the bit b j has no effect.
Second, each such collected coefficient of y j leads to an equation with some coefficients of s as unknowns.However, it turns out that for CDT sampling the cache patterns do not give exact equations.Instead, we learn equations which hold with high probability, but might be off by ±1 with non-negligible probability.We managed to turn the computation of s into a lattice problem and show how to solve it using the LLL algorithm [21].For rejection sampling we can obtain exact equations but at the expense of requiring more signatures.
We first tweaked the BLISS implementation to provide us with the exact cache lines used, modeling a perfect side-channel.For BLISS-I, designed for 128 bits of security, the attack on CDT needs to observe on average 441 signatures during the online phase.Afterwards, the offline phase succeeds after 37.6 seconds with probability 0.66.This corresponds to running LLL once.If the attack does not succeed at first, a few more signatures (on average a total of 446) are sampled and LLL is run with some randomized selection of inputs.The combined attack succeeds with probability 0.96, taking a total of 85.8 seconds.Similar results hold for other BLISS versions.In the case of rejection sampling, we are given exact equations and can use simple linear algebra to finalize the attack, given a success probability of 1.0, taking 14.7 seconds in total.
To remove the assumption of a perfect side-channel we performed a proof-ofconcept attack using the Flush+Reload technique on a modern laptop.This attack achieves similar success rates, albeit requiring 3438 signatures on average for BLISS-I with CDT sampling.For rejection sampling, we now had to deal with measurement errors.We did this again by formulating a lattice problem and using LLL in the final step.The attack succeeds with a probability of 0.88 after observing an average of 3294 signatures.
1.3.Structure.In Section 2, we give brief introductions to lattices, BLISS, and the used methods for discrete Gaussian sampling as well as to cache-attacks.In Section 3, we present two information leakages through cache-memory for CDT sampling and provide a strategy to exploit this information for secret key extraction.In Section 4, we present an attack strategy for the case of rejection sampling.In Section 5, we present experimental results for both strategies assuming a perfect side-channel.In Section 6, we show that realistic experiments also succeed, using Flush+Reload attacks.

Preliminaries
This section describes the BLISS signature scheme and the used discrete Gaussian samplers.It also provides some background on lattices and cache attacks.
We call {b 1 , . . ., b m } a basis of Λ and define m as the rank.We represent the basis as a matrix B = (b 1 , . . .b m ), which contains the vectors b i as row vectors.In this paper, we mostly consider full-rank lattices, i.e. m = n, unless stated otherwise.Given a basis B ∈ R n×n of a full-rank lattice Λ, we can apply any uni-modular transformation matrix U ∈ Z n×n and UB will also be a basis of Λ.The LLL algorithm [21] transforms a basis B to its LLL-reduced basis B in polynomial time.An LLL-reduced basis has that the shortest vector v of B satisfies ||v|| 2 ≤ 2 n−1 4 (| det(B)|) 1/n and there are looser bounds for the other basis vectors.Here ||.|| 2 denotes the Euclidean norm.Besides the LLLreduced basis, NTL's [33] implementation of LLL also returns the uni-modular transformation matrix U, satisfying UB = B .
In cryptography, lattices are often defined via polynomials, e.g., to take advantage of efficient polynomial arithmetic.The elements in R = Z[x]/(x n +1) are represented as polynomials of degree less than n.For each polynomial f (x) ∈ R we define the corresponding vector of coefficients as f = (f 0 , f 1 , . . ., f n−1 ).Addition of polynomials f (x) + g(x) corresponds to addition of their coefficient vectors f + g.Additionally, multiplication of f (x) • g(x) mod (x n + 1) defines a multiplication operation on the vectors f • g = gF = fG, where F, G ∈ Z n×n are matrices, whose columns are the rotations of (the coefficient vectors of) f, g, with possibly opposite signs.Lattices using polynomials modulo x n + 1 are often called NTRU lattices after the NTRU encryption scheme [14].
An integer lattice is a lattice for which the basis vectors are in Z n , such as the NTRU lattices just described.For integer lattices it makes sense to consider elements modulo q, so basis vectors and coefficients are taken from Z q .We represent the ring Z q as the integers in [−q/2, q/2).We denote the quotient ring R/(qR) by R q .When we work in R q = Z q [x]/(x n + 1) (or R 2q ), we assume n is a power of 2 and q is a prime such that q ≡ 1 mod 2n.

BLISS.
We provide the basic algorithms of BLISS, as given in [8].Details of the motivation behind the construction and associated security proofs are given in the original work.All arithmetic for BLISS is performed in R and possibly with each coefficient reduced modulo q or 2q.We follow notation of BLISS and also use boldface notation for polynomials.
By D σ we denote the discrete Gaussian distribution with standard deviation σ.In the next subsection, we will zoom in on this distribution and how to sample from it in practice.The main parameters of BLISS are dimension n, modulus q and standard deviation σ.BLISS uses a cryptographic hash function H, which outputs binary vectors of length n and weight κ; parameters d 1 and d 2 determining the density of the polynomials forming the secret key; and d, determining the length of the second signature component.
Signature generation (Algorithm 2.2) uses p = 2q/2 d , which is the highest order bits of the modulus 2q, and constant ζ = 1 q−2 mod 2q.In general, with .d we denote the d highest order bits of a number.In Step 1 of Algorithm 2.2, two integer vectors are sampled, where each coordinate is drawn independently and according to the discrete Gaussian distribution D σ .This is denoted by y ← D Z n ,σ .

Algorithm 2.2 BLISS Signature Algorithm
Input: Message µ, public key A = (a1, q − 2), secret key S = (s1, s2) continue with a probability based on σ, ||Sc||, z, Sc (details in [8]), else restart 8: In the attacks, we concentrate on the first signature vector z 1 , since z † 2 only contains the d highest order bits and therefore lost information about s 2 • c; furthermore, A and f determine s 2 as shown above.So in the following, we only consider z 1 , y 1 and s 1 , and thus will leave out the indices.
In lines 5 and 6 of Algorithm 2.2, we compute s • c over R 2q .However, since secret s is sparse and challenge c is sparse and binary, the absolute value of ||s • c|| ∞ ≤ 5κ 2q, with ||.|| ∞ the ∞ -norm.This means these computations are simply additions over Z, and we can therefore model this computation as a vector-matrix multiplication over Z: where C ∈ {−1, 0, 1} n×n is the matrix whose columns are the rotations of challenge c (with minus signs matching reduction modulo x n + 1).In the attacks we access individual coefficients of s • c; note that the jth coefficient equals s, c j , where c j is the jth column of C.
For completeness, we also show the verification procedure (Algorithm 2.3), although we do not use it further in this paper.Note that reductions modulo 2q are done before truncating and reducing modulo p.

Discrete Gaussian distribution.
The probability distribution of a (centered) discrete Gaussian distribution is a distribution over Z, with mean 0 and standard deviation σ.A value x ∈ Z is sampled with probability: , where ρ σ (x) = exp −x 2 2σ 2 .Note that the sum in the denominator ensures that this is actually a probability distribution.We denote the denominator by ρ σ (Z).
To make sampling practical, most lattice-based schemes use three simplifications: First, a tail-cut τ is used, restricting the support of the Gaussian to a finite interval [−τ σ, τ σ].The tail-cut τ is chosen such that the probability of a real discrete Gaussian sample landing outside this interval is negligible.Second, values are sampled from the positive half of the support and then a bit is flipped to determine the sign.For this the probability of obtaining zero in [0, τ σ] needs to be halved.The resulting distribution on the positive numbers is denoted by D + σ .Finally, the precision of the sampler is chosen such that the statistical distance between the output distribution and the exact distribution is negligible.
There are two generic ways of sampling from a discrete Gaussian distribution: using the cumulative distribution function [26] or via rejection sampling [10].Both these methods are deployed with some improvements which we describe next.These modified versions are implemented in [7].We note that there are also other ways [9,32,31,4] of efficiently sampling discrete Gaussians.
CDT sampling.The basic idea of using the cumulative distribution function in the sampler, is to approximate the probabilities p y = P[x ≤ y| x ← D σ ], computed with λ bits of precision, and save them in a large table.At sampling time, one samples a uniformly random r ∈ [0, 1), and performs a binary search through the table to locate y ∈ [−τ σ, τ σ] such that r ∈ [p y−1 , p y ).Restricting to the non-negative part [0, τ σ] corresponds to using the probabilities While this is the most efficient approach, it requires a large table.We denote the method that uses the approximate cumulative distribution function with tail cut and the modifications described next, as the CDT sampling method.
One can speedup the binary search for the correct sample y in the table, by using an additional guide table I [30,20,5].The BLISS implementation we attack uses I with 256 entries.The guide table stores for each u ∈ {0, . . ., 255} the smallest interval I[u] = (a u , b u ) such that p * au ≤ u/256 and p * bu ≥ (u+1)/256.The first byte of r is used to select I[u] leading to a much smaller interval for the binary search.Effectively, r is picked byte-by-byte, stopping once a unique value for y is obtained.The CDT sampling algorithm with guide table is summarized in Algorithm 2.4.
Rejection sampling.The basic idea behind rejection sampling is to sample a uniformly random integer y ∈ [−τ σ, τ σ] and accept this sample with probability ρ σ (y)/ρ σ (Z).For this, a uniformly random value r ∈ [0, 1) is sampled and the uniformly random y is accepted iff r ≤ ρ σ (y).This method has two huge downsides: calculating the values of ρ σ (y) to high precision is expensive and the rejection rate can be quite high.
In the same paper introducing BLISS [8], the authors also propose a more efficient rejection sampling algorithm.We recall the algorithms used (Algorithms 2.5, 2.6, 2.7), more details are given in the original work.We denote this method as rejection sampling in the remainder of this paper.
The basic idea is to first sample a value x, according to the binary discrete Gaussian distribution D σ2 , where σ 2 = 1 2 ln 2 (Step 1 of Algorithm 2.5).This can be done efficiently using uniformly random bits [8].The actual sample y = Kx + z, where z ∈ {0, . . ., K − 1} is sampled uniformly at random and K = σ σ2 + 1 , is then distributed according to the target discrete Gaussian distribution D σ , by rejecting with a certain probability (Step 4 of Algorithm 2.5).The number of rejections in this case is much lower than in the original method.This step still requires computing a bit, whose probability is an exponential value.However, it can be done more efficiently using Algorithm 2.7, requiring only a small table ET.Algorithm 2.7 Sampling a bit with probability exp(−x/(2σ 2 )) for x ∈ [0, 2 )

Algorithm 2.4 CDT Sampling With Guide
1: sample Ai with probability ET[i].

4:
if Ai = 0 then return 0 5: return 1 2.4.Cache attacks.The cache is a small bank of memory which exploits the temporal and the spacial locality of memory access to bridge the speed gap between the faster processor and the slower memory.The cache consists of cache lines, which, on modern Intel architecture, can store a 64-byte aligned block of memory of size 64 bytes.
In a typical processor there are several cache levels.At the top level, closest to the execution core, is the L1 cache, which is the smallest and the fastest of the hierarchy.Each successive level (L2, L3, etc.) is bigger and slower than the preceding level.
When the processor accesses a memory address it looks for the block containing the address in the L1 cache.In a cache hit, the block is found in the cache and the data is accessed.Otherwise, in a cache miss, the search continues on lower levels, eventually retrieving the memory block from the lower levels or from the memory.The cache then evicts a cache line and replaces its contents with the retrieved block, allowing faster future access to the block.
Because cache misses require searches in lower cache levels, they are slower than cache hits.Cache timing attacks exploit this timing difference to leak information [2,28,25,12,23].In a nutshell, when an attacker uses the same cache as a victim, victim memory accesses change the state of the cache.The attacker can then use the timing variations to check which memory blocks are cached and from that deduce which memory addresses the victim has accessed.Ultimately, the attacker learns the cache line of the victim's table access: a range of possible values for the index of the access.
In this work we use the Flush+Reload attack [36,12].A Flush+Reload attack uses the clflush instruction of the x86-64 architecture to evict a memory block from the cache.The attacker then lets the victim execute before measuring the time to access the memory block.If during its execution the victim has accessed an address within the block, the block will be cached and the attacker's access will be fast.If, however, the victim has not accessed the block, the attacker will reload the block from memory, and the access will take much longer.Thus, the attacker learns whether the victim accessed the memory block during its execution.The Flush+Reload attack has been used to attack implementations of RSA [36], AES [12,17], ECDSA [35,29] and other software [38,11].

Attack 1: CDT Sampling
This section presents the mathematical foundations of our cache attack on the CDT sampling.We first explain the phenomena we can observe from cache misses and hits in Algorithm 2.4 and then show how to exploit them to derive the secret signing key of BLISS using LLL.Sampling of the first noise polynomial y ∈ D Z n ,σ is done coefficientwise.Similarly the cache attack targets coefficients y i for i = 0, . . ., n − 1 independently.

Weaknesses in cache.
Sampling from a discrete Gaussian distribution using both an interval table I and a table with the actual values T , might leak information via cache memory.The best we can hope for is to learn the cachelines of index u of the interval and of index I z of the table look-up in T .Note that we cannot learn the sign of the sampled coefficient y i .Also, the cache line of T [I z ] always leaves a range of values for |y i |.However, in some cases we can get more precise information combining cache-lines of table look-ups in both tables.
Here are two observations that narrow down the possibilities: We will restrict ourselves to only look for cache access patterns that give even more precision, at the expense of requiring more signatures: 1.The first restriction is to only look at cache weaknesses (of type Intersection or Last-Jump), in which the number of possible values for sample |y i | is two.Since we do a binary search within an interval, this is the most precision one can get (unless an interval is unique): after the last comparisons (table look-up in T ), one of two values will be returned.This means that by picking either of these two values we limit the error of |y i | to at most 1.

The probabilities of sampling values using CDT sampling with guide table
I are known to match the following probability requirement : Due to the above condition, it is possible that adjacent intervals are partially overlapping.That is, for some u, v we have that So we search for values γ 1 so that for small α, which also matches access patterns for the first restriction.Then, if we observe a matching access pattern, it is safe to assume the outcome of the sample is γ 1 .

The last restriction is to only look at cache-access patterns, which reveal
that for some constant β ≥ 1, which is an easy calculation using the distributions of s, c.If we use this restriction in our attack targeted at coefficient y i of y, we learn the sign of |y i | by looking at the sign of coefficient z i of z, since: So by requiring that |y i | must be larger than the expected value, we expect to learn the sign of y i .We therefore omit the absolute value sign in |y i | and simply write that we learn There is some flexibility in these restrictions, in choosing parameters α, β.Choosing these parameters too restrictively, might lead to no remaining cacheaccess patterns, choosing them too large makes other parts fail.
In the last part of the attack described next, we use LLL to calculate short vectors of a certain (random) lattice we create using BLISS signatures.We noticed that LLL works very well on these lattices, probably because the basis used is sparse.This implies that the vectors are already relatively short and orthogonal.The parameter α determines the shortness of the vector we look for, and therefore influences if an algorithm like LLL finds our vector.For the experiments described in Section 5, we required α ≤ 0.1.This made it possible for every parameter set we used in the experiments to always have at least one cache-access pattern to use.
Parameter β influences the probability that one makes a huge mistake when comparing the values of y i and z i .However, for the parameters we used in the experiments, we did not find recognizable cache-access patterns which correspond to small y i .This means, we did not need to use this last restriction to reject certain cache-access patterns.

Exploitation.
For simplicity, we assume we have one specific cache access pattern, which reveals if y i ∈ {γ 1 , γ 2 } for i = 0, . . ., n − 1 of polynomial y, and if this is the case, y i has probability (1 − α) to be value γ 1 , with small α.In practice however, there might be more than one cache weakness, satisfying the above requirements.This would allow the attacker to search for more than one cache access pattern done by the victim.For the attack, we assume the victim is creating N signatures3 (z j , c j ) for j = 1, . . ., N , and an attacker is gathering these signatures with associated cache information for noise polynomial y j .We assume the attacker can search for the specific cache access pattern, for which he can determine if y ji ∈ {γ 1 , γ 2 }.For the cases revealed by cache access patterns, the attacker ends up with the following equation: where the attacker knows coefficient z ji of z j , rotated coefficient vectors c ji of challenge c j (both from the signatures) and y ji ∈ {γ 1 , γ 2 } of noise polynomial y j (from the side-channel attack).Unknowns to the attacker are bit b j and secret vector s.
If z ji = γ 1 , the attacker knows that s, c ji ∈ {0, 1, −1}.Moreover, with high probability (1 − α) the value will be 0, as by the second restriction y ji is biased to be value γ 1 .So if z ji = γ 1 , the attacker adds ξ k = c ji to a list of good vectors.The restriction z ji = γ 1 means that the attacker will in some cases not use the information in Equation ( 2), although he knows that y ji ∈ {γ 1 , γ 2 }.
When the attacker collects enough of these vectors he can build a matrix L ∈ {−1, 0, 1} n×n , whose columns are the ξ k 's.This matrix satisfies: for some unknown but short vector v.The attacker does not know v, so he cannot simply solve for s, but he does know that v has norm about √ αn, and lies in the lattice spanned by the rows of L. He can use a lattice reduction algorithm, like LLL, on L to search for v. LLL also outputs the uni-modular matrix U satisfying UL = L .The attack tests for each row of U (and its rotations) whether it is sparse and could be a candidate for s = f.As stated before, correctness of a secret key guess can be verified using the public key.This last step does not always succeed, just with high probability.To make sure the attack succeeds, this process is randomized.Instead of collecting exactly n vectors ξ k = c ji , we gather m > n vectors, and pick a random subset of n vectors as input for LLL.While we do not have a formal analysis of the success probability, experiments (see Section 5) confirm that this method works and succeeds in finding the secret key (or its negative) in few rounds of randomization.
A summary of the attack is given in Algorithm 3.1.

Algorithm 3.1 Cache-attack on BLISS with CDT Sampling
Input: Access to cache memory of a victim with a key-pair (A, S).Input parameters n, σ, q, κ of BLISS.Access to signature polynomials (z1, z † 2 , c) produced using S. Victim uses CDT sampling with tables T, I for noise polynomials y.Cache weakness that allows to determine if coefficient yi ∈ {γ1, γ2} of y, and when this is the case, the value of yi is biased towards γ1.Output: Secret key S.
1: Let k = 0 be the number of vectors collected so far and let M = [] be an empty list of vectors.2: while (k < m): // Collect m vectors ξ k before randomizing LLL.

3:
Collect signature (z1, z † 2 , c), together with cache information for each coefficient yi of noise polynomial y.

Attack 2: Rejection Sampling
In this section, we discuss the foundations and strategy of our second cache attack on the rejection-based sampler (Algorithms 2.5, 2.6, and 2.7).We show how to exploit the fact that this method uses a small table ET, leaking very precise information about the sampled value.4.2.Exploitation.We will use the same methods as described in Section 3.2, but now we know that for a certain cache access pattern the coefficient y i ∈ {0, ±K, ±2K, . ..}, i = 0, . . ., n − 1, of the noise polynomial y.If max | s, c | ≤ κ < K, (which is something anyone can check using the public parameters and which holds for typical implementations), we can determine y i completely using the knowledge of signature vector z.When more signatures4 (z j , c j ); j = 1, . . ., N are created, the attacker can search for the specific access pattern and verify whether y ji ∈ {0, ±K, ±2K, . ..},where y ji is the i'th coefficient of noise polynomial y j .
If the attacker knows that y ji ∈ {0, ±K, ±2K, . ..} and it additionally holds that z ji = y ji , where z ji is the i'th coefficient of signature polynomial z j , he knows that s, c ji = 0.If this is the case, the attacker includes coefficient vector ζ k = c ji in the list of good vectors.Also for this attack the attacker will discard some known y ji if it does not satisfy z ji = y ji .
Once the attacker has collected n of these vectors he can form a matrix L ∈ {−1, 0, 1} n×n , whose columns are the ξ k 's, satisfying sL = 0, where 0 is the all-zero vector.With very high probability, the ξ k 's have no dependency other than introduced by s.This means s is the only kernel vector.
Note the subtle difference with Equation (3): we do not need to randomize the process, because we know the right-hand side is the all-zero vector.The attack procedure is summarized in Algorithm 4.1.

Possible extensions.
One might ask why we not always use the knowledge of y ji , since we can completely determine its value, and work with a non-zero right-hand side.Unfortunately, bits b j of the signatures are unknown, which means an attacker has to use a linear solver 2 N times, where N is the number of required signatures (grouping columns appropriately if they come from the same signature).For large N this becomes infeasible and N is typically on the scale of n.By requiring that z ji = y ji , we remove the unknown bit b j from the Equation (2).
Similar to the first attack, an attacker might also use vectors ξ k = c ji , where s, c ji ∈ {−1, 0, 1}, in combination with LLL and possibly randomization.This approach might help if fewer signatures are available, but the easiest way is to require exact knowledge, which comes at the expense of needing more signatures, but has a very fast and efficient offline part.

Results With a Perfect Side-Channel
In this section we provide experimental results, where we assume the attacker has access to a perfect side-channel: no errors are made in measuring the table 5.2.CDT sampling.When the signing algorithm uses CDT sampling as described in Algorithm 2.4, the perfect side-channel provides the values of u/8 and I z /8 of the table accesses for u and I z in tables I and T .We apply the attack strategy of Section 3.
We first need to find cache-line patterns, of type intersection or last-jump, which reveal that y i ∈ {γ 1 , γ 2 } and P[y i = γ 1 | y i ∈ {γ 1 , γ 2 }] = 1−α with α ≤ 0.1 One way to do that is to construct two tables: one table that lists elements I[u], that belong to certain cache-lines of table I, and one table that list the accessed elements I z inside these intervals I[u], that belong to certain cache-lines of table T .We can then brute-force search for all cache weaknesses of type intersection or last-jump.For example, in BLISS-I we have the first seven elements of I[u] belong to the first cache-line of I, but elements in I[7] = {7, 8}, access element I z = 8, which is part of the second cache-line for T .This is an intersection weakness: if the first cache-line of I is accessed and the second cache-line of T is accessed, we know y i ∈ {7, 8}.Similarly, one can find last-jump weaknesses, by searching for intervals I[u] that access multiple cache-lines of T .Once we have these weaknesses, we need to use the biased restriction with α ≤ 0.1.This can be done by looking at all bytes except the first of the entry T For each set of parameters we found at least one of these weaknesses using the above method (see Table B.1 for the values).
We collect m (possibly rotated) coefficient vectors c j and then run LLL at most t = 2(m − n) + 1 times, each time searching for s in the uni-modular transformation matrix using the public key.We consider the experiment failed if the secret key is not found after this number of trials; the randomly constructed lattices have a lot of overlap in their basis vectors which means that increasing t further is not likely to help.We performed 1000 repetitions of each experiment (different parameters and sizes for m) and measured the success probability p succ , the average number of required signatures N to retrieve m usable challenges, and the average length of v if it was found.The expected number of required signatures E[N ] is also given, as well as the running time for the LLL trials.This expected number of required signatures can be computed as: where CP is the event of a usable cache-access pattern for a coordinate of y.
From the results in Table B.2 we see that, although BLISS-0 is a toy example (with security level λ ≤ 60), it requires the largest average number N of signatures to collect m columns, i.e., before the LLL trials can begin.This illustrates that the cache-attack depends less on the dimension n, but mainly on σ.For BLISS-0 with σ = 100, there is only one usable cache weaknesses with the restrictions we made.
For all cases, we see that a small increase of m greatly increases the success probability p succ .The experimental results suggest that picking m ≈ 2n suffices to get a success probability close to 1.0.This means that one only needs more signatures to always succeed in the offline part.

Rejection sampling.
When the signature algorithm uses rejection sampling from Algorithm 2.6, a perfect side-channel determines if there has been a table access in table ET.Thus, we can apply the attack strategy given in Section 4. We require m = n (possibly rotated) challenges c i to start the kernel calculation.We learn whether any element has been accessed in table ET, e.g., by checking the cache-lines belonging to the small part of the table.We performed only 100 experiments this time, since we noticed that p succ = 1.0 for all parameter sets with a perfect side-channel.This means that the probability that n random challenges c are linearly independent is close to 1.0.We state the average number N of required signatures in Table B.3.This time, the expected number is simply: σ2 + 1 and tail-cut τ ≥ 1.Note that the number of required signatures is smaller for BLISS-II than for BLISS-I.This might seem surprising as one might expect it to increase or be about the same as BLISS-I because the dimensions and security level are the same for these two parameter sets.However, σ is chosen a lot smaller in BLISS-II, which means that also value K is smaller.This influences N significantly as the probability to sample values xK is larger for small σ.

Proof-of-Concept Implementation
So far, the experimental results were based on the assumption of a perfect sidechannel: we assumed that we would get the cache-line of every table look-up in the CDT sampling and rejection sampling.In this section, we reduce the assumption and discuss the results of more realistic experiments using the Flu-sh+Reload technique.
When moving to real hardware some of the assumptions made in Section 5 no longer hold.In particular, allocation does not always ensure that tables are aligned at the start of cache lines and processor optimizations may pre-load memory into the cache, resulting in false positives.One such optimization is the spatial prefetcher, which pairs adjacent cache lines into 128-byte chunks and prefetches a cache line if an access to its pair results in a cache miss [16].
6.1.Flush+Reload on CDT sampling.Due to the spatial prefetcher, Flu-sh+Reload cannot be used consistently to probe two paired cache lines.Consequently, to determine access to two consecutive CDT table elements, we must use a pair that spans two unpaired cache lines.In Table C.1, we show that when the CDT table is aligned at 16 bytes, we can always find such a pair for BLISS-I.Although this is not a proof that our attack works in all scenarios, i.e. for all σ and all offsets, it would also not be a solid defence to pick exactly those scenarios for which our attack would not work, e.g., because α could be increased.
The attack was carried out on an HP Elite 8300 with an i5-3470 processor.running CentOS 6.6.Before sampling each coordinate y i , for i = 0, . . ., n − 1, we flush the monitored cache lines using the clflush instruction.After sampling the coordinate, we reload the monitored cache lines and measure the response time.We compare the response times to a pre-defined threshold value to determine whether the cache lines were accessed by the sampling algorithm.
A visualization of the Flush+Reload measurements for CDT sampling is given in Figure 6.1.Using the intersection and last-jump weakness of the sampling method in cache-memory, we can determine which value is sampled by the victim by probing two locations in memory.To reduce the number of false positives, we focus on one of the weaknesses from Table B.1 as a target for the Flush+Reload.This means that the other weaknesses are not detected and we need to observe more signatures than with a perfect side-channel, before we collect enough columns to start with the offline part of the attack.
We executed 50 repeated attacks against BLISS-I, probing the last-jump weakness for {γ 1 , γ 2 } = {55, 56}.We completely recovered the private key in 46 out of the 50 cases.On average we require 3438 signatures for the attack, to collect m = 2n = 1024 equations.We tried LLL five times after the collection and considered the experiment a failure if we did not find the secret key in these five times.We stress that this is not the optimal strategy to minimize the number of required signatures or to maximize the success probability.However, it is an indication that this proof-of-concept attack is feasible.6.2.Other processors.We also experimented with a newer processor (Intel core i7-5650U) and found that this processor has a more aggressive prefetcher.In particular, memory locations near the start and the end of the page are more likely to be prefetched.Consequently, the alignment of the tables within the page can affect the attack success rate.We find that in a third of the locations within a page the attack fails, whereas in the other two thirds it succeeds with probabilities similar to those on the older processor.We note that, as demonstrated in Table B.1, there are often multiple weaknesses in the CDT.While some weaknesses may fall in unexploitable memory locations, others may still be exploitable.
6.3.Flush+Reload on rejection sampling.For attacking BLISS using rejection sampling, we need to measure if table ET has been accessed at all.Due to the spatial prefetcher we are unable to probe all of the cache lines of the table.Instead, we flush all cache lines containing ET before sampling and reload only even cache lines after the sampling.Flushing even cache lines is required for y i ∈ {γ 1 , γ 2 } Fig. 6.1: Visualization of Flush+Reload measurements of table look-ups for BLISS-I using CDT sampling with guide table I. Two locations in memory are probed, denoted in the vertical axis by 0, 1, and they represent two adjacent cache-lines.For interval I[51] = [54, 57], there is a last-jump weakness for {γ 1 , γ 2 } = {55, 56}, where the outcome of |y i | is biased towards γ 1 = 55 with α = 0.03.For each coordinate (the horizontal axis), we get a response time for each location we probe: dark regions denote a long response time, while lighter regions denote a short response time.When both of the probed locations give a fast response, it means the victim accessed both cache-lines for sampling y i .In this case the attacker knows that |y i | ∈ {55, 56}; here for i = 8 and i = 41. the Flush+Reload attack.We flush the odd cache lines to trigger the spatial prefetcher, which will prefetch the paired even cache lines when the sampling accesses an odd cache line.Thus, flushing all of the cache lines gives us a complete coverage of the table even though we only reload half of the cache lines.
Since we do not get error-free side-channel information, we are likely to collect some c with s, c i = 0 as columns in L. Instead of computing the kernel (as in the offline part) we used LLL (as in CDT) to handle small errors and we gathered more than n columns and randomized the selection of L.
We tested the attack on a MacBook air with the newer processor (Intel core i7-5650U) running Mac OS X El Capitan.We executed 50 repeated attacks against BLISS-I, probing three out of the six cache lines that cover the ET table.We completely recovered the private key in 44 of these samples.On average we required 3294 signatures for the attack to collect m = n + 100 = 612 equations.The experiment is considered a failure if we did not find the secret key after trying LLL five times.

Conclusion.
Our proof-of-concept implementation demonstrates that in many cases we can overcome the limitations of processor optimizations and perform the attack on BLISS.The attack, however, requires a high degree of synchronization between the attacker and the victim, which we achieve by modifying the victim code.For a similar level of synchronization in a real attack scenario, the attacker will have to be able to find out when each coordinate is sampled.One possible approach for achieving this is to use the attack of Gullasch et al. [12] against the Linux Completely Fair Scheduler.The combination of a cache attack with the attack on the scheduler allows the attacker to monitor each and every table access made by the victim, which is more than required for our attacks.

A Parameter Suggestions for BLISS
We provide the parameter suggestions for different levels of security for BLISS, used to test the side-channel attacks.The parameters we focus on are: security level λ, dimension n, modulus q, Gaussian standard deviation σ, secret key sparsity parameters δ 1 , δ 2 , so that d i = nδ i for i ∈ {1, 2}, and weight of challenge κ.

2. 1 .
Lattices.We define a lattice Λ as a discrete subgroup of R n : given m ≤ n linearly independent vectors b 1 , . . ., b m ∈ R n , then Λ is the set Λ(b 1 , . . ., b m ) of all integral linear combinations of the b i 's:

:
We can intersect knowledge about the used index u in I with the knowledge of the access T [I z ].Getting the cache-line of I[u] gives a range of intervals, which is simply another (bigger) interval of possible values for sample |y i |.If the values in the range of intervals are largely nonoverlapping with the range of values learned from the access to T [I z ], then the combination gives a much more precise estimate.For example: if the cache-line of I[u] reveals that sample |y i | is in set S 1 = {0, 1, 2, 3, 4, 5, 7, 8} and the cache-line of T [I z ] reveals that sample |y i | must be in set S 2 = {7, 8, 9, 10, 11, 12, 13, 14, 15}, then by intersecting both sets we know that |y i | ∈ S 1 ∩ S 2 = {7, 8}, which is much more precise information.Last-Jump: If the elements of an interval I[u] in I are divided over two cachelines of T , we can sometimes track the search for the element to sample.If a small part of I[u] is in one cache-line, and the remaining part of I[u] is in another, we are able to distinguish if this small part has been accessed.For example, interval I[u] = {5, 6, 7, 8, 9} is divided over two cache-lines of T : cache-line T 1 = {0, 1, 2, 3, 4, 5, 6, 7} and line T 2 = {8, 9, 10, 11, 12, 13, 14, 15}.The binary search starts in the middle of I[u], at value 7, which means line T 1 is always accessed.However, only for values {8, 9} also line T 2 is accessed.So if both lines T 1 and T 2 are accessed, we know that sample |y i | ∈ {8, 9}.

Table Input :
Big table T [y] containing values p * y of the cumulative distribution function of the discrete Gaussian distribution (using only non-negative values), omitting the first byte.Small table I consisting of the 256 intervals.
In practice, this only happens for u = v + 1, meaning adjacent intervals might overlap.For example, if the probability of sampling x is greater than 1/256, then x has to be an element in at least two intervals I[u].Because of this, it is possible that for certain parts of an interval I[u], there is a biased outcome of the sample.The second restriction is to only consider cache weaknesses for which additionally one of the two values is significantly more likely to be sampled, i.e., if|y i | ∈ {γ 1 , γ 2 } ⊂ I[u]is the outcome of cache access patterns, then we further insist on 4.1.Weaknesses in cache.The rejection-sampling algorithm described in Section 2.3 uses a table with exponential values ET[i] = exp(−2 i /(2σ 2 )) and inputs of bit-size = O(log K), which means this table is quite small.Depending on bit i of input x, line 3 of Algorithm 2.7 is performed, requiring a table look-up for value ET[i].In particular when input x = 0, no table look-up is required.An attacker can detect this event by examining cache activity of the sampling process.If this is the case, it means that the sampled value z = 0 in Step 2 of Algorithm 2.5.The possible values for the result of sampling are y ∈ {0, ±K, ±2K, . ..}.So for some cache access patterns, the attacker is able to determine if y ∈ {0, ±K, ±2K, . ..}.
Table A.1: Parameter suggestions for different security levels of BLISS (from [8]) B.2. Experimental results for CDT sampling.Table B.2: Experimental results with a perfect side-channel, when BLISS is used with CDT sampling (Algorithm 2.4).For each parameter set, we managed to gather m equations from N signatures.The running time of the offline part is given in seconds.