Design Space Exploration of Galois and Fibonacci Configuration based on Espresso Stream Cipher

Zhengyuan Shi1, Gangqiang Yang∗1, Hailiang Xiong1, Fudong Li2, and Honggang Hu3

1Shandong University
2University of Alberta
3University of Science and Technology of China,
zshi0616@mail.sdu.edu.cn, g37yang@sdu.edu.cn, hailiangxiong@sdu.edu.cn, fudong1@ualberta.ca, hghu2005@ustc.edu.cn

Abstract: Galois and Fibonacci are two different configurations of stream ciphers. Because the Fibonacci configuration is more convenient for cryptanalysis, most ciphers are designed as Fibonacci-configured. So far, although many transformations between Fibonacci and Galois configurations have been proposed, there is no sufficient analysis of their respective hardware performance. The 128-bit secret key stream cipher Espresso, its Fibonacci-configured variant and linear Fibonacci variant have a similar security level. We take them as examples to design the optimization strategies in terms of both area and throughput, investigate which configuration is more efficient in a certain aspect. The Fibonacci-configured Espresso occupies 52 slices on Spartan-3 and 22 slices on Virtex-7, which are the minimum solutions among those three Espresso schemes or even smaller than 80-bit secret key ciphers. Based on our throughput improvement strategy, parallel Espresso design can perform 4.1 Gbps on Virtex-7 FPGA and 1.9 Gbps on Spartan-3 FPGA at most. In brief, the Fibonacci cipher is more suitable for extremely resource-constrained or extremely high-throughput applications, while the Galois cipher seems like a compromise between area and speed. Besides, the transformation from nonlinear feedback to linear feedback is not recommended for any hardware implementations.

Keywords: lightweight cryptography; Espresso; FPGA Optimization; stream cipher; Galois NFSR; Fibonacci NFSR

1 Introduction

The stream ciphers have high throughput for software applications and highly restricted resources for hardware applications. Although some stream ciphers have been proved to contain design flaws [1], [2], [3], but it is undeniable that in addition to security level, throughput and area are also two significant factors for evaluating ciphers. Therefore, ECRYPT launched the eSTREAM project [4], many stream ciphers for confidentiality, integrity and implementation efficiency have been proposed and widely deployed, such as Trivium [5], Grain [6] and MICKEY [7]. Many

∗The corresponding author is Gangqiang Yang
security protocols based on the stream ciphers have also been accepted as international standards, for example, the 3rd Generation Partnership Project (3GPP) specifies UEA2, UIA2 [8] based on stream cipher SNOW 3G [9] and EEA3, EIA3 3GPP:EEA3 based on stream cipher ZUC [10].

Block cipher produces the ciphertext after nonlinear substitution with the plaintext input. Different from block cipher, stream cipher is also regarded as a pseudo-random keystream generator and produces the keystream bits sequence with the same length as plaintext. The ciphertext is generated bit by bit by XOR operation between plaintext and keystream. Generally, the simplified model of stream cipher is divided into two parts: feedback shift register (FSR) and filter function, as in Figure 1. The initial internal state of stream cipher is determined by the input data, including the secret key and initial vector (IV). Then, the internal state, stored in FSR, is updated in each round by the feedback function. When the feedback function in the FSR is nonlinear, the FSR is also called nonlinear feedback shift register (NFSR). Conversely, when the feedback function is linear, the register is called linear feedback shift register (LFSR). The filter function outputs 1 bit as the keystream bit in each round according to the internal state.

The feedback shift register (FSR) is one of the most critical components of stream cipher, which is divided into Galois configuration and Fibonacci configuration according to the feedback bits. As in figure 1, the Fibonacci configuration is more straightforward. Every feedback shift register only has one feedback function which feeds to the \( n - 1 \)th bit at each round, the other bits are consecutive moved one-bit step. Conversely, in the Galois configuration, the FSR has any number of feedback functions.

The stream cipher contains Fibonacci NFSR or Fibonacci LFSR is named as Fibonacci-configured cipher and the cipher contains Galois NFSR or LFSR is called as Galois-configured cipher. Because Fibonacci FSR conforms to cryptanalysis formally, most of the stream ciphers are designed in this configuration. For instance, Grain [6] consists of both Fibonacci NFSR and Fibonacci LFSR, each series [11] [12] [13] in WG family contains a Fibonacci LFSR.

Meanwhile, previous work provided the universal transformations, which can transform the Fibonacci-configured stream cipher and Galois-configured stream cipher to each other [14], and even transform the nonlinear feedback shift register to linear feedback shift register [15].

The analysis and optimization of FSR determine the area occupation and throughput of cipher implementations. In general, the depth of combinational logic circuits is smaller than that in Fibonacci configuration, hence, the Galois-configured stream cipher has a higher maximum frequency. The Galois-configured Grain [16] has a larger throughput of 41.3% than Fibonacci-configured Grain. According to the transformation [14], the feedback login is divided into several parts to drive various bits in feedback shift registers in Galois configuration, without adding any extra feedback login. Therefore, the author implied that in TSMC 90nm ASIC technology library, the circuit area of Galois-configured Grain is not much larger than that of Fibonacci-configured Grain.
1.1 Espresso Stream Cipher

Espresso stream cipher was proposed in 2015 [17] considering hardware size and throughput simultaneously. The original Espresso stream cipher, supporting 128-bit secret key is designed in Galois configuration, consists of a 256-bit Galois-configured non-linear feedback shift register (NFSR) with 14 individual feedback functions and a 20-variable keystream filter function. It supports 128 bits secret key and 96 bits initial vector. The implementations on ATmega328P and ESP8266 processors have shown that Espresso requires the minimum program size, global variables storage and computation time among Grain v1, Grain 128 and Grain 128a [18].

Based on the transformation mentioned above, another two Espresso variants are introduced, Fibonacci-configured Espresso [17] (noted as Espresso-F) and Espresso-like LFSR filter generator [15] (noted as Espresso-L) have been introduced in the previous cryptography designs. Espresso-F can be broken by related key chosen IV attack with $2^{42}$ IVs and time complexity $2^{64}$ [19], Espresso-L can be broken with time complexity $2^{68.44}$ and $2^{66.86}$ under standard algebraic attack and the Ronjom-Helleseth attack respectively [15]. Although the cryptanalysis has proved that the variants do not fulfill the 128-bit security level, they can still be accepted under certain circumstances, such as for lightweight devices [20]. Consequently, the resource occupancy and throughput analysis based on Espresso and its variants may provide guidance and references for the selection of Galois or Fibonacci ciphers.

1.2 Related Work and Motivation

Unlike ASIC, the semicustom Field Programmable Gate Array (FPGA), including reconfigurable combinational logic and flip-flops, is applied to shorten time-to-market, save development costs and implement the domain-specific computing architecture. With those advantages, FPGA is more and more adopted in numerous areas, such as military, industry network, IoT, aeronautics and astronautics. Many kinds of ciphers are deployed in the programmable chips to not only evaluate hardware efficiency but also provide security solutions for FPGA application, such as WG-8 [21], Lizard [22], Grain [23] and Trivium [23].

The eSTREAM project’s [4] requirements for stream cipher are high throughput and more hardware efficiency. Grain [6] and Trivium [5] are regarded as the pioneers of modern lightweight cryptography, because of their compact architectures and efficient hardware performance with only 80-bit security level. Trivium is considered as a kind of Galois-configured cipher and Grain is a typical Fibonacci-configured cipher. So far, there is still no comprehensive discussion of both advantages and disadvantages of Galois and Fibonacci configurations implemented on FPGA. Meanwhile, Espresso and its variants provide an ideal condition for the discussion, they ensure that in the case of similar cryptography attack complexity, the discussion can determine which configuration is more recommended towards a particular scenario.

1.3 Our Contributions

In our paper, we firstly provide a brief overview of original Espresso (Galois configuration [17]) and its variants, including Espresso-F in Fibonacci configuration [17] and Espresso-L, an Espresso-like LFSR filter generator [15]. Based on that, Espresso and its variants are deployed on Xilinx FPGA toward the minimum area and highest throughput, respectively. We then investigate which is the smallest solution aimed for 256 bits NFSR implementation. After that, a hybrid architecture is proposed for throughput improvement, where Espresso initializes serially
and produces multiple keystream bits with a parallel architecture in the running phase during each clock, thus, throughput increases significantly.

We noticed that, unlike AISC implementations, Fibonacci-configured cipher is more area-saved than Galois-configured cipher on FPGA, but the latter is faster. For example, on Virtex-7 FPGA, Fibonacci-configured Espresso only occupies 22 slices with a maximum frequency of 427.2 MHz, while Galois-configured Espresso occupies 25 slices, with a maximum frequency of 491.4 MHz. The reason is that though both NFSRs have the same number of logic gates [14], Galois NFSR feedback functions are independent to be synthesized as look-up-tables, the minimum reconfigurable unit in FPGA, it will lead to more area than Fibonacci NFSR.

Besides, according to our throughput improvement strategy, Fibonacci configuration can support a larger parallel width, i.e., support to use extra resources to increase the number of keystream bits per cycle. The maximum throughput of Fibonacci-configured Espresso is 4.09 Gbps on Virtex-7 FPGA.

Based on our discussion, the transformation from Galois NFSR (original Espresso) to Fibonacci LFSR (Espresso-L) will generate more complex circuit than that to Fibonacci NFSR (Espresso-F). This kind of transformation should be avoided in FPGA implementation.

In brief, Fibonacci-configured stream cipher should be considered for scenarios with area requirements and parallel Fibonacci-configured stream cipher should be used for high throughput applications. Galois-configured stream cipher takes both throughput and area into account, which is suitable for compact devices but slightly requires high throughput.

In a nutshell, we list several main contributions as following:

• We conclude three different configurations of Espresso stream cipher, the original Espresso with Galois NFSR, the variant Espresso with Fibonacci NFSR and another variant Espresso with Fibonacci LFSR. Meanwhile, the comparison of Espresso and its variants show that Galois-configured cipher performs evidently faster than Fibonacci configured cipher on FPGA, but the former occupies more slices. Espresso-like LFSR filter generator variant is not recommended due to larger area and higher latency for hardware implementation and weak resistance for security analysis.

• To deploy cipher on the resource-compact devices, we provide an area optimized method with shift register look-up-table (SRL). According to the ratio of LUT to FF about different series FPGAs. We investigate the minimum area solution by enumerating the SRL replacement length of consecutive registers fragment. Based on the method, the minimum Espresso solution only takes 22 slices on Virtex-7 and 52 slices on Spartan-3. Our efficient Espresso solutions can be deployed on tiny devices, especially emergent IoT wireless devices.

• To fulfill the high throughput requirement, we propose a hybrid architecture which not only realizes the 4-bit solution but improves the parallel width to 8-bit and 16-bit. In short, the additional feedback functions tap more significant internal state bits than the serial feedback function, where, the unloaded feedback function value will be used as a variable if the variable index has been over-range. Although this method will cause more critical path propagation delay, the throughput of 16-bit hybrid architecture can reach more than 4 Gbps on Virtex-7, satisfied high-speed demand.

The following sections are organized as follows. Section 2 lists some cipher variables and notations. Section 2 provides a summary of Espresso design criteria and algorithm flow. The original Espresso in Galois configuration is implemented on FPGA in Section 3, including serial solution and hybrid solution. The Fibonacci-configured Espresso is described in Section 4. What’s more, another throughput improved method achieving more than 4-bit parallel width is
proposed. Section 5 briefs NFSR to Fibonacci LFSR transformation and provides implementation results on FPGA. The comparison between Espresso with its variants and other ciphers is shown in Section 6. Finally, we conclude this paper in Section 7.

2 Overview of Espresso Stream Cipher

2.1 Design of Espresso Stream Cipher

Espresso stream cipher aims at lightweight and high-speed applications. It is designed under the trade-off between hardware area and latency, with 128-bit secret key \((k)\) and 96-bit initial vector \((v)\). Espresso is designed as the fastest lightweight cipher in below 1500 GEs area level, better than Grain-128 and Trivium [17]. The block design diagram of Espresso stream cipher is shown in Figure 2.

![Block Diagram of Espresso Stream Cipher](image)

Figure 2: The block diagram of Espresso stream cipher

As a stream cipher, Espresso has 256-bit nonlinear feedback shift register (NFSR) in Galois configuration with 14 parallel feedback functions specified as follows:

\[
\begin{align*}
    f_{255}(x) &= x_0 \oplus x_{41}x_{70} \\
    f_{251}(x) &= x_{252} \oplus x_{42}x_{83} \oplus x_8 \\
    f_{247}(x) &= x_{248} \oplus x_{44}x_{102} \oplus x_4 \\
    f_{243}(x) &= x_{244} \oplus x_{43}x_{118} \oplus x_{103} \\
    f_{239}(x) &= x_{240} \oplus x_{46}x_{141} \oplus x_{117} \\
    f_{235}(x) &= x_{236} \oplus x_{67}x_{90}x_{110}x_{137} \\
    f_{231}(x) &= x_{232} \oplus x_{50}x_{159} \oplus x_{189} \\
    f_{217}(x) &= x_{214} \oplus x_3x_{32} \\
    f_{213}(x) &= x_{214} \oplus x_4x_{45} \\
    f_{209}(x) &= x_{210} \oplus x_6x_{64} \\
    f_{205}(x) &= x_{206} \oplus x_5x_{80} \\
    f_{201}(x) &= x_{202} \oplus x_8x_{103} \\
    f_{197}(x) &= x_{198} \oplus x_{29}x_{52}x_{72}x_{99} \\
    f_{193}(x) &= x_{194} \oplus x_{12}x_{121}
\end{align*}
\]
We can define those 14 bits in update set \( U \):

\[
U = \{193, 197, 201, 205, 209, 213, 217, 231, 235, 239, 243, 247, 251, 255\}
\] (2)

Note set \( V \) includes all variables used in feedback function. The variable set \( V \) of Espresso is:

\[
V = \{0, 3, 4, 5, 6, 8, 12, 29, 32, 40, 41, 42, 43, 44, 46, 50, 52, 64, 67, 70, 72, 80, 83, 90, 99, 102, 103, 110, 117, 118, 121, 137, 141, 159, 189, 194, 198, 202, 206, 210, 214, 232, 236, 240, 244, 248, 252\}
\] (3)

The NFSR bits which are not in update set \( U \) are loaded from higher bits. Hence, Espresso internal state update is as following:

\[
x_{i}^{t+1} = \begin{cases} 
  f_{i}(x), & i \in U \\
  x_{i}^{t}, & i \notin U 
\end{cases}
\] (4)

It also has a 20-variable nonlinear output function \( h(x) \) as keystream filter function, consists of a 6-variable linear function and a 14-variable nonlinear function:

\[
h(x) = \overline{x_{80}} \oplus x_{99} \oplus x_{137} \oplus x_{227} \oplus x_{222} \oplus x_{187} \oplus x_{243} \oplus x_{217} \oplus x_{247} \oplus x_{231} \oplus x_{213} \oplus x_{235} \\
\oplus x_{255} \oplus x_{251} \oplus x_{181} \oplus x_{239} \oplus x_{174} \oplus x_{164} \oplus x_{29} \oplus x_{255} \oplus x_{247} \oplus x_{243} \oplus x_{213} \oplus x_{181} \oplus x_{174}
\] (5)

### 2.2 Espresso Algorithm Flow

Espresso as a keystream generator only has 2 phases, they are initialization and running phases. Before executing 256 initialization rounds, the 256-bit NFSR is loaded with secret key (128 bits) and initial vector bits (96 bits) as Formula 6. The initial internal state from \( x_{224} \) to \( x_{254} \) is assigned as 1 to ensure there is no full-zero state in NFSR. The most significant bit \( x_{255} \) is set to 0.

\[
x_{i}^{0} = k_{i}, \quad 0 \leq i < 128 \\
x_{i}^{0} = v_{i}, \quad 128 \leq i < 224 \\
x_{i}^{0} = 1, \quad 224 \leq i < 255 \\
x_{i}^{0} = 0, \quad i = 255
\] (6)

The output bit from filter function \( h(x) \) result is fed into feedback functions \( f_{255}(x) \) and \( f_{217}(x) \) as Formula 7 during initialization phase, and the remaining feedback functions are invariable in any phases. After 256 times update, the filter result \( h(x) \) is output as keystream, no longer feeding to internal state update.

\[
f_{217}(x) = x_{218} \oplus x_{3} x_{32} \oplus h(x) \\
f_{255}(x) = x_{0} \oplus x_{41} x_{70} \oplus h(x)
\] (7)

Espresso also introduces a pipeline circuit implementing the 20-variable filter function, shown in Figure 3. As such 3-stage pipeline structure, Espresso has no valid output keystream at the first 3 cycles after initialization. However, the combinational logic circuit is synthesized into look-up-table, which packages 4 or 6 input variables. As a general rule, there are just 2 logic levels about the filter function under Xilinx 7 series FPGA synthesize strategy. Therefore, it’s useless to adopt the pipeline structure in the following FPGA implementation.
Figure 3: The pipeline structure of Espresso filter function

3 Implementation of Galois-configured Espresso

In this section, we explore the efficient FPGA implementation strategies for Espresso stream cipher in Galois configuration serially and parallelly. The hardware architectures in this section and the following sections have no secret key storage component and the input vectors ($k$ and $v$) are loaded into Espresso hardware bit by bit. This is a common and acceptable FPGA hardware performance evaluating method used for Grain [23], ZUC [24], Snow3g [24] and WG-8 [21].

3.1 Overview of Espresso FPGA Implementation

Field programmable gate array (FPGA) has been adopted everywhere, including IoT and artificial intelligence. Different from ASIC, FPGA can be reconfigured after manufacturing like microprocessor, but the hardware circuit (registers, logic gate) can be directly mapped into FPGA resources gearing to the needs of higher speed and lower power consumption. In our paper, the optimized Espresso keystream generator architectures are deployed on Xilinx Spartan-3 and Virtex-7 FPGAs.

Take Galois-configured Espresso as an example, the architecture has clock port $clk$, reset port $rst$, 1-bit width data input port $in$ receiving secret key or initial vector from storage component (testbench) and 1-bit width output port $ks$. It also has two output signal ports $load$ for initial data input enable and $work$ for the keystream output enable after initialization. Hence, Espresso or Espresso-F controller has 4 modes, they are IDLE, LOAD, INIT, WORK implemented as a finite state machine (FSM) with 9-bit counter $cnt$. The FSM transformation is shown in Figure 4. When counter increases to 256 during LOAD stage, the state machine transforms into the next state INIT. When the counter overflows to 0 during INIT state, the state machine changes to WORK and keeps that state until reset. Therefore, $load$ signal is synchronous with the finite state machine’s LOAD state and $work$ is coincident with WORK state. The Espresso-like LFSR filter generator has another procedure for loaded compensated internal state, we will discuss the variant Espresso-L in Section 5.
3.2 Efficient Implementation of NFSR

The 256-bit nonlinear feedback shift register is the only sequential logical component in Espresso hardware determining the latency and area occupation. Galois NFSR includes 14 feedback functions updating 14 bits internal state parallelly once triggered by clock edge. The combinational logic circuit is synthesized as LUT, specially, 6-input LUT on Virtex-7 FPGA. Hence, the 20-variable filter function of Espresso NFSR tapping 20 registers is no longer pipelined with only 2 logic levels.

The most commonly used method to reduce slices on FPGA is replacing consecutive registers (synthesized as FFs) by shift register look-up-table (SRL). On 7 series FPGA, a LUT may also be configured as SRL32 to substitute for 32-bit shift register at most, and 4-input LUT can be synthesized as SRL16 on Spartan-3. However, the internal state of SRL cannot be detected by any wire except an output port. Therefore, when cipher has a large number of consecutive internal states, the SRL optimized method is evidently effective. For instance, Espresso’s registers from $x_{190}$ to $x_{192}$ 3 bits are not tapped for any function, the 3 flip-flops can be replaced by one SRL to save area. This SRL output port is tapped as $x_{190}$ and loaded from $x_{192}$.

We define the consecutive register fragment from $x_a$ to $x_b$ as $R(a, b)$, whose valid length is equal to $b - a - 1$, representing $x_a$ and $x_b$ are implemented as flip-flops to keep only one logic level between the optimized fragment beginning and ending at FFs. The fragment $R(a, b)$ two terminal bits $x_a$ and $x_b$, which is tapped or updated by function. As the above example, the register fragment $R(189, 193)$ left-shown in Figure 5 is optimized designed as right-shown.

Each reconfigurable slice on Virtex-7 has 4 6-input LUTs and 8 flip-flops. In general term, the ratio of LUT to FF in minimum area solutions should approach 1: 2. Noteworthy, the ratio is not necessarily equal to 1: 2 by the reason of placing and routing. Moreover, with a...
few LUTs are used for combinational logic function rather than shift register, replacing all 2-bit length consecutive registers is discouraged. The smallest Espresso architecture has to consider the trade-off between registers implemented by FFs and registers implemented by LUTs. Along this line, we list all consecutive fragments in Table 1 according to the filter function, feedback function tapped positions and feedback function updated positions.

Firstly, every consecutive register fragment with more than or equal to 5 bits length is synthesized as SRL. Despite the longest fragment has 17-bit register, over the maximum SRL length 16 bits on 3 series FPGAs, there are 13 SRLs occupied not only on Virtex-7 but also on Spartan-3. The 17-bit register fragment \( R_{141,159} \) is divided into 1-bit single register and 16-bit register, thus the length changes to 16 bits which may be synthesized as one SRL16.

Secondly, based on above solution, we continue replacing 4 bits length fragments, \( R_{159,164} \), \( R_{217,222} \) and \( R_{222,227} \). There are three more LUTs are used, shown Table 2. Noted that there are 11 flip-flops reduced rather than 12 flip-flops, because the terminal bit \( x_{217} \) is driven by feedback function \( f_{217}(x) \), and insert another flip-flop will reduce one logic level to improve frequency.

Finally, we replace 17 3-bit length fragments and 7 2-bit length fragments respectively. It’s obvious that taking replacement on 3-bit and 2-bit fragments is not necessary on Virtex-7, shown in Table 2. Until now, apply SRL optimization method for every more than or equal to 4 bits length register fragment can lead to the minimum area on Virtex-7. Meanwhile, due to the different ratio of LUT to FF between Virtex-7 and Spartan-3 FPGA, the same method should be applied for Spartan-3 implementation toward the smaller solution. Table 2 also shows the implementation results on Spartan-3. In brief, if a fragment includes more than 1 registers, it should be replaced by SRL.

As a result, Espresso keystream generator architecture occupies 62 slices on Spartan-3 FPGA and 25 slices on Virtex-7 FPGA at least.

### 3.3 Throughput Improvement in Hybrid Architecture

Galois-configured Espresso stream cipher performs high throughput, up to 2.22 Gbps under 90nm CMOS technology [17]. We can improve throughput further by redesigning the NFSR in parallel. Denote parallel width \( w \) means parallelized cipher generates \( w \) keystream bits in each clock. In serial solution, the feedback function \( f_i(x) \) used to update \( x_i \), but in parallel solution, the functions have to be copied for \( w \) times. Table 3 showing internal state update in serial will be conducive to the comprehension and exploration of parallel scheme.
Table 2: The Results of SRL Optimized Method in Galois Configuration

<table>
<thead>
<tr>
<th>Device</th>
<th>SRL replacement</th>
<th>#LUT / #FF</th>
<th>#LUT as SRL (Slices)</th>
<th>Area (MHz)</th>
<th>Freq. (Mbps/Slices)</th>
<th>T./A.</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>R length ≥</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>5 bits</td>
<td>45/148</td>
<td>13</td>
<td>30</td>
<td>572.1</td>
<td>19.07</td>
</tr>
<tr>
<td></td>
<td>4 bits</td>
<td>48/137</td>
<td>16</td>
<td>29</td>
<td>491.4</td>
<td>19.66</td>
</tr>
<tr>
<td></td>
<td>3 bits</td>
<td>65/93</td>
<td>33</td>
<td>30</td>
<td>445.6</td>
<td>15.37</td>
</tr>
<tr>
<td></td>
<td>2 bits</td>
<td>72/79</td>
<td>40</td>
<td>32</td>
<td>422.7</td>
<td>13.21</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>5 bits</td>
<td>63/144</td>
<td>13</td>
<td>85</td>
<td>176.1</td>
<td>2.07</td>
</tr>
<tr>
<td></td>
<td>4 bits</td>
<td>66/133</td>
<td>16</td>
<td>79</td>
<td>179.3</td>
<td>2.27</td>
</tr>
<tr>
<td></td>
<td>3 bits</td>
<td>83/94</td>
<td>33</td>
<td>70</td>
<td>182.5</td>
<td>2.61</td>
</tr>
<tr>
<td></td>
<td>2 bits</td>
<td>90/80</td>
<td>40</td>
<td>62</td>
<td>198.5</td>
<td>3.20</td>
</tr>
</tbody>
</table>

Table 3: The Serially Update of the NFSR from $x_{206}$ to $x_{209}$

<table>
<thead>
<tr>
<th>Round/Cycle</th>
<th>$x_{206}$</th>
<th>$x_{207}$</th>
<th>$x_{208}$</th>
<th>$x_{209}$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$t$</td>
<td>$x^{t+1}_{206}$</td>
<td>$x^{t+1}_{207}$</td>
<td>$x^{t+1}_{208}$</td>
<td>$f_{209}(x^t)$</td>
</tr>
<tr>
<td>$t+1$</td>
<td>$x^{t+2}_{206}$</td>
<td>$x^{t+2}_{207}$</td>
<td>$x^{t+2}_{208}$</td>
<td>$f_{209}(x^{t+1})$</td>
</tr>
<tr>
<td>$t+2$</td>
<td>$x^{t+3}_{206}$</td>
<td>$x^{t+3}_{207}$</td>
<td>$x^{t+3}_{208}$</td>
<td>$f_{209}(x^{t+2})$</td>
</tr>
<tr>
<td>$t+3$</td>
<td>$x^{t+4}_{206}$</td>
<td>$f_{209}(x^t)$</td>
<td>$f_{209}(x^{t+1})$</td>
<td>$f_{209}(x^{t+3})$</td>
</tr>
<tr>
<td>$t+4$</td>
<td>$f_{209}(x^t)$</td>
<td>$f_{209}(x^{t+1})$</td>
<td>$f_{209}(x^{t+2})$</td>
<td>$f_{209}(x^{t+3})$</td>
</tr>
</tbody>
</table>
Table 4: The Parallelly Update of the NFSR from $x_{206}$ to $x_{209}$

<table>
<thead>
<tr>
<th>Cycle</th>
<th>$x_{206}$</th>
<th>$x_{207}$</th>
<th>$x_{208}$</th>
<th>$x_{209}$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$t$</td>
<td>$x_{206}^t$</td>
<td>$x_{207}^t$</td>
<td>$x_{208}^t$</td>
<td>$x_{209}^t$</td>
</tr>
<tr>
<td>$t + 1$</td>
<td>$f_{209}^0(x^t)$</td>
<td>$f_{209}^1(x^t)$</td>
<td>$f_{209}^2(x^t)$</td>
<td>$f_{209}^3(x^t)$</td>
</tr>
</tbody>
</table>

Take a part of Espresso NFSR (from $x_{206}$ to $x_{209}$) as an example, the feedback function $f_{209}(x)$ drives $x_{209}$. At the next round (Round = $t + 1$), the NFSR part from $x_{206}^{t+1}$ to $x_{209}^{t+1}$ is equal to $x_{207}^{t+1}$, $x_{208}^{t+1}$ and $x_{209}^{t+1}$, which shifts one bit and loads the function $f_{209}(x^t)$. After that, the 4-bit NFSR part shifts one more bits at round $t + 2$, changes to $x_{208}^{t+2}$, $x_{209}^{t+2}$, $f_{209}(x^{t+1})$ and $f_{209}(x^{t+2})$. Meanwhile, $f_{209}(x^{t+1})$ take NFSR $t + 1$ round internal state as variable, which can be calculated by shifting $x^t$ one bit. For the same reason, we can get the internal state at round $t + 3$ and $t + 4$. Thus, Espresso can update multiple rounds in each cycle has been proved.

Here we define $f_i^j(x)$ used to update $x_{i-(w-1-j)}$ in each clock in parallel schemes, as Table 4. Every index of $f_i^j(x)$ variables should add $j$ on the basis of original index, similar as getting the variables for function $f_i(x)$ at the next $j$ rounds in serial. For example, function $f_{209}(x) = x_{210} \oplus x_{3}x_{64}$ drives internal state $x_{209}$, if we note the parallel width is 4 bits, there are another 4 inferred functions based on $f_{209}(x)$:

$$
\begin{align*}
    f_{209}^0(x) &= x_{210+0} \oplus x_{6+0}x_{64+0} \\
    f_{209}^1(x) &= x_{210+1} \oplus x_{6+1}x_{64+1} \\
    f_{209}^2(x) &= x_{210+2} \oplus x_{6+2}x_{64+2} \\
    f_{209}^3(x) &= x_{210+3} \oplus x_{6+3}x_{64+3}
\end{align*}
$$

For the hybrid-designed Espresso, the cipher update by the Formula 4 before generating keystream phase. After 256 clocks for initialization, Espresso update $w$ rounds in each clock following:

$$
    x_{i+1} = \begin{cases} 
    f_i^{w-1-j}(x), & \exists j : (i+j) \in U, j=0,1,...,w-1 \\
    x_{i+w}, & \text{other}
    \end{cases}
$$

(9)

According to the above formulas, $x_{206}$ updates by $f_{209}^0(x)$, $x_{207}$ updates by $f_{209}^1(x)$, $x_{208}$ updates by $f_{209}^2(x)$ and $x_{209}$ updates by $f_{209}^3(x)$ respectively. Hence, the hybrid Espresso take forward for 4 rounds once triggered by clock edge.

However, the minimum number of spare registers between tapped bit and the near bit in update set $U$ determines the maximum parallel width. If we increase one bit parallel ($w = 5$), another additional function may be noted as $f_{209}^4(x) = x_{210+4} \oplus x_{6+4}x_{64+4}$. But the monomial $x_{210+4}$ is not correct because $x_{213}$ is driven by another feedback function instead of the more significant bit $x_{214}$. Hence, one of the $f_{209}^4(x)$ variables is not existed in current internal state, but equal to $f_{213}^4(x)$:

$$
    f_{209}^4(x) = f_{213}^4(x) \oplus x_{6+4}x_{64+4}
$$

(10)

In this case, the data path of $f_{209}^4(x)$ based on $f_{213}^4(x)$ will lead to much latency and we will adopt this method in the following Fibonacci variants Espresso hybrid architecture. Therefore, we stipulate that the maximum parallel width of Galois Espresso is 4 bits. For example, Trivium stream cipher has the maximum parallel width 64 because there are at least 64 non-tapped iterations lower than modified bit [5].

We conclude the maximum parallel width in this method is $w_{max}$, which can be gotten by:

$$
    w_{max} = \min(u - v + 1), (u \in U, v \in V, u > v)
$$

(11)

11
Furthermore, there is another problem which restricts Espresso parallelized implementation. Espresso's filter function $h(x)$ variables $x_{255}$, $x_{247}$, $x_{243}$ and $x_{213}$ belong to set $U$, along the above discussion, the variables of $h^1(x)$ function includes one more significant bit of $h^0(x)$ (i.e. $h(x)$) variables, but we can’t get the next bits in set $U$ just according to current internal state.

In another way, the filter function is no more fed into NFSR after initialization. Espresso can firstly update for 4 rounds and then produce 4 keystream bits, the 4 filter functions are noted as $h_0(z), h^{-1}(z), h^{-2}(z)$ and $h^{-3}(z)$. Consequently, other device collects keystream from $z_{w(t-2)}$ to $z_{w(t-1)-1}$ at clock $t$ rising edge, which represents once hybrid Espresso detects the first rising edge in finite state machine WORK state, there is no valid keystream bit. Precisely, the first $w$ bits keystream $z_0, z_1, ..., z_{w-1}$ are sampled at the second clock $t = 2$ rising edge during running phase. That can be named as first update then filter strategy.

As a result, our optimized Espresso is designed in hybrid architecture, Espresso update serially during load and initialization phases and update multiple rounds in each clock during processing plaintext phase. The FPGA implement results are listed in Table 5.

### 4 Description and Implementation of Fibonacci-configured Espresso

In this section, we will provide a brief description about Galois-to-Fibonacci transformation method [14]. Subsequently, the serial and hybrid architectures are introduced for efficient FPGA implementation.

#### 4.1 Galois-to-Fibonacci Transformation

Fibonacci NFSR is a special kind of Galois NFSR, whose feedback functions $f_i(x) = x_{i+1}$ except the most significant bit. For $n$-bit Fibonacci NFSR, the update set only has one element, $U = \{n-1\}$. Earlier research has concluded that Galois NFSR is more efficient due to the parallel feedback functions [25], for example, Galois-configured Grain-80 [16] can perform 58% higher frequency than that in Fibonacci configuration, but Fibonacci NFSR has significant advantages for security analysis due to few number of feedback functions.

According to transformation method in [14], Espresso NFSR can be configured as two Fi-
Figure 6: The block diagram of Fibonacci-configured Espresso-F

Fibonacci NFSRs, updated by function $f_{255}(x)$ and function $f_{217}(x)$:

$$f_{255}(x) = f_L(x) \oplus f_N(x)$$

- $f_L(x) = x_0 \oplus x_{12} \oplus x_{18} \oplus x_{115} \oplus x_{133} \oplus x_{213}$
- $f_N(x) = x_1 \oplus x_{16} \oplus x_{18} \oplus x_{52} \oplus x_{110} \oplus x_{55} \oplus x_{130} \oplus x_{62} \oplus x_{157} \oplus x_{74} \oplus x_{183} \oplus x_{87} \oplus x_{110} \oplus x_{130} \oplus x_{157}$

$$f_{217}(x) = f_{218} \oplus f'_N(x)$$

- $f_{218} = x_{218}$
- $f'_N(x) = x_{3} \oplus x_{8} \oplus x_{14} \oplus x_{17} \oplus x_{24} \oplus x_{36} \oplus x_{49} \oplus x_{72} \oplus x_{92} \oplus x_{119}$

It’s evidently that $f_{255}(x)$ combines a 6-variable linear function $f_L(x)$ and a 12-variable nonlinear function $f_N(x)$. Meanwhile, $f_{217}(x)$ has the similar specification, XORed by $x_{218}$ and shifted version of 12-variable nonlinear function $f'_N(x)$. We can note the Fibonacci configuration Espresso as Espresso-F, shown in Figure 6.

4.2 Efficient Implementation in Fibonacci Configuration

In the above transformation method, a Fibonacci-configured NFSR has one element in update set $U$. Despite the fact that only one feedback function, the extra monomials in $f_{255}(x)$ are shifted from other functions. Thus, the quantity of logic gates is theoretically constant between Galois-configured NFSR and Fibonacci-configured NFSR. However, hardware implementation on FPGA is different from that on ASIC, namely, combinational logic gate is synthesized as look-up-table similar to distribute RAM stored the logic truth table. The special feature represents several logic gates can be packaged into one LUT, the Fibonacci NFSR with larger depth of logic circuit gates has significant advantage.

On another hand, only one bit $x_{255}$ is driven by feedback function in Espresso-F, 13 update tapped positions has been deleted, potentially forming longer consecutive register fragments. We can also list all consecutive fragments in Table 6. The longest fragment is $R(187, 213)$ with 25 bits length.

13
Table 6: The Consecutive Register Fragments of Espresso-F

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>R(187, 213)</td>
<td>25</td>
<td>12</td>
<td>R(157, 164)</td>
<td>6</td>
<td>23</td>
<td>R(115, 119)</td>
<td>3</td>
<td>34</td>
<td>R(251, 255)</td>
<td>3</td>
</tr>
<tr>
<td>2</td>
<td>R(145, 157)</td>
<td>11</td>
<td>13</td>
<td>R(174, 181)</td>
<td>6</td>
<td>24</td>
<td>R(133, 137)</td>
<td>3</td>
<td>35</td>
<td>R(0, 3)</td>
<td>2</td>
</tr>
<tr>
<td>3</td>
<td>R(99, 110)</td>
<td>10</td>
<td>14</td>
<td>R(74, 80)</td>
<td>5</td>
<td>25</td>
<td>R(183, 187)</td>
<td>3</td>
<td>36</td>
<td>R(14, 17)</td>
<td>2</td>
</tr>
<tr>
<td>4</td>
<td>R(119, 130)</td>
<td>10</td>
<td>15</td>
<td>R(3, 8)</td>
<td>4</td>
<td>26</td>
<td>R(213, 217)</td>
<td>3</td>
<td>37</td>
<td>R(29, 32)</td>
<td>2</td>
</tr>
<tr>
<td>5</td>
<td>R(164, 174)</td>
<td>9</td>
<td>16</td>
<td>R(24, 29)</td>
<td>4</td>
<td>27</td>
<td>R(218, 222)</td>
<td>3</td>
<td>38</td>
<td>R(41, 44)</td>
<td>2</td>
</tr>
<tr>
<td>6</td>
<td>R(62, 70)</td>
<td>7</td>
<td>17</td>
<td>R(36, 41)</td>
<td>4</td>
<td>28</td>
<td>R(227, 231)</td>
<td>3</td>
<td>39</td>
<td>R(49, 52)</td>
<td>2</td>
</tr>
<tr>
<td>7</td>
<td>R(137, 145)</td>
<td>7</td>
<td>18</td>
<td>R(87, 92)</td>
<td>4</td>
<td>29</td>
<td>R(231, 235)</td>
<td>3</td>
<td>40</td>
<td>R(52, 55)</td>
<td>2</td>
</tr>
<tr>
<td>8</td>
<td>R(17, 24)</td>
<td>6</td>
<td>19</td>
<td>R(110, 115)</td>
<td>4</td>
<td>30</td>
<td>R(235, 239)</td>
<td>3</td>
<td>41</td>
<td>R(130, 133)</td>
<td>2</td>
</tr>
<tr>
<td>9</td>
<td>R(55, 62)</td>
<td>6</td>
<td>20</td>
<td>R(222, 227)</td>
<td>4</td>
<td>31</td>
<td>R(239, 243)</td>
<td>3</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>10</td>
<td>R(80, 87)</td>
<td>6</td>
<td>21</td>
<td>R(8, 12)</td>
<td>3</td>
<td>32</td>
<td>R(243, 247)</td>
<td>3</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>11</td>
<td>R(92, 99)</td>
<td>6</td>
<td>22</td>
<td>R(32, 36)</td>
<td>3</td>
<td>33</td>
<td>R(247, 251)</td>
<td>3</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Table 7: Results of SRL Optimized Method in Fibonacci Configuration

<table>
<thead>
<tr>
<th>Device</th>
<th>SRL replacement</th>
<th>#LUT / #FF as SRL</th>
<th>#LUT (Slices)</th>
<th>Area (Slices)</th>
<th>Freq. (MHz)</th>
<th>T./A. (Mbps/Slices)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Virtex-7</td>
<td>5 bits</td>
<td>44/147</td>
<td>14</td>
<td>28</td>
<td>533.3</td>
<td>19.05</td>
</tr>
<tr>
<td></td>
<td>4 bits</td>
<td>50/123</td>
<td>20</td>
<td>24</td>
<td>415.6</td>
<td>17.32</td>
</tr>
<tr>
<td></td>
<td>3 bits</td>
<td>64/81</td>
<td>34</td>
<td>22</td>
<td>427.2</td>
<td>19.42</td>
</tr>
<tr>
<td></td>
<td>2 bits</td>
<td>71/67</td>
<td>41</td>
<td>25</td>
<td>397.1</td>
<td>15.89</td>
</tr>
<tr>
<td>Spartan-3</td>
<td>5 bits</td>
<td>56/141</td>
<td>15</td>
<td>91</td>
<td>167.1</td>
<td>1.84</td>
</tr>
<tr>
<td></td>
<td>4 bits</td>
<td>62/117</td>
<td>21</td>
<td>79</td>
<td>172.6</td>
<td>2.19</td>
</tr>
<tr>
<td></td>
<td>3 bits</td>
<td>76/75</td>
<td>35</td>
<td>59</td>
<td>169.7</td>
<td>2.88</td>
</tr>
<tr>
<td></td>
<td>2 bits</td>
<td>83/61</td>
<td>42</td>
<td>52</td>
<td>195.4</td>
<td>3.76</td>
</tr>
</tbody>
</table>

Based on the table, we investigate the variation between area and SRL replaced fragment threshold length, similar to the serial Galois Espresso discussion. The implementation results are shown in Table 7. There are 25 slices occupied on Virtex-7 and 52 slices on Spartan-3. Compared with Galois Espresso, the Fibonacci-configured Espresso-F uses less reconfigurable resources. Although Espresso-F critical data path has less route delay with smaller placement, more combinational logic level will lead to much propagation time. On balance, Fibonacci configuration is smaller but slower than Galois configuration on FPGA.

It should be noted that R(187, 213) with 25 bits length is divided into R(187, 204) and R(204, 213) on Spartan-3, because for 4-input LUT, the synthesized SRL maximum length is $2^4 = 16$ bits. The solutions on Spartan-3 have one more SRL than that on Virtex-7.

4.3 Hybrid Fibonacci-configured Espresso Implementation

Fibonacci-configured Espresso consists of a 218-bit NFSR and a 38-bit NFSR driven by functions $f_{217}(x)$ and $f_{255}(x)$ respectively. Compared with Galois feedback shift register, each Fibonacci FSR has only one update function, thus, the Espresso-F update set $U_A$ and feedback function...
variable set \( V_A \) are following:

\[
U_A = \{217, 255\} \\
V_A = \{0, 3, 8, 12, 14, 17, 24, 32, 36, 41, 46, 49, 52, 55, 62, 70, \ldots, 74, 87, 92, 110, 115, 119, 130, 133, 145, 157, 183, 213, 218\} 
\]

According to Formula 11, when the iteration element \( u \) in set \( U \) is 213 and \( v \) is 217, the maximum parallel width \( w_{\text{max}} \) is still 5.

To make further investigation, we take the 5-bit parallel solution as an example:

\[
\begin{align*}
  f_{0}^{255}(x) &= f_{L}^{0}(x) \oplus f_{N}^{0}(x) \\
  f_{1}^{255}(x) &= f_{L}^{1}(x) \oplus f_{N}^{1}(x) \\
  f_{2}^{255}(x) &= f_{L}^{2}(x) \oplus f_{N}^{2}(x) \\
  f_{3}^{255}(x) &= f_{L}^{3}(x) \oplus f_{N}^{3}(x) \\
  f_{4}^{255}(x) &= f_{L}^{4}(x) \oplus f_{N}^{4}(x) \\
 \end{align*}
\]

If we increase 1 more bit width to produce 6 keystream bits in each clock, the additional function \( f_{255}^{5} \) should be:

\[
f_{255}^{5}(x) = f_{L}^{5}(x) \oplus f_{N}^{5}(x) = x_{0}+5 \oplus x_{12}+5 \oplus x_{48}+5 \oplus x_{115}+5 \oplus x_{133}+5 + f_{0}^{217}(x) + f_{N}^{5}(x) 
\]

Therefore, the 6 bits parallel width solution in Fibonacci configuration is shown in Figure 7. We can find that \( f_{255}^{5} \) variable includes one bit from \( f_{217}^{0} \). Although this combining strategy will cause much propagation time with longer data path, we can implement the 8-bit and 16-bit parallel width Espresso-F solutions, shown in Table 8.

![Figure 7: The block diagram of hybrid Espresso-F x6](image)

It should be noted that the multiple bits generation strategy of Espresso-F is not first update then filter, but first filter then update, which means Espresso-F produces keystream bits \( z_0, z_1, \ldots, z_{w-1} \) directly at the first clock after initialization. Accordingly, \( w \) filter functions are listed as \( h^0(x), h^1(x), \ldots, h^{w-1}(x) \).
Table 8: Implementation Results of Espresso-F Hybrid Architecture

<table>
<thead>
<tr>
<th>Device</th>
<th>Width (bit)</th>
<th>#LUT/FF</th>
<th>Area (Slices)</th>
<th>Freq. (MHz)</th>
<th>Thro. (Mbps)</th>
<th>T./A. (Mbps/Slices)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Virtex-7</td>
<td>1</td>
<td>64/81</td>
<td>22</td>
<td>427.2</td>
<td>427.2</td>
<td>19.42</td>
</tr>
<tr>
<td></td>
<td>4</td>
<td>207/267</td>
<td>60</td>
<td>348.2</td>
<td>1392.8</td>
<td>23.21</td>
</tr>
<tr>
<td></td>
<td>8</td>
<td>279/267</td>
<td>75</td>
<td>341.1</td>
<td>2728.5</td>
<td>36.38</td>
</tr>
<tr>
<td></td>
<td>16</td>
<td>415/267</td>
<td>112</td>
<td>255.4</td>
<td>4085.8</td>
<td>36.48</td>
</tr>
<tr>
<td>Spartan-3</td>
<td>1</td>
<td>83/61</td>
<td>52</td>
<td>195.4</td>
<td>195.4</td>
<td>3.76</td>
</tr>
<tr>
<td></td>
<td>4</td>
<td>356/267</td>
<td>197</td>
<td>134.2</td>
<td>536.8</td>
<td>2.73</td>
</tr>
<tr>
<td></td>
<td>8</td>
<td>449/267</td>
<td>230</td>
<td>123.4</td>
<td>987.2</td>
<td>4.13</td>
</tr>
<tr>
<td></td>
<td>16</td>
<td>641/267</td>
<td>332</td>
<td>118.7</td>
<td>1899.1</td>
<td>5.72</td>
</tr>
</tbody>
</table>

Figure 8: The block diagram of Fibonacci-configured Espresso-L

5 Description and Implementation of Espresso-like LFSR filter generator

We have introduced the implementation of Fibonacci-configured Espresso stream cipher in the last section. In this section, another transformation method from Galois NFSR to Fibonacci LFSR with compensation lists is briefly summarized, noted as Espresso-L.

5.1 Galois NFSR to Fibonacci LFSR Transformation

Another Espresso variant [15] is same as a LFSR filter generator, consists of 256-bit LFSR and 104-variables keystream filter function, named as Espresso-L, shown in Figure 8. After shifting all monomials to $f_{255}(x)$, the unique feedback function is linear function following:

$$f_{255}(x) = x_0 \oplus x_{12} \oplus x_{48} \oplus x_{115} \oplus x_{133} \oplus x_{213}$$  \hspace{1cm} (16)

Denote $m|d$ represents each variables $x_i$ in monomial $m$ are change to $x_{i+d}$. Compensation list is generated during monomial transformation from $f_a(x)$ to $f_b(x)$, $b = 255$ in Espresso-L configuration as Formula 17, where $p \in U$.

$$C_p = \begin{cases} 0, p \leq a \\ m|p-a-1, p > a \end{cases}$$  \hspace{1cm} (17)

There are totally 14 compensation lists because 14 feedback functions are transformed to $f_{255}(x)$. In order to achieve better comprehension, we note that $f_{255}(x)$ is also converted to itself.
Figure 9: Espresso-L FSM transformation diagram

\[ f_{255}(x), \text{ while every element in list } C_{255} \text{ is equal to } 0. \]
We combine 14 lists as:

\[ C[i] = \sum_{p \in U} C_p[i] \] (18)

The compensated internal state \( \hat{x}_i \) is generated by \( \hat{x}_i = x_i \oplus C[i] \). It’s obviously that the compensated internal state \( \hat{x}_i \), where \( i \leq 193 \), are same as \( x_i \) because compensation list \( C[i] \) is empty.

The filter function \( h(x) \) variables \( x_i \) are replaced by \( \hat{x}_i \). And the 256 bits initial internal state before rounding are also loaded as \( \hat{x}_i^0(i = 0, 1, ..., 255) \).

\[ h(x) = x_{80} \oplus x_{99} \oplus x_{137} \oplus x_{227} \oplus x_{222} \oplus x_{187} \oplus \hat{x}_{243} \hat{x}_{217} \oplus \hat{x}_{247} \hat{x}_{231} \oplus \hat{x}_{213} \hat{x}_{235} \]
\[ \oplus \hat{x}_{255} \hat{x}_{251} \oplus x_{181} \hat{x}_{239} \oplus x_{174} x_{44} \oplus x_{164} x_{29} \oplus \hat{x}_{255} \hat{x}_{247} \hat{x}_{243} \hat{x}_{213} x_{181} x_{174} \] (19)

Until now, there are three Espresso modes, they are original Galois-configured Espresso, Fibonacci-configured Espresso-F and Fibonacci-configured LFSR filter generator Espresso-L.

5.2 Hardware Design of Espresso-L

Due to the influence of compensation list, the initial internal state \( \hat{x}_i^0 \) is changed to \( x_i^0 \), in especial, the 32 bits from \( \hat{x}_i^{224} \) to \( \hat{x}_i^{256} \) are no longer constant. There should be extra circuit realizing the compensation list and state update, so another state is added based on Galois Espresso finite state machine, SET state represents 256 bits internal state are XORed with \( C[i] \) respectively after loading initial vector. The SET state lasts for one clock period before INIT state, shown in Figure 9.

The internal states from \( x_0 \) to \( x_{254} \) are updated following Formula 20, no longer just relying the more significant bit \( x_{i+1} \). Hence, the SRL optimized is not available for Espresso-L, and each register (synthesized to flip-flop) should be driven by multiplex circuit (synthesized to look-up-table), which causes much more area.

\[ x_i = \begin{cases} 
  x_i \oplus C[i], & \text{state} = \text{SET} \\
  x_{i+1}, & \text{others} 
\end{cases} \] (20)

We implement the Espresso-L in both serial architecture and hybrid architecture, the results are listed in Table 9. It’s obviously that Espresso-L requires much area for compensation list and leads to much critical path delay because of 104-variables filter function.
Table 9: Implementation Results of Espresso-L Variant

<table>
<thead>
<tr>
<th>Device</th>
<th>Width (bit)</th>
<th>#LUT / #FF</th>
<th>Area (Slices)</th>
<th>Freq. (MHz)</th>
<th>Thro. (Mbps)</th>
<th>T./A. (Mbps/Slices)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Virtex-7</td>
<td>1</td>
<td>227/268</td>
<td>72</td>
<td>275.3</td>
<td>275.3</td>
<td>3.82</td>
</tr>
<tr>
<td></td>
<td>4</td>
<td>509/268</td>
<td>143</td>
<td>271.9</td>
<td>1087.5</td>
<td>7.61</td>
</tr>
<tr>
<td></td>
<td>8</td>
<td>572/268</td>
<td>153</td>
<td>269.0</td>
<td>2151.7</td>
<td>14.06</td>
</tr>
<tr>
<td></td>
<td>16</td>
<td>735/268</td>
<td>197</td>
<td>208.2</td>
<td>3330.6</td>
<td>16.91</td>
</tr>
<tr>
<td>Spartan-3</td>
<td>1</td>
<td>550/268</td>
<td>303</td>
<td>113.8</td>
<td>113.8</td>
<td>0.38</td>
</tr>
<tr>
<td></td>
<td>4</td>
<td>1104/268</td>
<td>573</td>
<td>113.4</td>
<td>453.6</td>
<td>0.79</td>
</tr>
<tr>
<td></td>
<td>8</td>
<td>1209/268</td>
<td>624</td>
<td>98.2</td>
<td>785.2</td>
<td>1.26</td>
</tr>
<tr>
<td></td>
<td>16</td>
<td>1462/268</td>
<td>756</td>
<td>86.9</td>
<td>1390.6</td>
<td>1.84</td>
</tr>
</tbody>
</table>

5.3 Security Analysis on Galois-to-Fibonacci Transformation

So far, there is no method that evinces active attacks on original Espresso, but Galois-to-Fibonacci transformation has been confirmed revealing possible security weakness [19] and [15]. Based on the transformation [14], another Fibonacci variant Espresso-A with two feedback function $f_{254}(x)$ and $f_{255}(x)$ has not 128-bit security level resistance, i.e. the 128-bit secret key can be recovered with only two pairs of related key-IVs, less than $2^{41}$ chosen IVs and $O(2^{64})$ computational complexity [19]. Meanwhile, the LFSR filter generator variant Espresso-L may be broken with complexity $O(2^{68.44})$ under algebraic attack and $O(2^{66.86})$ under Ronjom-Helleseth attack [15]. However, they are sufficiently to be used in the tiny devices. Besides, our results have demonstrated that the Espresso-like LFSR filter generator cannot be implemented efficiently on hardware. Therefore, the LFSR variant Espresso-L is not recommended for ultra-lightweight cases.

6 Hardware Performance Comparison of Espresso and other ciphers

6.1 Comparison of Espresso and its variants

In this paper, in Table 10, we investigate the hardware performance of the stream ciphers Espresso and its two variants, who have the similar security level. Our optimizations and FPGA implementations aimed at evaluating the cipher’s hardware performance in Galois and Fibonacci configuration, targeting for both high-speed and resource-constrained scenes.

The original Galois-configured Espresso has 14 bits internal state driven by the 14 feedback functions. The minimum distance between them is 4 bits, so the original Espresso can be upgraded to the parallel Espresso with 4 bits parallel width, which produces 4 bits keystream at each clock.

The Fibonacci-configured Espresso (Espresso-F) consisting of 2 Fibonacci NFSRs, are transformed [14] by the original Espresso. There are only 2 bits driven by the 2 feedback functions respectively. Espresso-F is not constrained by the maximum parallel width of 4 bits and we implement up to Espresso-F x16 to evaluate the throughput improvement strategy. The comparison between Galois-configured Espresso and Fibonacci-configured Espresso reveals which configuration is more efficient for various FPGA applications.

As shown in Figure 10, for the serial or low-width parallel solutions (i.e. x4), the Galois
configuration has a higher throughput, but is larger than the Fibonacci configuration. The reason is that the split combinational logic in the Galois configuration do not lead to much logic level and route delay, but has to be synthesized in the extra look-up-tables.

<table>
<thead>
<tr>
<th>Cipher (Variant)</th>
<th>#LUTs</th>
<th>Area (Slices)</th>
<th>Freq. (MHz)</th>
<th>Thro. (Mbps)</th>
<th>Pow. (mW)</th>
<th>T./A.</th>
<th>T./P.</th>
<th>T./(P. · A.)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Espresso x1 [Sec. 3]</td>
<td>90/80</td>
<td>62</td>
<td>198.5</td>
<td>198.5</td>
<td>3.20</td>
<td>2</td>
<td>99.2</td>
<td>1.60</td>
</tr>
<tr>
<td>Espresso x4 [Sec. 3]</td>
<td>371/267</td>
<td>198</td>
<td>163.3</td>
<td>653.3</td>
<td>3.30</td>
<td>11</td>
<td>59.4</td>
<td>0.30</td>
</tr>
<tr>
<td>Espresso-F x1 [Sec. 4]</td>
<td>83/61</td>
<td>52</td>
<td>195.4</td>
<td>195.4</td>
<td>3.76</td>
<td>2</td>
<td>97.7</td>
<td>1.88</td>
</tr>
<tr>
<td>Espresso-F x4 [Sec. 4]</td>
<td>356/267</td>
<td>197</td>
<td>134.2</td>
<td>536.8</td>
<td>2.73</td>
<td>11</td>
<td>48.8</td>
<td>0.25</td>
</tr>
<tr>
<td>Espresso-F x8 [Sec. 4]</td>
<td>449/267</td>
<td>239</td>
<td>123.4</td>
<td>987.2</td>
<td>4.13</td>
<td>18</td>
<td>54.8</td>
<td>0.23</td>
</tr>
<tr>
<td>Espresso-F x16 [Sec. 4]</td>
<td>641/267</td>
<td>332</td>
<td>118.7</td>
<td>1899.1</td>
<td>5.72</td>
<td>33</td>
<td>57.5</td>
<td>0.17</td>
</tr>
<tr>
<td>Espresso-L x1 [Sec. 5]</td>
<td>550/268</td>
<td>303</td>
<td>113.8</td>
<td>113.8</td>
<td>0.38</td>
<td>9</td>
<td>12.6</td>
<td>0.04</td>
</tr>
<tr>
<td>Espresso-L x4 [Sec. 5]</td>
<td>1104/268</td>
<td>573</td>
<td>113.4</td>
<td>453.6</td>
<td>0.79</td>
<td>19</td>
<td>23.9</td>
<td>0.04</td>
</tr>
<tr>
<td>Espresso-L x8 [Sec. 5]</td>
<td>1209/268</td>
<td>624</td>
<td>98.2</td>
<td>785.2</td>
<td>1.26</td>
<td>31</td>
<td>25.3</td>
<td>0.04</td>
</tr>
<tr>
<td>Espresso-L x16 [Sec. 5]</td>
<td>1462/268</td>
<td>756</td>
<td>86.9</td>
<td>1390.6</td>
<td>1.84</td>
<td>46</td>
<td>30.2</td>
<td>0.04</td>
</tr>
<tr>
<td>Espresso Best Item</td>
<td>-/-</td>
<td>52</td>
<td>198.5</td>
<td>1899.1</td>
<td>5.72</td>
<td>2</td>
<td>99.2</td>
<td>1.88</td>
</tr>
</tbody>
</table>

Another variant is Fibonacci-configured Espresso 256-bit LFSR filter generator (Espresso-L), transformed [15] from nonlinear FSR to linear FSR. Although there is only one simplified linear feedback function, consisting of only 6 variables, much nonlinear feedback logic is shifted to filter function, forming a 104-variables keystream filter function. This kind of transformation does not increase the throughput or reduce the area, but is not conducive to hardware implementation. The aggregate index Throughput/Area (T./A.) of Espresso-L x1 is only 11.9% of Espresso x1 and 10.1% of Espresso-F x1. This kind of cipher containing linear FSR may only be adopted on specific programmable chips, which has coarse-grain reconfigurable linear feedback shift registers [26], [27], [28].
<table>
<thead>
<tr>
<th>Cipher</th>
<th>#LUTs</th>
<th>Area</th>
<th>Freq.</th>
<th>Thro.</th>
<th>T./A.</th>
<th>Pow.</th>
<th>T./P.</th>
<th>T./(P.·A.)</th>
</tr>
</thead>
<tbody>
<tr>
<td>(Variant)</td>
<td>/#FFs (Slices) (MHz) (Mbps) (Mbps/ (mW) (Mbps/ (mW· Slices))</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Espresso x1 [Sec. 3]</td>
<td>48/137</td>
<td>25</td>
<td>491.4</td>
<td>491.4</td>
<td>19.66</td>
<td>1</td>
<td>491.4</td>
<td>19.66</td>
</tr>
<tr>
<td>Espresso x4 [Sec. 3]</td>
<td>211/267</td>
<td>60</td>
<td>373.3</td>
<td>1493.1</td>
<td>24.88</td>
<td>4</td>
<td>373.3</td>
<td>6.22</td>
</tr>
<tr>
<td>Espresso-F x1 [Sec. 4]</td>
<td>64/81</td>
<td>22</td>
<td>427.2</td>
<td>427.2</td>
<td>19.42</td>
<td>1</td>
<td>427.2</td>
<td>19.42</td>
</tr>
<tr>
<td>Espresso-F x4 [Sec. 4]</td>
<td>207/267</td>
<td>60</td>
<td>348.2</td>
<td>1392.8</td>
<td>23.21</td>
<td>3</td>
<td>464.3</td>
<td>7.74</td>
</tr>
<tr>
<td>Espresso-F x8 [Sec. 4]</td>
<td>279/267</td>
<td>75</td>
<td>341.1</td>
<td>2728.5</td>
<td>36.38</td>
<td>7</td>
<td>389.8</td>
<td>5.20</td>
</tr>
<tr>
<td>Espresso-F x16 [Sec. 4]</td>
<td>415/267</td>
<td>112</td>
<td>255.4</td>
<td>4085.8</td>
<td>36.48</td>
<td>9</td>
<td>454.0</td>
<td>4.05</td>
</tr>
<tr>
<td>Espresso-L x1 [Sec. 5]</td>
<td>227/268</td>
<td>72</td>
<td>275.3</td>
<td>275.3</td>
<td>3.82</td>
<td>4</td>
<td>68.8</td>
<td>0.96</td>
</tr>
<tr>
<td>Espresso-L x4 [Sec. 5]</td>
<td>509/268</td>
<td>143</td>
<td>271.9</td>
<td>1087.5</td>
<td>7.61</td>
<td>6</td>
<td>181.3</td>
<td>1.27</td>
</tr>
<tr>
<td>Espresso-L x8 [Sec. 5]</td>
<td>572/268</td>
<td>153</td>
<td>269.0</td>
<td>2151.7</td>
<td>14.06</td>
<td>9</td>
<td>239.1</td>
<td>1.56</td>
</tr>
<tr>
<td>Espresso-L x16 [Sec. 5]</td>
<td>735/268</td>
<td>197</td>
<td>208.2</td>
<td>3330.6</td>
<td>16.91</td>
<td>16</td>
<td>208.2</td>
<td>1.06</td>
</tr>
<tr>
<td>Espresso Best Item</td>
<td>/-/-</td>
<td>22</td>
<td>491.4</td>
<td>4085.8</td>
<td>36.48</td>
<td>1</td>
<td>491.4</td>
<td>19.66</td>
</tr>
</tbody>
</table>

Overall, in Figure 11, except Espresso-L, the Fibonacci variant support increase throughput by large area, so it is more acceptable for high-throughput application without considering lightweight design. Meanwhile, the Espresso-F x1 is better performance than the original Espresso x1 for compact devices, which does not need to process the large volume of data in a short time. The Galois-configured Espresso seems to be moderate, it is not suitable for straight-forward high-throughput applications, nor for resource-limited devices, but a trade-off between the two factors. As a result, the Galois-configured cipher is able to balance speed and area.

Figure 11: The hardware performance comparison among Espresso and Espresso-F

Besides, We list the optimal implementation results on Virtex-7 FPGA in Table 11 for Espresso and its variants with serial and all typical parallel widths optimizations. The minimum area of Espresso optimization on Virtex-7 FPGA only utilizes 22 slices and the highest throughput of Espresso in hybrid architecture produces keystream with more than 4 Gbps.
### Table 12: Comparison with Other Stream Ciphers

<table>
<thead>
<tr>
<th>Cipher</th>
<th>Security (Variant)</th>
<th>Security Level (bits)</th>
<th>#LUTs</th>
<th>Area /#FFs (Slices)</th>
<th>Freq. (MHz)</th>
<th>Thro. (Mbps)</th>
<th>T./A. (Mbps/ Family Slices)</th>
<th>Devices Device</th>
</tr>
</thead>
<tbody>
<tr>
<td>Grain v1</td>
<td>80</td>
<td>44</td>
<td>196.0</td>
<td>196.0</td>
<td>4.45</td>
<td>S3 XC3S50</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Grain v1 x16</td>
<td>80</td>
<td>348</td>
<td>130.0</td>
<td>2080.0</td>
<td>5.98</td>
<td>S3 XC3S50</td>
<td></td>
<td></td>
</tr>
<tr>
<td>MICKEY 2.0</td>
<td>80</td>
<td>115</td>
<td>233.0</td>
<td>233.0</td>
<td>2.03</td>
<td>S3 XC3S50</td>
<td></td>
<td></td>
</tr>
<tr>
<td>MICKEY 2.0</td>
<td>80</td>
<td>98</td>
<td>250.0</td>
<td>250.0</td>
<td>2.55</td>
<td>S3 XC3S700A</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Trivium</td>
<td>80</td>
<td>50</td>
<td>240.0</td>
<td>240.0</td>
<td>4.80</td>
<td>S3 XC3S50</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Trivium</td>
<td>80</td>
<td>149</td>
<td>326.0</td>
<td>326.0</td>
<td>2.19</td>
<td>S3 XC3S700A</td>
<td></td>
<td></td>
</tr>
<tr>
<td>WG-8</td>
<td>80</td>
<td>137</td>
<td>190.0</td>
<td>190.0</td>
<td>1.39</td>
<td>S3 XC3S1000</td>
<td></td>
<td></td>
</tr>
<tr>
<td>WG-8 x11</td>
<td>80</td>
<td>398</td>
<td>192.0</td>
<td>2112.0</td>
<td>5.31</td>
<td>S3 XC3S1000</td>
<td></td>
<td></td>
</tr>
<tr>
<td>E0</td>
<td>128</td>
<td>140</td>
<td>187.0</td>
<td>187.0</td>
<td>1.34</td>
<td>S3 XC3S700A</td>
<td></td>
<td></td>
</tr>
<tr>
<td>A5/1</td>
<td>128</td>
<td>57</td>
<td>174.0</td>
<td>174.0</td>
<td>3.05</td>
<td>S3 XC3S50</td>
<td></td>
<td></td>
</tr>
<tr>
<td>A5/1 x4</td>
<td>128</td>
<td>287</td>
<td>79.0</td>
<td>316.0</td>
<td>1.10</td>
<td>S3 XC3S50</td>
<td></td>
<td></td>
</tr>
<tr>
<td>ZUC</td>
<td>128</td>
<td>1147</td>
<td>38.0</td>
<td>1216.0</td>
<td>1.06</td>
<td>S3 XC3S700A</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Snow3g</td>
<td>128</td>
<td>3559</td>
<td>104.0</td>
<td>3328.0</td>
<td>0.94</td>
<td>S3 XC3S700A</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Grain v1</td>
<td>80</td>
<td>66/87</td>
<td>250.0</td>
<td>250.0</td>
<td>9.62</td>
<td>S7 XC7S50</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Grain v1 x16</td>
<td>80</td>
<td>361/166</td>
<td>250.0</td>
<td>4000.0</td>
<td>36.03</td>
<td>S7 XC7S50</td>
<td></td>
<td></td>
</tr>
<tr>
<td>MICKEY 2.0</td>
<td>80</td>
<td>171214</td>
<td>51</td>
<td>250.0</td>
<td>4.90</td>
<td>S7 XC7S50</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Trivium</td>
<td>80</td>
<td>4932</td>
<td>22</td>
<td>385.0</td>
<td>17.50</td>
<td>S7 XC7S50</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Lizard</td>
<td>80</td>
<td>106/252</td>
<td>60</td>
<td>100.0</td>
<td>1.67</td>
<td>S7 XC7S50</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Lizard x6</td>
<td>80</td>
<td>466/241</td>
<td>150</td>
<td>1200.0</td>
<td>8.00</td>
<td>S7 XC7S50</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

6.2 Comparison of Espresso and other ciphers

Espresso stream cipher supports 128 bits secret key and 96 bits initial vector. The standardized cipher supporting 128-bit secret key include stream ciphers SNOW3G [9], ZUC [10] and block cipher AES [29]. In addition, the eStream portfolio lightweight stream ciphers with 80-bit secret key include Trivium [5], Grain [6] and MICKEY [7]. Although some of the other stream ciphers, such as A5/1 [30] for GSM and E0 [31] for Bluetooth, have been confirmed vulnerable to attacks [1], [2], [3], we take their implementation results as references to evaluate Espresso hardware adaptivity.

Table 12 compares our optimized Espresso implementations with other stream ciphers. Our best Espresso FPGA implementations can achieve 52 slices and perform more than 1.8 Gbps on Spartan-3. The results show that our optimized Espresso design on FPGA is the smallest and most efficient (evaluated according to T./A.) solutions among 128-bit secret key ciphers, and it is much smaller than MICKEY 2.0, less 10 slices larger than Grain and Trivium.
7 Conclusions

In this paper, we concluded three stream ciphers Espresso and its variants, original Galois-configured Espresso with 14 feedback functions, Fibonacci-configured Espresso consisting of 2 Fibonacci NFSRs and Fibonacci-configured Espresso 256-bit LFSR filter generator, to investigate which configuration for stream cipher is more efficient, when all of them are implemented under the optimal strategies toward area and throughput respectively.

For serial solution, we explored the smallest area architecture by adjusting the occupation ratio of FF to LUT and apply this strategy for all variants’ implementations. To improve throughput, we designed the hybrid architecture without increasing the critical path delay significantly. After that, another strategy to improve throughput further was proposed, i.e., the higher feedback functions take the lower functions results as variables. This strategy caused much latency but improved throughput evidently.

According to our implementations, Fibonacci-configured Espresso FPGA architecture is smaller than that in Galois configuration, despite they both have the same quantity of logic gates, because several logic gates are packaged and synthesized as one 4-input or 6-input look-up-table. Under the premise of the equal parallel width, the Espresso hybrid architecture in Galois configuration has lower critical path propagation delay, which represents higher frequency than that in Fibonacci configuration. With regard to Espresso LFSR filter generator (Espresso-L), the variant not only has potential security weakness, but also inefficient in hardware implementation. The Espresso-L has a huge filter function with 104 variables, which lead to much combinational logic level and path delay.

The implementations of Espresso on Spartan-3 only take 52 slices under area optimized strategy in Fibonacci configuration and perform about 1.90 Gbps in hybrid architecture. Our optimal serial Espresso hardware scheme is smaller than most 128-bit secret key stream ciphers including ZUC and SNOW3G, even smaller than 80-bit secret key stream cipher MICKEY 2.0. Besides, our Espresso hardware design just occupies 22 slices on Virtex-7 FPGA at least, and parallelized design even performs more than 4 Gbps, satisfying compact design and high throughput demands.

To summarize, for the same series of ciphers, the transformation from nonlinear feedback shift register to linear feedback shift register do not improve security level, but is not suitable for hardware implementation. The Fibonacci configuration is smaller than the Galois configuration on FPGA applications, and the former can support the higher parallel width to improve throughput. When the device is not only resource-limited, but also slightly require high-throughput, i.e., the trade-off between area and speed, the Galois configuration is more worthy of recommendation.

As for the future work, we hope our analysis of Galois and Fibonacci configuration could provide reference for cipher hardware implementations, and our optimizations and throughput improvements’ strategies could be used for more stream ciphers.

Acknowledgement

This work was supported by the Fundamental Research Funds of Shandong University under Grant 2019HW036.

References


