## Cryptology ePrint Archive: Report 2022/124

On the Performance Gap of a Generic C Optimized Assembler and Wide Vector Extensions for Masked Software with an Ascon-{\it{p}} test case

Dor Salomon and Itamar Levi

Abstract: Efficient implementations of software masked designs constitute both an important goal and a significant challenge to Side Channel Analysis attack (SCA) security. In this manuscript we discuss the shortfall between generic C implementations and optimized (inline-)assembler versions while providing a large spectrum of efficient and generic implementations, and exemplifying cryptographic algorithms and masking gadgets with reference to the state of the art. We show the prime performance gaps we can expect between different implementations and suggest how to harness the underlying hardware efficiently, a daunting task for any masking-order or masking algorithm (multiplications, refreshing etc.). This paper focuses on implementations targeting wide vector bitsliced designs such as the ISAP algorithm. We explore concrete instances of implementations utilizing processors enabled by wide-vector capability extensions of the Instruction Set Architecture (ISA); namely, the SSE2/3/4.1, AVX-2 and AVX-512 Streaming Single Instruction Multiple Data (SIMD) extensions. These extensions mainly enable efficient memory level parallelism and provide a gradual reduction in computation-time as a function of the level of extensions and the hardware support for instruction-level parallelism. We also evaluate the disparities between $\mathit{generic}$ high-level language masking implementations for optimized (inline) assemblers and conventional single execution path data-path architectures such as the ARM architecture. We underscore the crucial trade-off between state storage in the data-memory as compared to keeping it in the register-file (RF). This relates specifically to masked designs, and is particularly difficult to resolve because it requires inline-assembler manipulations and is not naively supported by compilers. Moreover, as the masking order ($d$) increases and the state gets larger, there must be an increase in data memory access for state handling since the RF is simply not large enough. This requires careful optimization which depends to a considerable extent on the underlying algorithm to implement. We discuss how full utilization of SSE extensions is not always possible; i.e. when $d$ is not a power of two, and pin-point the optimal $d$ values and very sub-optimal values of $d$ which aggressively under-utilize the hardware. More generally, this manuscript presents several different fully generic masked implementations for any order or multiple highly optimized (inline-)assembler instances which are quite generic (for a wide spectrum of ISAs), and provide very specific implementations targeting specific extensions. The goal is to promote open-source availability, research, improvement and implementations relating to SCA security and masked designs. The building blocks and methodologies provided here are portable and can be easily adapted to other algorithms.

Category / Keywords: AVX, Countermeasures, Code-Size, Low-Cost, Masking, Side-Channel Analysis, Security Order, SIMD, SSE