May 11, 2022
Alexandre Adomnicai - CryptoNext Security
While the primary goal of the NIST LWC project is to select standards for efficient authenticated encryption on resource-constrained devices (e.g. low-cost microcontrollers), these algorithms will inevitably be deployed on more sophisticated platforms (e.g. smartphones, servers) for interoperability purposes. Although such platforms show less computational limitations, having dedicated efficient implementations is a nice-to-have feature since it may have to communicate with many devices simultaneously. When high parallelism can be achieved in the operating mode where the iternal cryptographic primitive will be placed, one can always use highly bitsliced implementations that can lead to excellent performance. Actually, the best software results reported for Skinny-128 on Intel SIMD are obtained by processing 64 128-bit blocks (i.e. 1KiB) of data in parallel. However, such highly bitsliced implementations are not relevant when processing small payloads or for sequential (i.e. non-parallelizable) operating modes as used in Romulus, one of the 10 NIST LWC finalists. In this talk, we introduce an implementation strategy to optimize the performance of Skinny-128 for sequential modes of operation on SIMD platforms. Our main optimization trick consists in decomposing the 8-bit S-box into smaller ones so that we can take advantage of SIMD-specific vector permute instructions to reach competitive performance without introducing secret-dependent timing variations. We applied our implementation strategy to all Romulus variants on ARM Neon and Intel SSE processors. As a result, we observe a speedup by a factor that ranges from 1.5 to 4.5 depending on the computing platform, compared to fixsliced implementations, which constitute the current best constant-time option when processing blocks in a sequential manner. Another benefit of our implementations is the memory footprint, since the stack consumption is reduced up to a factor of 5.
LWC Workshop 2022