U.S. flag   An unofficial archive of your favorite United States government website
Dot gov

Official websites do not use .rip
We are an unofficial archive, replace .rip by .gov in the URL to access the official website. Access our document index here.

Https

We are building a provable archive!
A lock (Dot gov) or https:// don't prove our archive is authentic, only that you securely accessed it. Note that we are working to fix that :)

This is an archive
(replace .gov by .rip)
Presentation

Fast Skinny-128 SIMD Implementations for Sequential Modes of Operation

May 11, 2022

Presenters

Alexandre Adomnicai - CryptoNext Security

Description

While the primary goal of the NIST LWC project is to select standards for efficient authenticated encryption on resource-constrained devices (e.g. low-cost microcontrollers), these algorithms will inevitably be deployed on more sophisticated platforms (e.g. smartphones, servers) for interoperability purposes. Although such platforms show less computational limitations, having dedicated efficient implementations is a nice-to-have feature since it may have to communicate with many devices simultaneously. When high parallelism can be achieved in the operating mode where the iternal cryptographic primitive will be placed, one can always use highly bitsliced implementations that can lead to excellent performance. Actually, the best software results reported for Skinny-128 on Intel SIMD are obtained by processing 64 128-bit blocks (i.e. 1KiB) of data in parallel. However, such highly bitsliced implementations are not relevant when processing small payloads or for sequential (i.e. non-parallelizable) operating modes as used in Romulus, one of the 10 NIST LWC finalists. In this talk, we introduce an implementation strategy to optimize the performance of Skinny-128 for sequential modes of operation on SIMD platforms. Our main optimization trick consists in decomposing the 8-bit S-box into smaller ones so that we can take advantage of SIMD-specific vector permute instructions to reach competitive performance without introducing secret-dependent timing variations. We applied our implementation strategy to all Romulus variants on ARM Neon and Intel SSE processors. As a result, we observe a speedup by a factor that ranges from 1.5 to 4.5 depending on the computing platform, compared to fixsliced implementations, which constitute the current best constant-time option when processing blocks in a sequential manner. Another benefit of our implementations is the memory footprint, since the stack consumption is reduced up to a factor of 5.

Presented at

LWC Workshop 2022

Event Details

Location

    
                            

Related Topics

Security and Privacy: cryptography

Created May 10, 2022, Updated May 12, 2022