June 9, 2021
Duc Tri Nguyen - George Mason University
This paper focuses on optimized constant-time software implementations of three NIST PQC KEM Finalists, CRYSTALS-Kyber, NTRU, and Saber, targeting ARMv8 microprocessor cores. All optimized implementations include explicit calls to Advanced Single-Instruction Multiple-Data instructions (a.k.a. NEON instructions). Benchmarking is performed using two platforms: 1) MacBook Air, based on an Apple M1 System on Chip (SoC), including four high-performance ’Firestorm’ ARMv8 cores, running with the frequency of around 3.2 GHz, and 2) Raspberry Pi 4, singleboard computer, based on the Broadcom SoC, BCM2711, with four 1.5 GHz 64-bit Cortex-A72 ARMv8 cores. In each case, only one core of the respective SoC is being used for benchmarking. The obtained results demonstrate substantial speed-ups vs. the best available implementations written in pure C. For the ’Firestorm’ core of Apple M1, NEON implementations outperform pure C implementations in the case of decapsulation by factors varying in the following ranges: 1.55-1.74 for Saber, 2.96-3.04 for Kyber, and 7.24-8.49 for NTRU, depending on an algorithm’s variant and security level. For encapsulation, the corresponding ranges are 1.37-1.60 for Saber, 2.33-2.45 for Kyber, and 3.05-6.68 for NTRU. These uneven speed-ups of the three lattice-based KEM finalists affect their rankings for optimized software implementations targeting ARMv8.