Algorithm Accelerators

IP-ALGO-001 v2.0 FPGA Validated

Dyber's algorithm accelerator portfolio provides complete, FPGA-validated hardware implementations of all three NIST post-quantum cryptographic standards plus protocol-level offload engines. Each accelerator performs the full algorithm lifecycle — key generation, encapsulation/signing, and decapsulation/verification — entirely in hardware, delivering orders-of-magnitude performance improvement over software implementations.

Overview #

Post-quantum cryptographic algorithms are computationally intensive by design. Lattice-based schemes like ML-KEM and ML-DSA rely on polynomial arithmetic over large rings, requiring thousands of modular multiplications per operation. Hash-based schemes like SLH-DSA require extensive Keccak permutation chains. In software, these operations impose 5–50× overhead versus classical cryptography (RSA/ECC) — a performance gap that hardware acceleration eliminates.

Dyber algorithm accelerators are self-contained cryptographic engines. Each integrator instantiates the accelerator as a memory-mapped peripheral, writes input data (public keys, messages, ciphertext) through the AMBA interface, triggers the operation, and reads the result. The internal architecture — NTT engines, hash cores, sampling units, and modular arithmetic — is entirely abstracted from the integrator.

AcceleratorNIST StandardTypeSecurity Levels
DYBER-MLKEMFIPS 203Key Encapsulation MechanismL1 (512), L3 (768), L5 (1024)
DYBER-MLDSAFIPS 204Digital SignatureL2 (44), L3 (65), L5 (87)
DYBER-SLHFIPS 205Hash-Based SignatureL1 (128f/128s), L5 (256f)
DYBER-HKEMIETF DraftHybrid KEM BridgeECDH + ML-KEM composite
DYBER-TLSRFC 8446TLS 1.3 Handshake OffloadML-KEM + ML-DSA
DYBER-SBOOT—Secure Boot VerificationML-DSA chain of trust

DYBER-MLKEM — ML-KEM Key Encapsulation #

The DYBER-MLKEM accelerator implements the complete CRYSTALS-Kyber key encapsulation mechanism as standardized in FIPS 203. ML-KEM is the primary algorithm for post-quantum key exchange and is expected to replace ECDH in TLS, IPsec, SSH, and virtually all transport-layer security protocols.

The accelerator performs three operations: KeyGen (generate public/private key pair), Encapsulate (produce ciphertext and shared secret from a public key), and Decapsulate (recover shared secret from ciphertext and private key). All operations execute entirely in hardware with sub-microsecond to low-microsecond latencies — enabling per-connection key exchange at rates that software implementations cannot approach.

ML-KEM Internal Architecture #

Internally, the DYBER-MLKEM accelerator is organized as a multi-stage pipeline:

┌──────────────────────────────────────────────────────────────┐
│  AMBA Bus Interface (AXI4-Lite control / AXI4-Stream data)   │
├──────────────────────────────────────────────────────────────┤
│  Command Sequencer  — FSM routing KeyGen/Encaps/Decaps      │
├─────────┬────────────┬──────────────┬────────────────────────┤
│ NTT     │ SHAKE-XOF  │ SAMPLER-CBD  │ POLY-ARITH           │
│ Engine  │ Hash Core  │ Noise Gen    │ Coefficient Ops      │
├─────────┴────────────┴──────────────┴────────────────────────┤
│  Coefficient Memory  — Banked BRAM for polynomial storage    │
├──────────────────────────────────────────────────────────────┤
│  Key Buffer  — Isolated storage for private key material    │
└──────────────────────────────────────────────────────────────┘

The Command Sequencer orchestrates the data flow for each operation. For encapsulation, the sequence is: parse public key → generate randomness via SHAKE → CBD noise sampling → NTT forward transform → polynomial multiply-accumulate → NTT inverse → compress → output ciphertext + shared secret. Each sub-operation executes on a dedicated functional unit with data forwarded through internal FIFOs.

The NTT engine within DYBER-MLKEM is configurable at synthesis time — integrators choose from NTT-R2 through NTT-R32 depending on throughput requirements. The NTT configuration determines the overall accelerator area and latency characteristics.

ML-KEM Security Level Variants #

VariantNIST LevelClassical EquivalentKey Size (pk/sk)CiphertextShared Secret
ML-KEM-512Level 1AES-128800 / 1,632 bytes768 bytes32 bytes
ML-KEM-768Level 3AES-1921,184 / 2,400 bytes1,088 bytes32 bytes
ML-KEM-1024Level 5AES-2561,568 / 3,168 bytes1,568 bytes32 bytes

The accelerator supports runtime parameter selection — a single hardware instance can process ML-KEM-512, -768, or -1024 operations without reconfiguration. The parameter set is specified as part of each command. Resource utilization scales with the chosen NTT engine and maximum supported security level.

Performance note: FPGA-validated hardware delivers 12–28× acceleration over optimized software (OpenSSL 3.2 on current-generation server processors) for equivalent operations. Exact latency and throughput figures are available under NDA in the evaluation datasheet. Contact ip-sales@dyber.com for access.

DYBER-MLDSA — ML-DSA Digital Signatures #

The DYBER-MLDSA accelerator implements the complete CRYSTALS-Dilithium digital signature algorithm as standardized in FIPS 204. ML-DSA provides post-quantum digital signatures for code signing, certificate authentication, document integrity, and any application currently using RSA or ECDSA signatures.

Three operations are supported: KeyGen (generate signing/verification key pair), Sign (produce signature over a message), and Verify (validate signature against message and public key).

ML-DSA Internal Architecture #

ML-DSA signing is architecturally more complex than ML-KEM due to its rejection sampling loop — the algorithm may need to restart signing if intermediate values exceed certain bounds. Dyber's implementation handles this entirely in hardware with a dedicated retry controller that restarts the signing pipeline without host CPU intervention.

┌──────────────────────────────────────────────────────────────┐
│  AMBA Bus Interface                                          │
├──────────────────────────────────────────────────────────────┤
│  Command Sequencer  +  Rejection Retry Controller           │
├─────────┬──────────┬───────────┬──────────────────────────────┤
│ NTT     │ SHAKE    │ SAMPLER   │ POLY-ARITH                │
│ Engine  │ XOF      │ Uniform + │ Coefficient arithmetic    │
│         │          │ CBD + Rej │ + norm checking           │
├─────────┴──────────┴───────────┴──────────────────────────────┤
│  Polynomial Memory  — Larger than ML-KEM (23-bit coefficients)│
├──────────────────────────────────────────────────────────────┤
│  Key Buffer  — Signing key isolation + hint generation       │
└──────────────────────────────────────────────────────────────┘

The rejection retry controller monitors intermediate values after each signing attempt. If the norm of z exceeds β or the number of 1s in the hint vector exceeds ω, the controller increments the nonce and restarts the pipeline. Average signing latency depends on the rejection rate (typically 4–7 attempts for ML-DSA-65), but the hardware pipeline restarts with zero overhead, making each retry significantly faster than a software restart.

Verify-only configuration: For applications that only need signature verification (e.g., secure boot, certificate validation), Dyber offers a verify-only variant of DYBER-MLDSA that omits the signing pipeline and rejection controller. This reduces area by approximately 35–40% while retaining full verification throughput.

ML-DSA Security Level Variants #

VariantNIST LevelClassical EquivalentPublic KeySignature
ML-DSA-44Level 2~SHA-256 collision1,312 bytes2,420 bytes
ML-DSA-65Level 3~AES-1921,952 bytes3,293 bytes
ML-DSA-87Level 5~AES-2562,592 bytes4,595 bytes

Like DYBER-MLKEM, the signature accelerator supports runtime parameter selection across all three security levels from a single hardware instance.

DYBER-SLH — SLH-DSA Hash-Based Signatures #

The DYBER-SLH accelerator implements SPHINCS+ as standardized in FIPS 205. SLH-DSA provides stateless hash-based digital signatures whose security relies solely on the security of the underlying hash function — offering a conservative alternative to lattice-based schemes for applications requiring defense-in-depth against potential future cryptanalytic advances in lattice problems.

SLH-DSA Architecture #

Unlike lattice-based algorithms, SLH-DSA does not use NTT or polynomial arithmetic. Instead, it relies on extensive hash tree computation — generating and traversing Merkle trees of WOTS+ one-time signatures. The computational bottleneck is raw Keccak permutation throughput.

DYBER-SLH integrates multiple KECCAK-CORE instances operating in parallel to accelerate the hash tree construction. The number of parallel hash instances is configurable at synthesis time, trading area for signing speed.

VariantNIST LevelSignature SizeUse Case
SLH-DSA-128fLevel 117,088 bytesFast signing when larger signatures are acceptable
SLH-DSA-128sLevel 17,856 bytesSmall signatures when signing latency is tolerable
SLH-DSA-256fLevel 549,856 bytesMaximum security with fast signing

SLH-DSA signatures are significantly larger and slower than ML-DSA, but offer a fundamentally different security assumption. Dyber recommends SLH-DSA for long-lived root-of-trust signatures (firmware signing, root CA certificates) where lattice-based redundancy provides additional assurance, and ML-DSA for high-volume operational signatures (TLS, API authentication).

DYBER-HKEM — Hybrid KEM Bridge #

The DYBER-HKEM accelerator implements combined classical + post-quantum key exchange for transitional deployments that require backward compatibility with existing PKI infrastructure. The core performs both ECDH and ML-KEM key exchange in parallel and derives the final shared secret by combining both results.

ModeClassical ComponentPQC ComponentCombined Security
Hybrid-P256-512ECDH P-256ML-KEM-512128-bit classical + Level 1 PQC
Hybrid-P384-768ECDH P-384ML-KEM-768192-bit classical + Level 3 PQC
Hybrid-X25519-768X25519ML-KEM-768128-bit classical + Level 3 PQC

The hybrid bridge is designed for IETF draft compliance (draft-ietf-tls-hybrid-design) and supports the KDF combination methods specified for TLS 1.3 hybrid key exchange. The classical ECDH component uses a hardened ECC core with constant-time scalar multiplication.

DYBER-TLS — TLS 1.3 Handshake Offload #

The DYBER-TLS engine offloads the complete TLS 1.3 cryptographic handshake from the host CPU. Rather than accelerating individual operations, DYBER-TLS orchestrates the entire key exchange and authentication sequence in hardware:

ClientHello → ServerHello → Key Exchange → Certificate Verify → Finished

The engine integrates DYBER-MLKEM for key exchange, DYBER-MLDSA for certificate verification, and SHA-3 for transcript hashing. The host CPU provides the certificate chain and receives the negotiated session keys — all intermediate cryptographic operations execute in hardware without CPU involvement.

Target application: DPU/SmartNIC deployments where TLS termination is performed at line rate. The DYBER-TLS engine can sustain hundreds of thousands of full handshakes per second, enabling cloud providers to terminate PQC-protected TLS connections at the network edge without CPU overhead.

DYBER-SBOOT — PQC Secure Boot #

The DYBER-SBOOT accelerator provides hardware-accelerated firmware chain-of-trust verification using ML-DSA signatures. It is optimized for the boot path: minimal area, fast verification, and simple integration with existing boot ROM architectures.

The core reads firmware images from memory, computes the hash, and verifies the ML-DSA signature against a root-of-trust public key stored in OTP or hardware fuses. Verification completes in microseconds — negligible compared to typical firmware load times — enabling quantum-resistant secure boot with zero impact on boot latency.

DYBER-SBOOT supports chained verification (bootloader → kernel → application) with multiple public key slots for key rotation and revocation.

Performance Comparison #

Hardware acceleration provides orders-of-magnitude improvement over software implementations for all PQC operations. The following table shows representative acceleration factors measured against optimized software on current-generation server processors:

OperationSoftware BaselineHardware Acceleration Factor
ML-KEM-768 KeyGenOpenSSL 3.2 / Zen 4~28× faster
ML-KEM-768 EncapsOpenSSL 3.2 / Zen 4~22× faster
ML-KEM-768 DecapsOpenSSL 3.2 / Zen 4~25× faster
ML-DSA-65 SignOpenSSL 3.2 / Zen 4~12× faster (avg)
ML-DSA-65 VerifyOpenSSL 3.2 / Zen 4~18× faster
SLH-DSA-128f SignReference C / Zen 4~8× faster
SLH-DSA-128f VerifyReference C / Zen 4~15× faster

These figures represent FPGA-validated measurements. ASIC implementation at advanced process nodes is expected to further improve both latency and throughput while substantially reducing power consumption.

Why this matters: A single TLS 1.3 handshake with PQC key exchange (ML-KEM-768) and authentication (ML-DSA-65) requires multiple PQC operations. At cloud provider scale — millions of connections per second across thousands of servers — even small per-operation latency improvements translate to significant infrastructure savings. Hardware acceleration makes PQC deployment cost-neutral versus classical crypto at the infrastructure level.

Multi-Algorithm Deployment #

Most real-world PQC deployments require multiple algorithms simultaneously — ML-KEM for key exchange and ML-DSA for authentication at minimum. Dyber accelerators are designed for co-deployment with shared submodules to reduce total area.

Shared NTT: ML-KEM and ML-DSA can share a single NTT engine instance (with different modulus configurations). The NTT engine switches between q=3329 (ML-KEM) and q=8380417 (ML-DSA) via register configuration with <10 cycle switching overhead.

Shared SHAKE: Both ML-KEM and ML-DSA use SHAKE-128/256 extensively. A single SHAKE-XOF core can be time-multiplexed between algorithm accelerators when concurrent operation is not required.

Bundled Subsystems: Dyber offers pre-configured multi-algorithm bundles optimized for common deployment scenarios:

BundleIncluded CoresTarget Use Case
PQC-TLS BundleMLKEM + MLDSA + shared NTT + SHAKETLS termination, web servers
PQC-HSM BundleAll algorithms + KMU + MASK + QRNGHardware security modules
PQC-Edge BundleMLKEM-512 + SBOOT + minimal NTTIoT gateways, constrained devices
PQC-DPU BundleTLS Engine + HKEM + KMU + QRNGSmartNIC/DPU line-rate offload