Algorithm Accelerators
Dyber's algorithm accelerator portfolio provides complete, FPGA-validated hardware implementations of all three NIST post-quantum cryptographic standards plus protocol-level offload engines. Each accelerator performs the full algorithm lifecycle — key generation, encapsulation/signing, and decapsulation/verification — entirely in hardware, delivering orders-of-magnitude performance improvement over software implementations.
Overview #
Post-quantum cryptographic algorithms are computationally intensive by design. Lattice-based schemes like ML-KEM and ML-DSA rely on polynomial arithmetic over large rings, requiring thousands of modular multiplications per operation. Hash-based schemes like SLH-DSA require extensive Keccak permutation chains. In software, these operations impose 5–50× overhead versus classical cryptography (RSA/ECC) — a performance gap that hardware acceleration eliminates.
Dyber algorithm accelerators are self-contained cryptographic engines. Each integrator instantiates the accelerator as a memory-mapped peripheral, writes input data (public keys, messages, ciphertext) through the AMBA interface, triggers the operation, and reads the result. The internal architecture — NTT engines, hash cores, sampling units, and modular arithmetic — is entirely abstracted from the integrator.
| Accelerator | NIST Standard | Type | Security Levels |
|---|---|---|---|
| DYBER-MLKEM | FIPS 203 | Key Encapsulation Mechanism | L1 (512), L3 (768), L5 (1024) |
| DYBER-MLDSA | FIPS 204 | Digital Signature | L2 (44), L3 (65), L5 (87) |
| DYBER-SLH | FIPS 205 | Hash-Based Signature | L1 (128f/128s), L5 (256f) |
| DYBER-HKEM | IETF Draft | Hybrid KEM Bridge | ECDH + ML-KEM composite |
| DYBER-TLS | RFC 8446 | TLS 1.3 Handshake Offload | ML-KEM + ML-DSA |
| DYBER-SBOOT | — | Secure Boot Verification | ML-DSA chain of trust |
DYBER-MLKEM — ML-KEM Key Encapsulation #
The DYBER-MLKEM accelerator implements the complete CRYSTALS-Kyber key encapsulation mechanism as standardized in FIPS 203. ML-KEM is the primary algorithm for post-quantum key exchange and is expected to replace ECDH in TLS, IPsec, SSH, and virtually all transport-layer security protocols.
The accelerator performs three operations: KeyGen (generate public/private key pair), Encapsulate (produce ciphertext and shared secret from a public key), and Decapsulate (recover shared secret from ciphertext and private key). All operations execute entirely in hardware with sub-microsecond to low-microsecond latencies — enabling per-connection key exchange at rates that software implementations cannot approach.
ML-KEM Internal Architecture #
Internally, the DYBER-MLKEM accelerator is organized as a multi-stage pipeline:
┌──────────────────────────────────────────────────────────────â”
│ AMBA Bus Interface (AXI4-Lite control / AXI4-Stream data) │
├──────────────────────────────────────────────────────────────┤
│ Command Sequencer — FSM routing KeyGen/Encaps/Decaps │
├─────────┬────────────┬──────────────┬────────────────────────┤
│ NTT │ SHAKE-XOF │ SAMPLER-CBD │ POLY-ARITH │
│ Engine │ Hash Core │ Noise Gen │ Coefficient Ops │
├─────────┴────────────┴──────────────┴────────────────────────┤
│ Coefficient Memory — Banked BRAM for polynomial storage │
├──────────────────────────────────────────────────────────────┤
│ Key Buffer — Isolated storage for private key material │
└──────────────────────────────────────────────────────────────┘
The Command Sequencer orchestrates the data flow for each operation. For encapsulation, the sequence is: parse public key → generate randomness via SHAKE → CBD noise sampling → NTT forward transform → polynomial multiply-accumulate → NTT inverse → compress → output ciphertext + shared secret. Each sub-operation executes on a dedicated functional unit with data forwarded through internal FIFOs.
The NTT engine within DYBER-MLKEM is configurable at synthesis time — integrators choose from NTT-R2 through NTT-R32 depending on throughput requirements. The NTT configuration determines the overall accelerator area and latency characteristics.
ML-KEM Security Level Variants #
| Variant | NIST Level | Classical Equivalent | Key Size (pk/sk) | Ciphertext | Shared Secret |
|---|---|---|---|---|---|
| ML-KEM-512 | Level 1 | AES-128 | 800 / 1,632 bytes | 768 bytes | 32 bytes |
| ML-KEM-768 | Level 3 | AES-192 | 1,184 / 2,400 bytes | 1,088 bytes | 32 bytes |
| ML-KEM-1024 | Level 5 | AES-256 | 1,568 / 3,168 bytes | 1,568 bytes | 32 bytes |
The accelerator supports runtime parameter selection — a single hardware instance can process ML-KEM-512, -768, or -1024 operations without reconfiguration. The parameter set is specified as part of each command. Resource utilization scales with the chosen NTT engine and maximum supported security level.
DYBER-MLDSA — ML-DSA Digital Signatures #
The DYBER-MLDSA accelerator implements the complete CRYSTALS-Dilithium digital signature algorithm as standardized in FIPS 204. ML-DSA provides post-quantum digital signatures for code signing, certificate authentication, document integrity, and any application currently using RSA or ECDSA signatures.
Three operations are supported: KeyGen (generate signing/verification key pair), Sign (produce signature over a message), and Verify (validate signature against message and public key).
ML-DSA Internal Architecture #
ML-DSA signing is architecturally more complex than ML-KEM due to its rejection sampling loop — the algorithm may need to restart signing if intermediate values exceed certain bounds. Dyber's implementation handles this entirely in hardware with a dedicated retry controller that restarts the signing pipeline without host CPU intervention.
┌──────────────────────────────────────────────────────────────â”
│ AMBA Bus Interface │
├──────────────────────────────────────────────────────────────┤
│ Command Sequencer + Rejection Retry Controller │
├─────────┬──────────┬───────────┬──────────────────────────────┤
│ NTT │ SHAKE │ SAMPLER │ POLY-ARITH │
│ Engine │ XOF │ Uniform + │ Coefficient arithmetic │
│ │ │ CBD + Rej │ + norm checking │
├─────────┴──────────┴───────────┴──────────────────────────────┤
│ Polynomial Memory — Larger than ML-KEM (23-bit coefficients)│
├──────────────────────────────────────────────────────────────┤
│ Key Buffer — Signing key isolation + hint generation │
└──────────────────────────────────────────────────────────────┘
The rejection retry controller monitors intermediate values after each signing attempt. If the norm of z exceeds β or the number of 1s in the hint vector exceeds ω, the controller increments the nonce and restarts the pipeline. Average signing latency depends on the rejection rate (typically 4–7 attempts for ML-DSA-65), but the hardware pipeline restarts with zero overhead, making each retry significantly faster than a software restart.
Verify-only configuration: For applications that only need signature verification (e.g., secure boot, certificate validation), Dyber offers a verify-only variant of DYBER-MLDSA that omits the signing pipeline and rejection controller. This reduces area by approximately 35–40% while retaining full verification throughput.
ML-DSA Security Level Variants #
| Variant | NIST Level | Classical Equivalent | Public Key | Signature |
|---|---|---|---|---|
| ML-DSA-44 | Level 2 | ~SHA-256 collision | 1,312 bytes | 2,420 bytes |
| ML-DSA-65 | Level 3 | ~AES-192 | 1,952 bytes | 3,293 bytes |
| ML-DSA-87 | Level 5 | ~AES-256 | 2,592 bytes | 4,595 bytes |
Like DYBER-MLKEM, the signature accelerator supports runtime parameter selection across all three security levels from a single hardware instance.
DYBER-SLH — SLH-DSA Hash-Based Signatures #
The DYBER-SLH accelerator implements SPHINCS+ as standardized in FIPS 205. SLH-DSA provides stateless hash-based digital signatures whose security relies solely on the security of the underlying hash function — offering a conservative alternative to lattice-based schemes for applications requiring defense-in-depth against potential future cryptanalytic advances in lattice problems.
SLH-DSA Architecture #
Unlike lattice-based algorithms, SLH-DSA does not use NTT or polynomial arithmetic. Instead, it relies on extensive hash tree computation — generating and traversing Merkle trees of WOTS+ one-time signatures. The computational bottleneck is raw Keccak permutation throughput.
DYBER-SLH integrates multiple KECCAK-CORE instances operating in parallel to accelerate the hash tree construction. The number of parallel hash instances is configurable at synthesis time, trading area for signing speed.
| Variant | NIST Level | Signature Size | Use Case |
|---|---|---|---|
| SLH-DSA-128f | Level 1 | 17,088 bytes | Fast signing when larger signatures are acceptable |
| SLH-DSA-128s | Level 1 | 7,856 bytes | Small signatures when signing latency is tolerable |
| SLH-DSA-256f | Level 5 | 49,856 bytes | Maximum security with fast signing |
SLH-DSA signatures are significantly larger and slower than ML-DSA, but offer a fundamentally different security assumption. Dyber recommends SLH-DSA for long-lived root-of-trust signatures (firmware signing, root CA certificates) where lattice-based redundancy provides additional assurance, and ML-DSA for high-volume operational signatures (TLS, API authentication).
DYBER-HKEM — Hybrid KEM Bridge #
The DYBER-HKEM accelerator implements combined classical + post-quantum key exchange for transitional deployments that require backward compatibility with existing PKI infrastructure. The core performs both ECDH and ML-KEM key exchange in parallel and derives the final shared secret by combining both results.
| Mode | Classical Component | PQC Component | Combined Security |
|---|---|---|---|
| Hybrid-P256-512 | ECDH P-256 | ML-KEM-512 | 128-bit classical + Level 1 PQC |
| Hybrid-P384-768 | ECDH P-384 | ML-KEM-768 | 192-bit classical + Level 3 PQC |
| Hybrid-X25519-768 | X25519 | ML-KEM-768 | 128-bit classical + Level 3 PQC |
The hybrid bridge is designed for IETF draft compliance (draft-ietf-tls-hybrid-design) and supports the KDF combination methods specified for TLS 1.3 hybrid key exchange. The classical ECDH component uses a hardened ECC core with constant-time scalar multiplication.
DYBER-TLS — TLS 1.3 Handshake Offload #
The DYBER-TLS engine offloads the complete TLS 1.3 cryptographic handshake from the host CPU. Rather than accelerating individual operations, DYBER-TLS orchestrates the entire key exchange and authentication sequence in hardware:
ClientHello → ServerHello → Key Exchange → Certificate Verify → Finished
The engine integrates DYBER-MLKEM for key exchange, DYBER-MLDSA for certificate verification, and SHA-3 for transcript hashing. The host CPU provides the certificate chain and receives the negotiated session keys — all intermediate cryptographic operations execute in hardware without CPU involvement.
Target application: DPU/SmartNIC deployments where TLS termination is performed at line rate. The DYBER-TLS engine can sustain hundreds of thousands of full handshakes per second, enabling cloud providers to terminate PQC-protected TLS connections at the network edge without CPU overhead.
DYBER-SBOOT — PQC Secure Boot #
The DYBER-SBOOT accelerator provides hardware-accelerated firmware chain-of-trust verification using ML-DSA signatures. It is optimized for the boot path: minimal area, fast verification, and simple integration with existing boot ROM architectures.
The core reads firmware images from memory, computes the hash, and verifies the ML-DSA signature against a root-of-trust public key stored in OTP or hardware fuses. Verification completes in microseconds — negligible compared to typical firmware load times — enabling quantum-resistant secure boot with zero impact on boot latency.
DYBER-SBOOT supports chained verification (bootloader → kernel → application) with multiple public key slots for key rotation and revocation.
Performance Comparison #
Hardware acceleration provides orders-of-magnitude improvement over software implementations for all PQC operations. The following table shows representative acceleration factors measured against optimized software on current-generation server processors:
| Operation | Software Baseline | Hardware Acceleration Factor |
|---|---|---|
| ML-KEM-768 KeyGen | OpenSSL 3.2 / Zen 4 | ~28× faster |
| ML-KEM-768 Encaps | OpenSSL 3.2 / Zen 4 | ~22× faster |
| ML-KEM-768 Decaps | OpenSSL 3.2 / Zen 4 | ~25× faster |
| ML-DSA-65 Sign | OpenSSL 3.2 / Zen 4 | ~12× faster (avg) |
| ML-DSA-65 Verify | OpenSSL 3.2 / Zen 4 | ~18× faster |
| SLH-DSA-128f Sign | Reference C / Zen 4 | ~8× faster |
| SLH-DSA-128f Verify | Reference C / Zen 4 | ~15× faster |
These figures represent FPGA-validated measurements. ASIC implementation at advanced process nodes is expected to further improve both latency and throughput while substantially reducing power consumption.
Multi-Algorithm Deployment #
Most real-world PQC deployments require multiple algorithms simultaneously — ML-KEM for key exchange and ML-DSA for authentication at minimum. Dyber accelerators are designed for co-deployment with shared submodules to reduce total area.
Shared NTT: ML-KEM and ML-DSA can share a single NTT engine instance (with different modulus configurations). The NTT engine switches between q=3329 (ML-KEM) and q=8380417 (ML-DSA) via register configuration with <10 cycle switching overhead.
Shared SHAKE: Both ML-KEM and ML-DSA use SHAKE-128/256 extensively. A single SHAKE-XOF core can be time-multiplexed between algorithm accelerators when concurrent operation is not required.
Bundled Subsystems: Dyber offers pre-configured multi-algorithm bundles optimized for common deployment scenarios:
| Bundle | Included Cores | Target Use Case |
|---|---|---|
| PQC-TLS Bundle | MLKEM + MLDSA + shared NTT + SHAKE | TLS termination, web servers |
| PQC-HSM Bundle | All algorithms + KMU + MASK + QRNG | Hardware security modules |
| PQC-Edge Bundle | MLKEM-512 + SBOOT + minimal NTT | IoT gateways, constrained devices |
| PQC-DPU Bundle | TLS Engine + HKEM + KMU + QRNG | SmartNIC/DPU line-rate offload |
Was this page helpful? Send feedback