Integration Guide

IP-INT-001 v2.0

This guide covers the integration of Dyber PQC IP cores into SoC designs. Dyber IP operates as dedicated acceleration blocks that interface via standard AMBA buses — the same RTL integrates identically whether paired with x86, ARM, RISC-V, or any other processor architecture.

Overview #

Integrating a Dyber IP core follows the same pattern as integrating any AMBA-compliant peripheral. The host CPU communicates with the accelerator through standard bus interfaces for control and data. The accelerator handles the computationally intensive cryptographic operations at hardware speed, returning results through the same bus interface.

A typical integration involves: instantiating the IP core RTL in the SoC design, connecting AMBA bus signals to the system interconnect, providing clock and reset, writing synthesis constraints, and building the software driver from the provided reference implementation.

Architecture-Agnostic Model #

Dyber IP cores are designed as co-processor blocks. The CPU handles control flow, key management orchestration, and application logic while the accelerator executes the mathematically intensive operations (NTT transforms, polynomial arithmetic, hash computations) in hardware. This is the same proven model used for existing hardware crypto accelerators (AES/SHA co-processors, DMA engines).

The architecture-agnostic property arises from standardized bus interfaces. The AMBA protocol suite (AXI4, APB, AHB) is the universal interconnect across SoC architectures — the same protocols used in x86 server SoCs, ARM-based DPUs, RISC-V microcontrollers, and FPGA soft-processor systems. Dyber IP speaks the same bus language as every other peripheral in the SoC — no architecture-specific modifications are required.

┌──────────────┐    ┌──────────┐    ┌────────────────────────┐
│ Host CPU      │    │ AMBA     │    │ Dyber IP Core           │
│ (x86/ARM/     │◄──►│ Intercon │◄──►│ (MLKEM, MLDSA, NTT...) │
│  RISC-V/any)  │    │          │    │                        │
└──────────────┘    └──────────┘    └────────────────────────┘
                      AXI4 / APB / AHB — same interface regardless of CPU

AMBA Bus Interfaces #

Each IP core exposes one or more AMBA interfaces depending on its complexity and performance requirements:

InterfaceRoleWhen to Use
AXI4-LiteControl registers, status, configurationAll cores — always present for command/status
AXI4 (Full)Bulk data transfer (keys, ciphertext, messages)Algorithm accelerators handling large data payloads
AXI4-StreamStreaming data pathHigh-throughput pipeline architectures (DPU, TLS offload)
APBLow-power peripheral accessIoT / ultra-low-power deployments
AHBLegacy system compatibilityDesigns using AHB-based interconnect
Native FIFOMinimal-overhead streamingDirect IP-to-IP connections without bus overhead

The interface selection is a compile-time configuration parameter. All interfaces expose the same register map and programming model — switching from AXI4-Lite to APB changes only the bus protocol wrapper, not the core functionality or driver API.

FPGA Integration #

FPGA integration is the primary validation path and the fastest way to evaluate Dyber IP in a working system. The IP is delivered with synthesis scripts and constraint templates for AMD/Xilinx FPGA families.

Vivado block design: Dyber IP cores are packaged as Vivado IP catalog components. Integration into a block design is drag-and-drop: instantiate the core, connect AXI interfaces to the processing system or system interconnect, and run connection automation.

Synthesis constraints: Clock constraints, I/O timing, and placement directives are provided as template XDC files. Timing closure has been achieved on the validated FPGA platforms with the provided constraints — integrators may need to adjust based on their specific design context and utilization.

FPGA-optimized vs. generic RTL: The FPGA-optimized variant uses DSP48E2 inference hints and BRAM primitive optimizations for maximum performance on Xilinx UltraScale+. The generic RTL variant uses pure inference-based coding with no vendor-specific constructs, suitable for porting to other FPGA families or ASIC flows.

Validated platforms: Zynq UltraScale+ (XCZU7EV) with ARM Cortex-A53 host integration via AXI is the primary validated platform. Zynq-7000 series validation has been completed with successful migration across DSP primitive architectures. Versal AI Core/Premium synthesis scripts are provided.

ASIC Integration Path #

ASIC integration requires technology-specific work that Dyber scopes as joint engineering with the integrator's implementation team. The generic RTL variant (inference-only code with no vendor primitives) is the starting point for all ASIC flows.

ASIC DeliverableDescription
Generic RTLInference-only code with no vendor primitives, included with production license
Synthesis supportTiming constraints, synthesis guidance, and critical path documentation for target process node
STA collaborationSupport for static timing analysis across PVT corners using the integrator's libraries
DFT integrationRTL structured for scan insertion; BIST hooks provided for memory and logic testing
Gate-level verificationSupport for gate-level simulation with post-synthesis netlists

Dyber provides dedicated engineering support throughout the ASIC flow. Final implementation uses the integrator's standard cell libraries, PDK, and design methodology — Dyber provides the design content and integration expertise.

Driver Architecture #

Dyber provides reference driver implementations that demonstrate the complete programming model. Drivers are provided in bare-metal C and Linux kernel module form, targeting both x86-64 and ARM64 architectures.

Programming model: All Dyber IP cores follow a consistent command-based programming model:

1. Configure  — Write parameter set and operation type to control registers
2. Load       — Write input data (public key, message, ciphertext) to data registers or DMA buffer
3. Execute    — Write GO bit to command register
4. Wait       — Poll status register or wait for interrupt (deterministic cycle count)
5. Read       — Read output data (shared secret, signature, verification result)

Bare-metal library: Architecture-independent C library with a clean API. Provides register access abstractions, operation wrappers, and self-test functions. No OS dependencies — suitable for firmware, RTOS, or bare-metal environments.

Linux kernel module: Reference kernel module implementing a character device interface (/dev/dyber_pqc0). Supports both x86-64 and ARM64 with architecture-specific DMA configuration. Upstream mainline submission planned in collaboration with integrator.

Adaptation: Reference drivers are starting points, not finished products. Platform-specific adaptation (register base addresses, interrupt routing, DMA configuration, integration with existing crypto subsystems) is expected as part of the integration process. Dyber provides engineering support for driver adaptation.

DMA & High-Throughput Patterns #

For high-throughput applications (TLS termination, bulk signing), register-based data transfer becomes a bottleneck. Dyber IP cores with AXI4 or AXI4-Stream interfaces support DMA-based data transfer for maximum throughput.

Scatter-gather DMA: The AXI4 interface supports scatter-gather descriptor lists for processing multiple operations without CPU intervention between operations. The host CPU populates a descriptor ring with input buffer pointers and output buffer pointers, then kicks the DMA engine. The accelerator processes operations back-to-back from the descriptor ring.

Streaming mode: AXI4-Stream interfaces enable continuous pipeline operation where input data flows through the accelerator without store-and-forward overhead. Best for network processing paths where packets flow through a pipeline of processing stages.

Interrupt coalescing: For high-throughput workloads, per-operation interrupts create excessive overhead. The IP supports configurable interrupt coalescing — raising a single interrupt after N operations complete or after a timeout, whichever comes first.

Multi-Instance Deployment #

For applications requiring higher throughput than a single core can provide, multiple IP core instances can be deployed in parallel. Each instance is independently addressable at a different base address on the AMBA bus.

Load balancing: The host driver or a hardware dispatcher distributes operations across available instances. Round-robin scheduling is simplest; priority-based scheduling enables latency-sensitive operations (handshake completion) to preempt bulk operations (certificate validation).

Shared resources: Multiple algorithm accelerator instances can share a single DYBER-KMU for key management and a single DYBER-QRNG for entropy, connected via internal AXI crossbar. This avoids duplicating support infrastructure when scaling compute cores.

Power & Clocking #

Clock domain: Each IP core operates in a single clock domain. The core clock can run at a different frequency than the bus clock — an asynchronous FIFO bridge (provided as part of the interface wrapper) handles clock domain crossing. This enables the core to run at its optimal frequency independent of the system bus clock.

Clock gating: IP cores support fine-grained clock gating for power management. When idle (no operation in progress), the internal pipeline clock is gated, reducing dynamic power to near-zero. The bus interface clock remains active for register access. Wake-from-idle latency is a single clock cycle.

Power domains: For ASIC integration, the IP can be placed in an independent power domain with its own voltage rail, enabling power-down of the PQC subsystem when not in use. The key buffer includes a power-down zeroization trigger to protect key material during power state transitions.

Platform Scenarios #

Platform TypeIntegration PatternKey Considerations
Adaptive SoC (Zynq/Versal)PL instantiation, PS-to-PL AXI bridgeValidated reference design available. Block design integration.
FPGA (Virtex/Kintex/Artix)Soft-processor + IP core, or standalone acceleratorMicroBlaze or external host via PCIe/Ethernet.
Server CPU (ASIC)On-die block alongside existing crypto co-processorPSP firmware integration, MMIO or dedicated command interface. Joint engineering.
DPU / SmartNICDedicated PQC block in packet processing pipelineAXI4-Stream for line-rate operation. P4 pipeline integration.
Client CPU (ASIC)Security co-processor enhancementTPM integration, fTPM offload, minimal area configuration.
MCU / IoTAPB peripheral in constrained SoCNTT-R2 + ML-KEM-512 minimal configuration. Ultra-low power.

Integration Deliverables #

All IP licenses include the following integration deliverables:

CategoryContents
Design filesSynthesizable RTL (Verilog/SystemVerilog/VHDL), FPGA synthesis scripts, constraint templates, reference block designs
VerificationUVM testbench, NIST KAT vectors, functional coverage models, formal verification assertions
SoftwareBare-metal C driver library, reference Linux kernel module (x86-64 + ARM64), API headers and documentation
DocumentationDatasheet with measured specifications, integration guide, register map, programming guide, security target

Production licenses additionally include generic RTL for ASIC migration, ASIC synthesis guidance, and DFT integration hooks.