NTT Engine Family

IP-NTT-001 v2.0

The Number Theoretic Transform (NTT) is the dominant computational bottleneck in all lattice-based post-quantum cryptography. Dyber's NTT engine family provides four radix configurations spanning three orders of magnitude in area/throughput trade-off, enabling optimal resource allocation from constrained IoT endpoints to datacenter-class accelerators.

Overview #

Every ML-KEM encapsulation/decapsulation and every ML-DSA sign/verify operation requires multiple NTT forward and inverse transforms. In a typical ML-KEM-768 encapsulation, NTT operations account for over 60% of total computation time. Hardware acceleration of NTT is therefore the single highest-leverage optimization for PQC performance.

Dyber's NTT engines are designed as standalone, reusable IP blocks that can be instantiated independently or composed into complete algorithm accelerators. Each engine supports both forward NTT (coefficient â†’ NTT domain) and inverse NTT (NTT domain â†’ coefficient) transforms with integrated twiddle factor ROM and modular reduction.

Property	NTT-R2	NTT-R4	NTT-R8	NTT-R16	NTT-R32
Radix	2	4	8	16	32
Butterflies/cycle	1	2	4	8	16
Area class	Ultra-compact	Compact	Standard	Large	Very large
Throughput class	Base	Mid-range	High	Very high	Maximum
Target market	IoT, wearable	Client, mobile	Server, network	Datacenter, DPU	HFT, hyperscale
Transform sizes	256, 512, 1024 points (configurable)
Moduli supported	q=3329 (ML-KEM), q=8380417 (ML-DSA), custom

Detailed resource utilization and exact performance figures are available under NDA as part of the evaluation package. Contact ip-sales@dyber.com for access to full datasheets with measured FPGA results.

Butterfly Architecture #

All NTT engines use the Cooley-Tukey decimation-in-time algorithm for forward transforms and Gentleman-Sande decimation-in-frequency for inverse transforms. This split allows optimal pipeline utilization and natural bit-reversal permutation handling.

Each butterfly unit performs the core NTT operation:

// Cooley-Tukey butterfly (forward NTT)
a' = a + wÂ·b  (mod q)
b' = a - wÂ·b  (mod q)

// Gentleman-Sande butterfly (inverse NTT)
a' = a + b      (mod q)
b' = wÂ·(a - b)  (mod q)

Where w is the twiddle factor (power of the primitive root of unity). Each butterfly requires one modular multiplication and two modular additions â€” the modular multiplication is the critical-path operation that determines maximum clock frequency.

Higher radix configurations process more butterflies per clock cycle by instantiating multiple parallel butterfly units with shared twiddle factor ROM and coordinated memory access scheduling. The Radix-4 engine processes two butterflies simultaneously (equivalent to two Radix-2 stages per cycle), Radix-8 processes four, Radix-16 processes eight, and Radix-32 processes sixteen.

NTT-R2 â€” Radix-2 #

The smallest NTT configuration, designed for area-constrained applications where PQC capability must fit within a tight silicon budget. A single butterfly unit processes one coefficient pair per clock cycle.

Architecture: Single Cooley-Tukey/Gentleman-Sande butterfly with one modular multiplier, one twiddle factor ROM read port, and ping-pong coefficient memory. For a 256-point transform, requires 8 stages Ã— 128 operations = 1024 cycles plus pipeline overhead.

Best for: IoT security elements, smart card controllers, low-power sensor nodes, and any application where area dominates over throughput. The NTT-R2 fits comfortably alongside a 32-bit RISC-V core in a compact secure microcontroller.

NTT-R4 â€” Radix-4 #

The recommended configuration for most client and edge applications. Two parallel butterfly units deliver approximately 3.5Ã— the throughput of NTT-R2 at roughly 2.5Ã— the area â€” an excellent efficiency sweet spot.

Architecture: Two parallel butterfly units with independent modular multipliers sharing a dual-port twiddle factor ROM. Coefficient memory uses a 4-bank interleaved architecture that supports conflict-free access for both butterflies. 256-point transform completes in approximately half the cycles of NTT-R2.

Best for: Laptop and desktop security processors, mobile SoCs, edge AI accelerators with security requirements, and general-purpose PQC acceleration where area and power are constrained but single-connection latency matters.

Validation: This is the production-validated configuration with the most extensive FPGA characterization data, including full ARM host integration over AXI on Zynq UltraScale+ platforms.

NTT-R8 â€” Radix-8 #

High-throughput configuration for server and networking applications where PQC operations per second must scale to support hundreds of thousands of concurrent connections.

Architecture: Four parallel butterfly units with dedicated modular multipliers, 8-bank interleaved coefficient memory, and a 4-port twiddle factor ROM. Advanced memory scheduling eliminates bank conflicts across all four butterfly units operating simultaneously.

Best for: Server CPUs, network processors, DPU/SmartNIC acceleration, enterprise HSMs, and PKI infrastructure. The NTT-R8 supports connection densities typical of TLS termination at cloud provider edge locations.

NTT-R16 â€” Radix-16 #

Maximum throughput configuration for applications where per-core PQC operation rate must be maximized regardless of area cost. Eight parallel butterfly units with the most aggressive memory architecture in the portfolio.

Architecture: Eight parallel butterfly units with 16-bank coefficient memory and fully pipelined modular multiplier array. The memory subsystem uses a multi-level crossbar interconnect to maintain conflict-free access at full bandwidth. Twiddle factors are distributed across multiple ROM banks with speculative prefetch.

Best for: Datacenter accelerator cards, high-frequency trading infrastructure, large-scale PKI migration workloads, and any application where the NTT engine is the dedicated accelerator purpose (not sharing die area with general-purpose compute).

NTT-R32 â€” Radix-32 #

The highest-throughput NTT configuration in the Dyber portfolio, designed for applications where absolute maximum cryptographic operations per second is the primary design objective. Sixteen parallel butterfly units deliver approximately twice the throughput of NTT-R16 for workloads that demand the ultimate in PQC acceleration performance.

Architecture: Sixteen parallel butterfly units with 32-bank coefficient memory and a two-level crossbar interconnect. The modular multiplier array is fully pipelined with dedicated per-butterfly reduction units, eliminating any sharing bottleneck. Twiddle factors are distributed across 16 ROM banks with two-cycle speculative prefetch, ensuring the butterfly array is never stalled waiting for twiddle data. For a 256-point transform, the engine processes 16 coefficient pairs per cycle â€” completing a full NTT stage in just 8 cycles.

Memory subsystem: The 32-bank memory architecture uses a two-level crossbar: a first-level 16Ã—32 switch connects butterfly inputs to coefficient banks, and a second-level 32Ã—16 switch routes results back. A hardware scheduling unit generates conflict-free access patterns for all 16 butterflies across all NTT stages without software intervention. The memory controller supports overlapped operation â€” loading the next polynomial while the current transform is completing â€” to minimize inter-transform idle time.

Best for: Hyperscale datacenter accelerator cards, high-frequency trading infrastructure requiring sub-microsecond PQC latency, large-scale certificate authority operations processing millions of signatures per second, and national-scale PKI migration workloads. The NTT-R32 is the dedicated engine for Dyber's QUAC 100 accelerator card, where it drives the industry-leading 1.4M+ ML-KEM operations per second throughput.

NTT-R32 is designed for dedicated acceleration platforms where the NTT engine is the primary silicon consumer. Due to the large memory crossbar and 16-wide butterfly array, NTT-R32 requires substantial FPGA fabric or ASIC area. For designs where PQC shares die area with other functions, NTT-R8 or NTT-R16 typically offers better overall system efficiency.

Configurable Moduli #

All NTT engines support runtime-selectable moduli through parameterized modular reduction units:

Modulus	Value	Algorithm	Reduction Method
q_Kyber	3329	ML-KEM (FIPS 203)	Barrett reduction, 12-bit operands
q_Dilithium	8380417	ML-DSA (FIPS 204)	Barrett reduction, 23-bit operands
Custom	User-defined	Research / future standards	Generic Montgomery, configurable width

The modulus is selected via a configuration register at initialization time. Switching between ML-KEM and ML-DSA modes requires a register write and pipeline flush (typically < 10 clock cycles). This enables a single NTT engine to service both key exchange and digital signature workloads in multi-algorithm deployments.

Pipeline Architecture #

Each NTT engine uses a multi-stage pipeline designed for sustained throughput:

Stage 1: Address Generation  â€” Compute butterfly pair addresses + twiddle factor index
Stage 2: Memory Read         â€” Fetch coefficient pair (a, b) and twiddle factor (w)
Stage 3: Multiply            â€” Compute wÂ·b using modular multiplier
Stage 4: Add/Subtract        â€” Compute aÂ±(wÂ·b) mod q (butterfly output pair)
Stage 5: Memory Write        â€” Write results back to coefficient memory

The pipeline operates continuously with no stalls for single-butterfly configurations (NTT-R2). Multi-butterfly configurations use carefully scheduled memory access patterns that guarantee conflict-free operation across all parallel units. Pipeline occupancy exceeds 95% for all transform sizes.

Memory Organization #

Coefficient storage uses dual-port block RAM organized in a banked architecture that scales with radix. The twiddle factor ROM stores precomputed powers of the primitive root of unity for each supported modulus and transform size.

Configuration	Coefficient Banks	Twiddle ROM Ports	Memory Strategy
NTT-R2	2 (ping-pong)	1	Simple alternating read/write
NTT-R4	4 (interleaved)	2	Stride-based conflict-free scheduling
NTT-R8	8 (interleaved)	4	Multi-stride with bank rotation
NTT-R16	16 (crossbar)	8	Crossbar interconnect with prefetch
NTT-R32	32 (two-level crossbar)	16	Two-level crossbar with speculative prefetch

Integration Options #

NTT engines can be instantiated in three modes:

Standalone Accelerator: The NTT engine is exposed as a memory-mapped peripheral via AMBA AXI4-Lite. The host CPU writes coefficient data, triggers the transform, and reads results. Best for software-driven PQC implementations that accelerate only the NTT bottleneck.

Streaming Co-processor: The NTT engine connects via AXI4-Stream interfaces, accepting coefficient data as a continuous stream and producing transformed output. Best for pipeline architectures where NTT is one stage in a larger datapath.

Submodule: The NTT engine is instantiated internally within a DYBER-MLKEM or DYBER-MLDSA algorithm accelerator. Memory interfaces are connected to the accelerator's internal data fabric. This is the default mode in algorithm accelerator IP and requires no separate integration effort.

Configuration Comparison #

Metric	NTT-R2	NTT-R4	NTT-R8	NTT-R16	NTT-R32
Relative area	1Ã—	~2.5Ã—	~5.5Ã—	~11Ã—	~23Ã—
Relative throughput	1Ã—	~3.5Ã—	~6.5Ã—	~12Ã—	~22Ã—
Efficiency (throughput/area)	1.0	1.4	1.2	1.1	~1.0
Max instances per design	Many	Multiple	Several	1â€“2	1
DSP utilization	Minimal	Moderate	High	Very high	Extreme
BRAM utilization	Low	Moderate	Moderate-high	High	Very high
Best for	Area floor	Efficiency sweet spot	High throughput	Very high throughput	Max throughput

NTT-R4 is the recommended default for most applications. It offers the best throughput-per-gate efficiency and has the most extensive validation history. Start with NTT-R4 unless your application has specific area constraints (â†’ R2) or throughput requirements (â†’ R8/R16) that justify a different configuration.

Was this page helpful? Send feedback