Architecture Deep Dive

QCORE-ARCH-001 Rev 0.9 â€” January 2026

Detailed internal architecture of the QCORE-C1 post-quantum cryptographic accelerator chiplet. This document covers the 8-way Radix-16 NTT butterfly array, Keccak-f[1600] hashing core, polynomial arithmetic unit, Centered Binomial Distribution (CBD) sampler, Kyber control finite state machine, secure SRAM subsystem, and overall design hierarchy.

Design Hierarchy #

The QCORE-C1 top-level module (qcore_c1_top) integrates five major subsystems connected through an internal crossbar with round-robin arbitration to the shared 64KB secure SRAM:

Table 1 â€” Top-Level Module Hierarchy
Module	Description	Area (SKY130)	Area %
`qli_interface`	QLI Protocol + PHY (TX/RX datapaths, link training, credit management)	0.2 mmÂ²	5%
`kyber_engine`	ML-KEM accelerator (NTT array, poly arithmetic, CBD, Kyber FSM)	1.1 mmÂ²	27.5%
`keccak_core`	Keccak-f[1600] hashing (6-stage unrolled pipeline, SHA-3/SHAKE)	0.2 mmÂ²	5%
`secure_sram`	64KB SRAM with ECC, tamper detection, dual-port arbiter	0.5 mmÂ²	12.5%
`dft_subsystem`	JTAG TAP, scan chains, memory BIST controller	0.1 mmÂ²	2.5%
`clock_reset`	PLL, clock distribution, power-on reset, clock gating cells	0.1 mmÂ²	2.5%
I/O ring	128 TX + 128 RX + 30 control pads	1.2 mmÂ²	30%
Routing & misc	Interconnect, fill, guard rings	0.6 mmÂ²	15%

8-Way Radix-16 NTT Array #

The Number Theoretic Transform (NTT) is the computational bottleneck in ML-KEM. The QCORE-C1 implements eight parallel Radix-16 NTT butterfly engines that process 256-coefficient polynomials over the ring â„¤_q[x]/(xÂ²âµâ¶+1) where q = 3329.

NTT Engine Architecture

Each NTT engine implements a Radix-16 butterfly unit that performs 16-point NTT operations in a single clock cycle using a network of modular multipliers and adders. The Radix-16 decomposition reduces the number of pipeline stages compared to Radix-2 (Cooley-Tukey) by a factor of 4, completing a full 256-point NTT in 16 cycles instead of 64.

Table 2 â€” NTT Engine Parameters
Parameter	Value
Butterfly radix	Radix-16 (4 stages collapsed)
Parallel engines	8
Coefficient width	12 bits (âŒˆlogâ‚‚(3329)âŒ‰)
Pipeline depth	4 stages per butterfly
Cycles per 256-point NTT	16 cycles
Twiddle factor storage	Compressed ROM (symmetry-exploited), 2.4 KB
Modular reduction	Barrett reduction (constant-time)

Twiddle Factor ROM

The twiddle factors (powers of the primitive 512th root of unity modulo 3329) are stored in a compressed ROM that exploits conjugate symmetry and periodicity properties. This reduces storage from 4KB (raw) to 2.4KB â€” a 40% compression achieved without runtime decompression overhead. The ROM is synthesized as combinational logic for the SKY130 prototype (no dedicated SRAM macro required).

Zero-Bubble Scheduling

The 8-way NTT array uses a pre-computed schedule that interleaves forward and inverse NTT operations across engines to achieve 100% pipeline utilization. When the Kyber FSM requires back-to-back NTT operations (e.g., polynomial multiplication via NTT â†’ pointwise-multiply â†’ inverse NTT), the scheduler assigns alternating engines to eliminate pipeline drain cycles.

Keccak-f[1600] Core #

The Keccak core implements the Keccak-f[1600] permutation used by SHA-3, SHAKE-128, and SHAKE-256 â€” all required by ML-KEM for hash functions, XOF (extendable output function), and PRF (pseudorandom function) operations.

Table 3 â€” Keccak Core Specifications
Parameter	Value
State width	1600 bits (5 Ã— 5 Ã— 64)
Pipeline architecture	6-stage unrolled (4 rounds per stage)
Rounds per permutation	24 (Keccak-f[1600])
Permutation latency	6 clock cycles
Permutation throughput	1 per 6 cycles (pipelined: 1 per cycle after fill)
Supported modes	SHA3-256, SHA3-512, SHAKE-128, SHAKE-256
Padding	Automatic multi-rate padding (10*1 domain separation)

The 6-stage unrolled design processes 4 Keccak rounds per pipeline stage, reducing permutation latency from 24 cycles (fully iterative) to 6 cycles. The pipeline accepts a new 1600-bit state every cycle once filled, achieving sustained throughput of one permutation per clock for streaming SHAKE operations used in matrix expansion (A-hat generation) during ML-KEM KeyGen.

Polynomial Arithmetic Unit #

The polynomial arithmetic unit performs coefficient-level operations on 256-element vectors in the NTT domain and standard domain:

Table 4 â€” Polynomial Arithmetic Operations
Operation	Module	Latency	Description
Pointwise multiply	`poly_mul_coeff`	1 cycle	256 parallel modular multiplications (NTT domain)
Polynomial add	`poly_add`	1 cycle	256-wide coefficient addition mod q
Polynomial subtract	`poly_sub`	1 cycle	256-wide coefficient subtraction mod q
Compress	`poly_compress`	1 cycle	Coefficient compression for ciphertext encoding
Decompress	`poly_decompress`	1 cycle	Coefficient decompression from ciphertext
Encode/Decode	`poly_codec`	2 cycles	Byte-to-coefficient and coefficient-to-byte packing

CBD Sampler #

The Centered Binomial Distribution sampler generates noise polynomials required by ML-KEM for key generation and encryption. It consumes random bytes from the Keccak core (via SHAKE-256 PRF) and produces polynomial coefficients distributed according to CBD(Î·).

Table 5 â€” CBD Sampler Parameters
Parameter Set	Î·	Input Bytes per Polynomial	Coefficient Range
ML-KEM-512 noise	3	192 bytes (3 Ã— 256 / 4)	[âˆ’3, +3]
ML-KEM-768 noise	2	128 bytes (2 Ã— 256 / 4)	[âˆ’2, +2]
ML-KEM-1024 noise	2	128 bytes (2 Ã— 256 / 4)	[âˆ’2, +2]

The sampler processes 4 coefficients per clock cycle using a parallel bit-counting network. For Î·=2, each coefficient requires 4 random bits (two pairs summed and subtracted). For Î·=3, each coefficient requires 6 random bits. The sampler includes a constant-time implementation that prevents timing side-channel leakage of the noise values.

Kyber Control FSM #

The Kyber control finite state machine orchestrates the complete ML-KEM protocol by sequencing operations across the NTT array, Keccak core, polynomial arithmetic unit, and CBD sampler. Three top-level operations are supported:

Table 6 â€” Kyber FSM Operation Breakdown (ML-KEM-768)
Operation	Sub-Steps	NTT Calls	Keccak Calls	Total Cycles
KeyGen	A-hat generation (9 XOF), s/e sampling (6 CBD), NTT(s), NTT(e), t=As+e	12	~72	~42
Encaps	Hash(pk), r sampling (6 CBD), NTT(r), u=A^Tr+eâ‚, v=t^Tr+eâ‚‚+m	15	~78	~51
Decaps	Decrypt, re-encrypt, compare (constant-time)	18	~84	~61

The FSM implements implicit rejection for decapsulation failures (as required by FIPS 203) using constant-time conditional selection â€” if the re-encrypted ciphertext does not match the received ciphertext, the FSM outputs a pseudorandom shared secret derived from the secret key and ciphertext, preventing chosen-ciphertext attacks.

Secure SRAM Subsystem #

Table 7 â€” SRAM Specifications
Parameter	Value
Total capacity	64 KB (65,536 bytes)
Organization	4 Ã— 16KB banks
Data width	256 bits per bank (with 8-bit ECC per 64-bit word)
Access latency	1 cycle (single-port), 2 cycles (dual-port arbitrated)
ECC	SECDED (single-error correct, double-error detect)
Tamper detection	Address/data parity checking, voltage glitch detector
Zeroization	Hardware-triggered full-bank zero in 64 cycles
Implementation (SKY130)	Synthesized flip-flop arrays (no SRAM macros available)
Implementation (GF22FDX)	Embedded SRAM macros (4Ã— density improvement)

The SRAM is partitioned into four banks to allow concurrent access from the NTT array (polynomial storage) and Keccak core (hash state buffering). A dual-port arbiter with round-robin priority prevents starvation while maintaining deterministic access latency for side-channel resistance.

Area Breakdown #

Table 8 â€” Die Area Comparison
Block	SKY130 (4.0 mmÂ²)	GF22FDX (1.44 mmÂ² projected)
NTT Array	0.80 mmÂ² (20%)	0.18 mmÂ² (12.5%)
Kyber Control + Poly Arith	0.30 mmÂ² (7.5%)	0.08 mmÂ² (5.6%)
Keccak/SHA-3	0.20 mmÂ² (5%)	0.06 mmÂ² (4.2%)
Secure SRAM	0.50 mmÂ² (12.5%)	0.12 mmÂ² (8.3%)
QLI Protocol + PHY	0.20 mmÂ² (5%)	0.10 mmÂ² (6.9%)
Clock/PLL	0.10 mmÂ² (2.5%)	0.06 mmÂ² (4.2%)
I/O Ring	1.20 mmÂ² (30%)	0.60 mmÂ² (41.7%)
Routing/Misc	0.70 mmÂ² (17.5%)	0.24 mmÂ² (16.7%)

SKY130 â†’ GF22FDX Migration #

Table 9 â€” Process Comparison
Parameter	SkyWater SKY130	GF 22FDX (Target)
Node	130nm CMOS	22nm FD-SOI
Purpose	Prototype / validation	Production
Die size	2.0 Ã— 2.0 mm	1.2 Ã— 1.2 mm (projected)
Max clock	100â€“120 MHz	500â€“800 MHz
Typical power	350 mW	200 mW
US fabrication	âœ“ (SkyWater, Minnesota)	âœ“ (GlobalFoundries, Vermont/New York)
CHIPS Act eligible	â€”	âœ“
EDA flow	OpenLane 2 (open-source)	Cadence/Synopsys (commercial)
SRAM macros	None (synthesized FFs)	Available (4Ã— density)
Shuttle cost	~$15K (Efabless chipIgnite)	~$500K (engineering lot)

See also: Datasheet for electrical specifications, Developer Guide for register-level programming, and the QLI Interface Reference for interconnect details.

Was this page helpful? Send feedback to docs@dyber.org