Architecture Deep Dive

QCORE-ARCH-001 Rev 0.9 — January 2026

Detailed internal architecture of the QCORE-C1 post-quantum cryptographic accelerator chiplet. This document covers the 8-way Radix-16 NTT butterfly array, Keccak-f[1600] hashing core, polynomial arithmetic unit, Centered Binomial Distribution (CBD) sampler, Kyber control finite state machine, secure SRAM subsystem, and overall design hierarchy.

Design Hierarchy #

The QCORE-C1 top-level module (qcore_c1_top) integrates five major subsystems connected through an internal crossbar with round-robin arbitration to the shared 64KB secure SRAM:

Table 1 — Top-Level Module Hierarchy
ModuleDescriptionArea (SKY130)Area %
qli_interfaceQLI Protocol + PHY (TX/RX datapaths, link training, credit management)0.2 mm²5%
kyber_engineML-KEM accelerator (NTT array, poly arithmetic, CBD, Kyber FSM)1.1 mm²27.5%
keccak_coreKeccak-f[1600] hashing (6-stage unrolled pipeline, SHA-3/SHAKE)0.2 mm²5%
secure_sram64KB SRAM with ECC, tamper detection, dual-port arbiter0.5 mm²12.5%
dft_subsystemJTAG TAP, scan chains, memory BIST controller0.1 mm²2.5%
clock_resetPLL, clock distribution, power-on reset, clock gating cells0.1 mm²2.5%
I/O ring128 TX + 128 RX + 30 control pads1.2 mm²30%
Routing & miscInterconnect, fill, guard rings0.6 mm²15%

8-Way Radix-16 NTT Array #

The Number Theoretic Transform (NTT) is the computational bottleneck in ML-KEM. The QCORE-C1 implements eight parallel Radix-16 NTT butterfly engines that process 256-coefficient polynomials over the ring ℤq[x]/(x²⁵⁶+1) where q = 3329.

NTT Engine Architecture

Each NTT engine implements a Radix-16 butterfly unit that performs 16-point NTT operations in a single clock cycle using a network of modular multipliers and adders. The Radix-16 decomposition reduces the number of pipeline stages compared to Radix-2 (Cooley-Tukey) by a factor of 4, completing a full 256-point NTT in 16 cycles instead of 64.

Table 2 — NTT Engine Parameters
ParameterValue
Butterfly radixRadix-16 (4 stages collapsed)
Parallel engines8
Coefficient width12 bits (⌈log₂(3329)⌉)
Pipeline depth4 stages per butterfly
Cycles per 256-point NTT16 cycles
Twiddle factor storageCompressed ROM (symmetry-exploited), 2.4 KB
Modular reductionBarrett reduction (constant-time)

Twiddle Factor ROM

The twiddle factors (powers of the primitive 512th root of unity modulo 3329) are stored in a compressed ROM that exploits conjugate symmetry and periodicity properties. This reduces storage from 4KB (raw) to 2.4KB — a 40% compression achieved without runtime decompression overhead. The ROM is synthesized as combinational logic for the SKY130 prototype (no dedicated SRAM macro required).

Zero-Bubble Scheduling

The 8-way NTT array uses a pre-computed schedule that interleaves forward and inverse NTT operations across engines to achieve 100% pipeline utilization. When the Kyber FSM requires back-to-back NTT operations (e.g., polynomial multiplication via NTT → pointwise-multiply → inverse NTT), the scheduler assigns alternating engines to eliminate pipeline drain cycles.

Keccak-f[1600] Core #

The Keccak core implements the Keccak-f[1600] permutation used by SHA-3, SHAKE-128, and SHAKE-256 — all required by ML-KEM for hash functions, XOF (extendable output function), and PRF (pseudorandom function) operations.

Table 3 — Keccak Core Specifications
ParameterValue
State width1600 bits (5 × 5 × 64)
Pipeline architecture6-stage unrolled (4 rounds per stage)
Rounds per permutation24 (Keccak-f[1600])
Permutation latency6 clock cycles
Permutation throughput1 per 6 cycles (pipelined: 1 per cycle after fill)
Supported modesSHA3-256, SHA3-512, SHAKE-128, SHAKE-256
PaddingAutomatic multi-rate padding (10*1 domain separation)

The 6-stage unrolled design processes 4 Keccak rounds per pipeline stage, reducing permutation latency from 24 cycles (fully iterative) to 6 cycles. The pipeline accepts a new 1600-bit state every cycle once filled, achieving sustained throughput of one permutation per clock for streaming SHAKE operations used in matrix expansion (A-hat generation) during ML-KEM KeyGen.

Polynomial Arithmetic Unit #

The polynomial arithmetic unit performs coefficient-level operations on 256-element vectors in the NTT domain and standard domain:

Table 4 — Polynomial Arithmetic Operations
OperationModuleLatencyDescription
Pointwise multiplypoly_mul_coeff1 cycle256 parallel modular multiplications (NTT domain)
Polynomial addpoly_add1 cycle256-wide coefficient addition mod q
Polynomial subtractpoly_sub1 cycle256-wide coefficient subtraction mod q
Compresspoly_compress1 cycleCoefficient compression for ciphertext encoding
Decompresspoly_decompress1 cycleCoefficient decompression from ciphertext
Encode/Decodepoly_codec2 cyclesByte-to-coefficient and coefficient-to-byte packing

CBD Sampler #

The Centered Binomial Distribution sampler generates noise polynomials required by ML-KEM for key generation and encryption. It consumes random bytes from the Keccak core (via SHAKE-256 PRF) and produces polynomial coefficients distributed according to CBD(η).

Table 5 — CBD Sampler Parameters
Parameter SetηInput Bytes per PolynomialCoefficient Range
ML-KEM-512 noise3192 bytes (3 × 256 / 4)[−3, +3]
ML-KEM-768 noise2128 bytes (2 × 256 / 4)[−2, +2]
ML-KEM-1024 noise2128 bytes (2 × 256 / 4)[−2, +2]

The sampler processes 4 coefficients per clock cycle using a parallel bit-counting network. For η=2, each coefficient requires 4 random bits (two pairs summed and subtracted). For η=3, each coefficient requires 6 random bits. The sampler includes a constant-time implementation that prevents timing side-channel leakage of the noise values.

Kyber Control FSM #

The Kyber control finite state machine orchestrates the complete ML-KEM protocol by sequencing operations across the NTT array, Keccak core, polynomial arithmetic unit, and CBD sampler. Three top-level operations are supported:

Table 6 — Kyber FSM Operation Breakdown (ML-KEM-768)
OperationSub-StepsNTT CallsKeccak CallsTotal Cycles
KeyGenA-hat generation (9 XOF), s/e sampling (6 CBD), NTT(s), NTT(e), t=As+e12~72~42
EncapsHash(pk), r sampling (6 CBD), NTT(r), u=ATr+e₁, v=tTr+e₂+m15~78~51
DecapsDecrypt, re-encrypt, compare (constant-time)18~84~61

The FSM implements implicit rejection for decapsulation failures (as required by FIPS 203) using constant-time conditional selection — if the re-encrypted ciphertext does not match the received ciphertext, the FSM outputs a pseudorandom shared secret derived from the secret key and ciphertext, preventing chosen-ciphertext attacks.

Secure SRAM Subsystem #

Table 7 — SRAM Specifications
ParameterValue
Total capacity64 KB (65,536 bytes)
Organization4 × 16KB banks
Data width256 bits per bank (with 8-bit ECC per 64-bit word)
Access latency1 cycle (single-port), 2 cycles (dual-port arbitrated)
ECCSECDED (single-error correct, double-error detect)
Tamper detectionAddress/data parity checking, voltage glitch detector
ZeroizationHardware-triggered full-bank zero in 64 cycles
Implementation (SKY130)Synthesized flip-flop arrays (no SRAM macros available)
Implementation (GF22FDX)Embedded SRAM macros (4× density improvement)

The SRAM is partitioned into four banks to allow concurrent access from the NTT array (polynomial storage) and Keccak core (hash state buffering). A dual-port arbiter with round-robin priority prevents starvation while maintaining deterministic access latency for side-channel resistance.

Area Breakdown #

Table 8 — Die Area Comparison
BlockSKY130 (4.0 mm²)GF22FDX (1.44 mm² projected)
NTT Array0.80 mm² (20%)0.18 mm² (12.5%)
Kyber Control + Poly Arith0.30 mm² (7.5%)0.08 mm² (5.6%)
Keccak/SHA-30.20 mm² (5%)0.06 mm² (4.2%)
Secure SRAM0.50 mm² (12.5%)0.12 mm² (8.3%)
QLI Protocol + PHY0.20 mm² (5%)0.10 mm² (6.9%)
Clock/PLL0.10 mm² (2.5%)0.06 mm² (4.2%)
I/O Ring1.20 mm² (30%)0.60 mm² (41.7%)
Routing/Misc0.70 mm² (17.5%)0.24 mm² (16.7%)

SKY130 → GF22FDX Migration #

Table 9 — Process Comparison
ParameterSkyWater SKY130GF 22FDX (Target)
Node130nm CMOS22nm FD-SOI
PurposePrototype / validationProduction
Die size2.0 × 2.0 mm1.2 × 1.2 mm (projected)
Max clock100–120 MHz500–800 MHz
Typical power350 mW200 mW
US fabrication✓ (SkyWater, Minnesota)✓ (GlobalFoundries, Vermont/New York)
CHIPS Act eligible—✓
EDA flowOpenLane 2 (open-source)Cadence/Synopsys (commercial)
SRAM macrosNone (synthesized FFs)Available (4× density)
Shuttle cost~$15K (Efabless chipIgnite)~$500K (engineering lot)
See also: Datasheet for electrical specifications, Developer Guide for register-level programming, and the QLI Interface Reference for interconnect details.