Architecture Deep Dive
Detailed internal architecture of the QCORE-C1 post-quantum cryptographic accelerator chiplet. This document covers the 8-way Radix-16 NTT butterfly array, Keccak-f[1600] hashing core, polynomial arithmetic unit, Centered Binomial Distribution (CBD) sampler, Kyber control finite state machine, secure SRAM subsystem, and overall design hierarchy.
Design Hierarchy #
The QCORE-C1 top-level module (qcore_c1_top) integrates five major subsystems connected through an internal crossbar with round-robin arbitration to the shared 64KB secure SRAM:
| Module | Description | Area (SKY130) | Area % |
|---|---|---|---|
qli_interface | QLI Protocol + PHY (TX/RX datapaths, link training, credit management) | 0.2 mm² | 5% |
kyber_engine | ML-KEM accelerator (NTT array, poly arithmetic, CBD, Kyber FSM) | 1.1 mm² | 27.5% |
keccak_core | Keccak-f[1600] hashing (6-stage unrolled pipeline, SHA-3/SHAKE) | 0.2 mm² | 5% |
secure_sram | 64KB SRAM with ECC, tamper detection, dual-port arbiter | 0.5 mm² | 12.5% |
dft_subsystem | JTAG TAP, scan chains, memory BIST controller | 0.1 mm² | 2.5% |
clock_reset | PLL, clock distribution, power-on reset, clock gating cells | 0.1 mm² | 2.5% |
| I/O ring | 128 TX + 128 RX + 30 control pads | 1.2 mm² | 30% |
| Routing & misc | Interconnect, fill, guard rings | 0.6 mm² | 15% |
8-Way Radix-16 NTT Array #
The Number Theoretic Transform (NTT) is the computational bottleneck in ML-KEM. The QCORE-C1 implements eight parallel Radix-16 NTT butterfly engines that process 256-coefficient polynomials over the ring ℤq[x]/(x²âµâ¶+1) where q = 3329.
NTT Engine Architecture
Each NTT engine implements a Radix-16 butterfly unit that performs 16-point NTT operations in a single clock cycle using a network of modular multipliers and adders. The Radix-16 decomposition reduces the number of pipeline stages compared to Radix-2 (Cooley-Tukey) by a factor of 4, completing a full 256-point NTT in 16 cycles instead of 64.
| Parameter | Value |
|---|---|
| Butterfly radix | Radix-16 (4 stages collapsed) |
| Parallel engines | 8 |
| Coefficient width | 12 bits (⌈log₂(3329)⌉) |
| Pipeline depth | 4 stages per butterfly |
| Cycles per 256-point NTT | 16 cycles |
| Twiddle factor storage | Compressed ROM (symmetry-exploited), 2.4 KB |
| Modular reduction | Barrett reduction (constant-time) |
Twiddle Factor ROM
The twiddle factors (powers of the primitive 512th root of unity modulo 3329) are stored in a compressed ROM that exploits conjugate symmetry and periodicity properties. This reduces storage from 4KB (raw) to 2.4KB — a 40% compression achieved without runtime decompression overhead. The ROM is synthesized as combinational logic for the SKY130 prototype (no dedicated SRAM macro required).
Zero-Bubble Scheduling
The 8-way NTT array uses a pre-computed schedule that interleaves forward and inverse NTT operations across engines to achieve 100% pipeline utilization. When the Kyber FSM requires back-to-back NTT operations (e.g., polynomial multiplication via NTT → pointwise-multiply → inverse NTT), the scheduler assigns alternating engines to eliminate pipeline drain cycles.
Keccak-f[1600] Core #
The Keccak core implements the Keccak-f[1600] permutation used by SHA-3, SHAKE-128, and SHAKE-256 — all required by ML-KEM for hash functions, XOF (extendable output function), and PRF (pseudorandom function) operations.
| Parameter | Value |
|---|---|
| State width | 1600 bits (5 × 5 × 64) |
| Pipeline architecture | 6-stage unrolled (4 rounds per stage) |
| Rounds per permutation | 24 (Keccak-f[1600]) |
| Permutation latency | 6 clock cycles |
| Permutation throughput | 1 per 6 cycles (pipelined: 1 per cycle after fill) |
| Supported modes | SHA3-256, SHA3-512, SHAKE-128, SHAKE-256 |
| Padding | Automatic multi-rate padding (10*1 domain separation) |
The 6-stage unrolled design processes 4 Keccak rounds per pipeline stage, reducing permutation latency from 24 cycles (fully iterative) to 6 cycles. The pipeline accepts a new 1600-bit state every cycle once filled, achieving sustained throughput of one permutation per clock for streaming SHAKE operations used in matrix expansion (A-hat generation) during ML-KEM KeyGen.
Polynomial Arithmetic Unit #
The polynomial arithmetic unit performs coefficient-level operations on 256-element vectors in the NTT domain and standard domain:
| Operation | Module | Latency | Description |
|---|---|---|---|
| Pointwise multiply | poly_mul_coeff | 1 cycle | 256 parallel modular multiplications (NTT domain) |
| Polynomial add | poly_add | 1 cycle | 256-wide coefficient addition mod q |
| Polynomial subtract | poly_sub | 1 cycle | 256-wide coefficient subtraction mod q |
| Compress | poly_compress | 1 cycle | Coefficient compression for ciphertext encoding |
| Decompress | poly_decompress | 1 cycle | Coefficient decompression from ciphertext |
| Encode/Decode | poly_codec | 2 cycles | Byte-to-coefficient and coefficient-to-byte packing |
CBD Sampler #
The Centered Binomial Distribution sampler generates noise polynomials required by ML-KEM for key generation and encryption. It consumes random bytes from the Keccak core (via SHAKE-256 PRF) and produces polynomial coefficients distributed according to CBD(η).
| Parameter Set | η | Input Bytes per Polynomial | Coefficient Range |
|---|---|---|---|
| ML-KEM-512 noise | 3 | 192 bytes (3 × 256 / 4) | [−3, +3] |
| ML-KEM-768 noise | 2 | 128 bytes (2 × 256 / 4) | [−2, +2] |
| ML-KEM-1024 noise | 2 | 128 bytes (2 × 256 / 4) | [−2, +2] |
The sampler processes 4 coefficients per clock cycle using a parallel bit-counting network. For η=2, each coefficient requires 4 random bits (two pairs summed and subtracted). For η=3, each coefficient requires 6 random bits. The sampler includes a constant-time implementation that prevents timing side-channel leakage of the noise values.
Kyber Control FSM #
The Kyber control finite state machine orchestrates the complete ML-KEM protocol by sequencing operations across the NTT array, Keccak core, polynomial arithmetic unit, and CBD sampler. Three top-level operations are supported:
| Operation | Sub-Steps | NTT Calls | Keccak Calls | Total Cycles |
|---|---|---|---|---|
| KeyGen | A-hat generation (9 XOF), s/e sampling (6 CBD), NTT(s), NTT(e), t=As+e | 12 | ~72 | ~42 |
| Encaps | Hash(pk), r sampling (6 CBD), NTT(r), u=ATr+eâ‚, v=tTr+eâ‚‚+m | 15 | ~78 | ~51 |
| Decaps | Decrypt, re-encrypt, compare (constant-time) | 18 | ~84 | ~61 |
The FSM implements implicit rejection for decapsulation failures (as required by FIPS 203) using constant-time conditional selection — if the re-encrypted ciphertext does not match the received ciphertext, the FSM outputs a pseudorandom shared secret derived from the secret key and ciphertext, preventing chosen-ciphertext attacks.
Secure SRAM Subsystem #
| Parameter | Value |
|---|---|
| Total capacity | 64 KB (65,536 bytes) |
| Organization | 4 × 16KB banks |
| Data width | 256 bits per bank (with 8-bit ECC per 64-bit word) |
| Access latency | 1 cycle (single-port), 2 cycles (dual-port arbitrated) |
| ECC | SECDED (single-error correct, double-error detect) |
| Tamper detection | Address/data parity checking, voltage glitch detector |
| Zeroization | Hardware-triggered full-bank zero in 64 cycles |
| Implementation (SKY130) | Synthesized flip-flop arrays (no SRAM macros available) |
| Implementation (GF22FDX) | Embedded SRAM macros (4× density improvement) |
The SRAM is partitioned into four banks to allow concurrent access from the NTT array (polynomial storage) and Keccak core (hash state buffering). A dual-port arbiter with round-robin priority prevents starvation while maintaining deterministic access latency for side-channel resistance.
Area Breakdown #
| Block | SKY130 (4.0 mm²) | GF22FDX (1.44 mm² projected) |
|---|---|---|
| NTT Array | 0.80 mm² (20%) | 0.18 mm² (12.5%) |
| Kyber Control + Poly Arith | 0.30 mm² (7.5%) | 0.08 mm² (5.6%) |
| Keccak/SHA-3 | 0.20 mm² (5%) | 0.06 mm² (4.2%) |
| Secure SRAM | 0.50 mm² (12.5%) | 0.12 mm² (8.3%) |
| QLI Protocol + PHY | 0.20 mm² (5%) | 0.10 mm² (6.9%) |
| Clock/PLL | 0.10 mm² (2.5%) | 0.06 mm² (4.2%) |
| I/O Ring | 1.20 mm² (30%) | 0.60 mm² (41.7%) |
| Routing/Misc | 0.70 mm² (17.5%) | 0.24 mm² (16.7%) |
SKY130 → GF22FDX Migration #
| Parameter | SkyWater SKY130 | GF 22FDX (Target) |
|---|---|---|
| Node | 130nm CMOS | 22nm FD-SOI |
| Purpose | Prototype / validation | Production |
| Die size | 2.0 × 2.0 mm | 1.2 × 1.2 mm (projected) |
| Max clock | 100–120 MHz | 500–800 MHz |
| Typical power | 350 mW | 200 mW |
| US fabrication | ✓ (SkyWater, Minnesota) | ✓ (GlobalFoundries, Vermont/New York) |
| CHIPS Act eligible | — | ✓ |
| EDA flow | OpenLane 2 (open-source) | Cadence/Synopsys (commercial) |
| SRAM macros | None (synthesized FFs) | Available (4× density) |
| Shuttle cost | ~$15K (Efabless chipIgnite) | ~$500K (engineering lot) |