Async & Batch Operations

SDK-ASYNC-007 v1.0

The QUAC 100 hardware processes cryptographic operations through a multi-stage pipeline capable of sustaining over 1.4 million ML-KEM operations per second. To fully exploit this throughput, the QuantaCore SDK provides an asynchronous job submission model that decouples application threads from hardware execution, enabling non-blocking cryptographic processing, batch aggregation, and DMA-optimized data transfers.

Overview #

The synchronous API (quac_kem_keygen, quac_sign, etc.) is convenient for simple applications but serializes operations â€” each call blocks until the hardware completes. For high-throughput deployments such as TLS termination proxies, payment processing, or certificate authorities, the asynchronous API provides dramatically higher aggregate throughput by keeping the hardware pipeline saturated.

Characteristic	Synchronous API	Asynchronous API	Batch API
Call model	Blocking	Non-blocking, callback	Non-blocking, vectored
Latency per op	~5 Î¼s (includes round-trip)	~0.7 Î¼s (submission only)	~0.3 Î¼s amortized
Throughput (1 thread)	~200K ops/s	~800K ops/s	~1.4M ops/s
CPU utilization	High (spin-wait)	Low (event-driven)	Minimal (DMA offload)
Complexity	Trivial	Moderate	Higher
Best for	Prototyping, low-rate	Servers, middleware	High-frequency, bulk

All three models operate on the same hardware and produce identical cryptographic outputs. You can freely mix synchronous and asynchronous calls within the same application â€” the SDK manages internal queue arbitration automatically.

Synchronous vs. Asynchronous #

The synchronous API wraps asynchronous submission internally. When you call quac_kem_keygen(), the SDK submits a job descriptor to the hardware command ring, then spin-waits on the completion ring until the result is available. This simplicity has a cost: the calling thread is blocked for the full hardware round-trip, typically 3â€“5 microseconds, during which it cannot submit additional work.

The asynchronous API exposes the underlying submission/completion separation. You submit operations via quac_async_submit() which returns immediately with a job handle. Completion is signaled through one of three mechanisms: polling, callbacks, or event file descriptors suitable for integration with epoll/kqueue event loops.

// Synchronous â€” blocks until completion
quac_kem_keypair_t kp;
quac_status_t rc = quac_kem_keygen(dev, QUAC_KEM_ML_KEM_768, &kp);
// rc is available immediately, kp is populated

// Asynchronous â€” returns immediately
quac_job_t job;
quac_async_submit(dev, &(quac_async_desc_t){
    .op      = QUAC_OP_KEM_KEYGEN,
    .alg     = QUAC_KEM_ML_KEM_768,
    .out     = &kp,
    .out_len = sizeof(kp),
    .cb      = my_keygen_callback,
    .cb_ctx  = my_context,
}, &job);
// job is a handle â€” kp is NOT yet populated
// my_keygen_callback fires when hardware completes

Choosing the Right Model

Use synchronous calls when your application processes fewer than 50,000 operations per second or when integration simplicity outweighs throughput. Switch to asynchronous when you need to overlap cryptographic work with I/O, serve multiple concurrent TLS handshakes, or saturate the hardware pipeline. Use the batch API when processing large volumes of similar operations â€” signing a batch of certificates, encrypting a queue of messages, or performing bulk key exchanges.

Job Submission Model #

Every asynchronous operation begins with a job descriptor (quac_async_desc_t) that fully describes the operation to perform. The descriptor is a value type â€” the SDK copies its contents into the hardware command ring, so the caller can reuse or free the descriptor immediately after quac_async_submit() returns.

typedef struct {
    quac_op_t         op;          // Operation type (KEYGEN, ENCAPS, SIGN, etc.)
    quac_alg_t        alg;         // Algorithm identifier
    const void       *in;          // Input data (message, ciphertext, etc.)
    size_t            in_len;      // Input data length
    const void       *key;         // Key material (for sign/verify/decaps)
    size_t            key_len;     // Key length
    void             *out;         // Output buffer (must remain valid until completion)
    size_t            out_len;     // Output buffer capacity
    size_t           *actual_len;  // Receives actual output length (optional)
    quac_priority_t   priority;    // QUAC_PRIO_LOW / NORMAL / HIGH / REALTIME
    quac_async_cb_t   cb;          // Completion callback (optional)
    void             *cb_ctx;      // Callback context pointer
    uint64_t          tag;         // User-defined correlation tag
} quac_async_desc_t;

Job Handles

On successful submission, quac_async_submit() returns a quac_job_t handle. This handle is a lightweight 64-bit token that can be used to poll for completion, cancel the operation, or query its status. Handles are valid until explicitly released via quac_async_release() or until the device context is closed.

quac_job_t job;
quac_status_t rc = quac_async_submit(dev, &desc, &job);
if (rc != QUAC_OK) {
    // Submission failed â€” queue full, invalid params, etc.
    handle_error(rc);
    return;
}

// Option 1: Poll for completion
quac_job_status_t status;
while ((status = quac_async_poll(job)) == QUAC_JOB_PENDING) {
    // Do other work or yield
    sched_yield();
}

// Option 2: Blocking wait with timeout
rc = quac_async_wait(job, 5000);  // 5 second timeout (ms)
if (rc == QUAC_ERR_TIMEOUT) { ... }

// Option 3: Wait on multiple jobs
quac_job_t jobs[4] = { j1, j2, j3, j4 };
size_t completed_idx;
rc = quac_async_wait_any(jobs, 4, &completed_idx, 5000);

// Release handle when done
quac_async_release(job);

Priority Levels

The hardware command ring supports four priority levels. Higher-priority jobs are dequeued before lower-priority ones, but the scheduler guarantees forward progress â€” low-priority jobs are never starved indefinitely. Priority scheduling uses a weighted fair-queuing algorithm with configurable weights.

Priority	Constant	Default Weight	Use Case
Realtime	`QUAC_PRIO_REALTIME`	8	Interactive TLS, financial trading
High	`QUAC_PRIO_HIGH`	4	Certificate signing, session keys
Normal	`QUAC_PRIO_NORMAL`	2	General application traffic (default)
Low	`QUAC_PRIO_LOW`	1	Background key rotation, pre-generation

Job Cancellation

Jobs that have not yet been dispatched to a hardware execution unit can be cancelled via quac_async_cancel(). Once a job is executing on hardware, cancellation is not possible â€” the operation will complete and the result will be discarded. The callback, if registered, is still invoked with status QUAC_JOB_CANCELLED.

quac_status_t rc = quac_async_cancel(job);
if (rc == QUAC_OK) {
    // Successfully cancelled before hardware dispatch
} else if (rc == QUAC_ERR_IN_PROGRESS) {
    // Already executing â€” will complete, callback fires with CANCELLED status
}

Callbacks & Notifications #

The callback model is the most efficient way to handle completions in event-driven architectures. The SDK maintains a completion thread pool that monitors hardware completion rings and dispatches callbacks. Callbacks execute on SDK-managed threads, not on the submitting thread.

void my_kem_callback(quac_job_t job, quac_job_status_t status,
                     void *ctx) {
    my_session_t *session = (my_session_t *)ctx;

    if (status == QUAC_JOB_COMPLETED) {
        // Output buffer (desc.out) is now populated
        session->state = SESSION_KEY_READY;
        session_resume(session);
    } else if (status == QUAC_JOB_FAILED) {
        quac_status_t err = quac_async_get_error(job);
        log_error("KEM failed: %s", quac_strerror(err));
        session_abort(session);
    }
    quac_async_release(job);
}

// Submit with callback
quac_async_desc_t desc = {
    .op     = QUAC_OP_KEM_ENCAPS,
    .alg    = QUAC_KEM_ML_KEM_768,
    .key    = peer_public_key,
    .key_len= peer_pk_len,
    .out    = &session->encaps_result,
    .out_len= sizeof(session->encaps_result),
    .cb     = my_kem_callback,
    .cb_ctx = session,
    .priority = QUAC_PRIO_HIGH,
};
quac_async_submit(dev, &desc, &session->job);

Event File Descriptors

For integration with epoll, kqueue, or io_uring event loops, the SDK exposes a file descriptor that becomes readable when one or more jobs complete. This avoids the overhead of the callback thread pool and gives you complete control over completion processing.

// Get the completion event fd
int comp_fd = quac_async_get_event_fd(dev);

// Add to epoll
struct epoll_event ev = { .events = EPOLLIN, .data.fd = comp_fd };
epoll_ctl(epfd, EPOLL_CTL_ADD, comp_fd, &ev);

// In event loop
if (events[i].data.fd == comp_fd) {
    quac_job_t completed[64];
    size_t n_completed;
    quac_async_reap(dev, completed, 64, &n_completed);

    for (size_t j = 0; j < n_completed; j++) {
        quac_job_status_t st = quac_async_get_status(completed[j]);
        uint64_t tag = quac_async_get_tag(completed[j]);
        // Route completion by tag
        dispatch_completion(tag, st);
        quac_async_release(completed[j]);
    }
}

Callback Thread Configuration

The number of callback threads and their CPU affinity can be configured at device open time or dynamically at runtime:

// Configure at device open
quac_device_config_t cfg = QUAC_DEVICE_CONFIG_INIT;
cfg.async_callback_threads = 4;
cfg.async_callback_cpu_mask = 0xF0;  // CPUs 4-7
quac_device_t *dev = quac_open_ex(0, &cfg);

// Adjust at runtime
quac_async_set_callback_threads(dev, 8);

Batch Processing API #

The batch API extends the asynchronous model by submitting arrays of related operations in a single call. This amortizes per-submission overhead, enables the SDK to optimize DMA scatter-gather lists, and allows the hardware scheduler to parallelize across internal execution units more effectively.

// Batch ML-KEM encapsulations for 1000 sessions
#define BATCH_SIZE 1000

quac_batch_desc_t descs[BATCH_SIZE];
quac_kem_encaps_result_t results[BATCH_SIZE];

for (int i = 0; i < BATCH_SIZE; i++) {
    descs[i] = (quac_batch_desc_t){
        .op      = QUAC_OP_KEM_ENCAPS,
        .alg     = QUAC_KEM_ML_KEM_768,
        .key     = peer_keys[i].pk,
        .key_len = peer_keys[i].pk_len,
        .out     = &results[i],
        .out_len = sizeof(results[i]),
        .tag     = i,
    };
}

// Submit entire batch at once
quac_batch_t batch;
quac_status_t rc = quac_batch_submit(dev, descs, BATCH_SIZE, &batch);
if (rc != QUAC_OK) { handle_error(rc); }

// Wait for all to complete
rc = quac_batch_wait(batch, 10000);  // 10s timeout

// Check individual results
for (int i = 0; i < BATCH_SIZE; i++) {
    quac_job_status_t st = quac_batch_get_status(batch, i);
    if (st != QUAC_JOB_COMPLETED) {
        log_error("Job %d failed: %d", i, st);
    }
}

quac_batch_release(batch);

Batch Options

Option	Default	Description
`QUAC_BATCH_ORDERED`	Off	Guarantee results are populated in submission order
`QUAC_BATCH_ATOMIC`	Off	All-or-nothing: if any operation fails, cancel remaining
`QUAC_BATCH_PROGRESS`	Off	Enable progress callbacks (fires every N completions)
`QUAC_BATCH_COALESCE`	On	Coalesce DMA transfers for same-algorithm operations
`QUAC_BATCH_MAX_INFLIGHT`	256	Maximum concurrent hardware submissions from this batch

// Atomic batch with progress reporting
quac_batch_opts_t opts = {
    .flags = QUAC_BATCH_ATOMIC | QUAC_BATCH_PROGRESS,
    .progress_interval = 100,  // Callback every 100 completions
    .progress_cb = my_progress_cb,
    .progress_ctx = &my_progress_state,
};
quac_batch_submit_ex(dev, descs, BATCH_SIZE, &opts, &batch);

Heterogeneous Batches

Batches can mix different operation types and algorithms. The scheduler automatically routes operations to the appropriate hardware execution units. Heterogeneous batches are useful when processing complete protocol flows â€” for example, performing a KEM encapsulation and a digital signature for a single TLS handshake:

quac_batch_desc_t handshake_ops[2] = {
    { .op = QUAC_OP_KEM_ENCAPS,  .alg = QUAC_KEM_ML_KEM_768,  ... },
    { .op = QUAC_OP_SIGN,        .alg = QUAC_SIG_ML_DSA_65,   ... },
};
quac_batch_submit(dev, handshake_ops, 2, &batch);

Pipeline Architecture #

Understanding the hardware pipeline is essential for maximizing throughput. The QUAC 100 processes operations through a five-stage pipeline, with each stage operating concurrently on different operations:

Stage	Duration	Description
1. Command Fetch	~100 ns	DMA engine reads job descriptor from host command ring
2. Input Transfer	~200 ns	Input data (keys, messages) DMA'd to on-chip HBM
3. Compute	~500 ns	Cryptographic operation executes on NTT/hash/sampler units
4. Output Transfer	~200 ns	Results DMA'd back to host memory
5. Completion Post	~50 ns	Completion descriptor written to host completion ring

With N operations in flight, stages overlap. The theoretical maximum throughput is limited by the longest stage (compute at ~500 ns), yielding ~2M ops/s. Practical throughput reaches 1.4M ops/s due to PCIe bus contention, host-side scheduling, and varying operation sizes.

To keep the pipeline full, the SDK maintains an in-flight window â€” the number of operations that have been submitted to hardware but not yet completed. The optimal window size depends on operation type and system configuration:

// Query recommended in-flight window
size_t optimal_window;
quac_async_get_optimal_window(dev, QUAC_OP_KEM_ENCAPS,
                               QUAC_KEM_ML_KEM_768, &optimal_window);
// Typically returns 32â€“128 depending on operation type

// Set maximum in-flight depth
quac_async_set_max_inflight(dev, 256);

DMA & Zero-Copy #

For maximum throughput, the SDK supports zero-copy DMA using pinned (page-locked) host memory. When input and output buffers are allocated through the SDK's DMA allocator, the driver can program scatter-gather DMA directly to user-space memory, eliminating kernel buffer copies.

// Allocate DMA-capable buffer pool
quac_dma_pool_t *pool = quac_dma_pool_create(dev,
    1024 * 1024,    // 1 MB total pool size
    4096,           // 4 KB alignment (page-aligned for DMA)
    QUAC_DMA_PINNED // Pin pages to prevent swapping
);

// Allocate buffers from pool
void *input_buf  = quac_dma_alloc(pool, input_size);
void *output_buf = quac_dma_alloc(pool, output_size);

// Use in async operations â€” SDK detects DMA-capable buffers
// and programs zero-copy DMA automatically
quac_async_desc_t desc = {
    .op      = QUAC_OP_KEM_ENCAPS,
    .alg     = QUAC_KEM_ML_KEM_768,
    .key     = pk_buf,         // Regular memory â€” will be copied
    .key_len = pk_len,
    .out     = output_buf,     // DMA pool memory â€” zero-copy
    .out_len = output_size,
};

// Cleanup
quac_dma_free(pool, input_buf);
quac_dma_free(pool, output_buf);
quac_dma_pool_destroy(pool);

IOMMU and NUMA Considerations

On systems with IOMMUs (Intel VT-d, AMD-Vi), the driver programs IOMMU page tables to map DMA-capable user buffers into the device's I/O address space. For NUMA systems, allocating DMA pools on the same NUMA node as the QUAC 100's PCIe root complex reduces cross-node memory access latency:

// Query the NUMA node of the device
int numa_node;
quac_get_device_numa_node(dev, &numa_node);

// Create NUMA-aware DMA pool
quac_dma_pool_t *pool = quac_dma_pool_create_ex(dev, &(quac_dma_pool_config_t){
    .size      = 4 * 1024 * 1024,
    .alignment = 4096,
    .flags     = QUAC_DMA_PINNED | QUAC_DMA_NUMA_LOCAL,
    .numa_node = numa_node,
});

Huge Pages

For very large DMA pools (100+ MB), using huge pages (2 MB or 1 GB) reduces TLB pressure and IOMMU page table overhead. The SDK automatically detects hugetlbfs availability:

// Enable huge page backing for DMA pool
quac_dma_pool_t *pool = quac_dma_pool_create_ex(dev, &(quac_dma_pool_config_t){
    .size      = 128 * 1024 * 1024,  // 128 MB
    .alignment = 2 * 1024 * 1024,    // 2 MB huge pages
    .flags     = QUAC_DMA_PINNED | QUAC_DMA_HUGEPAGES,
});

Queue Management #

The QUAC 100 supports up to 16 independent hardware command/completion ring pairs, called queue pairs. Each queue pair can be assigned to a different thread or CPU core, eliminating lock contention on the submission path.

// Query available queue pairs
size_t n_queues;
quac_get_queue_count(dev, &n_queues);

// Bind current thread to a specific queue
quac_queue_bind(dev, 3);  // Use queue pair 3 for this thread

// Or let the SDK auto-assign per thread
quac_queue_bind(dev, QUAC_QUEUE_AUTO);

Queue Depth and Backpressure

Each queue pair has a configurable depth (number of outstanding operations). When a queue is full, quac_async_submit() returns QUAC_ERR_QUEUE_FULL. The SDK provides several strategies for handling backpressure:

// Strategy 1: Blocking wait for queue space
quac_async_set_submit_mode(dev, QUAC_SUBMIT_BLOCKING);

// Strategy 2: Non-blocking with error return (default)
quac_async_set_submit_mode(dev, QUAC_SUBMIT_NONBLOCKING);

// Strategy 3: Adaptive â€” block briefly, then return error
quac_async_set_submit_mode_ex(dev, &(quac_submit_config_t){
    .mode          = QUAC_SUBMIT_ADAPTIVE,
    .spin_us       = 10,     // Spin for up to 10 Î¼s
    .backoff_us    = 100,    // Then sleep for 100 Î¼s
    .max_retries   = 3,      // Retry up to 3 times
});

Queue Statistics

quac_queue_stats_t stats;
quac_get_queue_stats(dev, 0, &stats);
printf("Queue 0: depth=%zu inflight=%zu submitted=%lu completed=%lu\n",
       stats.depth, stats.inflight, stats.total_submitted, stats.total_completed);
printf("  avg_latency=%.1f Î¼s  p99_latency=%.1f Î¼s\n",
       stats.avg_latency_us, stats.p99_latency_us);

Throughput Tuning #

Achieving peak throughput requires careful tuning of several parameters. The following guidelines are based on benchmarks with ML-KEM-768 operations on an AMD EPYC 7763 system with PCIe Gen4 x16.

Tuning Checklist

Parameter	Recommendation	Impact
Queue pairs	1 per submitting thread, min 4	Eliminates submission lock contention
In-flight depth	64â€“128 per queue	Keeps hardware pipeline saturated
DMA pools	Use pinned, NUMA-local memory	10â€“30% throughput improvement
Batch size	32â€“256 for homogeneous ops	Amortizes per-submission overhead
Completion model	Event FD with epoll for servers	Lowest CPU overhead
CPU affinity	Pin threads to NUMA-local cores	Reduces cache thrashing
Interrupt coalescing	16â€“64 completions or 10 Î¼s	Reduces interrupt overhead
PCIe MPS	Set to 256B via BIOS	Matches typical descriptor size

Interrupt Coalescing

By default, the hardware generates one interrupt per completion. For high-throughput workloads, coalescing multiple completions into a single interrupt reduces CPU overhead significantly:

// Coalesce: interrupt after 32 completions OR 10 Î¼s, whichever first
quac_set_interrupt_coalescing(dev, &(quac_irq_coalesce_t){
    .count_threshold = 32,
    .timer_us        = 10,
});

Adaptive Polling

For ultra-low latency workloads where interrupt coalescing adds unacceptable delay, the SDK supports busy-poll mode on the completion ring. This trades CPU cycles for sub-microsecond completion latency:

// Enable busy-poll mode on queue 0
quac_async_set_completion_mode(dev, 0, QUAC_COMP_BUSYPOLL);

// Hybrid: busy-poll for 5 Î¼s, then fall back to interrupt
quac_async_set_completion_mode_ex(dev, 0, &(quac_comp_config_t){
    .mode           = QUAC_COMP_HYBRID,
    .busypoll_us    = 5,
});

Performance Benchmarks #

The following benchmarks were measured on an AMD EPYC 7763 (64 cores) with QUAC 100 Rev B installed in a PCIe Gen4 x16 slot, using 4 queue pairs and 128-depth in-flight windows with NUMA-local DMA pools.

Operation	Sync (1 thread)	Async (4 threads)	Batch (4Ã—256)	Peak (16 threads)
ML-KEM-768 Keygen	198K ops/s	762K ops/s	1.12M ops/s	1.41M ops/s
ML-KEM-768 Encaps	185K ops/s	714K ops/s	1.05M ops/s	1.38M ops/s
ML-KEM-768 Decaps	180K ops/s	698K ops/s	1.01M ops/s	1.35M ops/s
ML-DSA-65 Sign	145K ops/s	548K ops/s	820K ops/s	1.08M ops/s
ML-DSA-65 Verify	210K ops/s	802K ops/s	1.18M ops/s	1.45M ops/s
SLH-DSA-128s Sign	42K ops/s	162K ops/s	245K ops/s	318K ops/s
QRNG (256-bit)	890K ops/s	2.1M ops/s	3.4M ops/s	4.2M ops/s

The SDK ships with a built-in benchmarking tool that produces results calibrated to your specific hardware and system configuration:

$ quac-bench --alg ml-kem-768 --mode async --threads 4 --duration 30
Algorithm:    ML-KEM-768
Mode:         Async (4 threads, 128 inflight/queue)
Duration:     30.0 seconds
Operations:   21,384,000
Throughput:   712,800 ops/sec
Avg Latency:  0.89 Î¼s
P50 Latency:  0.82 Î¼s
P99 Latency:  1.47 Î¼s
P99.9 Latency: 3.21 Î¼s
CPU Usage:    12.4% (4 cores)

Next Steps #

With asynchronous and batch operations enabling maximum hardware utilization, explore the Security & Compliance page to understand FIPS 140-3 operational requirements, key lifecycle management, and audit logging configuration. For a complete function-level reference of all async and batch APIs, see the API Reference.

Was this page helpful? Send feedback