Achieving deterministic LLM inference has recently attracted serious engineering attention. Three notable efforts:

All tackle this problem from different angles and arrive at overlapping but distinct insights. This post goes over 2 items:

  • why would you want to make inference deterministic?
  • a synthesis of all approaches with the aim of providing an approachable yet complete picture of what technically causes nondeterminism and how to eliminate it.

Why is deterministic inference useful?

The below use cases are inspired by vLLM’s documentation of its deterministic mode (via batch invariance):

  • Reinforcement Learning (RL): Deterministic inference removes the randomness from an agent’s decision-making process, ensuring that identical environmental states always result in identical actions. This consistency is essential for reproducibility because it allows researchers to perfectly replicate scenarios to isolate bugs or verify hyperparameter changes, which is impossible if the agent’s behavior shifts unpredictably between runs. It also promotes stable training and accurate evaluation by reducing the variance in collected trajectories, providing a clear, noise-free signal of the policy’s true quality rather than outcomes skewed by lucky or unlucky random sampling.

  • Framework debugging: Deterministic outputs make it easier to debug issues in the inference framework, as the same input will always produce the same output regardless of batching.

  • Model debugging: Helps identify issues in model implementations by ensuring consistent behavior across different batch configurations.

  • Large-scale inference systems: Systems that use an inference engine that supports deterministic inference as a component benefit from deterministic behavior for testing, validation, and consistency guarantees.

  • Auditability and Compliance Requirements: Enterprises need audit trails for compliance purposes and having deterministic processes is helpful in this regard.

How does deterministic inference work?

The Problem: Why Can't We Just Re-Run the Model?

Here's something that might surprise you: if you send the exact same prompt to the exact same large language model twice, you might get different outputs. This happens even with the temperature set to zero and it's a fundamental property of how modern GPUs execute neural network computations. It's not a bug in the code or a quirk of the API.

Understanding why this happens requires us to dig into the guts of GPU computing. The culprits are

  1. floating-point non-associativity and
  2. nondeterministic execution ordering.

Floating-Point Math Isn't What You Think

In school, we learned that addition is associative: (a + b) + c = a + (b + c). This is true for real numbers, but it's not true for floating-point numbers on a computer. Due to rounding errors, the order in which you add numbers changes the final result.

Consider a simple example. When you sum three floating-point values, the intermediate rounding differs based on grouping:

(0.1 + 0.2) + 0.3  →  might give 0.6000000000000001
0.1 + (0.2 + 0.3)  →  might give 0.5999999999999999

These differences are tiny but they compound. In a single transformer forward pass, you're performing billions of such operations. A minuscule difference early in the computation can cascade, ultimately changing the probability distribution from which the next token gets sampled.

GPUs Make This Worse

Modern GPUs are massively parallel processors. When you perform a matrix multiplication or a reduction (like summing a vector), the work gets split across thousands of threads. These threads race to complete their partial computations, and the order in which their results get combined depends on how the computation is structured.

This nondeterminism exists at multiple layers of the stack:

  • Hardware layer: Different GPU architectures (A100 vs. H100) implement fused multiply-add operations and rounding modes with subtle architectural distinctions. Even "identical" operations produce different bit patterns across different hardware generations.

  • Library layer: Core libraries like cuBLAS (CUDA Basic Linear Algebra Subprograms) and cuDNN (CUDA Deep Neural Network) may use different kernel implementations depending on input shapes. Their "fast math" modes trade reproducibility for throughput.

  • Framework layer: Deep learning frameworks apply dynamic optimizations: graph fusion, asynchronous kernel launches, automatic mixed precision, among others. These all introduce additional variability.

  • Decoding layer: Token sampling from probability distributions involves random number generation. Even with fixed seeds, the iteration order through logits can vary.

The Atomic Add Hypothesis: Does it hold up?

A common explanation for GPU nondeterminism is what the Thinking Machines team calls the "concurrency + floating point" hypothesis: if you use atomic adds to accumulate results from parallel threads, the order in which threads finish is nondeterministic, so the accumulation order varies between runs.

While atomic adds can cause nondeterminism, here's the surprising part: the forward pass of a typical LLM contains no atomic adds at all. The Thinking Machines blog demonstrates this with a simple test:

A = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16)
B = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16)
ref = torch.mm(A, B)
for _ in range(1000):
    assert (torch.mm(A, B) - ref).abs().max().item() == 0

This test passes. Matrix multiplication is deterministic on a single GPU with the same inputs. If atomic adds were the culprit, we'd expect different results each time. So what's really going on? We'll cover that in the Thinking Machines section below.


High-level Comparison of the Approaches

Before we deep dive into each approach, the table below shows a high-level overivew of how the approaches compare:

Aspect EigenAI (GPU) Thinking Machines (GPU) Groq (LPU)
Determinism source Software constraints Kernel design Hardware architecture
Execution scheduling Runtime (constrained) Runtime (constrained) Compile-time (static)
Memory architecture HBM + caches HBM + caches SRAM only
Cross-run reproducibility Achieved via pinning Achieved via batch invariance Guaranteed by design
Performance overhead ~2% ~40-60% None (baseline)
Hardware flexibility Any compatible GPU Any compatible GPU Groq LPUs only
Multi-chip coordination Nondeterministic networking Nondeterministic networking Plesiosynchronous protocol

The fundamental difference is where determinism lives in the stack. GPU-based approaches achieve determinism by constraining a fundamentally nondeterministic system; either by pinning the environment (EigenAI) or by designing kernels that produce identical results despite varying conditions like batch sizes (Thinking Machines). Groq achieves determinism by building a system where nondeterminism never enters in the first place, as the software schedule is always known ahead of time.

EigenAI's Approach: Lock Everything Down

EigenAI treats determinism as an infrastructure problem. The philosophy: if every variable that could affect computation is fixed and identical between runs, the outputs must be identical. This is achieved through exhaustive environment pinning, deterministic kernels, and where necessary (due to inability to inspect some kernels) custom kernels.

The approach treats inference as a pure mathematical function:

F: (model, architecture, prompt, seed, decode_policy) → output

When all these inputs are fixed, the output must be bit-identical across any number of re-executions.


Principle 1: Single-Architecture Enforcement

The first and most important rule is simple: pick one GPU architecture and stick with it.

Cross-architecture reproducibility is essentially impossible because different GPU generations implement floating-point operations differently at the silicon level. EigenAI's experiments report that A100 and H100 produce outputs that differ by approximately 10⁻⁷ in logit values - small, but enough to change token selections.

EigenAI enforces what it calls a "single-architecture policy": every execution environment that needs to produce matching outputs must use identical GPU SKUs. An H100-based system only ever compares results with other H100 systems.

Additionally, EigenAI enables GPU persistence mode: a setting that keeps the GPU driver loaded in memory even when no applications are actively using the GPU. Without this, the GPU driver can reset state between workloads, potentially altering kernel execution order on subsequent runs.


Principle 2: Pinned Software Stack

Every piece of software that touches the computation must be version-locked and immutable. This means:

  • Container digests, not tags: Container images are referenced by their content hash (digest), never by mutable tags like latest or even version numbers. This guarantees that every execution uses exactly the same binaries.

  • Pinned CUDA and driver versions: The recommended configuration is on "CUDA 12.4 with R550 driver" as fixed combinations. Different driver versions can alter kernel scheduling and autotuner behavior.

  • Deterministic library modes: cuBLAS and cuBLASLt provide configuration flags that force reproducible execution paths:

// Force deterministic behavior - no nondeterministic atomics,
// no non-associative mixed-precision reductions
cublasSetMathMode(handle, CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION);
cublasLtMatmulPreferenceSetAttribute(pref, 
    CUBLASLT_MATMUL_PREF_MATH_MODE_MASK, 
    &deterministicMode, sizeof(deterministicMode));
  • Disabled autotuners: Many frameworks automatically benchmark different kernel implementations and cache the fastest one. This introduces nondeterminism because the "winner" can vary between runs. EigenAI disables all such autotuning.

Principle 3: Deterministic Kernels with Fixed Reduction Order

EigenAI uses a small set of custom GPU kernels designed for reproducibility. They build on llama.cpp, an open-source C/CUDA implementation with a minimal and auditable surface area.

These kernels satisfy three critical invariants:

  • Fixed block-thread mapping: Each thread block is deterministically assigned to specific output tiles. There's no inter-block communication that could introduce ordering dependencies.

  • Warp-synchronous reductions: Instead of allowing threads to combine results in arbitrary order, reductions follow a fixed binary-tree pattern using warp-level primitives:

// Canonical reduction tree - same order every time
for (int offset = warpSize/2; offset > 0; offset /= 2) {
    sum += __shfl_down_sync(0xffffffff, sum, offset);
}

The __shfl_down_sync intrinsic performs a warp-wide shuffle operation that is guaranteed to execute in the same order on every run. All 32 threads in a warp participate in a synchronized binary-tree reduction where thread 0 combines with thread 16, thread 1 with thread 17, and so on, halving the active threads at each step.

This creates a "canonical reduction order": the same accumulation sequence every time, producing identical rounding behavior.

Despite these constraints, EigenAI's deterministic kernels achieve 95-98% of standard cuBLAS throughput on Hopper-class hardware.

  • No floating-point atomics: Atomic operations are banned entirely. All accumulations use explicit ordering through registers or shared memory.

Principle 4: Deterministic Decoding

Even after computing deterministic logits, token generation can introduce variability. Sampling methods like top-k or nucleus sampling draw from probability distributions, which requires random number generation.

EigenAI enforces deterministic decoding through two mechanisms:

  • Fixed-seed PRNG: The pseudorandom number generator is initialized with a known seed that's recorded as part of the inference request. Given the same seed, the same sequence of random numbers is generated.

  • Canonical iteration order: When iterating through logits to apply sampling, the order must be fixed. Different iteration orders can produce different results due to floating-point rounding in probability normalization.

The decode policy and seed become explicit parameters of the inference function. Users who want behavioral variety can simply vary the seed, but they can always verify that any specific output matches its declared seed.


Principle 5: Quantized Matrix Multiplication

llama.cpp's quantized matrix-multiplication kernels (Q4, Q5, etc.) are also "inherently deterministic." This is because quantization constrains the numerical representation enough that the operations become more predictable.

Quantized inference uses integer or low-bit representations for weights, with floating-point only at specific stages. The reduced numerical surface area means fewer opportunities for rounding divergence, and the custom kernels are written with determinism as a first-class requirement.


Empirical Validation

The EigenAI team validated their approach with systematic experiments on NVIDIA Hopper GPUs. Their test suite processed 1,000 prompts spanning summarization, reasoning, and code generation tasks, recording cryptographic hashes of the full output:

SHA256(prompt || logits || tokens)

The results demonstrate that their engineering discipline works:

Test Condition Match Rate Notes
Same host, same GPU 100.0% Bitwise identical
Same host, different GPU count 100.0% Bitwise identical
Different hosts, same GPU SKU 100.0% Bitwise identical
Different architecture (A100 vs. H100) 0.0% Expected—different rounding

Across 10,000 runs, all hashes matched exactly when the architecture was held constant. No bit-level divergence was observed.

The team also tested robustness under operational stress by co-scheduling background GPU workloads that induced synthetic jitter and scheduling variability. Despite this perturbation, all runs still produced identical outputs, confirming that their deterministic kernel design effectively isolates inference from transient runtime effects.

Performance Overhead

A common concern is that determinism comes at a steep performance cost. The measurements suggest otherwise:

Kernel Type Relative Throughput Overhead
cuBLAS (baseline) 1.00×
Deterministic GEMM 0.97× +2.4%
Deterministic mixed-precision 0.95× +4.1%
End-to-end LLM inference 0.98× +1.8%

The end-to-end overhead of roughly 2% is remarkably small. The key insight is that modern GPUs have enough parallelism that carefully designed deterministic algorithms can still saturate the hardware.


Thinking Machines' Approach: Batch Invariance

The Thinking Machines analysis reveals a subtler problem: batch invariance. Even though individual kernels are run-to-run deterministic (same inputs → same outputs), they're not batch-invariant. The output for a given input can change depending on how many other inputs are processed in the same batch.

Here's their demonstration:

B = 2048
D = 4096
a = torch.linspace(-1000, 1000, B*D).reshape(B, D)
b = torch.linspace(-1000, 1000, D*D).reshape(D, D)

# Matrix-vector multiplication (batch size 1)
out1 = torch.mm(a[:1], b)

# Matrix-matrix multiplication, then take first element
out2 = torch.mm(a, b)[:1]

print((out1 - out2).abs().max())  # tensor(1669.2500, device='cuda:0')

The same mathematical operation on the same data produces different results depending on batch size! This is because different batch sizes trigger different kernel implementations, tile sizes, or reduction strategies - each with different accumulation orders.

In a production inference server, the batch size varies constantly based on concurrent load. From any individual user's perspective, the "other requests being processed simultaneously" is effectively random. This randomness in batch size translates to randomness in outputs, even though the kernels themselves are technically deterministic.

The Thinking Machines team tested this with Qwen-3-235B, generating 1000 completions of "Tell me about Richard Feynman" at temperature 0. They observed 80 unique completions, with divergence first appearing at the 103rd token. Some outputs said "Queens, New York" while others said "New York City"—a meaningful semantic difference caused purely by numerical variance.

This reframes the problem. Instead of locking down the entire environment (which may not be practical for a shared inference service), you design kernels that produce identical outputs regardless of concurrent load.

Why Kernels Lack Batch Invariance

The Thinking Machines analysis identifies three key operations that need batch-invariant implementations: RMSNorm, matrix multiplication, and attention. Each presents different challenges.

  • RMSNorm: The standard "data-parallel" approach assigns each input row to a separate GPU core. This works well when batch sizes are large enough to saturate all cores. But when batch sizes are small, kernels may switch to "split-reduction" strategies that parallelize within each row. This changes the accumulation order.

  • Matrix Multiplication: Similarly, matmuls typically tile the output into 2D chunks assigned to different cores. Small batch sizes can trigger "Split-K" strategies that parallelize along the reduction dimension, changing the accumulation order.

  • Attention: This is the trickiest case. The FlashAttention algorithm parallelizes along the query dimension. During decode (when you're generating one token at a time), the query length is tiny, so kernels switch to "Split-KV" or "FlashDecoding" strategies that parallelize along the key-value dimension. The specific split strategy typically depends on how many queries are being processed. This breaks batch invariance.

Achieving Batch Invariance

The Thinking Machines solution is to design kernels with fixed reduction strategies regardless of batch size:

  • For RMSNorm and Matmuls: Use a consistent reduction strategy that has enough parallelism even for small batch sizes. This might sacrifice some peak performance for large batches, but guarantees identical numerics.

  • For Attention: Use a "fixed split-size" strategy instead of a "fixed split-count" strategy. Rather than dividing the KV dimension into N equal parts (where N varies based on load), divide it into chunks of fixed size. The number of chunks varies, but each chunk's reduction order stays constant.

# Fixed split-count (NOT batch-invariant):
# KV length 1000 with 4 splits → chunks of 250 each

# Fixed split-size (batch-invariant):
# KV length 1000 with chunk size 256 → three 256-chunks + one 232-chunk

This ensures that processing token 1000 in a sequence has identical numerics whether 0 tokens or 999 tokens are already in the KV cache.


A Different Path: Groq's Hardware-Level Determinism

Both EigenAI and Thinking Machines approach determinism as a software problem to be solved on top of GPU hardware. They accept that GPUs are fundamentally nondeterministic and work around those limitations through careful kernel design and environment control. But there's a third approach that's worth understanding: designing hardware that's deterministic from the ground up.

Groq's Language Processing Unit (LPU) represents a fundamentally different philosophy. Rather than fighting GPU nondeterminism with software constraints, Groq built custom silicon where determinism is an architectural guarantee. Understanding how they achieve this illuminates why GPU-based determinism requires such elaborate engineering in the first place.

Why GPUs Are Inherently Probabilistic

To appreciate Groq's approach, we need to understand what makes GPUs nondeterministic at the hardware level. Modern GPUs are designed around a principle of dynamic scheduling - the hardware itself decides, at runtime, how to allocate resources and order operations. This includes branch predictors that guess which code path will execute, arbiters that resolve conflicts when multiple threads need the same resource, reordering buffers that rearrange instructions for efficiency, and caches that introduce variable latency depending on hit rates.

These components exist because GPUs evolved from graphics workloads where maximizing average throughput matters more than predictable latency. When you're rendering a game at 60 frames per second, it doesn't matter if individual pixels take slightly different amounts of time - what matters is that the whole frame finishes fast enough. The hardware optimizes for the average case, and the variance is hidden by the sheer volume of work.

This design philosophy creates problems for LLM inference. Token generation is inherently sequential (each token depends on the previous one), and even tiny timing variations can compound into different execution orders which as we've seen means different floating-point accumulation sequences and different numerical results.

The LPU: Determinism by Design

Groq's LPU takes the opposite approach: it eliminates all the hardware components that introduce unpredictability. There are no branch predictors because all execution paths are determined at compile time. There are no arbiters because the compiler pre-allocates every resource. There are no reordering buffers because instructions execute in exactly the order specified. There are no caches because all memory access patterns are known in advance.

The key insight is that AI inference workloads are remarkably predictable. Unlike general-purpose computing, where program behavior depends on runtime data, an LLM's computational graph is fixed. The same model architecture always performs the same operations in the same order. Groq's compiler exploits this predictability by pre-computing the entire execution plan down to individual clock cycles before any inference begins.

This creates what Groq calls a "software-scheduled" architecture. The compiler tells the hardware exactly when each operation will occur and where each piece of data will be at every moment. The hardware simply follows these instructions without making any autonomous decisions. The same input to the same compiled model produces the same execution trace every single time.

SRAM vs. HBM: The Memory Trade-off

One of Groq's most radical design choices is abandoning High Bandwidth Memory (HBM) entirely in favor of on-chip SRAM.

Standard GPUs like the H100 pair their compute cores with large HBM stacks - 141GB in the case of the H200. HBM provides massive capacity but introduces variable latency. Cache misses, memory controller scheduling, and refresh cycles all create unpredictable delays. These delays propagate through the computation, affecting when different threads complete their work and ultimately influencing accumulation order.

Groq's LPU instead integrates hundreds of megabytes of SRAM directly on the chip—not as a cache, but as the primary weight storage. SRAM is roughly 100x faster than HBM and provides completely deterministic access times. Every memory read takes exactly the same number of clock cycles, every time. This eliminates a major source of timing variability that GPU-based systems must carefully engineer around.

The trade-off is capacity. A single LPU chip has only about 230MB of SRAM, compared to 80GB+ of HBM on a modern GPU. This means running a large model like Llama-3 70B requires hundreds of LPU chips working together, while the same model fits on just a few GPUs. Groq addresses this through their multi-chip architecture, which deserves its own discussion.

Plesiosynchronous Multi-Chip Coordination

Running inference across hundreds of chips introduces its own coordination challenges. In GPU clusters, inter-chip communication typically uses network protocols that introduce nondeterministic latency: packets can take different routes, switches introduce queuing delays, and collective operations must wait for the slowest participant.

Groq sidesteps these issues with a "plesiosynchronous" chip-to-chip protocol. This term, borrowed from telecommunications, describes a system where multiple clocks run at nominally the same frequency but without a single shared clock source. Groq's chips cancel natural clock drift through periodic software synchronization, allowing hundreds of LPUs to behave as a single logical processor.

Because the compiler knows exactly when data will arrive at each chip, it can schedule inter-chip communication just as precisely as on-chip operations. There's no runtime negotiation or dynamic routing. The data flow is completely predetermined. This extends the determinism guarantee across the entire rack-scale system.

TruePoint Numerics: Precision Without Sacrifice

Another source of variability in GPU inference comes from aggressive quantization. To improve throughput, many GPU deployments compress model weights to INT8 or even INT4, accepting some accuracy loss for speed. The specific rounding behavior of these low-precision operations can vary based on implementation details.

Groq's approach, which they call "TruePoint numerics," maintains higher precision where it matters for accuracy while still achieving high throughput. Their intermediate accumulations use 100-bit precision. This is sufficient range to guarantee lossless accumulation regardless of input bit width. This eliminates a class of numerical instability that can cause divergent outputs in GPU systems using mixed-precision computation.

Trade-offs and Implications

As with any approach, Groq's approach has its own set of trade-offs.

  • Capacity constraints: The SRAM-only architecture means large models require massive chip counts. A deployment that fits on a handful of GPUs might need multiple racks of LPUs. This increases physical footprint and infrastructure complexity.

  • Flexibility: GPUs can run any workload, from training to inference to general-purpose computing. LPUs are purpose-built for inference and can't be repurposed for training or other tasks. The compiler-first approach also means that novel model architectures may require significant compiler work before they can run efficiently.

  • Ecosystem: CUDA has decades of tooling, libraries, and expertise. Groq's software stack is newer and less battle-tested, though this gap is narrowing.

That said, for applications where determinism is critical, whether it's regulated industries requiring audit trails, verification systems that need re-execution, or mission-critical deployments where inconsistent behavior is unacceptable, Groq's architectural guarantees eliminate entire categories of engineering complexity. You don't need to worry about batch invariance, environment pinning, or kernel auditing because the hardware simply doesn't permit the variability that makes those measures necessary on GPUs.

The Broader Lesson

Groq's existence validates the thesis that GPU nondeterminism is a design choice, not an immutable law of physics. After all, a significant percentage of the world's software today is deterministic and relies on this property. GPUs are nondeterministic because they were designed for workloads where predictability didn't matter. LPUs demonstrate that alternative designs can achieve determinism without sacrificing performance, at least for the specific workload of LLM inference.

This doesn't diminish the value of GPU-based determinism engineering. Many organizations will still run inference on GPUs for the foreseeable future, and the techniques developed by EigenAI and Thinking Machines are essential for achieving reproducibility on that hardware. But understanding Groq's approach helps clarify why those techniques are necessary and hints at a future where determinism might come standard rather than requiring elaborate software scaffolding.


Conclusion

The conventional wisdom that GPU-based neural network inference is inherently nondeterministic turns out to be wrong or at least, surmountable. Through a combination of careful engineering at every layer of the stack, from hardware configuration through kernel design to decoding policy, bit-exact reproducibility is achievable with minimal performance overhead.