CircuitExplorer: Interpretability Research Infrastructure for Automated Circuit Discovery (Long)

Circuit analysis is still too manual.

Attribution graphs make mechanistic interpretability more concrete by giving researchers a prompt-specific graph of features, edges, and target-logit influence. But they do not solve the main workflow problem by themselves. A researcher still has to decide which nodes belong in the circuit, which paths matter, which features are adjacent but not mechanistic, and when the circuit is complete enough to validate.

CircuitExplorer is my attempt to turn that workflow into research infrastructure. It extends Neuronpedia's attribution-graph interface with automated circuit search, live fidelity feedback, semantic grouping, causal validation, and a benchmark harness for stress-testing the method across prompt categories and models.

The core result is:

On 15 researcher-verified circuits, CircuitExplorer's graph search algorithm matches expert-crafted circuits on median causal necessity (46.1pp vs 42.5pp) while improving median sufficiency (62.3% vs 51.2%), and it builds circuits in 6 seconds on average instead of ~10 minutes (100x faster).

The shorter write-up focuses on that story. This post fills in the technical details: how the search is formulated, why the algorithm uses completeness as its main graph-side signal, how the evaluation is separated from the search, how performance engineering brought benchmark runtime down, how the system is wired into Neuronpedia, and where the method currently fails.

The Circuit Discovery Problem
Attribution Graphs as a Search Substrate
Problem Formulation
IA+PC: Influence-Aware Search with Pathway Completion
Why Completeness Beats Replacement for Search
Causal Validation
Evaluation Design
Main Results
Counterfactual Mechanism-Reuse Analysis
Cross-Model Transfer
Runtime Engineering
Performance Engineering
UI and Research Workflow
System Architecture
Failure Modes
Next Method Step: Feature-Role Labeling
Next Scaling Considerations
What This Shows

The Circuit Discovery Problem

An attribution graph is already a major improvement over staring at raw activations. For a given prompt and target output, it gives a directed graph whose nodes correspond to interpretable features and whose edges estimate influence between features. In a factual prompt like:

The capital of the state containing Dallas is -> Austin

the graph can expose features related to Dallas, Texas, capitals, locations, and the target token. The issue is scale. A typical graph for Gemma-2-2B can contain hundreds to thousands of feature nodes and tens of thousands of edges. Somewhere inside that graph is the mechanism the researcher cares about, but the graph does not identify that mechanism automatically.

The manual workflow looks roughly like this:

Pick a target output token.
Follow high-influence features backward from the target logit.
Read feature labels.
Pin features that look relevant.
Follow upstream or downstream connections from those features.
Watch for error nodes that mark unexplained computation.
Repeat until the circuit looks plausible.
Run steering or ablation to test whether the pinned features matter.

This is an expert-guided search procedure, but the search policy is mostly in the researcher's head. The researcher has to infer which features matter, which paths are redundant, and whether the graph is missing a downstream pathway. Even when the result is good, it is difficult to scale and difficult to compare systematically across prompts.

CircuitExplorer's goal is not to remove the researcher from the loop. The goal is to automate the expensive first pass: propose candidate circuits quickly, attach graph-side scores to them, make them inspectable, and connect them to causal validation.

Attribution Graphs as a Search Substrate

CircuitExplorer builds on the circuit-tracer attribution-graph setup. Current attribution graphs decompose MLP computation using cross-layer transcoders (CLTs) while freezing attention patterns. Features become graph nodes, linearized influence becomes graph edges, and error nodes capture computation the transcoders fail to reconstruct.

There are three details that matter for CircuitExplorer.

First, the graph is feature-level. Instead of searching over layers or attention heads, CircuitExplorer searches over interpretable feature nodes (to the limit that the quality of the CLT allows). This makes the resulting circuits more directly inspectable than a component-level circuit like "attention head 7.3 to MLP layer 12."

Second, scores computed based on attribution graphs are cheap. Replacement and Completeness can be computed through the linearized adjacency matrix. That means the search loop does not need to run model inference for every candidate feature.

Third, graph-side scores are not ground truth. The linearized replacement model is a useful search guide, but the final question is still causal: does intervening on these features change the real model's behavior?

That separation is central to the design. CircuitExplorer uses attribution-graph linear algebra to search, then uses model interventions to validate.

Problem Formulation

For each target behavior, CircuitExplorer receives:

a prompt
a target token or target logit
an attribution graph
feature labels and metadata
influence edges
error nodes

The output is a candidate circuit: a subset of feature nodes that should explain the target behavior.

A good circuit should satisfy four constraints:

Necessary: removing the circuit should reduce the target probability.
Sufficient: keeping only the circuit should preserve much of the target behavior.
Inspectable: the circuit should be small and structured enough for a researcher to understand.
Fast to build: discovery should be fast enough for interactive use and broad evaluation.

These constraints are in tension. A tiny hand-picked circuit can be interpretable but slow to build and low necessity. A large high-sufficiency circuit can include support features that are not part of the mechanism of interest. A purely graph-score-optimized circuit can look coherent in the replacement model but fail under real intervention.

The search problem is therefore not "find the highest scoring subgraph" in isolation. It is:

Quickly find a small, target-relevant, mechanism-coherent feature subset that is likely to survive causal validation and is useful as a research object.

IA+PC: Influence-Aware Search with Pathway Completion

CircuitExplorer's main algorithm is IA+PC: Influence-Aware search followed by Pathway Completion.

Baseline: Completeness-Only Greedy Search

The simplest version of the method is a greedy search over graph features using Completeness as the objective. Completeness measures how much of the incoming influence to selected features is explained by interpretable graph features rather than error nodes.

This is a reasonable starting point because it pushes the search toward coherent causal chains. A feature whose activation is mostly explained by other selected features is easier to trust than a feature whose activation is dominated by error nodes.

But completeness-only search has a failure mode: it can find well-explained chains that are not target-specific enough. In other words, it can produce plausible-looking circuits that preserve graph coherence but miss features the model actually uses to push the target token.

Influence-Aware (IA) Search

Influence-Aware search adds target relevance to the greedy process. Candidate features are evaluated not only by how they affect Completeness, but also by whether they have positive influence on the target logit.

This matters because target influence and graph coherence are not the same thing. A feature can be cleanly explained by other features but weakly related to the target. Conversely, a target-promoting feature can be partly error-node-dependent, which makes a pure completeness search avoid it.

The IA stage gives the algorithm a way to include features that are useful for the target while still using Completeness as the main guardrail against opaque, error-node-driven computation.

Pathway Completion (PC)

Pathway Completion addresses the opposite problem: the IA stage can find causally important features but leave downstream gaps. A circuit can be necessary because it contains a critical upstream feature, while still being insufficient because it omits downstream features needed to carry that signal to the output.

PC expands the circuit after the greedy stage by adding nearby features that close those gaps without substantially degrading Completeness. The number of PC passes is a search parameter.

The conceptual split is:

IA improves target-specific necessity.
PC improves pathway-level sufficiency.
Completeness keeps the circuit anchored to interpretable graph structure.

Graph Pruning

Before search, CircuitExplorer prunes the graph to a smaller candidate set using target influence and connectivity. Non-feature nodes such as embeddings, logits, and error nodes are preserved, but low-value feature nodes are filtered out.

The pruning step is deliberately conservative. It is not trying to solve circuit discovery before the actual search starts. It is only trying to remove feature nodes that are unlikely to be selected by IA+PC. To do that, it assigns each feature a composite pre-search score.

The strongest signal is direct target influence: if a feature has an outgoing edge into the target logit, it gets a large boost. The next signal is one-hop indirect target influence: if a feature points to an intermediate node that then points to the target logit, it gets a smaller boost proportional to the product of those edge weights. This keeps features that may not directly touch the output but appear to be part of a short target-directed pathway.

The pruning score also includes two broader relevance signals. First, it uses the feature's existing graph influence as a weak tiebreaker, so globally important features are not discarded solely because they do not have a direct target edge. Second, it adds a small connectivity term based on the total absolute outgoing edge weight from the feature. This favors features that are structurally connected enough to participate in a pathway, while still letting target-specific influence dominate the ranking.

In implementation terms, the pre-search score is roughly:

score =
  2.0 * direct_target_edge_weight
  + 0.3 * one_hop_target_path_weight
  + 0.1 * node_influence
  + 0.05 * outgoing_edge_weight

After scoring, CircuitExplorer keeps the top fraction of feature nodes, with a minimum feature count so small graphs are not over-pruned. It then filters links to edges whose source and target both survived. Embeddings, logits, and error nodes are always kept because they define the input/output boundary and preserve the scoring semantics for Replacement and Completeness.

This is not just a speed optimization. In practice, pruning low-value features can improve circuit quality because it removes distracting features that inflate the search space without adding useful target-specific structure. It is also part of the answer to whether CircuitExplorer can scale to much larger models or larger transcoder dictionaries: as graphs get bigger, the search cannot treat every feature as an equally plausible candidate. The README reports that pruning to the top 40% of features by composite score reduces a typical Gemma-2-2B scoring graph from about 855 nodes to about 340 nodes, a roughly 60% node-count reduction. For a dense scoring matrix, that is about an 84% reduction in matrix entries, from roughly 731k to 116k. Per-seed build time falls from about 60 seconds to about 5-7 seconds, saving roughly 53-55 seconds per seed while preserving or improving circuit quality.

Why Completeness Beats Replacement for Search

The circuit-tracer framework exposes two important graph-side scores:

Replacement: how much end-to-end influence flows through selected features rather than error nodes.
Completeness: how much of each selected feature's incoming influence is explained by interpretable features rather than error nodes.

Replacement sounds like the more obvious search objective. If the goal is to explain the target logit, why not maximize the amount of target-directed influence that flows through the circuit?

The empirical answer is that Replacement can select high-influence features that are not connected into transparent causal chains. It can find leaves near the output without finding the branches that explain how those leaves became active.

Completeness behaves differently. It rewards circuits whose features are explained by other interpretable features, and it treats error-node dependence as a warning signal. That turns out to matter for causal validation. In the weighting analysis summarized in the main README, C-only search produced the best mean and median necessity, the best mean sufficiency, and tied for the best median sufficiency among tested R/C weighting strategies:

Strategy	Avg Features	Mean Necessity	Median Necessity	Mean Sufficiency	Median Sufficiency
C only	25.1	0.196	0.182	40.7%	34.5%
Adaptive C->R	20.4	0.189	0.173	38.4%	34.5%
0.3R + 0.7C	18.2	0.185	0.171	38.5%	33.5%
R+C equal	16.9	0.178	0.169	35.5%	32.3%
R only	14.8	0.165	0.163	34.6%	32.4%

The lesson I take from this is that error-node dependence should be treated as a causal uncertainty signal. A feature can point toward the target logit but still be hard to trust if the graph cannot explain why it activated. Completeness is useful because it favors features whose upstream story is visible inside the graph.

This is also why CircuitExplorer does not stop at graph scores. Completeness is a search heuristic, not a calibrated prediction of model behavior.

Causal Validation

CircuitExplorer validates circuits with ablation-based interventions against the model.

For each candidate circuit, the validation loop measures three conditions:

Baseline: the model runs normally on the prompt.
Necessity: circuit features are ablated.
Sufficiency: everything except the circuit features is ablated.

Necessity asks whether the circuit matters:

necessity = P(target | baseline) - P(target | circuit ablated)

If ablating the circuit substantially reduces target probability, the circuit contains features the model relies on.

Sufficiency asks whether the circuit is complete:

sufficiency = P(target | complement ablated) / P(target | baseline)

If keeping only the circuit preserves the target probability, the circuit captures enough of the pathway to reproduce the behavior under heavy ablation.

The two metrics answer different questions. High necessity with low sufficiency often means the circuit found an important upstream junction but omitted downstream or parallel support. High sufficiency with weak necessity can mean the circuit contains features that preserve the ablated model's function but are not the target-specific mechanism.

This distinction becomes important in the failure analysis.

Evaluation Design

CircuitExplorer is evaluated in four main ways.

First, there is a researcher comparison set: 15 researcher-verified circuits on Gemma-2-2B with gemmascope-transcoder-16k. This is the central comparison because it tests whether the automated method can match the workflow it is meant to accelerate.

Second, there is a broader prompt benchmark: 62 prompts across 9 categories, including factual recall, multi-step factual lookup, cross-lingual identification, conditional reasoning, syntactic agreement, negation, antonyms, irregular morphology, and transitive reasoning.

Third, there is a counterfactual mechanism-reuse suite: 40 prompt pairs across 8 categories. The goal is to test whether circuits share causally important features across prompt variants that instantiate the same abstract mechanism.

Fourth, there is a cross-model transfer check on Qwen3-4B using a different transcoder set.

The main metrics are:

necessity
sufficiency
circuit size
runtime
feature overlap for counterfactual pairs
category-level strength and weakness patterns

The 15-circuit researcher set is not large enough to prove universal generality. It is enough to make a meaningful comparison against an expert workflow, especially when combined with broader category-level and counterfactual evidence.

Main Results

On the 15 researcher-verified circuits:

Method	Avg Necessity (pp)	Avg Sufficiency
Researcher	32.7	79.1%
C-only greedy	30.8	33.3%
C-only + PC	30.8	86.0%
IA+PC	32.7	86.2%

C-only greedy search finds useful interpretable chains, but its sufficiency is poor. Adding Pathway Completion raises sufficiency from 33.3% to 86.0%, which suggests the missing pieces were often downstream or adjacent features rather than the core target-relevant chain. Adding Influence-Aware search recovers the necessity gap, bringing average necessity from 30.8pp to 32.7pp, matching the researcher average.

That decomposition is important because it makes the method less like an opaque "algorithm that worked" and more like a set of targeted fixes:

Completeness finds graph-coherent chains.
Pathway Completion fills missing pathway support.
Influence-Aware search preserves target relevance.

The category-level pattern is also informative. CircuitExplorer is strongest on entity-specific factual recall, cross-lingual identification, and some arithmetic or multi-step factual tasks. It is weaker on analogy-style prompts and attention-heavy categories such as antonyms and parts of syntactic agreement.

This is roughly what the architecture of the method predicts. CircuitExplorer works best when the target mechanism is represented in the MLP-focused attribution graph. It struggles when important computation is mediated by attention, hidden in error nodes, or entangled with suppressive features.

Counterfactual Mechanism-Reuse Analysis

A natural objection to automated circuit discovery is that the algorithm might find correlated feature bundles rather than mechanisms.

For example, a prompt about Paris and France might activate features for the Eiffel Tower, French language, tourism, European cities, and Paris. Those features are semantically related, but not all of them are necessarily part of the mechanism that maps "Paris" to "France."

The counterfactual prompt-pair suite tests this by asking whether two prompts that instantiate the same mechanism share causally important features.

Examples:

Bonjour means hello in -> French
Hola means hello in -> Spanish

If the model uses a reusable language-identification mechanism, the two circuits should share language-identification features even though the source word and target language differ.

Across the prompt-pair suite, the strongest sharing appears in cross-lingual and multi-hop factual categories. The README reports cross-lingual overlap around the mid-40% range, with pairs such as "Bonjour -> French" and "Ciao -> Italian" sharing a large fraction of circuit features. Many shared features have language-related labels across layers, forming a coherent language-identification pathway.

Multi-step factual prompts also show reusable structure, especially where the prompt asks for an entity-to-intermediate-to-property lookup. Some features correspond to political entities, locations, languages, or currencies depending on the output type.

As for the negative results, antonyms show near-zero overlap across prompt pairs. Irregular morphology also shows low overlap, with the few shared features often looking like generic prompt or question-answering infrastructure rather than morphology-specific mechanism. That suggests these tasks are either more word-specific, more attention-mediated, or less well covered by the available transcoder features.

This gives CircuitExplorer a more nuanced claim:

The discovered circuits reflect reusable MLP-based mechanisms when those mechanisms are present in the attribution graph, and the counterfactual analysis exposes categories where that assumption breaks down.

Cross-Model Transfer

The primary results use Gemma-2-2B. To test whether the search method is brittle to one model family, I also ran the evaluation on Qwen3-4B with a different transcoder set and no algorithm changes.

The transfer result:

Metric	Gemma-2-2B	Qwen3-4B
Mean necessity	36.0%	37.4%
Mean sufficiency	86.8%	27.9%

The promising part is necessity. IA+PC finds causally important circuits on Qwen3-4B with comparable average necessity to Gemma-2-2B. The category-level pattern also broadly transfers: factual recall and cross-lingual tasks remain strong, while attention-heavy tasks remain weak.

The caveat is sufficiency. The selected circuits are usually incomplete as a self-sufficient mechanism for Qwen3-4B. Necessity only asks whether the discovered features are load-bearing. Sufficiency asks whether the discovered features, by themselves, can carry the behavior after almost everything else is ablated. A circuit can therefore be highly necessary and still not sufficient if it captures a key junction but misses distributed support features, downstream readout features, or computation that the graph represents only through error nodes.

The right conclusion is therefore narrow:

The search procedure appears to transfer across model families, while the achievable circuit quality remains bounded by transcoder quality and graph coverage.

That is useful evidence, but it is not a claim of broad model independence as more models would need to be tested.

Runtime Engineering

CircuitExplorer has two runtime targets:

interactive use in the graph UI
fast enough benchmark execution to evaluate dozens of prompts systematically

The main design decision is to avoid model inference during search. Candidate scoring runs over the attribution graph's linearized representation. Model inference is reserved for causal validation after candidate circuits are built.

That separation keeps the interactive path responsive. The graph UI can stream candidate circuits as they are built, while the expensive causal question is deferred until the researcher asks to validate a circuit or an evaluation runner reaches the validation stage.

It also makes the benchmark harness tractable. Discovery can be run across many prompts using graph-side operations, and validation can be batched separately through the graph server's intervention endpoints.

Performance Engineering

The runtime target was not just "faster than manual tracing." The system needed to be fast enough that repeated evaluation was realistic. A method that takes seconds in the UI but hours to benchmark is hard to iterate on, and circuit quality claims become expensive to re-check after every algorithm change.

The full 62-prompt discovery and validation suite was optimized 15x from roughly 90 minutes to about 6 minutes without changing circuit quality. The main improvements came from moving hot loops into tensor operations, reducing client/server round trips, and overlapping build and validation work.

In this section, a mask means a boolean tensor that selects a subset of graph features. A pinned mask marks which features are currently in the candidate circuit; a trial mask is a temporary variant of that pinned mask with one or more candidate features added so the scorer can evaluate that possible circuit.

For example:

Feature index:     0        1        2        3        4
Feature:        Dallas   Texas   capital   Austin   tourism

Pinned mask:    true     true    false     false    false
Trial mask:     true     true    true      false    false
                                  ^ candidate feature added

The pinned mask says "score the circuit with features 0 and 1 selected." The trial mask adds feature 2 temporarily, so the scorer can ask whether adding that candidate improves Completeness or target influence.

Optimization	Effect
Server-side PyTorch scoring	moved matrix operations out of client-side JavaScript
`scatter_add` pin merging	replaced a Python loop over many features with vectorized tensor merging
Batched candidate evaluation	scored several candidates with one tensor operation
Batched steer validation	submitted necessity and sufficiency ablations together through `/steer-batch`
Pipelined build and validation	overlapped the next circuit build with the previous validation

Server-side PyTorch scoring. (graph scorer setup, device tensors, server-side build endpoint) The first bottleneck was where scoring ran. The UI already had a JavaScript/Web Worker implementation for live Replacement and Completeness feedback, which is useful for interactive editing. But benchmark search repeatedly scores many candidate pin sets, and doing that in client-side JavaScript creates avoidable overhead around data movement, matrix construction, and per-candidate execution. The optimized path moves benchmark scoring to the graph server, where the adjacency matrix is preprocessed once as a PyTorch tensor and reused across candidate evaluations. That keeps the hot path closer to the graph data and lets the same scoring operation run on CPU, GPU, or MPS through tensor kernels.

The linked scorer setup is the place where the graph is converted into a reusable tensor representation, and the /build-circuit endpoint is where IA+PC runs against that server-side scorer rather than in the browser.

The server-side scorer builds the dense adjacency once and keeps the graph metadata as device tensors:

base = torch.zeros(self.n, self.n, dtype=torch.float32)
for link in links_data:
    src = link["source"] if isinstance(link["source"], str) else link["source"]["node_id"]
    tgt = link["target"] if isinstance(link["target"], str) else link["target"]["node_id"]
    si, ti = self.node_id_to_idx.get(src), self.node_id_to_idx.get(tgt)
    if si is not None and ti is not None:
        base[ti, si] = link["weight"]

self.base_adj = base.to(self.device)
self.feature_indices_t = torch.tensor(self.feature_indices, dtype=torch.long, device=self.device)
self.feature_error_indices_t = torch.tensor(self.feature_error_indices, dtype=torch.long, device=self.device)
self.logit_weights = lw.to(self.device)

scatter_add pin merging. (vectorized pin merging) Each candidate circuit is scored by asking: "what if only these pinned features remained interpretable, and every unpinned feature were treated as unexplained error?" To simulate that, every unpinned feature's outgoing contribution has to be moved into the error node for the same layer and token position.

The key line is adj.scatter_add_(...): it adds all unpinned feature columns into their matching error-node columns in one tensor operation, then zeroes the original feature rows and columns.

Remember that the search's objective is Completeness. Completeness penalizes circuits whose selected features depend on error-mediated or unexplained computation. Redirecting unpinned feature influence into error nodes keeps that penalty honest.

The naive implementation did this one feature at a time:

for each unpinned feature:
    move that feature's contribution into its matching error node
    remove the feature itself from the graph

Conceptually, this is like moving hundreds of small piles into a smaller number of labeled buckets, one pile at a time.

scatter_add does the same bookkeeping in one tensor operation. It gathers all unpinned feature contributions, builds a vector saying which error node each one should go to, and asks PyTorch to add them into their destination error nodes all at once:

unpinned features:  F1   F2   F3   F4   F5
error destinations: E7   E7   E9   E7   E12

scatter_add:
  E7  += F1 + F2 + F4
  E9  += F3
  E12 += F5

In the adjacency matrix, that reassignment corresponds to adding each unpinned feature's outgoing-influence column into the column for its matching error node, then zeroing the feature's row and column.

Batched candidate evaluation. (batched scoring kernel, batch scoring wrapper, IA trial batches, PC trial batches) IA and PC both evaluate small groups of possible next features. In IA, the algorithm ranks candidate additions and chooses the one that best preserves Completeness while adding target influence. In PC, it tests adjacent features that might close missing pathways. The unoptimized version scores these trial circuits one at a time. The optimized version builds a batch of trial pin masks, stacks them into a tensor, and scores the whole group with a batched kernel. That matters because each trial shares the same base graph and differs only in which features are pinned. The server can stack the trial adjacency matrices into a batch and run the same matrix multiplication across all candidates at once with PyTorch's batched matrix multiplication (torch.bmm), instead of looping over candidates in Python.

The IA and PC links show the trial_masks tensors being constructed for multiple candidate additions, while score_batch_masks(...) and _jit_score_batch(...) score those masks together with torch.bmm.

Batched steer validation. (batched validation request, necessity+sufficiency packed into one batch, graph server /steer-batch endpoint) Causal validation is still real model intervention, so it is much more expensive than graph-side scoring. Each circuit needs at least two intervention conditions: ablate the circuit for necessity, and ablate the complement for sufficiency. The original validation path could submit those as separate /steer calls, paying repeated request overhead and repeated baseline work. The optimized path submits the intervention feature sets together through /steer-batch, receives the baseline logits once, and then reads necessity and sufficiency from the returned steered results. This does not make intervention cheap, but it removes unnecessary round trips from every evaluated circuit.

The validation runner builds one feature_sets array containing the necessity ablation and the sufficiency ablation, and the /steer-batch handler computes the default logits once before applying each intervention set.

Pipelined build and validation. (validation queue setup, single-worker validation pool, submit validation while continuing builds, serial vs wall-time summary) The final speedup came from reducing idle time in the benchmark runner. Discovery and validation stress different parts of the stack: circuit building is mostly graph-side scoring, while validation waits on model intervention. Instead of building a circuit, waiting for its validation, and only then starting the next prompt, the optimized runner overlaps those stages where possible. While one circuit is being validated, the next circuit can already be built. That turns the benchmark from a strictly serial process into a small pipeline, which matters when the same build-then-validate pattern repeats across dozens of prompts.

The runner submits validation as a future after a circuit is built, flushes only completed validations during the prompt loop, and compares the serial estimate with total wall time to show the overlap.

This changed the shape of the bottleneck. Before these changes, benchmark time was dominated by repeated validation calls and Python-side overhead around scoring. After batching and pipelining, the optimized path is build-bound rather than validation-bound. That means future speedups would likely come from improving the scoring path itself (especially as models / transcoders get larger): sparse matrices, more GPU-friendly kernels, or better candidate pruning.

The important constraint was preserving the evaluation boundary. None of these optimizations changed what IA+PC selected or how causal validation was measured. They changed where the same work ran, how much of it was batched, and how idle time was removed from the benchmark pipeline.

UI and Research Workflow

For interpretability work, the user interface is part of the research method. A circuit proposal is only useful if a researcher can inspect it, compare it, edit it, and validate it.

CircuitExplorer extends Neuronpedia's graph interface with several workflow pieces:

Circuit Explorer modal: launches automated circuit search with configurable starting features, IA steps, PC passes, and pruning.
SSE streaming: returns candidate circuits as they finish rather than waiting for a full batch.
Explored circuits panel: lists candidate circuits with feature counts, scores, grouping status, comparison controls, and validation actions.
Circuit viewing: applies a candidate circuit to the graph as pins and grouped supernodes.
Fidelity dashboard: gives live Replacement and Completeness feedback as features are pinned or unpinned.
Causality testing: runs ablation validation from the graph interface.

The UI supports two modes of use. A researcher can use automated search to get candidate circuits quickly, or they can manually edit pins while watching the fidelity dashboard update. In both cases, the causal validation path stays available from the same workspace.

System Architecture

Since CircuitExplorer extends Neuronpedia, it spans the frontend, Next.js API layer, graph server, model-inference path, and evaluation harness.

At a high level:

Frontend: Next.js, React, TypeScript, D3 graph visualization.
Client-side scoring: Web Worker for live Replacement and Completeness feedback.
Graph server: FastAPI/Python, PyTorch, TransformerLens, circuit-tracer, transcoders.
Validation: steer and steer-batch endpoints for feature ablation.
Persistence/platform: Prisma/Postgres through the Neuronpedia app.
Evaluation: TypeScript and Python runners using the same discovery and validation machinery as the UI.

Failure Modes

The clearest failure mode for CircuitExplorer is analogy-style prompts, such as:

Mexico:Spanish :: US:English
Mexico:peso :: US:dollar

These are cases where researcher-built circuits still beat IA+PC. A possible issue is suppressive-feature contamination. The search can include features that help the target win by suppressing competitors rather than by participating in the target mechanism itself.

This interacts with a broader sufficiency problem. The sufficiency test is a severe intervention: it ablates everything except the circuit. Under that kind of intervention, features that preserve the residual stream's functional support can look useful even if they are not specific to the mechanism being studied.

I think of the problematic features in two broad categories:

Suppressive features: features that help mainly by reducing competitors.
Infrastructure features: features that keep the model functional under heavy ablation.

Both can improve measured sufficiency without making the circuit more mechanism-clean.

There is also the attention gap. Current attribution graphs decompose MLP computation but freeze attention patterns. Tasks like antonyms, syntactic agreement, and some irregular morphology cases can have weak or near-zero shared-feature overlap because the important computation may not be represented cleanly in the MLP feature graph.

Finally, all results are bounded by transcoder quality. If the replacement model does not decompose a relevant direction into interpretable features, CircuitExplorer cannot select it. It will appear as error-node-mediated computation or be absent from the candidate feature set.

Next Method Step: Feature-Role Labeling

The most obvious next method step is to classify discovered features by role:

core
suppressive
infrastructure

The current version uses existing graph-side signals and does not change search behavior yet. That scope is intentional. A hard search-time penalty would be premature because some suppressive-looking features are not pure contamination; they can provide real support structure.

The next evaluation should be narrow:

inspect analogy circuits
check whether strong factual-recall circuits are mostly labeled core
check whether cross-lingual circuits are mostly labeled core
measure whether flagged suppressive/infrastructure features concentrate in known failures
avoid adding a search penalty until the post-hoc labels are validated

If that works, the smallest search-time change would be a soft penalty:

penalize suppressive candidates late in IA
penalize infrastructure candidates during PC
keep the penalty soft enough to avoid deleting useful support structure

Next Scaling Considerations

Current evidence is intentionally small-model evidence. The strongest causal comparison is the Gemma-2-2B verified-circuit run in apps/graph/eval/verified-circuits/results/verified-circuit-comparison.md: 15 researcher-verified circuits, zero eval errors, median verified necessity 42.5pp, median CircuitExplorer necessity 46.1pp, median verified sufficiency 51.2%, and median CircuitExplorer sufficiency 62.3%. The broader build-only timing run in apps/graph/eval/models/gemma-2-2b/circuit-build-only-eval.md covers 62 Gemma-2-2B prompts and reports 6.05s average IA+PC build time on CPU, excluding causal validation. The Qwen3-4B notes in apps/graph/eval/models/qwen3-4b/iapc-parameters.md record the next-size-up operating point: graph generation with batch_size=1, edge_threshold=0.98, max_feature_nodes=10000, IA search budget 200, and PC candidate budget 30. There is not yet evidence here for 10B+ or 100B-class models.

The main scaling boundary is that CircuitExplorer is cheap only after an attribution graph already exists. The search path consumes graph JSON: nodes, links, target logits, and endpoint pins. It does not directly run a full model forward pass during circuit search. However, larger base models usually make every upstream and downstream step harder: graph generation needs a larger loaded model plus more replacement modules, the resulting graph can contain more layers and candidate features, and causal validation requires intervention forwards through the large model.

The current graph server is a single-model, mostly single-request service. server.py loads one ReplacementModel at process startup, and graph generation, steering, forward-pass, and steer-batch routes share a process-level request_lock. That is reasonable for a local Gemma-2B workbench, but it becomes a throughput bottleneck for larger models. A 10B-100B deployment would need separate pools for graph generation, CircuitExplorer scoring, and causal validation; explicit queueing and cancellation; model workers pinned to specific GPU sets; and autoscaling around loaded model/transcoder state rather than stateless HTTP workers.

Model and transcoder residency is the first hard systems bottleneck. The current circuit-tracer backend assumes the base model and transcoders/CLTs can be loaded by one graph server process. For 10B-class models this likely means high-memory GPUs, careful dtype/quantization choices, and prewarmed workers. For 100B-class models it likely means tensor or pipeline parallel inference plus the following systems changes:

Sharded replacement modules: split transcoders, CLTs, or other replacement components across GPUs, machines, layers, or model blocks instead of assuming one process can own every module. The graph server would need to route activations through the shard that owns the relevant layer or block.
Remote model-serving integration: call a dedicated model-serving backend instead of loading the full base model inside the CircuitExplorer graph server process. That backend could be a distributed TransformerLens service, vLLM/SGLang-style serving path, TensorRT-LLM deployment, or custom inference cluster that accepts graph-generation and intervention requests.
Artifact locality: keep model weights, transcoder weights, graph files, caches, and intermediate tensors close to the workers that use them. At 100B scale, repeatedly downloading from object storage or moving tensors across machines can dominate runtime, so workers need prewarmed caches, local or high-throughput shared storage, and fewer serialization/network hops.

The largest missing prerequisite is often not the CircuitExplorer search algorithm; it is high-quality replacement modules for every layer or a credible subset of layers.

Attribution graph generation is the next bottleneck. The Qwen3-4B parameters already use batch_size=1 and a high edge_threshold while still allowing up to 10000 feature nodes. Larger models will put pressure on node and edge counts before CircuitExplorer sees the graph. If graph size scales linearly with layers and candidate features, browser transfer, server JSON parsing, scorer construction, and UI rendering all degrade. The practical path is hierarchical graph generation: layer/window budgets, adaptive per-layer feature caps, early target-specific pruning, sparse graph serialization, and progressive graph loading instead of one large JSON blob.

The scorer is currently dense-matrix based. BatchGraphScorer builds an n x n float32 adjacency matrix, and batched scoring expands that into B x n x n tensors. The JIT scorer clones adjacency, removes unpinned features, normalizes rows, then iterates influence propagation up to 200 steps per score batch. This is fine for the hundreds-to-low-thousands of nodes seen in current evals, but it is the wrong asymptotic shape for large graphs. A 10x increase in retained nodes creates roughly 100x adjacency memory and arithmetic pressure. For large models, this needs to become:

Sparse or block-sparse message passing: store and propagate only along existing graph edges, or dense blocks of related layers/tokens, instead of materializing every possible node-to-node pair in an n x n matrix. This makes memory and compute scale with retained edges or structured blocks rather than all node pairs.
Incremental scoring over changed pins: reuse the previous score state when testing one additional pinned feature instead of cloning and recomputing the full graph for every candidate mask. Each IA/PC trial usually differs by one pin, so most propagation work should be shared.
Topological or dynamic-programming propagation where the graph is acyclic enough: exploit the mostly forward layer structure of attribution graphs to compute influence in a layer-ordered pass when possible, rather than using repeated dense matrix multiplication until convergence.
Explicit memory budgets: estimate the tensor footprint before constructing scorer batches and reduce batch size, prune more aggressively, or return a clear error before allocating dense B x n x n tensors that would exhaust memory.

The IA+PC search loop is also shaped for small batches. The current implementation scores at most five IA trial candidates per iteration and at most thirty PC candidates per pass, with Python control flow and frequent .item() synchronization. The Gemma-2B build-only eval notes show this clearly: the MPS smoke test was much slower than CPU because the scorer issued many tiny accelerator workloads rather than fewer large ones. CUDA is likely better than MPS, but the same structural issue remains. For larger models, the search needs to become accelerator-shaped: larger candidate batches, beam or stochastic search that scores hundreds or thousands of masks per dispatch, resident device tensors, fewer host synchronizations, and ideally compiled kernels or torch.compile/custom kernels for the scoring core.

Multi-seed exploration is currently serial inside one exploration job. The UI defaults to multiple seed circuits, but _run_exploration loops over seeds one at a time and then ranks the results. This keeps implementation simple and avoids memory spikes, but it leaves parallel hardware idle. Larger deployments should split seeds across scorer workers or devices, cache the pruned graph/scorer tensors once, and merge streamed partial results. The challenge is memory: naive parallelism would duplicate dense adjacency and candidate tensors, so this should only be done after scorer state is made shareable or sparse.

Causal validation becomes dominant at larger model sizes. CircuitExplorer can build candidate circuits without model forwards, but the UI's necessity and sufficiency checks use the graph server's steering/logit path. On 2B this is interactive; on 10B-100B it may be the slowest and most expensive part. The validation path needs batched interventions, shared baseline forward caches, asynchronous job status, partial result streaming, and possibly approximate validation tiers before full intervention runs. For 100B-class models, causal validation should probably be served by a dedicated inference backend rather than the same process that serves graph search.

Semantic grouping is a secondary bottleneck. It calls an external LLM over selected pinned nodes and can run after candidate circuits stream back. It does not determine whether the search scales computationally, but it can become expensive or latency-heavy if circuits contain hundreds of pins. At larger scale, grouping should be optional, cached by graph/circuit hash, and run on summarized feature metadata rather than raw large node neighborhoods.

The implementation changes most likely to matter are:

Replace dense adjacency scoring with sparse or block-sparse propagation and incremental updates.
Redesign IA+PC around larger batched candidate scoring and fewer Python/device synchronization points.
Cache pruned graphs, scorer tensors, and generated graph artifacts across exploration, scoring, and validation requests.
Split graph generation, circuit search, grouping, and causal validation into separate worker pools with explicit queues.
Add adaptive graph budgets that are target-specific, layer-aware, and hardware-aware instead of one global keep_ratio.
Add large-model evals that measure graph generation time, graph size, scorer memory, circuit build time, validation time, and quality separately.

The near-term conclusion is that CircuitExplorer's search workflow is promising at 2B/4B scale, but it should not be treated as automatically ready for 10B-100B models. The conceptual workflow should transfer: generate an attribution graph, prune it, search candidate circuits, and causally validate them. The current systems shape will not transfer without substantial work on graph sparsity, batching, worker architecture, and large-model validation infrastructure.

What This Shows

CircuitExplorer supports three claims.

First, attribution-graph circuit discovery can be automated for a meaningful class of mechanisms. Manual expert tracing does not have to be the only path from attribution graph to candidate circuit.

Second, fast graph-side search can preserve causal quality. IA+PC matches researcher circuits on average necessity while improving average sufficiency, and it does so in seconds rather than expert-minutes.

Third, interpretability infrastructure should surface its own limits. The analogy failures, attention-heavy categories, Qwen sufficiency gap, and suppressive/infrastructure feature diagnosis are not incidental caveats. They are part of what makes the system useful as research infrastructure: it gives researchers a fast first-pass circuit, then makes the remaining uncertainty visible enough to investigate.

The final shape of the work is a search-and-validation loop:

Build candidate circuits cheaply from attribution-graph linear algebra.
Inspect and edit them in a graph workspace.
Validate them with model interventions.
Benchmark the method across categories.
Use the failures to improve the next search layer / iteration.

That's what I wanted from CircuitExplorer: not a polished visualization alone, but a system that automates a real interpretability workflow while remaining honest about where the current method breaks.

Table of Contents