NeurIPS 2024 · December 2024

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Authors: Me*, Iván Arcuschin Moreno*, Thomas Kwa, Adrià Garriga-Alonso

Tl;dr: We provided a new way to train models to follow predefined Tracr circuits, and then made a benchmark out of those models to evaluate circuit discovery methods.

Comparison of Tracr, Natural and SIIT transformers: Tracr has the ground truth circuit but unrealistic weights, Natural has realistic weights but no ground truth, SIIT has both. — SIIT transformers implement a known ground-truth circuit, but their weights and activations are similar to those in naturally trained transformers, letting us measure — in a realistic setting — how accurate circuit-discovery methods are at finding the true circuit.

The problem

This work follows Interchange Intervention Training (IIT). They claimed that they could train circuits into a model using this method, which resample ablates nodes and uses a behavioral loss that says: “If this node behaves the way I want to, it should induce this change to the final answer”. If resample ablation doesn’t do that, penalise the model.

However, there is a problem with this:

Diagram showing that IIT only intervenes on aligned (in-circuit) low-level nodes, while SIIT additionally intervenes on non-aligned (out-of-circuit) nodes. — The set of interventions IIT considers vs. the set SIIT considers. IIT only intervenes on aligned (in-circuit) low-level nodes — so non-aligned nodes can quietly contribute to the output without ever being penalised. SIIT closes that hole by intervening on non-aligned nodes too.

There are many nodes in the transformer that won’t be a part of the circuit. They should ideally do nothing, but the loss doesn’t penalise for that. As a result, we can get a counter example like above. If this was too confusing, here is a set of slides that show this in a Tracr circuit:

I initially complained to Adrià saying that the models will be ‘leaky’: he thought I was insane (which is fair — I would also be quite suspicious of a dude who hasn’t ever done research shouting “Da Models are leaky Saar” in a heavy Indian accent).

Geiger et al. actually had a proof showing their models ‘faithfully follow a high-level abstraction’ which said that if loss is 0 on this task, the ‘high-level structure’ we defined is a constructive abstraction in the sense of Beckers & Halpern. However, their definition of the set $I$ of possible interventions ignored these nodes we saw altogether. So their proof makes sense under this definition, but it isn’t what we want. I made Claude formally write this down for whoever that might need this (I highly doubt anyone will):

Show the formal counterexampleHide the formal counterexample

Geiger et al. claim that loss = 0 makes the high-level model a constructive abstraction of the network in the sense of Beckers & Halpern. But constructive abstraction (B&H Def 3.19) requires strong abstraction, which quantifies over all τ-compatible interventions on the low-level model — including the out-of-circuit nodes their loss never touches. So the relation they actually verify is a strictly weaker thing.

Claim. There exists a network $N_{θ}$ and alignment $Π$ such that the IIT loss is exactly zero, yet $M_{H}$ is not a constructive $τ$ -abstraction of $N_{θ}$ in the sense of Beckers & Halpern. Concretely: a single intervention on an out-of-circuit hidden unit flips the network’s output while leaving the high-level model’s output unchanged — a violation that IIT’s loss is structurally incapable of detecting because it never intervenes there.

What Geiger et al. claim. They define the IIT abstraction condition (Eqn. 2) as

\forall b, s \in V^{In} : κ (INTINV (N_{θ}, b, s, Π (V))) = INTINV (M_{H}, b, s, V),

and write, immediately after, that “this is in fact a constructive abstraction relationship in the sense of Beckers & Halpern (2019)”. $L_{IIT}$ is summed only over $b, s$ and aligned variables $V \in V_{H}^{In}$ , with interventions confined to the aligned neurons $Π (V)$ .

What B&H actually require. A constructive $τ$ -abstraction (B&H Def 3.19) is by definition a strong $τ$ -abstraction (Def 3.15) whose $τ$ factors through a partition ${Z_{1}, \dots, Z_{n}, Z_{n + 1}}$ of $V_{L}$ . Strong $τ$ -abstraction demands that $(M_{H}, I_{H}^{τ})$ be a $τ$ -abstraction of $(M_{L}, I_{L}^{τ})$ where $I_{L}^{τ}$ is the full set of low-level interventions on which $ω_{τ}$ is defined — not just interventions on the aligned $Π (V)$ .

Where the gap lives. Take any out-of-circuit cluster $Z_{n + 1}$ and any assignment $w$ . Because varying the aligned variables already covers every high-level state, $τ (Rst (V_{L}, w)) = R_{H} (V_{H})$ , so applying B&H Def 3.12 gives

ω_{τ} (Z_{n + 1} \leftarrow w) = \emptyset.

These interventions therefore lie in $I_{L}^{τ}$ , and constructive abstraction requires

τ (N_{θ} [Z_{n + 1} \leftarrow w] (x)) = M_{H} [\emptyset] (x) for every x and every w .

IIT’s loss never evaluates a single such intervention.

A concrete counterexample. Let $V_{H} = {V}$ with input $x \in {0, 1}$ , structural equation $V = x$ , and output $O = V$ . Let $N_{θ}$ have two hidden units satisfying $h_{a} = x$ and $h_{ℓ} = x$ , with output logit

ℓ (x) = 4 h_{a} + h_{ℓ} - 2.5, \overset{o}{^} = 1 [ℓ > 0] .

Align $Π (V) = {h_{a}}$ and take the partition $Z_{1} = {h_{a}}$ , $Z_{2} = {h_{ℓ}}$ with $τ (h_{a}, h_{ℓ}) = h_{a}$ .

IIT loss is zero. For any $b, s \in {0, 1}$ , intervening only on $h_{a} \leftarrow s$ gives $ℓ = 4 s + b - 2.5 \in {- 2.5, - 1.5} \cup {1.5, 2.5}$ , so $\overset{o}{^} = s = M_{H} [V \leftarrow s] (b)$ . ✓
Not a strong $τ$ -abstraction. Take $w = (h_{ℓ} \leftarrow 10)$ with $x = 0$ : $ℓ = 0 + 10 - 2.5 = 7.5$ , so $\overset{o}{^} = 1$ . But $M_{H} [\emptyset] (0) = 0$ . ✗

So $L_{IIT} = 0$ does not imply that $M_{H}$ is a constructive $τ$ -abstraction of $N_{θ}$ in the Beckers–Halpern sense. What it implies is the weaker statement actually proved: aligned interchange interventions agree. SIIT plugs the hole by adding a loss term over $Z_{n + 1} \leftarrow w$ interventions — exactly the condition missing above. $■$

The fix

How to fix this? Just force those nodes to do nothing:

resample ablate at a node that shouldn’t be a part of the circuit
the output should be exactly the same

There are other methods we considered (like stopping gradients flow outside our circuit, or freezing weights not in the circuit), but none worked as well as the simplest fix — maybe because we used really small residual stream sizes and started with raw initialized model (the non-circuit nodes would always add noise to the dimension where the signal was supposed to be, and the fix I described above drastically decreased some of these nodes’ magnitudes).

But we fixed it:

Scatter plot of node effects for IIT vs SIIT transformers, with in-circuit nodes in green and out-of-circuit nodes in red. — Scatter plot comparing per-node ablation effects for IIT (y-axis) vs SIIT (x-axis) transformers across the 16 main tasks. Green = in-circuit nodes; red = out-of-circuit. Under IIT, plenty of out-of-circuit nodes have high effect (off the diagonal, top-left). Under SIIT, the same nodes collapse to near-zero effect — the model genuinely doesn't use them.

Hey, we even trained an IOI model.

Simplified IOI circuit: Duplicate Token Head, S-Inhibition Head, Name Mover Head over the sentence 'When Mary and John went to the store, John gave a drink to Mary'.

This also successfully made me evaluate the models we trained like a paranoid person. I’ll leave those experiments to be read with the paper, if you’re interested.

The Benchmark

ACDC works best,
Subnetwork Probing is the most expensive but was underwhelming,
EAP really benefits from integrated gradients

Boxplot of edge AUROCs across circuit discovery techniques: ACDC, node SP, edge SP, EAP, EAP-ig. — (a)AUROCs of circuit-discovery techniques on InterpBench’s 16 main models. ACDC’s AUROC is taken by sweeping the threshold; SP/edge-SP by sweeping the regulariser (3000 epochs); EAP-ig uses 10 samples.

Bar plot of edge AUROC differences relative to ACDC for each circuit discovery technique, per SIIT model. — (b)Difference in edge AUC ROC for each technique vs. ACDC, broken out per SIIT model.

The Impact

Was this useful? Not really. Things don’t scale. It took us way too long to get it working on Tracr. We need to artificially add things like inter-node superposition, and the number of interventions blows up exponentially if done naively (similar sentiment to what was explained above- we need to make sure everything does exactly what we want). Activation-based circuit discovery is doomed. We now have reasoning models and mega agent clusters.

This was super elegant and interesting at the time, alas, it was completely useless. I should really just move to behavioral safety research (I still haven’t lmao).