NeurIPS 2025 Mechanistic Interpretability Workshop (Spotlight) · June 2025

RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

Authors: Me, Erik Jenner

Tl;dr: Models can learn to evade (some) harmfulness probes if we reward them to do so, without needing to pass direct gradients through the probe.

RL-Obfuscation overview: an LLM is rewarded to produce harmful completions that evade a white-box activation probe, learning both new policies and new representations. — RL-Obfuscation tests whether a language model can evade a monitoring method without any gradient information passed through the monitor. The model is rewarded for high-quality harmful generations under a black-box reward model, while simultaneously being pushed to lower the recall of white-box monitors on its residual stream. The completions on the left show various policies the model learns; the rightmost plot visualises activations of the tuned model on the base model's harmful generations (yellow), the tuned model's benign generations (green), and the tuned model's harmful generations (maroon).

Method

Training: Pretty much literally what the tl;dr describes: we run GRPO with a completion reward (using StrongReject: higher the rejection score, the better), a probe penalty (1 if the probe fires, 0 if not — entered with a negative coefficient, so the model is rewarded for evading), and a length penalty (because it refuses to answer otherwise).

Show the mathHide the math

Notation. Let $x$ be a prompt, $y \sim π_{θ} (\cdot ∣ x)$ a completion sampled from the current policy $π_{θ}$ , and $M_{L} (x, y) \in R^{T \times D}$ the residual-stream activations at layer $L$ when feeding $(x, y)$ through the model — a sequence of $T$ token activations of dimension $D$ (we restrict attention to the $T$ generation tokens of $y$ , not the prompt tokens). Let

$WBM : R^{D} \to R$ be the white-box monitor (a probe at layer $L$ that produces a per-token logit; we apply it row-wise to $h_{L}$ to get a vector of $T$ per-token scores),
$τ \in R$ the probe classification threshold (a hyperparameter, set from a target FPR on a held-out base-model dataset),
$A : R^{T} \to R$ a token-aggregation function (mean, median, max, or top- $k$ mean),
$1 [\cdot]$ the indicator function ( $1$ if the condition holds, $0$ otherwise),
$len (y)$ the number of tokens in the completion.

Reward terms. Writing $h_{L} = M_{L} (x, y)$ for brevity:

S_{W B} (h_{L}, τ, A) = A (1 [WBM (h_{L}) > τ])

— the white-box reward. Apply the probe per-token, threshold each score, aggregate. Higher = more tokens flagged as harmful.

S_{B B} (x, y) = StrongReject (x, y)

— the black-box reward, scored by the StrongReject grader (Souly et al., 2024) which assigns an integer $\in {1, \dots, 5}$ for how relevant the completion is to a harmful query. We centre at $3$ and rescale to $[- 0.5, 0.5]$ for training.

S_{L P} (y) = exp (- (\frac{len ( y ) - 200}{10})^{2})

— a length penalty (Gaussian bump centred at 200 tokens with width 10) that softly keeps completions near 200 tokens.

Full reward (unconditional setup).

R (x, y, h_{L}, τ, A) = - λ_{W B} S_{W B} (h_{L}, τ, A) + λ_{B B} S_{B B} (x, y) + λ_{L P} S_{L P} (y),

where $λ_{W B}, λ_{B B}, λ_{L P} \in R_{> 0}$ are scalar weights. Defaults: $λ_{W B} = 0.8$ , $λ_{B B} = 1.0$ , $λ_{L P} = 2.0$ . Optimised with GRPO at sampling temperature $1.0$ , KL penalty $0.05$ , learning rate $1 0^{- 5}$ .

For the backdoor setup, the sign on the $S_{W B}$ term is flipped on prompts without the trigger (the model is rewarded for maximising probe score there), and flipped back on triggered prompts.

Show probe architecturesHide probe architectures

We sweep over three per-token architectures combined with four token-aggregation strategies. Each probe consumes a sequence of residual-stream activations $X = (x_{1}, \dots, x_{T}) \in R^{T \times D}$ — the $T$ generation-token activations at some layer $L$ , each of dimension $D$ — and produces a single per-completion score in $[0, 1]$ after a sigmoid $σ (z) = 1/ (1 + e^{- z})$ .

Linear

A single linear map applied per token. The cheapest baseline.

MLP

One hidden layer of width 512. Same per-token treatment, more capacity.

Attention head

Single-head, masked self-attention over the sequence, then a linear scalar read-out. Can pool across tokens with learned weights.

Aggregations. Each architecture above produces a per-token score $s_{t} \in [0, 1]$ (for $t = 1, \dots, T$ ). An aggregation function then reduces $(s_{1}, \dots, s_{T})$ to a single per-completion score:

Mean

\frac{1}{T} \sum_{t} s_{t}

Average across all generation tokens.

Median

median_{t} s_{t}

Robust to a few extreme tokens.

Max

max_{t} s_{t}

Fires if any token looks harmful.

top-k maxpool

\frac{1}{k} t \in TopK_{k} (s) \sum s_{t}

Average of the $k$ highest token scores, where $TopK_{k} (s)$ is the set of indices with the $k$ largest $s_{t}$ values. Trained end-to-end with this aggregation (the naive token-wise loss collapses under plain max).

The full sweep is {Linear, MLP, Attention} × {mean, median, max}, plus a dedicated top-k maxpool variant. Each per-token probe is trained for 40 epochs at batch size 16 with BCE; the maxpool variants for 80 epochs.

Takeaways

1: Some probes remain robust!

Scatter plot of StrongReject score vs evasion rate at 1% FPR across probe architectures and aggregation functions. — Each point is a separate training run on Llama 2 7B, labelled by probe architecture (Linear / MLP / Attention) and aggregation (mean / median / max). As a rule of thumb, runs with StrongReject Score below 2.5/5 shouldn't really count as 'successful' evasions — they usually just degraded into low-quality completions. Mean/median aggregations get evaded; attention probes and max aggregation hold up.

It was actually pretty hard to train models to evade probes (this might be because harmfulness is pretty prevalent and easy to detect). I had to use a curriculum of increasingly better recalls to get it working at 1% FPR or higher: which seems insane.

What’s a curriculum?Hide curriculum explanation

Instead of training against a fixed probe threshold $τ$ , we vary $τ$ during training. Concretely, we start with an easy probe (high $τ$ / low target FPR — the probe rarely fires, so any half-decent evasion looks great), let the policy find a working strategy there, then loosen the probe every 2 epochs (lower $τ$ / higher FPR — the probe flags more aggressively, so the policy has to do real work to keep evading). This gives the policy something to climb instead of a flat reward landscape where any non-evasive sample looks identical to any other.

We also tried continuous rewards or step functions over the probe logit instead of a hard indicator at one threshold; both work in principle, but were much more sensitive to hyperparameters in practice.

And even with a curriculum, attention probes / max pooling still remain robust. So RL really just exploited aggregation quirks here. You can also look at the tokens used during generation here:

Per-token probe logit difference between base and tuned model vs. relative occurrence of each token category in the tuned model's generations. — Per-token probe logit difference (layer 10) between base and tuned models, vs. the change in how often each token appears under the tuned policy. HTML tags, punctuation, whitespace, and numbers are used dramatically more often *and* have a large negative score under the tuned model — these are the categories the policy leans on.

2: Policy v/s representation

Bar plot comparing probe recall across layers when feeding the tuned model's completions to either the base model or the tuned model. — White-box monitor recall at 0.5% FPR across layers. Brown = the monitor's original recall on the base model. Blue = tuned model's completions fed back through the base model (isolates the contribution of black-box behaviour). Yellow = tuned model's completions fed through the tuned model itself (adds the contribution of representation change). The model was trained to fool the layer-10 monitor, but the effect leaks to other layers.

I came up with a weird way of measuring the effect of policy v/s representation change by token forcing the base model with the tuned model’s completions.

Show the mathHide the math

Notation. Let $π_{θ}$ be the RL-tuned policy (the model after RL-Obfuscation) and $π_{base}$ the original helpful-only base model. For a model $M$ , write $M_{L} (x, y) \in R^{T \times D}$ for the layer- $L$ residual-stream activations of the $T$ generation tokens of $y$ (each of residual-stream dimension $D$ ) when $(x, y)$ is run through $M$ (token-forcing $y$ if $M$ did not generate it). Let $Mon_{L}$ denote a probe trained on base-model activations at layer $L$ — for brevity we fold the per-token probe and the token-aggregation function from the training math into a single map $Mon_{L} : R^{T \times D} \to [0, 1]$ — and let $D_{Unsafe}$ be the held-out distribution of harmful prompts.

For any model $M \in {π_{base}, π_{θ}}$ , define the token-forced recall

R_{L} (π_{θ} \to M) = E_{x \sim D_{Unsafe}, y \sim π_{θ} (\cdot ∣ x)} [Mon_{L} (M_{L} (x, y))] .

That is: sample completions from the tuned policy $π_{θ}$ , but read activations off of model $M$ (token-forcing $y$ through $M$ if $M \neq = π_{θ}$ ). The expectation is taken over both the prompt and the sampled completion.

Then, writing $R_{L} (π_{base} \to π_{base})$ for the original base-model recall as the reference point:

$R_{L} (π_{θ} \to π_{base})$ vs. $R_{L} (π_{base} \to π_{base})$ isolates the effect of the behaviour change alone (same weights — base model — but new completions, i.e., $π_{θ}$ ‘s).
$R_{L} (π_{θ} \to π_{θ})$ vs. $R_{L} (π_{θ} \to π_{base})$ isolates the effect of the representation change (same completions, new weights).

This says it’s like 70% because of policy and the other 30% drop comes due to the change in representation. I later (it was embarrassingly late) realised I can literally train a probe on the base model but with token forced generations. It also shows the same effect:

Performance of a linear mean-aggregation probe at 0.5% FPR trained on base-model activations of the tuned model's completions. — A linear (mean-agg) probe trained at 0.5% FPR on base-model activations when token-forced with the tuned model's generations. We see a similar split as in the bar plot above: the representation contribution is real and reproducible even when the probe is retrained from scratch on this token-forced setup.

So what the hell is happening? My basic intuition for how RL-Obfuscation worked is this:

the model finds tokens that decrease the probe score slightly and bias the black-box grader to assign a higher score (e.g., “a well structured list of harmfulness is much better than an unclear prose of harmfulness”),
Then it starts using these tokens more and more; eventually, these tokens start meaning something very different in this context.

A sanity check for this is: it clearly started using more numbers, HTML tags and code across checkpoints (I’ll try to put an image for this if I remember).

It’s not a particularly smart way of evading probes tbh, so you should be surprised that we can backdoor models to do this as well. I’ll leave the results for these (and some more interesting details I found) in the paper itself for an excited reader.

Some other cool papers that built on this

Obfuscation Atlas:

They concretize the intuition I described above as ‘representation drift’ and scaled it up to a level that blew my mind tbh. Although, I don’t think I can agree with everything they wrote there. (It’ll have to be left for some other time, unfortunately)
Red Teaming Simple Whitebox Auditing Tools with a prompted frontier model:

RL-Obfuscation asked what policies would the model itself find to evade probes, compared to Bailey et al. who asked what policies would GCG style hard prompt attacks find, it is definitely a more powerful greybox attacker. This paper uses some 50 iterations of Claude Opus to evade, well not probes, but SAEs and logit lens. I really liked this pipeline but I also think it manages to confuse the model just enough to beat the tools (this definitely happens for probes, and one of my SPAR mentees might write about this eventually, if he has the enthusiasm :p)