Research

Legend: Title — Blogposts on this site Title — Links directly to the paper outside this site

Dec 2025 Open Source Replication of the Auditing Game Model Organism ↗ Replicated Marks et al. (Auditing language models for hidden objectives) using Llama 3.3 70B
Jun 2025 RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors? Models can learn to evade (some) harmfulness probes if we reward them to do so, without needing to pass direct gradients through the probe.
Apr 2025 MIB: A Mechanistic Interpretability Benchmark ↗ Mainly combined a bunch of complementary benchmarks that various people made into one nice package.
Dec 2024 InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques We provided a new way to train models to follow predefined Tracr circuits, and then made a benchmark out of those models to evaluate circuit discovery methods.