Research

Legend: Title — Blogposts on this site Title — Links directly to the paper outside this site

  1. Dec 2025 Open Source Replication of the Auditing Game Model Organism Replicated Marks et al. (Auditing language models for hidden objectives) using Llama 3.3 70B
  2. Jun 2025 RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors? Models can learn to evade (some) harmfulness probes if we reward them to do so, without needing to pass direct gradients through the probe.
  3. Apr 2025 MIB: A Mechanistic Interpretability Benchmark Mainly combined a bunch of complementary benchmarks that various people made into one nice package.
  4. Dec 2024 InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques We provided a new way to train models to follow predefined Tracr circuits, and then made a benchmark out of those models to evaluate circuit discovery methods.