Scott Emmons

I research AI safety and alignment at Anthropic. Before that, I was a research scientist at Google DeepMind. I completed my PhD at UC Berkeley's Center for Human-Compatible AI, advised by Stuart Russell. At that time, I cofounded FAR.AI, a research and education nonprofit advancing the global field of trustworthy and secure AI.

I develop AI alignment frameworks, stress-test their limits, and turn insights into methodology adopted across the field. I have established that chain-of-thought monitoring is a substantial defense when reasoning is necessary for misalignment, designed practical metrics for model developers to preserve chain-of-thought monitorability, shown that obfuscated activations can bypass latent-space defenses, and developed StrongREJECT, a jailbreak benchmark now used by OpenAI, US/UK AISI, Amazon, and others.

Curriculum Vitae

scott at scottemmons dot com

Publications

Obfuscated Activations Bypass LLM Latent-Space Defenses
Luke Bailey^*, Alex Serrano^*, Abhay Sheshadri^*, Mikhail Seleznyov^*, Jordan Taylor^*, Erik Jenner^*, Jacob Hilton, Stephen Casper, Carlos Guestrin, and Scott Emmons
ICLR, 2026
[project page] [code] [arXiv] [BibTeX]
Frontier AI Auditing: Toward Rigorous Third-Party Assessment of Safety and Security Practices at Leading AI Companies
Miles Brundage, Noemi Dreksler, Aidan Homewood, Sean McGregor, ..., Scott Emmons, ..., and Ryan Tovcimak
arXiv, 2026
[project page] [arXiv] [BibTeX]
Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors
Max McGuinness^*, Alex Serrano^*, Luke Bailey, and Scott Emmons
arXiv, 2025
[project page] [arXiv] [BibTeX]
A Pragmatic Way to Measure Chain-of-Thought Monitorability
Scott Emmons^*, Roland S. Zimmermann^*, David K. Elson, and Rohin Shah
arXiv, 2025
[arXiv] [BibTeX]
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Tomek Korbak^*, Mikita Balesni^*, ..., Scott Emmons^†, ..., Bowen Baker^‡, Rohin Shah^‡, and Vlad Mikulik^‡
arXiv, 2025
[arXiv] [BibTeX]
When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors
Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, and Rohin Shah
arXiv, 2025
[arXiv] [BibTeX]
An Approach to Technical AGI Safety and Security
Rohin Shah, ..., Scott Emmons^*, ..., and Anca Dragan
arXiv, 2025
[arXiv] [BibTeX]
Observation Interference in Partially Observable Assistance Games
Scott Emmons^*, Caspar Oesterheld^*, Vincent Conitzer, and Stuart Russell
International Conference on Machine Learning, 2025
[arXiv] [BibTeX]
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristobal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, Tony Tong Wang, John Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, and Ethan Perez
International Conference on Learning Representations, 2025
[code] [arXiv] [BibTeX]
The Partially Observable Off-Switch Game
Andrew Garber^*, Rohan Subramani^*, Linus Luu^*, Mark Bedaywi, Stuart Russell, and Scott Emmons
Association for the Advancement of Artificial Intelligence, 2025
[arXiv] [BibTeX]
When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback
Leon Lang^*, Davis Foote^*, Stuart Russell, Anca Dragan, Erik Jenner, and Scott Emmons^*
Neural Information Processing Systems, 2024
[arXiv] [BibTeX]
A StrongREJECT for Empty Jailbreaks
Alexandra Souly^*, Qingyuan Lu^*, Dillon Bowen^*, Tu Trinh^†, Elvis Hsieh^†, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons^‡, Olivia Watkins^‡, and Sam Toyer^‡
Neural Information Processing Systems, 2024
[code] [arXiv] [BibTeX]
Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Erik Jenner, Shreyas Kapur, Vasil Georgiev, Cameron Allen, Scott Emmons, and Stuart Russell
Neural Information Processing Systems, 2024
[project page] [code] [arXiv] [BibTeX]
Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Luke Bailey^*, Euan Ong^*, Stuart Russell, and Scott Emmons
International Conference on Machine Learning, 2024
[project page] [code] [arXiv] [BibTeX]
ALMANACS: A Simulatability Benchmark for Language Model Explainability
Edmund Mills, Shiye Su, Stuart Russell, and Scott Emmons
arXiv, 2023
[code] [arXiv] [BibTeX]
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Alexander Pan^*, Chan Jun Shern^*, Andy Zou^*, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, and Dan Hendrycks
International Conference on Machine Learning, 2023
[project page] [code] [arXiv] [BibTeX]
For Learning in Symmetric Teams, Local Optima are Global Nash Equilibria
Scott Emmons, Caspar Oesterheld, Andrew Critch, Vincent Conitzer, and Stuart Russell
International Conference on Machine Learning, 2022
[code] [arXiv] [BibTeX]
RvS: What is Essential for Offline RL via Supervised Learning?
Scott Emmons, Benjamin Eysenbach, Ilya Kostrikov, and Sergey Levine
International Conference on Learning Representations, 2022
[code] [arXiv] [BibTeX]
An Empirical Investigation of Representation Learning for Imitation
Xin Chen^*, Sam Toyer^*, Cody Wild^*, Scott Emmons, Ian Fischer, Kuang-Huei Lee, Neel Alex, Steven H. Wang, Ping Luo, Stuart Russell, Pieter Abbeel, and Rohin Shah
Neural Information Processing Systems, 2021
[code] [arXiv] [BibTeX]
Sparse Graphical Memory for Robust Planning
Scott Emmons^*, Ajay Jain^*, Michael Laskin^*, Thanard Kurutach, Pieter Abbeel, and Deepak Pathak
Neural Information Processing Systems, 2020
[video] [code] [arXiv] [BibTeX]
Concurrency and Reachability in Treelike Temporal Networks
Eun Lee, Scott Emmons, Ryan Gibson, James Moody, and Peter J. Mucha
Physical Review E, 2019
[arXiv] [BibTeX]
A Map Equation with Metadata: Varying the Role of Attributes in Community Detection
Scott Emmons and Peter J. Mucha
Physical Review E, 2019
[code] [arXiv] [BibTeX]
Global Redundancy Resolution via Continuous Pseudoinversion of the Forward Kinematic Map
Kris Hauser and Scott Emmons
IEEE Transactions on Automation Science and Engineering, 2018
[project page] [code] [preprint] [BibTeX]
MOOC Visual Analytics: Empowering Students, Teachers, Researchers, and Platform Developers of Massively Open Online Courses
Scott Emmons, Robert Light, and Katy Börner
Journal of the Association for Information Science and Technology (JASIST), 2017
[project page] [code] [preprint] [BibTeX]
Post-Processing Partitions to Identify Domains of Modularity Optimization
William H. Weir, Scott Emmons, Ryan Gibson, Dane Taylor, and Peter J. Mucha
Algorithms, 2017
[code] [arXiv] [BibTeX]
Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale
Scott Emmons, Stephen Kobourov, Mike Gallant, and Katy Börner
PLoS ONE, 2016
[project page] [code] [arXiv] [BibTeX]

Open-Source Software

imitation: Clean Imitation Learning Implementations
Adam Gleave, Mohammad Taufeeque, Juan Rocamonde, Erik Jenner, Steven H. Wang, Sam Toyer, Maximilian Ernestus, Nora Belrose, Scott Emmons, and Stuart Russell
arXiv, 2022
[code] [arXiv] [documentation] [BibTeX]