Natural Emergent Misalignment from Reward Hacking in Production RL
Monte MacDiarmid, Benjamin Wright, Jonathan Uesato, Joe Benton, Jon Kutasov, Sara Price, Naia Bouscal, Sam Bowman, Trenton Bricken, Alex Cloud, Carson Denison, Johannes Gasteiger, Ryan Greenblatt, Jan Leike, Jack Lindsey, Vlad Mikulik, Ethan Perez, Alex Rodrigues, Drake Thomas, Albert Webson, Daniel Ziegler, Evan Hubinger
2025-11-22
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Francesco Giarrusso, Marcantonio Bracale, Marcello Galisai, Vincenzo Suriani, Olga Sorokoletova, Federico Sartore, Daniele Nardi
2025-11-19
APP: Accelerated Path Patching with Task-Specific Pruning
Frauke Andersen, William Rudman, Ruochen Zhang, Carsten Eickhoff
2025-11-07
Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training
Zheng-Xin Yong, Stephen H. Bach
2025-10-23
Decoupling Safety into Orthogonal Subspace: Cost-Efficient and Performance-Preserving Alignment for Large Language Models
Yutao Mou, Xiaoling Zhou, Yuxiao Luo, Shikun Zhang, Wei Ye
2025-10-10
Are LLMs Reliable Rankers? Rank Manipulation via Two-Stage Token Optimization
Tiancheng Xing, Jerry Li, Yixuan Du, Xiyang Hu
2025-10-08
Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?
Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
2025-10-08
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, Fazl Barez
2025-09-30
On the Theoretical Limitations of Embedding-Based Retrieval
Orion Weller, Michael Boratko, Iftekhar Naim, Jinhyuk Lee
2025-08-28
Safety Alignment Should Be Made More Than Just A Few Attention Heads
Chao Huang, Zefeng Zhang, Juewei Yue, Quangang Li, Chuang Zhang, Tingwen Liu
2025-08-27
Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models
Guangyu Yang, Jinghong Chen, Jingbiao Mei, Weizhe Lin, Bill Byrne
2025-08-22
Does More Inference-Time Compute Really Help Robustness?
Tong Wu, Chong Xiang, Jiachen T. Wang, Weichen Yu, Chawin Sitawarin, Vikash Sehwag, Prateek Mittal
2025-07-21