A new research paper reveals that many large language models (LLMs) have a single direction in their internal representations that governs whether they refuse to comply with harmful prompts. By identifying and removing this direction, researchers were able to bypass safety guardrails, allowing models like Llama and Mistral to answer harmful questions. The finding highlights a fundamental vulnerability in current alignment methods and explains the ongoing jailbreak arms race between attackers and defenders. The paper suggests that until more robust safety techniques are developed, models will remain susceptible to simple adversarial manipulations.
One Simple Direction Controls All AI Refusals—And It Can Be Removed
AI
May 3, 2026 · 1:48 AM