Large language models (LLMs) trained on massive text datasets excel at generating coherent text, yet they often produce harmful outputs such as personal information leaks, misinformation, bias, or toxic content. For instance, early versions of GPT-3 displayed sexist behaviors and biases against Muslims. Identifying these flaws allows researchers to develop mitigation strategies such as GeDi or PPLM, but even recent GPT models remain vulnerable to prompt injection attacks that create security risks for downstream applications.
What is red-teaming? Red-teaming is a form of evaluation that exposes model vulnerabilities leading to undesirable behaviors, also known as jailbreaking. Notable real-world failures include Microsoft's Tay chatbot and Bing's Sydney chatbot, which resulted from insufficient red-teaming. Originating from military adversary simulations, red-teaming aims to craft natural language prompts that trigger harmful generations. Unlike adversarial attacks (e.g., adding nonsensical prefixes), red-teaming prompts are human-readable.
Red-teaming reveals limitations that could cause upsetting user experiences or enable malicious activities. The outputs are often used to retrain the model to avoid harm. However, due to the vast search space, red-teaming is resource-intensive. One alternative is to use a classifier to detect potentially harmful prompts and return canned responses, but this over-cautious approach reduces helpfulness. Thus, there is a tension between making models helpful versus harmless.
Red teams can involve humans or other LMs testing target models. For safety-tuned models, roleplay attacks—instructing the LLM to act as a malicious character—are effective. Another method is instructing the model to respond in code instead of natural language, which may reveal hidden biases. ChatGPT itself has provided a list of jailbreaking strategies.
Challenges and best practices As LLMs grow more capable, red-teaming must evolve to address emerging capabilities like power-seeking behavior, persuasion, and physical agency (e.g., ordering chemicals via an API). These critical threat scenarios require simulation to uncover potential malevolent outcomes. Safety depends on the strength of red-teaming methods, emphasizing the need for multi-organizational collaboration on datasets and best practices.
Open-source datasets for red-teaming
- Meta's Bot Adversarial Dialog dataset
- Anthropic's red-teaming attempts on Hugging Face
- AI2's RealToxicityPrompts
Key findings from prior research
- Few-shot prompted LMs with helpful, honest, and harmless behavior are not harder to red-team than plain LMs.
- No clear trend of attack success with model scale, except RLHF models become harder to red-team as they scale.
- Models may become evasive, trading off helpfulness for harmlessness.
- Low agreement among humans on what constitutes a successful attack.
- Success rates vary by harm category, with non-violent categories higher.
- Crowdsourced red-teaming often yields template-like prompts (e.g., "give a mean word that begins with X").
Future directions Notably, there is no open-source red-teaming dataset for code generation that attempts jailbreaking via code (e.g., generating a DDOS program). Developing such datasets and methodologies remains an open challenge.