Large language models are powerful but often opaque—when they misbehave, developers have few tools to diagnose why at the level of internal computations. Qwen-Scope aims to change that.
The Qwen Team has released Qwen-Scope, an open-source suite of sparse autoencoders (SAEs) trained on the Qwen3 and Qwen3.5 model families. The release includes 14 groups of SAE weights across 7 model variants: five dense models (Qwen3-1.7B, Qwen3-8B, Qwen3.5-2B, Qwen3.5-9B, and Qwen3.5-27B) and two mixture-of-experts (MoE) models (Qwen3-30B-A3B and Qwen3.5-35B-A3B).
What Is a Sparse Autoencoder?
A sparse autoencoder acts as a translation layer between raw neural network activations and human-understandable concepts. LLMs generate high-dimensional hidden states that are difficult to interpret directly. An SAE decomposes these activations into a large dictionary of sparse latent features, each corresponding to a specific concept—such as a language, style, or safety-relevant behavior.
For each backbone and transformer layer, Qwen-Scope trains a separate SAE to reconstruct residual-stream activations using a sparse set of latent features. The SAE encoder maps each activation to an overcomplete latent representation, and a Top-k activation rule retains only the largest k latent activations (set to 50 or 100). For dense backbones, the SAE width scales to 16× the model hidden size; for MoE backbones, standard SAEs use 32K width (16× expansion), with wider versions up to 128K width (64× expansion) available for finer-grained insights.
The result is a layer-wise feature dictionary for all transformer layers across the seven backbones. Notably, only the Qwen3.5-27B SAEs are trained on the instruct variant; all others use base model checkpoints.
Practical Applications
1. Inference-Time Steering
Developers can influence model output without modifying weights by adding or subtracting a feature direction from the residual stream during inference. The formula h' ← h + αd (where h is the hidden state, d is the SAE feature direction, and α controls strength) enables behavior adjustments like changing language or reducing repetition.
2. Debugging and Interpretability
SAE features allow developers to inspect which concepts activate for a given output, helping identify causes of hallucinations, safety failures, or unexpected responses.
3. Customizable Safety Guardrails
By steering away from unsafe features or toward desired behaviors, teams can build model-specific safety filters without expensive retraining.
4. Research into Model Behavior
Academics and researchers can use the feature dictionaries to study how LLMs represent knowledge, bias, and reasoning, accelerating progress in mechanistic interpretability.
Qwen-Scope is available as an open-source release, providing the AI community with practical tools to make LLMs more transparent and controllable.