A new study from MIT offers a mechanistic explanation for why larger language models consistently outperform smaller ones, tracing the phenomenon to a geometric property called superposition.
The observation that doubling a model's parameters, training data, or compute leads to a predictable drop in error—following a power law—has driven the push for ever-larger AI systems. But the underlying reason has remained elusive.
In a paper presented at NeurIPS 2025, researchers Yizhou Liu, Ziming Liu, and Jeff Gore show that this scaling behavior arises from how models pack multiple meanings into limited internal space.
Superposition: Packing More Concepts Than Dimensions
Language models must represent tens of thousands of tokens and abstract concepts in an internal space with only a few thousand dimensions. In theory, a 3D space can hold only three non-overlapping concepts. LLMs circumvent this by storing many concepts simultaneously in the same dimensions, creating overlapping vectors—a phenomenon known as superposition.
Earlier theories assumed that only common concepts get clean representations, leaving rare ones lost ("weak superposition"). The MIT team, using a simplified model inspired by Anthropic's toy model, shows this picture is incomplete.
Two Regimes, Two Explanations
The researchers built a simplified AI with a dial to control how much concepts overlap, comparing two extremes:
- Weak superposition: Only common concepts are stored; rare ones are dropped. Prediction error comes from missing concepts and follows a power law only if the data distribution itself is a power law ("power law in, power law out").
- Strong superposition: All concepts are stored with overlapping vectors. Error arises from overlap noise, and scaling follows a clean 1/m relationship (where m is model width), largely independent of data distribution.
Real Models Operate in Strong Superposition
Analyzing output layers of open-source models (OPT, GPT-2, Qwen2.5, Pythia), ranging from 100 million to 70 billion parameters, the team found that all tokens are represented, vectors overlap, and overlap strength scales as 1/m. Language models operate in the strong superposition regime.
The measured scaling exponent (0.91) matches the theoretical value of 1, and DeepMind's Chinchilla data gives a nearly identical 0.88. Scaling laws emerge directly from the geometric organization of meaning in representations.
Implications for Scaling and Architecture
The study answers two key questions:
- Does scaling eventually stop? Yes, when model width equals vocabulary size, there's enough room for all tokens without overlap, and the power law breaks down.
- Can scaling be accelerated? For natural language, probably not, as word frequencies are fairly flat. But for specialized applications with uneven concept distributions, steeper scaling is possible.
The findings also suggest that architectures encouraging superposition—like Nvidia's nGPT, which forces vectors onto a unit sphere—could improve performance at the same size.
However, denser overlap makes it harder to trace internal model workings, posing challenges for mechanistic interpretability and AI safety.