Laravel

New Optimizer Newton–Muon Speeds Up LLM Training by 6%

May 2, 2026 · 3:42 PM

Researchers have introduced Newton–Muon, a novel optimization method that improves the training of large language models and deep neural networks. The method builds on the popular Muon optimizer, correcting a key oversight: Muon treats input data as isotropic, ignoring its directional structure. Newton–Muon uses a triplet quadratic surrogate model and a right-preconditioner derived from the inverse second moment of layer activations, yielding more efficient update directions.

In tests on the Modded-NanoGPT speedrun benchmark, Newton–Muon reduced required training iterations by 6% and total wall-clock time by approximately 4% while achieving lower validation loss, outperforming both AdamW and standard Muon.

The work is described in a paper on arXiv (2604.01472). A video summary, generated with Google's NotebookLM, is available on YouTube.

New Optimizer Newton–Muon Speeds Up LLM Training by 6%

We Care About Your Privacy

How and why we process data