Laravel

Newton–Muon: A New Optimizer That Outperforms AdamW and Muon in LLM Training

May 2, 2026 · 5:15 PM

Researchers have introduced Newton–Muon, a novel optimization method that enhances training efficiency for large language models and deep neural networks. The approach revises the widely used Muon optimizer, which was found to ignore the geometry of input data, by incorporating a right-preconditioner derived from the inverse second moment of layer activations. This correction enables the optimizer to handle anisotropic data distributions, leading to more effective update directions.

In empirical tests on the Modded-NanoGPT speedrun benchmark, Newton–Muon outperformed both AdamW and standard Muon. It reduced the required training iterations by 6% and total wall-clock time by approximately 4%, while achieving lower validation loss. The method is detailed in a preprint titled "Newton-Muon: An Implicit Newton Method for LLM Training" available on arXiv.

Newton–Muon: A New Optimizer That Outperforms AdamW and Muon in LLM Training

We Care About Your Privacy

How and why we process data