NorMuon

Official implementation of NorMuon: Making Muon more efficient and scalable (arXiv:2510.05491).

🎉 Accepted as a Spotlight at ICML 2026.

NorMuon augments Muon with a per-row second-moment normalizer (similar in spirit to Adam's second moment) applied to the orthogonalized update. The normalizer is rescaled so that the overall update norm matches Muon's, giving better-conditioned per-neuron step sizes without changing the effective learning rate.

NorMuon is also used in karpathy/nanochat. For a fully distributed FSDP-style implementation, see our PR to the Dion codebase. For a modded-nanogpt integration, see this PR.

Installation

pip install git+https://github.com/zichongli5/NorMuon.git

normuon.py has no dependencies beyond PyTorch, so you can also just drop it into your project.

What's in `normuon.py`

NorMuon / SingleDeviceNorMuon — distributed (DDP) and single-GPU optimizers.
NorMuonWithAuxAdam / SingleDeviceNorMuonWithAuxAdam — bundle NorMuon for hidden weights with AdamW for the rest. Recommended.

Usage

Like Muon, NorMuon is meant for the hidden 2D weight matrices of the network. Embeddings, the classifier head, and gains/biases should be optimized with AdamW. The WithAuxAdam variants take care of routing for you:

from normuon import NorMuonWithAuxAdam

hidden_weights      = [p for p in model.body.parameters() if p.ndim >= 2]
hidden_gains_biases = [p for p in model.body.parameters() if p.ndim < 2]
nonhidden_params    = [*model.head.parameters(), *model.embed.parameters()]

param_groups = [
    dict(params=hidden_weights, use_muon=True,
         lr=0.02, momentum=0.95, beta2=0.95, weight_decay=0.01),
    dict(params=hidden_gains_biases + nonhidden_params, use_muon=False,
         lr=3e-4, betas=(0.9, 0.95), weight_decay=0.01),
]
optimizer = NorMuonWithAuxAdam(param_groups)

The defaults (lr=0.02, momentum=0.95, beta2=0.95) match Muon's; in our experiments only lr and weight_decay typically need tuning, and the same values that work for Muon are a good starting point.

Citation

@misc{li2025normuon,
  title         = {NorMuon: Making Muon more efficient and scalable},
  author        = {Li, Zichong and Liu, Liming and Liang, Chen and Chen, Weizhu and Zhao, Tuo},
  year          = {2025},
  eprint        = {2510.05491},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG}
}

Acknowledgements

Built directly on Keller Jordan's Muon; the Newton–Schulz iteration and the distributed update-sharding pattern are taken from there with minimal changes.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
LICENSE		LICENSE
README.md		README.md
normuon.py		normuon.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NorMuon

Installation

What's in `normuon.py`

Usage

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NorMuon

Installation

What's in normuon.py

Usage

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

What's in `normuon.py`

Packages