Skip to content

Conversation

@dmr51
Copy link

@dmr51 dmr51 commented Jan 5, 2026

Add adaptive Muon with precondition like in AdamW-style. I am Machine Learning Engineer, but not very good in math, so it is needed for a proof that it will work, I wrote this code using ChatGPT. But this variation of adaptive Muon gives 1.5x speedup in convergence in my private benchmark using CIFAR-10 with almost the same accuracy (93.9 vs 94.1).

@dmr51
Copy link
Author

dmr51 commented Jan 5, 2026

I found the paper of AdaMuon https://arxiv.org/abs/2507.11005. It seems close to what I got with ChatGPT, but it does the AdamW step after the orthogonalization (as I understand). As I said, I am not very good in math, but ChatGPT tells that it is better to do AdamW step before the orthogonalization if we do the step elementwise, not layerwise, because if we do it after, it can break the meaning of Muon altogether.

@parlance-zz
Copy link

I wrote this code using ChatGPT. But this variation of adaptive Muon gives 1.5x speedup in convergence in my private benchmark using CIFAR-10 with almost the same accuracy

I've tried a few adaptive scaling methods, including AdaMuon you linked above and what I found is that if you're seeing a big gain it's only because you didn't use a good learn rate schedule to begin with. Your base learn rate is probably too low and your learn rate decay probably isn't aggressive enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants