Dear Authors:
Thanks for sharing this tremendous work to the community. I am really impressed by the surprising effect of variance reduction as this work exposes.
I notice that both nano-GPT models are trained up to 100,000 steps, which is still much smaller than 600,000 steps as in Andrej Karpathy.
I am wondering have you guys ever trained MARS with the same budget and does it still outperform AdamW?
Another interesting point is that the betas of MARS are (0.95, 0.99), while betas of AdamW are (0.9, 0.95). Could this possibly be one cause for the improved convergence speed?
Best regards
Dear Authors:
Thanks for sharing this tremendous work to the community. I am really impressed by the surprising effect of variance reduction as this work exposes.
I notice that both nano-GPT models are trained up to 100,000 steps, which is still much smaller than 600,000 steps as in Andrej Karpathy.
I am wondering have you guys ever trained MARS with the same budget and does it still outperform AdamW?
Another interesting point is that the betas of MARS are (0.95, 0.99), while betas of AdamW are (0.9, 0.95). Could this possibly be one cause for the improved convergence speed?
Best regards