Performance of MARS with more training steps?

Dear Authors:

Thanks for sharing this tremendous work to the community. I am really impressed by the surprising effect of variance reduction as this work exposes.

I notice that both nano-GPT models are trained up to 100,000 steps, which is still much smaller than 600,000 steps as in Andrej Karpathy.
I am wondering have you guys ever trained MARS with the same budget and does it still outperform AdamW?

Another interesting point is that the betas of MARS are (0.95, 0.99), while betas of AdamW are (0.9, 0.95). Could this possibly be one cause for the improved convergence speed?

Best regards


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of MARS with more training steps? #7

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Performance of MARS with more training steps? #7

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions