-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data Loading] Round-based per-epoch shuffling data loader for distributed training. #15531
Conversation
…t, add shuffle round deduction from batch size, add support for multiple batches per round, and many other misc. additions.
…d consumer stats CSVs.
…batch size in stats file name, include number of rounds and min/max stats in benchmark stats.
409ecd0
to
4d6a43f
Compare
This comment has been minimized.
This comment has been minimized.
I got it to work on 4 GPUs. Nit. Seeing this warning:
|
Example working here: https://gist.github.com/9040abcc654ce6b5ed817b8263d723e2 I also had to add this to get the workload to work:
|
Trying now on 16 GPU cluster with the provided example. A couple notes:
|
adae570
to
9e1be62
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just reviewed shuffle.py mainly.
b1b03a5
to
a8686f2
Compare
…pdate cluster config for multi-node benchmarking.
…nd tasks; removed max_concurrent_rounds; port cache mapper to num_rounds * num_reducers return values.
…educer --> trainer shuffle.
Moved to an external repo. |
A round-based per-epoch shuffling data loader for distributed training. I'm opening this PR early in order to facilitate easier collaboration.