-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Data] Update Data+Train Benchmark #39276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Update Data+Train Benchmark #39276
Conversation
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Could you update the PR description? Also, I'm thinking we should actually update the benchmarks to not have to use a GPU, since we're currently not even testing this part. Can we switch to an m5 instance type with the same number of CPUs? There should be a way to override the |
parser.add_argument( | ||
"--read-task-cpus", | ||
default=1, | ||
type=int, | ||
type=float, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we scale the target max block size by this value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephanie-wang which max block size are you referring to here? and what would be an appropriate scaling factor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataContext.target_max_block_size.
You can just multiply it by this value to scale it properly.
if args.use_torch: | ||
torch_num_workers = args.torch_num_workers | ||
if torch_num_workers is None: | ||
torch_num_workers = 256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably best to make this in terms of the CPU count and whether the data is stored locally, instead of hardcoding.
Let's also share the code with use_mosaic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed with @c21 , for now, we will use 256 workers for all cases directly using TorchLoader. implemented logic based on CPU count for mosaic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just share the torch_num_workers for both? Mosaic is also using a torch dataloader.
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! But should we update the instance type to not use GPUs?
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
|
||
head_node_type: | ||
name: head_node | ||
instance_type: m5.4xlarge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a way to pass a custom num_gpus here, so that way we can still use --use-gpu for the script.
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Add Torch/Mosaic data loading benchmarks Add various command line parameters for customizing the benchmark run (e.g. data loader type, dataset size/type/location, dataset size per worker, etc.) Add utility functions to generate image/parquet datasets based on data size per worker Increased release test timeouts accordingly --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: Cheng Su <scnju13@gmail.com> Signed-off-by: Hao Chen <chenh1024@gmail.com> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Cheng Su <scnju13@gmail.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.