Skip to content

Conversation

scottjlee
Copy link
Contributor

@scottjlee scottjlee commented Sep 5, 2023

Why are these changes needed?

  • Add Torch/Mosaic data loading benchmarks
  • Add various command line parameters for customizing the benchmark run (e.g. data loader type, dataset size/type/location, dataset size per worker, etc.)
    • Add utility functions to generate image/parquet datasets based on data size per worker
  • Increased release test timeouts accordingly

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

stephanie-wang and others added 18 commits August 25, 2023 12:46
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Scott Lee added 2 commits September 5, 2023 13:52
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
c21 and others added 8 commits September 5, 2023 15:20
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
raulchen and others added 6 commits September 8, 2023 10:30
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee scottjlee marked this pull request as ready for review September 11, 2023 18:10
@stephanie-wang
Copy link
Contributor

Could you update the PR description?

Also, I'm thinking we should actually update the benchmarks to not have to use a GPU, since we're currently not even testing this part. Can we switch to an m5 instance type with the same number of CPUs? There should be a way to override the num_gpus that is set on each node.

parser.add_argument(
"--read-task-cpus",
default=1,
type=int,
type=float,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we scale the target max block size by this value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephanie-wang which max block size are you referring to here? and what would be an appropriate scaling factor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataContext.target_max_block_size.

You can just multiply it by this value to scale it properly.

if args.use_torch:
torch_num_workers = args.torch_num_workers
if torch_num_workers is None:
torch_num_workers = 256
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to make this in terms of the CPU count and whether the data is stored locally, instead of hardcoding.

Let's also share the code with use_mosaic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed with @c21 , for now, we will use 256 workers for all cases directly using TorchLoader. implemented logic based on CPU count for mosaic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just share the torch_num_workers for both? Mosaic is also using a torch dataloader.

@stephanie-wang stephanie-wang self-assigned this Sep 11, 2023
Scott Lee added 2 commits September 11, 2023 19:29
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! But should we update the instance type to not use GPUs?

Scott Lee added 3 commits September 12, 2023 13:41
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>

head_node_type:
name: head_node
instance_type: m5.4xlarge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a way to pass a custom num_gpus here, so that way we can still use --use-gpu for the script.

Scott Lee and others added 4 commits September 13, 2023 11:20
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@stephanie-wang stephanie-wang merged commit fd227e2 into ray-project:master Sep 14, 2023
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
    Add Torch/Mosaic data loading benchmarks
    Add various command line parameters for customizing the benchmark run (e.g. data loader type, dataset size/type/location, dataset size per worker, etc.)
        Add utility functions to generate image/parquet datasets based on data size per worker
    Increased release test timeouts accordingly

---------

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: Cheng Su <scnju13@gmail.com>
Co-authored-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants