[Data] Update Data+Train Benchmark #39276

scottjlee · 2023-09-05T19:52:40Z

Why are these changes needed?

Add Torch/Mosaic data loading benchmarks
Add various command line parameters for customizing the benchmark run (e.g. data loader type, dataset size/type/location, dataset size per worker, etc.)
- Add utility functions to generate image/parquet datasets based on data size per worker
Increased release test timeouts accordingly

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

…rf-blog

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

…rf-blog

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Signed-off-by: Scott Lee <sjl@anyscale.com>

release/nightly_tests/dataset/multi_node_train_benchmark.py

Signed-off-by: Scott Lee <sjl@anyscale.com>

release/nightly_tests/dataset/multi_node_train_benchmark.py

Signed-off-by: Cheng Su <scnju13@gmail.com>

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Signed-off-by: Cheng Su <scnju13@gmail.com>

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Signed-off-by: Cheng Su <scnju13@gmail.com>

Signed-off-by: Hao Chen <chenh1024@gmail.com>

Signed-off-by: Scott Lee <sjl@anyscale.com>

stephanie-wang · 2023-09-11T18:10:57Z

Could you update the PR description?

Also, I'm thinking we should actually update the benchmarks to not have to use a GPU, since we're currently not even testing this part. Can we switch to an m5 instance type with the same number of CPUs? There should be a way to override the num_gpus that is set on each node.

release/nightly_tests/dataset/multi_node_train_benchmark.py

stephanie-wang · 2023-09-11T18:18:16Z

release/nightly_tests/dataset/multi_node_train_benchmark.py

    parser.add_argument(
        "--read-task-cpus",
        default=1,
-        type=int,
+        type=float,


Shall we scale the target max block size by this value?

@stephanie-wang which max block size are you referring to here? and what would be an appropriate scaling factor?

DataContext.target_max_block_size.

You can just multiply it by this value to scale it properly.

stephanie-wang · 2023-09-11T18:18:58Z

release/nightly_tests/dataset/multi_node_train_benchmark.py

+    if args.use_torch:
+        torch_num_workers = args.torch_num_workers
+        if torch_num_workers is None:
+            torch_num_workers = 256


Probably best to make this in terms of the CPU count and whether the data is stored locally, instead of hardcoding.

Let's also share the code with use_mosaic?

discussed with @c21 , for now, we will use 256 workers for all cases directly using TorchLoader. implemented logic based on CPU count for mosaic

Could we just share the torch_num_workers for both? Mosaic is also using a torch dataloader.

release/nightly_tests/dataset/multi_node_train_benchmark.py

Signed-off-by: Scott Lee <sjl@anyscale.com>

stephanie-wang

Looks good! But should we update the instance type to not use GPUs?

Signed-off-by: Scott Lee <sjl@anyscale.com>

stephanie-wang · 2023-09-13T14:41:53Z

release/nightly_tests/dataset/multi_node_train_16_workers.yaml

+
+head_node_type:
+    name: head_node
+    instance_type: m5.4xlarge


I think there is a way to pass a custom num_gpus here, so that way we can still use --use-gpu for the script.

Signed-off-by: Scott Lee <sjl@anyscale.com>

Add Torch/Mosaic data loading benchmarks Add various command line parameters for customizing the benchmark run (e.g. data loader type, dataset size/type/location, dataset size per worker, etc.) Add utility functions to generate image/parquet datasets based on data size per worker Increased release test timeouts accordingly --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: Cheng Su <scnju13@gmail.com> Signed-off-by: Hao Chen <chenh1024@gmail.com> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Cheng Su <scnju13@gmail.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: Victor <vctr.y.m@example.com>

stephanie-wang and others added 18 commits August 25, 2023 12:46

Update microbenchmark

ccd14ae

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Update benchmarks

f5280a5

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

lint

c6cef19

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

benchmark

21d1bb7

single node benchmark

fc0cc92

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

update

c9cf0fe

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Update

1fabc71

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Add back missing file

bdb79d7

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

update

2eddfcb

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Merge remote-tracking branch 'upstream/master' into mlperf-blog

d9d2563

Merge branch 'mlperf-blog' of github.com:stephanie-wang/ray into mlpe…

0505331

…rf-blog

fix

7d29ab3

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Merge branch 'mlperf-blog' of github.com:stephanie-wang/ray into mlpe…

b0f19d4

…rf-blog

update

48bf64b

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

lint and add S3

1e1e71a

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Merge remote-tracking branch 'upstream/master' into mlperf-blog

1d83317

include file generation fn

10a8b29

Signed-off-by: Scott Lee <sjl@anyscale.com>

lint

343e2c8

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen approved these changes Sep 5, 2023

View reviewed changes

release/nightly_tests/dataset/multi_node_train_benchmark.py Outdated Show resolved Hide resolved

Scott Lee added 2 commits September 5, 2023 13:52

link gh issue

630e4fc

Signed-off-by: Scott Lee <sjl@anyscale.com>

add hao's num_cpus changes

2eda970

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen reviewed Sep 5, 2023

View reviewed changes

release/nightly_tests/dataset/multi_node_train_benchmark.py Show resolved Hide resolved

c21 and others added 8 commits September 5, 2023 15:20

Add PyTorch DataLoader

f4f39fe

Signed-off-by: Cheng Su <scnju13@gmail.com>

tmp

4f46492

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

MosaicML S3 benchmark updates

331e05f

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Merge branch 'mlperf-experiments' into HEAD

779cc67

Add --split-input to allow N Ray dataset for N training worker

8709172

Signed-off-by: Cheng Su <scnju13@gmail.com>

reverse --split-input flag

2600adf

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Change to record actual torch tensor batch size

a96a2b7

Signed-off-by: Cheng Su <scnju13@gmail.com>

Correct the actual number of rows for ray data as well

bf1af0e

Signed-off-by: Cheng Su <scnju13@gmail.com>

raulchen and others added 6 commits September 8, 2023 10:30

Multi-threading read

5b8afed

Signed-off-by: Hao Chen <chenh1024@gmail.com>

Merge branch 'master' into 0905-bcmk-update

a9a59be

Signed-off-by: Scott Lee <sjl@anyscale.com>

calculate avg per epoch tput

2f1b2f6

Signed-off-by: Scott Lee <sjl@anyscale.com>

format

416f9ae

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0905-bcmk-update

228d13e

Signed-off-by: Scott Lee <sjl@anyscale.com>

format

8697bb0

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee marked this pull request as ready for review September 11, 2023 18:10

stephanie-wang requested changes Sep 11, 2023

View reviewed changes

raulchen reviewed Sep 11, 2023

View reviewed changes

release/nightly_tests/dataset/multi_node_train_benchmark.py Outdated Show resolved Hide resolved

stephanie-wang self-assigned this Sep 11, 2023

Scott Lee added 2 commits September 11, 2023 19:29

address comments

ac96c26

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0905-bcmk-update

97ac175

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee requested a review from stephanie-wang September 12, 2023 18:37

stephanie-wang approved these changes Sep 12, 2023

View reviewed changes

Scott Lee added 3 commits September 12, 2023 13:41

Merge branch 'master' into 0905-bcmk-update

9c8a8c1

Signed-off-by: Scott Lee <sjl@anyscale.com>

remove unused params, change to cpu nodes for benchmark

fa38463

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0905-bcmk-update

9d15c7e

Signed-off-by: Scott Lee <sjl@anyscale.com>

stephanie-wang reviewed Sep 13, 2023

View reviewed changes

Scott Lee and others added 4 commits September 13, 2023 11:20

specify gpu=1 resources

dbc5e45

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0905-bcmk-update

0a47c62

Signed-off-by: Scott Lee <sjl@anyscale.com>

revert custom gpu resourcing

fe42a88

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0905-bcmk-update

f7601cf

stephanie-wang merged commit fd227e2 into ray-project:master Sep 14, 2023

This was referenced Sep 15, 2023

Release test read_parquet_train_4_gpu.aws failed #39331

Closed

Release test read_parquet_train_16_gpu.aws failed #39310

Closed

Release test read_images_train_16_gpu.aws failed #39308

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Update Data+Train Benchmark #39276

[Data] Update Data+Train Benchmark #39276

Uh oh!

scottjlee commented Sep 5, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

stephanie-wang commented Sep 11, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stephanie-wang Sep 11, 2023

Uh oh!

scottjlee Sep 11, 2023

Uh oh!

stephanie-wang Sep 12, 2023

Uh oh!

stephanie-wang Sep 11, 2023

Uh oh!

scottjlee Sep 11, 2023

Uh oh!

stephanie-wang Sep 12, 2023

Uh oh!

Uh oh!

stephanie-wang left a comment

Uh oh!

stephanie-wang Sep 13, 2023

Uh oh!

Uh oh!

[Data] Update Data+Train Benchmark #39276

[Data] Update Data+Train Benchmark #39276

Uh oh!

Conversation

scottjlee commented Sep 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Uh oh!

Uh oh!

stephanie-wang commented Sep 11, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stephanie-wang Sep 11, 2023

Choose a reason for hiding this comment

Uh oh!

scottjlee Sep 11, 2023

Choose a reason for hiding this comment

Uh oh!

stephanie-wang Sep 12, 2023

Choose a reason for hiding this comment

Uh oh!

stephanie-wang Sep 11, 2023

Choose a reason for hiding this comment

Uh oh!

scottjlee Sep 11, 2023

Choose a reason for hiding this comment

Uh oh!

stephanie-wang Sep 12, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

stephanie-wang Sep 13, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

scottjlee commented Sep 5, 2023 •

edited

Loading