-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Additional args for Data + Train benchmark #37839
[Data] Additional args for Data + Train benchmark #37839
Conversation
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Release test outputs:
|
Signed-off-by: Scott Lee <sjl@anyscale.com>
The changes look good, but aren't the throughputs here a bit odd? I thought the throughput should go up instead of down for 4 -> 16 CPUs? |
release/release_tests.yaml
Outdated
cluster_env: app_config.yaml | ||
cluster_compute: ../../air_tests/air_benchmarks/compute_gpu_4x4_gce.yaml | ||
|
||
- name: read_images_train_16_cpu_preserve_order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpu
-> gpu
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If yes, can you also update line 6364 - name: read_images_train_4_cpu
release/release_tests.yaml
Outdated
cluster_env: app_config.yaml | ||
cluster_compute: ../../air_tests/air_benchmarks/compute_gpu_4x4_gce.yaml | ||
|
||
- name: read_parquet_train_4_cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
release/release_tests.yaml
Outdated
cluster_env: app_config.yaml | ||
cluster_compute: ../../air_tests/air_benchmarks/compute_gpu_2x2_gce.yaml | ||
|
||
- name: read_parquet_train_16_cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
) | ||
result = torch_trainer.fit() | ||
|
||
# Report the throughput of the last training epoch. | ||
metrics["ray.TorchTrainer.fit"] = list(result.metrics_dataframe["tput"])[-1] | ||
metrics["ray_TorchTrainer_fit"] = list(result.metrics_dataframe["tput"])[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how this metrics_dataframe
being aggregated across training workers? is this the sum up of individual tput
from each worker?
Signed-off-by: Scott Lee <sjl@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG, let's debug the throughput performance separately.
As a followup to ray-project#37624, add the following additional parameters for the multi-node training benchmark: - File type (image, parquet) - local shuffle buffer size - preserve_order (train config) - increases default # epochs to 10 Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>
As a followup to ray-project#37624, add the following additional parameters for the multi-node training benchmark: - File type (image, parquet) - local shuffle buffer size - preserve_order (train config) - increases default # epochs to 10 Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
As a followup to ray-project#37624, add the following additional parameters for the multi-node training benchmark: - File type (image, parquet) - local shuffle buffer size - preserve_order (train config) - increases default # epochs to 10 Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
As a followup to #37624, add the following additional parameters for the multi-node training benchmark:
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.