Skip to content

Commit 115f395

Browse files
committed
Merge branch 'master' into lkchen-ray_data_llm
Signed-off-by: Linkun Chen <github@lkchen.net>
2 parents 579c42f + 75941e7 commit 115f395

38 files changed

+799
-887
lines changed

.buildkite/core.rayci.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -214,7 +214,9 @@ steps:
214214
--except-tags kubernetes,manual
215215

216216
- label: ":ray: core: asan tests"
217-
tags: python
217+
tags:
218+
- python
219+
- skip-on-premerge # currently failing
218220
instance_type: medium
219221
commands:
220222
- bazel run //ci/ray_ci:test_in_docker -- //python/ray/tests/... core

.buildkite/release-automation/pre_release.rayci.yml

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -51,11 +51,22 @@ steps:
5151
RAYCI_RELEASE: 1
5252
RAYCI_SCHEDULE: "nightly"
5353

54+
- label: "Check Ray commit in {{matrix}} nightly images"
55+
key: check-ray-commit
56+
if: build.branch !~ /^releases\// && build.env("RAYCI_WEEKLY_RELEASE_NIGHTLY") == "1"
57+
depends_on: trigger-postmerge-nightly
58+
allow_dependency_failure: true
59+
commands:
60+
- bazel run //ci/ray_ci/automation:check_nightly_ray_commit -- --ray_type={{matrix}} --expected_commit="${BUILDKITE_COMMIT}"
61+
matrix:
62+
- ray
63+
- ray-ml
64+
5465
- label: "Trigger :kubernetes: Kuberay CI Tests"
5566
if: build.env("RAYCI_WEEKLY_RELEASE_NIGHTLY") == "1"
5667
trigger: "ray-ecosystem-ci-kuberay-ci"
5768
key: trigger-kuberay
58-
depends_on: trigger-postmerge-nightly
69+
depends_on: check-ray-commit
5970
build:
6071
branch: "release-1.3"
6172
message: "Triggered by release-automation build #${BUILDKITE_BUILD_NUMBER}"
@@ -125,14 +136,3 @@ steps:
125136
env:
126137
AUTOMATIC: 1
127138
RELEASE_FREQUENCY: "weekly"
128-
129-
- label: "Check Ray commit in {{matrix}} nightly images"
130-
key: check-ray-commit
131-
if: build.branch !~ /^releases\// && build.env("RAYCI_WEEKLY_RELEASE_NIGHTLY") == "1"
132-
depends_on: trigger-postmerge-nightly
133-
allow_dependency_failure: true
134-
commands:
135-
- bazel run //ci/ray_ci/automation:check_nightly_ray_commit -- --ray_type={{matrix}} --expected_commit="${BUILDKITE_COMMIT}"
136-
matrix:
137-
- ray
138-
- ray-ml

.github/CODEOWNERS

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,10 @@
6868

6969
# ==== Libraries and frameworks ====
7070

71+
# Common directory shared by core and the libraries.
72+
# @edoakes is the czar for now because the pattern is new.
73+
/python/ray/_common/ @edoakes @aslonnie
74+
7175
# Ray data.
7276
/python/ray/data/ @ray-project/ray-data
7377
/doc/source/data/ @ray-project/ray-data

BUILD.bazel

Lines changed: 26 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -142,8 +142,8 @@ ray_cc_library(
142142
hdrs = ["src/ray/rpc/grpc_client.h"],
143143
deps = [
144144
":grpc_common_base",
145-
":rpc_client_call",
146145
":rpc_chaos",
146+
":rpc_client_call",
147147
"//src/ray/common:grpc_util",
148148
"//src/ray/common:ray_config",
149149
"//src/ray/common:status",
@@ -157,9 +157,9 @@ ray_cc_library(
157157
deps = [
158158
":stats_metric",
159159
"//src/ray/common:asio",
160-
"//src/ray/common:ray_config",
161160
"//src/ray/common:grpc_util",
162161
"//src/ray/common:id",
162+
"//src/ray/common:ray_config",
163163
"//src/ray/common:status",
164164
"@com_github_grpc_grpc//:grpc++",
165165
],
@@ -170,8 +170,8 @@ ray_cc_library(
170170
srcs = ["src/ray/rpc/retryable_grpc_client.cc"],
171171
hdrs = ["src/ray/rpc/retryable_grpc_client.h"],
172172
deps = [
173-
":rpc_client_call",
174173
":grpc_client",
174+
":rpc_client_call",
175175
"@com_google_absl//absl/container:btree",
176176
"@com_google_absl//absl/strings:str_format",
177177
"@com_google_absl//absl/time",
@@ -184,8 +184,8 @@ ray_cc_library(
184184
deps = [
185185
":grpc_client",
186186
"//src/ray/common:status",
187-
"//src/ray/util:logging",
188187
"//src/ray/protobuf:reporter_cc_proto",
188+
"//src/ray/util:logging",
189189
"@com_github_grpc_grpc//:grpc++",
190190
],
191191
)
@@ -198,8 +198,8 @@ ray_cc_library(
198198
":grpc_common_base",
199199
":rpc_server_call",
200200
"//src/ray/common:asio",
201-
"//src/ray/common:status",
202201
"//src/ray/common:ray_config",
202+
"//src/ray/common:status",
203203
"//src/ray/util:thread_utils",
204204
"@com_github_grpc_grpc//:grpc++",
205205
"@com_github_grpc_grpc//:grpc++_reflection",
@@ -216,9 +216,9 @@ ray_cc_library(
216216
],
217217
# TODO(core): These three dependencies come from raylet client, should be able to remove after we split node rpc and raylet client into smaller targets.
218218
deps = [
219+
"//src/ray/common:network",
219220
"//src/ray/common:ray_object",
220221
"//src/ray/common:task_common",
221-
"//src/ray/common:network",
222222
] + [
223223
":grpc_client",
224224
":grpc_common_base",
@@ -459,9 +459,9 @@ ray_cc_library(
459459
}),
460460
linkopts = PLASMA_LINKOPTS,
461461
deps = [
462+
":object_manager_common",
462463
":plasma_fbs",
463464
":ray_common",
464-
":object_manager_common",
465465
"//src/ray/protobuf:common_cc_proto",
466466
"//src/ray/util",
467467
"//src/ray/util:compat",
@@ -526,13 +526,13 @@ ray_cc_library(
526526
name = "ray_mock",
527527
hdrs = glob(
528528
["src/mock/**/*.h"],
529-
exclude = ["src/mock/ray/common/ray_syncer/ray_syncer.h"]
529+
exclude = ["src/mock/ray/common/ray_syncer/ray_syncer.h"],
530530
),
531531
)
532532

533533
ray_cc_library(
534534
name = "ray_mock_syncer",
535-
hdrs = ["src/mock/ray/common/ray_syncer/ray_syncer.h"]
535+
hdrs = ["src/mock/ray/common/ray_syncer/ray_syncer.h"],
536536
)
537537

538538
cc_grpc_library(
@@ -573,8 +573,8 @@ ray_cc_binary(
573573
":raylet_lib",
574574
"//src/ray/util",
575575
"//src/ray/util:cmd_line_utils",
576-
"//src/ray/util:stream_redirection_options",
577576
"//src/ray/util:stream_redirection",
577+
"//src/ray/util:stream_redirection_options",
578578
"@com_github_gflags_gflags//:gflags",
579579
],
580580
)
@@ -810,8 +810,8 @@ ray_cc_binary(
810810
deps = [
811811
":gcs_server_lib",
812812
":stats_lib",
813-
"//src/ray/util:stream_redirection_options",
814813
"//src/ray/util:stream_redirection",
814+
"//src/ray/util:stream_redirection_options",
815815
"@com_github_gflags_gflags//:gflags",
816816
],
817817
)
@@ -864,8 +864,8 @@ ray_cc_library(
864864
name = "stats_opentelemetry",
865865
srcs = ["src/ray/stats/opentelemetry_metrics.cc"],
866866
deps = [
867-
"@io_opentelemetry_cpp//sdk/src/logs:logs",
868-
"@io_opentelemetry_cpp//sdk/src/trace:trace",
867+
"@io_opentelemetry_cpp//sdk/src/logs",
868+
"@io_opentelemetry_cpp//sdk/src/trace",
869869
],
870870
)
871871

@@ -1069,8 +1069,8 @@ ray_cc_library(
10691069
"@com_google_absl//absl/base:core_headers",
10701070
"@com_google_absl//absl/container:flat_hash_set",
10711071
"@com_google_absl//absl/memory",
1072-
"@com_google_absl//absl/strings:str_format",
10731072
"@com_google_absl//absl/strings",
1073+
"@com_google_absl//absl/strings:str_format",
10741074
"@com_google_googletest//:gtest",
10751075
"@io_opencensus_cpp//opencensus/exporters/stats/prometheus:prometheus_exporter",
10761076
"@io_opencensus_cpp//opencensus/stats",
@@ -1093,16 +1093,16 @@ ray_cc_library(
10931093
srcs = ["src/ray/raylet_client/raylet_client.cc"],
10941094
hdrs = ["src/ray/raylet_client/raylet_client.h"],
10951095
deps = [
1096-
":raylet_client_connection_lib",
10971096
":node_manager_rpc",
1098-
"//src/ray/common:id",
1097+
":raylet_client_connection_lib",
10991098
"//src/ray/common:asio",
1099+
"//src/ray/common:id",
1100+
"//src/ray/common:network",
11001101
"//src/ray/common:ray_object",
11011102
"//src/ray/common:status",
1102-
"//src/ray/common:network",
11031103
"//src/ray/common:task_common",
1104-
"//src/ray/util:logging",
11051104
"//src/ray/protobuf:common_cc_proto",
1105+
"//src/ray/util:logging",
11061106
],
11071107
)
11081108

@@ -1191,8 +1191,8 @@ ray_cc_library(
11911191
"//src/ray/util:mutex_protected",
11921192
"//src/ray/util:process",
11931193
"//src/ray/util:shared_lru",
1194-
"//src/ray/util:stream_redirection_options",
11951194
"//src/ray/util:stream_redirection",
1195+
"//src/ray/util:stream_redirection_options",
11961196
"@boost//:circular_buffer",
11971197
"@boost//:fiber",
11981198
"@com_google_absl//absl/cleanup",
@@ -2514,10 +2514,10 @@ ray_cc_library(
25142514
deps = [
25152515
":chunk_object_reader",
25162516
":object_buffer_pool",
2517-
":object_manager_common",
25182517
":object_directory",
2519-
":ownership_based_object_directory",
2518+
":object_manager_common",
25202519
":object_manager_rpc",
2520+
":ownership_based_object_directory",
25212521
":plasma_store_server_lib",
25222522
":pull_manager",
25232523
":push_manager",
@@ -2562,8 +2562,8 @@ ray_cc_library(
25622562
"//src/ray/common:ray_config",
25632563
"//src/ray/common:ray_object",
25642564
"//src/ray/common:status",
2565-
"//src/ray/util:counter_map",
25662565
"//src/ray/util:container_util",
2566+
"//src/ray/util:counter_map",
25672567
"@boost//:asio",
25682568
"@boost//:bind",
25692569
"@com_google_absl//absl/container:flat_hash_map",
@@ -2937,11 +2937,11 @@ ray_cc_library(
29372937
name = "gcs",
29382938
deps = [
29392939
":gcs_callback",
2940+
":gcs_pb_util",
29402941
":node_manager_fbs",
29412942
":node_manager_rpc",
2942-
":gcs_pb_util",
29432943
":redis_client",
2944-
]
2944+
],
29452945
)
29462946

29472947
ray_cc_test(
@@ -3055,7 +3055,7 @@ pyx_library(
30553055
cc_kwargs = dict(
30563056
srcs = PYX_SRCS,
30573057
# cython code is auto-generated, which is out of our control.
3058-
copts = COPTS + PYX_COPTS + ["-Wno-shadow"],
3058+
copts = COPTS + PYX_COPTS,
30593059
# see https://github.com/tensorflow/tensorflow/blob/r2.1/tensorflow/lite/BUILD#L444
30603060
linkopts = select({
30613061
"@platforms//os:osx": [
@@ -3082,8 +3082,8 @@ pyx_library(
30823082
"//src/ray/protobuf:serialization_cc_proto",
30833083
"//src/ray/util",
30843084
"//src/ray/util:memory",
3085-
"//src/ray/util:stream_redirection_options",
30863085
"//src/ray/util:stream_redirection",
3086+
"//src/ray/util:stream_redirection_options",
30873087
],
30883088
)
30893089

bazel/ray.bzl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ PYX_COPTS = select({
3131
"//conditions:default": [
3232
# Ignore this warning since CPython and Cython have issue removing deprecated tp_print on MacOS
3333
"-Wno-deprecated-declarations",
34+
"-Wno-shadow",
3435
],
3536
}) + select({
3637
"@platforms//os:windows": [

doc/source/train/common/torch-configure-run.rst

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
Configure scale and resources
2-
-----------------------------
1+
Configure scale and GPUs
2+
------------------------
33

44
Outside of your training function, create a :class:`~ray.train.ScalingConfig` object to configure:
55

@@ -11,7 +11,6 @@ Outside of your training function, create a :class:`~ray.train.ScalingConfig` ob
1111
from ray.train import ScalingConfig
1212
scaling_config = ScalingConfig(num_workers=2, use_gpu=True)
1313

14-
3. (Optional) :class:`resources_per_worker <ray.train.ScalingConfig>` - The resources reserved for each worker. If you want to allocate more than one CPU or GPU per training worker, or if you need to specify other accelerators, set this attribute.
1514

1615
For more details, see :ref:`train_scaling_config`.
1716

doc/source/train/examples/lightning/dolly_lightning_fsdp_finetuning.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -338,7 +338,7 @@
338338
"source": [
339339
"## Fine-tune with Ray TorchTrainer\n",
340340
"\n",
341-
"Ray TorchTrainer allows you to scale your PyTorch Lightning training workload over multiple nodes. See {ref}`Configuring Scale and Resources <train_scaling_config>` for more details."
341+
"Ray TorchTrainer allows you to scale your PyTorch Lightning training workload over multiple nodes. See {ref}`Configuring Scale and GPUs <train_scaling_config>` for more details."
342342
]
343343
},
344344
{

doc/source/train/getting-started-pytorch-lightning.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,9 @@ This tutorial walks through the process of converting an existing PyTorch Lightn
77

88
Learn how to:
99

10-
1. Configure the Lightning Trainer so that it runs distributed with Ray and on the correct CPU, GPU, or other accelerator device.
10+
1. Configure the Lightning Trainer so that it runs distributed with Ray and on the correct CPU or GPU device.
1111
2. Configure :ref:`training function <train-overview-training-function>` to report metrics and save checkpoints.
12-
3. Configure :ref:`scaling <train-overview-scaling-config>` and CPU, GPU, or other accelerator resource requirements for a training job.
12+
3. Configure :ref:`scaling <train-overview-scaling-config>` and CPU or GPU resource requirements for a training job.
1313
4. Launch a distributed training job with a :class:`~ray.train.torch.TorchTrainer`.
1414

1515
Quickstart
@@ -31,7 +31,7 @@ For reference, the final code is as follows:
3131
result = trainer.fit()
3232

3333
1. `train_func` is the Python code that executes on each distributed training worker.
34-
2. :class:`~ray.train.ScalingConfig` defines the number of distributed training workers and whether to use GPUs or other types of accelerators.
34+
2. :class:`~ray.train.ScalingConfig` defines the number of distributed training workers and whether to use GPUs.
3535
3. :class:`~ray.train.torch.TorchTrainer` launches the distributed training job.
3636

3737
Compare a PyTorch Lightning training script with and without Ray Train.

doc/source/train/getting-started-pytorch.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,10 @@ This tutorial walks through the process of converting an existing PyTorch script
77

88
Learn how to:
99

10-
1. Configure a model to run distributed and on the correct CPU, GPU, or other accelerator device.
11-
2. Configure a dataloader to shard data across the :ref:`workers <train-overview-worker>` and place data on the correct CPU, GPU, or other accelerator device.
10+
1. Configure a model to run distributed and on the correct CPU/GPU device.
11+
2. Configure a dataloader to shard data across the :ref:`workers <train-overview-worker>` and place data on the correct CPU or GPU device.
1212
3. Configure a :ref:`training function <train-overview-training-function>` to report metrics and save checkpoints.
13-
4. Configure :ref:`scaling <train-overview-scaling-config>` and CPU, GPU, or other accelerator resource requirements for a training job.
13+
4. Configure :ref:`scaling <train-overview-scaling-config>` and CPU or GPU resource requirements for a training job.
1414
5. Launch a distributed training job with a :class:`~ray.train.torch.TorchTrainer` class.
1515

1616
Quickstart
@@ -33,7 +33,7 @@ For reference, the final code will look something like the following:
3333
result = trainer.fit()
3434

3535
1. `train_func` is the Python code that executes on each distributed training worker.
36-
2. :class:`~ray.train.ScalingConfig` defines the number of distributed training workers, and whether to use CPUs, GPUs, or other types of accelerator devices.
36+
2. :class:`~ray.train.ScalingConfig` defines the number of distributed training workers and whether to use GPUs.
3737
3. :class:`~ray.train.torch.TorchTrainer` launches the distributed training job.
3838

3939
Compare a PyTorch training script with and without Ray Train.

doc/source/train/getting-started-transformers.rst

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ This tutorial shows you how to convert an existing Hugging Face Transformers scr
77

88
In this guide, learn how to:
99

10-
1. Configure a :ref:`training function <train-overview-training-function>` that reports metrics and saves checkpoints.
11-
2. Configure :ref:`scaling <train-overview-scaling-config>` and resource requirements for CPUs, GPUs or other accelerators for your distributed training job.
10+
1. Configure a :ref:`training function <train-overview-training-function>` that properly reports metrics and saves checkpoints.
11+
2. Configure :ref:`scaling <train-overview-scaling-config>` and resource requirements for CPUs or GPUs for your distributed training job.
1212
3. Launch a distributed training job with :class:`~ray.train.torch.TorchTrainer`.
1313

1414

@@ -21,6 +21,7 @@ Install the necessary packages before you begin:
2121
2222
pip install "ray[train]" torch "transformers[torch]" datasets evaluate numpy scikit-learn
2323
24+
2425
Quickstart
2526
----------
2627

@@ -43,7 +44,7 @@ Here's a quick overview of the final code structure:
4344
The key components are:
4445

4546
1. `train_func`: Python code that runs on each distributed training worker.
46-
2. :class:`~ray.train.ScalingConfig`: Defines the number of distributed training workers and their CPUs, GPUs, or other types of accelerator devices.
47+
2. :class:`~ray.train.ScalingConfig`: Defines the number of distributed training workers and GPU usage.
4748
3. :class:`~ray.train.torch.TorchTrainer`: Launches and manages the distributed training job.
4849

4950
Code Comparison: Hugging Face Transformers vs. Ray Train Integration

0 commit comments

Comments
 (0)