add external scaler enabled #57554

harshit-anyscale · 2025-10-08T11:31:37Z

adding external scaler enabled flag in the application config, which will dictate, whether to allow the external scalers to update the num replicas for an application or not.

this is being done as part of the custom autoscaling story

Signed-off-by: harshit <harshit@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces an external_scaler_enabled flag, which is a great addition for custom autoscaling. The implementation is mostly solid, but I've found a few critical and high-severity issues related to correctness that should be addressed. Specifically, there are issues with the order of operations in the controller, a bug in how the new flag is passed in application_state.py, and incorrect boolean logic in deployment_info.py. I've also included a medium-severity suggestion to improve code style. Please review my comments for details and suggested fixes.

python/ray/serve/_private/application_state.py

python/ray/serve/_private/controller.py

python/ray/serve/_private/deployment_info.py

python/ray/serve/_private/application_state.py

python/ray/dashboard/modules/serve/tests/test_serve_dashboard.py

Signed-off-by: harshit <harshit@anyscale.com>

…-scaler-enabled

Signed-off-by: harshit <harshit@anyscale.com>

…-scaler-enabled

Signed-off-by: harshit <harshit@anyscale.com>

…-scaler-enabled

and allow a test to match with one of filters in the same attribute instead of saying it's mismatch if any of the filter fail Signed-off-by: kevin <kevin@anyscale.com>

…57663) also fixes typo in build file Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

… Troubleshooting Guide (#55236) ## Why are these changes needed? Some users may not know how to configure the `ReconcileConcurrency` in kuberay. Docs link: https://anyscale-ray--55236.com.readthedocs.build/en/55236/cluster/kubernetes/troubleshooting/troubleshooting.html#how-to-configure-reconcile-concurrency-when-there-are-large-mount-of-crs ray-project/kuberay#3909 --------- Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com>

…ses. (#57269) This PR stacks on #57244. For more details about the resource isolation project see #54703. In the previous ray cgroup hierarchy, all processes that were in the path `--cgroup-path` were moved into the system cgroup. This changes the hierarchy to now have a separate cgroup for all non-ray processes. The new cgroup hierarchy looks like ``` cgroup_path (e.g. /sys/fs/cgroup) | ray-node_<node_id> | | system user | | | leaf workers non-ray ``` The cgroups contain the following processes * system/leaf (all ray non-worker processes e.g. raylet, runtime_env_agent, gcs_server, ...) * user/workers (all ray worker processes) * user/non-ray (all non-ray processes migrated from cgroup_path). Note: If you're running ray inside a container, all non-ray processes running in the container will be migrated to `user/non-ray` The following controllers will be enabled * cgroup_path (cpu, memory) * ray-node_<node_id> (cpu, memory) * system (memory) The following constraints are applied * system (cpu.weight, memory.min) * user (cpu.weight) --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

@rueian

… to fix over-provisioning (#57130) This PR is related to [https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864) The v1 autoscaler monitor currently is pulling metrics from two different modules in GCS: - **`GcsResourceManager`** (v1, legacy): manages `node_resource_usages_` and updates it at two different intervals (`UpdateNodeResourceUsage` every 0.1s, `UpdateResourceLoads` every 1s). - **`GcsAutoscalerStateManager`** (v2): manages `node_resource_info_` and updates it via `UpdateResourceLoadAndUsage`. This module is already the source for the v2 autoscaler. | Field | Source (before, v1) | Source (after) | Change? | Notes | | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | | current cluster resources | RaySyncer | `GcsResourceManager::UpdateNodeResourceUsage` | 100ms (`raylet_report_resources_period_milliseconds`) | [gcs_resource_manager.cc#L170](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_resource_manager.cc#L170) | | current pending resources | GcsServer | `GcsResourceManager::UpdateResourceLoads` | 1s (`gcs_pull_resource_loads_period_milliseconds`) | [gcs_server.cc#L422](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_server.cc#L422) | Because these two modules update asynchronously, the autoscaler can end up seeing inconsistent resource states. That causes a race condition where extra nodes may be launched before the updated availability actually shows up. In practice, this means clusters can become over-provisioned even though the demand was already satisfied. In the long run, the right fix is to fully switch the v1 autoscaler over to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like v2 already does. But since v1 will eventually be deprecated, this PR takes a practical interim step: it merges the necessary info from both GcsResourceManager::HandleGetAllResourceUsage and GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid approach. This keeps v1 correct without big changes, while still leaving the path open for a clean migration to v2 later on. ## Details This PR follows the fix suggested by @rueian in #52864 by switching the v1 autoscaler's node state source from `GcsResourceManager::HandleGetAllResourceUsage` to `GcsAutoscalerStateManager::HandleGetClusterResourceState`. Root Cause: The v1 autoscaler previously getting data from two asynchronous update cycles: - Node resources: updated every ~100ms via `UpdateNodeResourceUsage` - Resource demands: updated every ~1s via `UpdateResourceLoads` This created a race condition where newly allocated resources would be visible before demand metrics updated, causing the autoscaler to incorrectly perceive unmet demand and launch extra nodes. The Fix: By using v2's `HandleGetClusterResourceState` for node iteration, both current resources and pending demands now come from the same consistent snapshot (same tick), so the extra-node race condition goes away. ## Proposed Changes in update_load_metrics() This PR updates how the v1 autoscaler collects cluster metrics. Most node state information is now taken from **v2 (`GcsAutoscalerStateManager::HandleGetClusterResourceState`)**, while certain fields still rely on **v1 (`GcsResourceManager::HandleGetAllResourceUsage`)** because v2 doesn't have an equivalent yet. | Field | Source (before, v1) | Source (after) | Change? | Notes | | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | | Node states (id, ip, resources, idle duration) | [gcs.proto#L526-L527](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/gcs.proto#L526-L527) (`resources_batch_data.batch`) | [autoscaler.proto#L206-L212](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/autoscaler.proto#L206-L212) (`cluster_resource_state.node_states`) | O | Now aligned with v2. Verified no regressions in tests. | | waiting_bundles / infeasible_bundles | `resource_load_by_shape` | same as before | X | v2 does not separate ready vs infeasible requests. Still needed for metrics/debugging. | | pending_placement_groups | `placement_group_load` | same as before | X | No validated equivalent in v2 yet. May migrate later. | | cluster_full | response flag (`cluster_full_of_actors_detected`) | same as before | X | No replacement in v2 fields, so kept as is. | ### Additional Notes - This hybrid approach addresses the race condition while still using legacy fields where v2 has no equivalent. - All existing autoscaler monitor tests still pass, which shows that the change is backward-compatible and does not break existing behavior. ## Changed Behavior (Observed) (Autoscaler config & serving code are same as this [https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864)) After switching to v2 autoscaler state (cluster resource), the issue no longer occurs: - Even with `gcs_pull_resource_loads_period_milliseconds=20000`, Node Provider only launches a single `ray.worker.4090.standard` node. (No extra requests for additional nodes are observed.) [debug.log](https://github.com/user-attachments/files/22659163/debug.log) ## Related issue number Closes #52864 Signed-off-by: jinbum-kim <jinbum9958@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com>

towards fixing test_in_docker for windows Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

as cluster envs are anyscale specific concepts Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

out of `util.py`. also adding its own `py_library` Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

@codope

## Why are these changes needed? I've encountered an issue where Ray sends SIGKILL to child processes (grandchild will not receive the signal) launched by a Ray actor. As a result, the subprocess cannot catch the signal to gracefully clean up its child processes. Therefore, the grandchild processes of the actor will leak. I'm glad to see #56476 by @codope, and I also built a similar solution myself. This PR adds the case where I met. @codope why not enable this feature by default? ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Kai-Hsun Chen <khchen@x.ai>

into `anyscale_job_runner`. it is only used in `anyscale_job_runner` now Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

it is only used in glue.py Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

should all be using hermetic python with python 3.8 or above now Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

make that `test_in_docker` does not depend on the entire `ray_release` library, but only depends on python files that are required for the test db to work. this removes the dependency of `cryptography` library from `ray_ci`, so that windows wheels can be built and windows tests can run again. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…ystem reserved resources (#57653) Signed-off-by: irabbani <israbbani@gmail.com> Signed-off-by: israbbani <israbbani@gmail.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Signed-off-by: harshit <harshit@anyscale.com>

Cleaning out plasma and the plasma client and its neighbors. The plasma client had a pimpl implementation, even though we didn't really need anything that would come with pimpl out of plasma. So just killing the separate impl class and just having the plasma client and its interface. One note about this is that it needs `shared_from_this` and the old plasma client would always contain a shared ptr to the impl, so had to refactor the raylet to use a shared ptr to the plasma client so we could keep using the `shared_from_this`. Other cleanup: - a lot of the ipc functions always returned status::ok so changed to void - some extra reserving of vectors and moving - unnecessary consts in pull manager that would prevent moves etc. --------- Signed-off-by: dayshah <dhyey2019@gmail.com>

@dayshah

#57626) The test times out frequently in CI. Before this change, the test took `~40s` to run on my laptop. After the change, the test took `~15s` to run on my laptop. There also seems to be hanging related to in-order execution semantics, so for now flipping to `allow_out_of_order_exection=True`. @dayshah will add the `False` variant when he fixes the underlying issue. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

This PR refactors the `TaskExecutionEvent` proto in two ways: - Rename the file to `events_task_lifecycle_event.proto` - Refactor the task_state from a map to an array of TaskState and timestamp. Also rename the field to `state_transitions` for consistency. This PR depends on the upstream to update their logic to consume this new schema. Test: - CI  --- > [!NOTE] > Renames task execution event to task lifecycle event and changes its schema from a state map to an ordered state_transitions list, updating core, GCS, dashboard, builds, tests, and docs. > > - **Proto/API changes (breaking)** > - Rename `TaskExecutionEvent` → `TaskLifecycleEvent` and update `RayEvent.EventType` (`TASK_EXECUTION_EVENT` → `TASK_LIFECYCLE_EVENT`). > - Replace `task_state` map with `state_transitions` (list of `{state, timestamp}`) in `events_task_lifecycle_event.proto`. > - Update `events_base_event.proto` field from `task_execution_event` → `task_lifecycle_event` and imports/BUILD deps accordingly. > - **Core worker** > - Update buffer/conversion logic in `src/ray/core_worker/task_event_buffer.{h,cc}` to populate/emit `TaskLifecycleEvent` with `state_transitions`. > - **GCS** > - Update `GcsRayEventConverter` to consume `TASK_LIFECYCLE_EVENT` and convert `state_transitions` to `state_ts_ns`. > - **Dashboard/Aggregator** > - Switch exposable type defaults/env to `TASK_LIFECYCLE_EVENT` in `python/.../aggregator_agent.py`. > - **Tests** > - Adjust unit tests to new event/type and schema across core worker, GCS, and dashboard. > - **Docs** > - Update event export guide references to new lifecycle event proto. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 61507e8. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>  Signed-off-by: Cuong Nguyen <can@anyscale.com>

- removing enum capability enum as it is not being used, for more details: #56707 (comment) --------- Signed-off-by: harshit <harshit@anyscale.com>

… default (#57623) Previously we were using `DeprecationWarning` which is silenced by default. Now this is printed: ``` >>> ray.init(local_mode=True) /Users/eoakes/code/ray/python/ray/_private/client_mode_hook.py:104: FutureWarning: `local_mode` is an experimental feature that is no longer maintained and will be removed in the near future. For debugging consider using the Ray distributed debugger. return func(*args, **kwargs) ``` --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

- adding a new note about using filesystem as a broker in celery --------- Signed-off-by: harshit <harshit@anyscale.com>

## Description  Improved the Ray pull request template to make it less overwhelming for contributors while giving maintainers better information for reviews and release notes. The new template has clearer sections and organized checklists that are much easier to fill out. This should encourage more contributions while making the review process smoother and release note generation more straightforward. ## Related issues  ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [x] Enhancement 🚀 - [ ] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes ⚠️ - [x] No  **Testing:** - [ ] Added/updated tests for my changes - [x] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [x] Signed off every commit (`git commit -s`) - [x] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context  --------- Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com> Signed-off-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…epseek support) (#56906) Signed-off-by: Jiang Wu <JWu@cclgroup.com> Signed-off-by: Jiang Wu <jwu@cclgroup.com> Co-authored-by: Nikhil G <nrghosh@users.noreply.github.com>

This PR refactors the operator metrics logging tests in `test_stats.py` to improve clarity, reliability, and maintainability. - Replaced `test_op_metrics_logging` and `test_op_state_logging` with a single, more focused test: `test_executor_logs_metrics_on_operator_completion` - Uses pytest's `caplog` fixture instead of mocking the logger (more idiomatic) - Tests the core behavior (operator completion metrics logged exactly once) without depending on exact log message formatting - Eliminates reliance on helper functions and complex string matching - More descriptive test name following unit testing best practices - Reduced test code complexity while maintaining coverage of critical logging behavior Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

## Why are these changes needed? as titled  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

…ray into add-external-scaler-enabled

Signed-off-by: harshit <harshit@anyscale.com>

harshit-anyscale · 2025-10-15T06:15:29Z

creating new PR for the same

harshit-anyscale · 2025-10-15T06:30:27Z

new PR link: #57727

- adding docs for POST API, here are more details: https://docs.google.com/document/d/1KtMUDz1O3koihG6eh-QcUqudZjNAX3NsqqOMYh3BoWA/edit?tab=t.0#heading=h.2vf4s2d7ca46 - also, making changes for the external scaler enabled in the existing serve application docs to be merged after #57554 --------- Signed-off-by: harshit <harshit@anyscale.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com>

add external scaler enabled

8b7c43a

Signed-off-by: harshit <harshit@anyscale.com>

harshit-anyscale requested a review from a team as a code owner October 8, 2025 11:31

harshit-anyscale self-assigned this Oct 8, 2025

This comment was marked as outdated.

Sign in to view

gemini-code-assist bot reviewed Oct 8, 2025

View reviewed changes

ray-gardener bot added the serve Ray Serve Related Issue label Oct 8, 2025

harshit-anyscale added the go add ONLY when ready to merge, run all tests label Oct 8, 2025

harshit-anyscale added 2 commits October 9, 2025 06:29

review changes

8372c73

Signed-off-by: harshit <harshit@anyscale.com>

add docstring

6f4c6ef

Signed-off-by: harshit <harshit@anyscale.com>

This comment was marked as outdated.

Sign in to view

harshit-anyscale added 2 commits October 9, 2025 06:50

changes

32bd566

Signed-off-by: harshit <harshit@anyscale.com>

Merge branch 'master' of github.com:ray-project/ray into add-external…

36d15dc

…-scaler-enabled

This comment was marked as outdated.

Sign in to view

harshit-anyscale added 2 commits October 9, 2025 11:13

lint changes

f18a96f

Signed-off-by: harshit <harshit@anyscale.com>

fix lint errors

5d1bb81

Signed-off-by: harshit <harshit@anyscale.com>

This comment was marked as outdated.

Sign in to view

fix lint errors

78e3b14

Signed-off-by: harshit <harshit@anyscale.com>

This comment was marked as outdated.

Sign in to view

harshit-anyscale added 4 commits October 12, 2025 08:11

Merge branch 'master' of github.com:ray-project/ray into add-external…

df0343a

…-scaler-enabled

fix lint errors

e46fc85

Signed-off-by: harshit <harshit@anyscale.com>

fix lint errors

5ad5718

Signed-off-by: harshit <harshit@anyscale.com>

fix lint errors

e5dc222

Signed-off-by: harshit <harshit@anyscale.com>

This comment was marked as outdated.

Sign in to view

harshit-anyscale added 2 commits October 14, 2025 04:55

fix lint errors

2b49d02

Signed-off-by: harshit <harshit@anyscale.com>

Merge branch 'master' into add-external-scaler-enabled

e14119a

harshit-anyscale mentioned this pull request Oct 14, 2025

add docs for post API #57698

Merged

harshit-anyscale and others added 3 commits October 14, 2025 17:49

fix lint errors

d4fc0ba

Signed-off-by: harshit <harshit@anyscale.com>

Merge branch 'master' of github.com:ray-project/ray into add-external…

13f5164

…-scaler-enabled

[release] Accommodate multiple filters with same attribute (#57577)

e00ea5e

and allow a test to match with one of filters in the same attribute instead of saying it's mismatch if any of the filter fail Signed-off-by: kevin <kevin@anyscale.com>

aslonnie and others added 26 commits October 14, 2025 18:04

[release test] let custom byod tests only depend on the binary file (#…

ca26ce7

…57663) also fixes typo in build file Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[release test] split out global_config, wheels and retry (#57667)

2488625

towards fixing test_in_docker for windows Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[release test] move cluster env utils to anyscale_util (#57669)

9c441ec

as cluster envs are anyscale specific concepts Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[release test] move azure related functions to cloud_util.py (#57675)

af973c2

out of `util.py`. also adding its own `py_library` Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[release test] move in join_cloud_storage_paths (#57679)

4b54f68

into `anyscale_job_runner`. it is only used in `anyscale_job_runner` now Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[release test] move upload_working_dir_to_gcs to glue.py (#57681)

738f637

it is only used in glue.py Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[release test] remove the use of typing extension for TypedDict (#57682)

a9600cd

should all be using hermetic python with python 3.8 or above now Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

fix lint errors

56e0ade

Signed-off-by: harshit <harshit@anyscale.com>

remove async capability enum (#57666)

b30c8ec

- removing enum capability enum as it is not being used, for more details: #56707 (comment) --------- Signed-off-by: harshit <harshit@anyscale.com>

add note for celery workers (#57686)

9b350ac

- adding a new note about using filesystem as a broker in celery --------- Signed-off-by: harshit <harshit@anyscale.com>

[Data][LLM] Fixing runai_streamer for vLLM 0.10.2 integration (and De…

61e6058

…epseek support) (#56906) Signed-off-by: Jiang Wu <JWu@cclgroup.com> Signed-off-by: Jiang Wu <jwu@cclgroup.com> Co-authored-by: Nikhil G <nrghosh@users.noreply.github.com>

Merge branch 'add-external-scaler-enabled' of github.com:ray-project/…

09fe1a2

…ray into add-external-scaler-enabled

merge master

0c86ebd

Signed-off-by: harshit <harshit@anyscale.com>

harshit-anyscale closed this Oct 15, 2025

add external scaler enabled #57554

add external scaler enabled #57554

Uh oh!

Conversation

harshit-anyscale commented Oct 8, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

harshit-anyscale commented Oct 15, 2025

Uh oh!

harshit-anyscale commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

39 participants