-
Notifications
You must be signed in to change notification settings - Fork 6.9k
add external scaler enabled #57554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add external scaler enabled #57554
Conversation
Signed-off-by: harshit <harshit@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an external_scaler_enabled flag, which is a great addition for custom autoscaling. The implementation is mostly solid, but I've found a few critical and high-severity issues related to correctness that should be addressed. Specifically, there are issues with the order of operations in the controller, a bug in how the new flag is passed in application_state.py, and incorrect boolean logic in deployment_info.py. I've also included a medium-severity suggestion to improve code style. Please review my comments for details and suggested fixes.
python/ray/dashboard/modules/serve/tests/test_serve_dashboard.py
Outdated
Show resolved
Hide resolved
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
and allow a test to match with one of filters in the same attribute instead of saying it's mismatch if any of the filter fail Signed-off-by: kevin <kevin@anyscale.com>
…57663) also fixes typo in build file Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
… Troubleshooting Guide (#55236) ## Why are these changes needed? Some users may not know how to configure the `ReconcileConcurrency` in kuberay. Docs link: https://anyscale-ray--55236.com.readthedocs.build/en/55236/cluster/kubernetes/troubleshooting/troubleshooting.html#how-to-configure-reconcile-concurrency-when-there-are-large-mount-of-crs ray-project/kuberay#3909 --------- Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com>
…ses. (#57269) This PR stacks on #57244. For more details about the resource isolation project see #54703. In the previous ray cgroup hierarchy, all processes that were in the path `--cgroup-path` were moved into the system cgroup. This changes the hierarchy to now have a separate cgroup for all non-ray processes. The new cgroup hierarchy looks like ``` cgroup_path (e.g. /sys/fs/cgroup) | ray-node_<node_id> | | system user | | | leaf workers non-ray ``` The cgroups contain the following processes * system/leaf (all ray non-worker processes e.g. raylet, runtime_env_agent, gcs_server, ...) * user/workers (all ray worker processes) * user/non-ray (all non-ray processes migrated from cgroup_path). Note: If you're running ray inside a container, all non-ray processes running in the container will be migrated to `user/non-ray` The following controllers will be enabled * cgroup_path (cpu, memory) * ray-node_<node_id> (cpu, memory) * system (memory) The following constraints are applied * system (cpu.weight, memory.min) * user (cpu.weight) --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
… to fix over-provisioning (#57130) This PR is related to [https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864) The v1 autoscaler monitor currently is pulling metrics from two different modules in GCS: - **`GcsResourceManager`** (v1, legacy): manages `node_resource_usages_` and updates it at two different intervals (`UpdateNodeResourceUsage` every 0.1s, `UpdateResourceLoads` every 1s). - **`GcsAutoscalerStateManager`** (v2): manages `node_resource_info_` and updates it via `UpdateResourceLoadAndUsage`. This module is already the source for the v2 autoscaler. | Field | Source (before, v1) | Source (after) | Change? | Notes | | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | | current cluster resources | RaySyncer | `GcsResourceManager::UpdateNodeResourceUsage` | 100ms (`raylet_report_resources_period_milliseconds`) | [gcs_resource_manager.cc#L170](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_resource_manager.cc#L170) | | current pending resources | GcsServer | `GcsResourceManager::UpdateResourceLoads` | 1s (`gcs_pull_resource_loads_period_milliseconds`) | [gcs_server.cc#L422](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_server.cc#L422) | Because these two modules update asynchronously, the autoscaler can end up seeing inconsistent resource states. That causes a race condition where extra nodes may be launched before the updated availability actually shows up. In practice, this means clusters can become over-provisioned even though the demand was already satisfied. In the long run, the right fix is to fully switch the v1 autoscaler over to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like v2 already does. But since v1 will eventually be deprecated, this PR takes a practical interim step: it merges the necessary info from both GcsResourceManager::HandleGetAllResourceUsage and GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid approach. This keeps v1 correct without big changes, while still leaving the path open for a clean migration to v2 later on. ## Details This PR follows the fix suggested by @rueian in #52864 by switching the v1 autoscaler's node state source from `GcsResourceManager::HandleGetAllResourceUsage` to `GcsAutoscalerStateManager::HandleGetClusterResourceState`. Root Cause: The v1 autoscaler previously getting data from two asynchronous update cycles: - Node resources: updated every ~100ms via `UpdateNodeResourceUsage` - Resource demands: updated every ~1s via `UpdateResourceLoads` This created a race condition where newly allocated resources would be visible before demand metrics updated, causing the autoscaler to incorrectly perceive unmet demand and launch extra nodes. The Fix: By using v2's `HandleGetClusterResourceState` for node iteration, both current resources and pending demands now come from the same consistent snapshot (same tick), so the extra-node race condition goes away. ## Proposed Changes in update_load_metrics() This PR updates how the v1 autoscaler collects cluster metrics. Most node state information is now taken from **v2 (`GcsAutoscalerStateManager::HandleGetClusterResourceState`)**, while certain fields still rely on **v1 (`GcsResourceManager::HandleGetAllResourceUsage`)** because v2 doesn't have an equivalent yet. | Field | Source (before, v1) | Source (after) | Change? | Notes | | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | | Node states (id, ip, resources, idle duration) | [gcs.proto#L526-L527](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/gcs.proto#L526-L527) (`resources_batch_data.batch`) | [autoscaler.proto#L206-L212](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/autoscaler.proto#L206-L212) (`cluster_resource_state.node_states`) | O | Now aligned with v2. Verified no regressions in tests. | | waiting_bundles / infeasible_bundles | `resource_load_by_shape` | same as before | X | v2 does not separate ready vs infeasible requests. Still needed for metrics/debugging. | | pending_placement_groups | `placement_group_load` | same as before | X | No validated equivalent in v2 yet. May migrate later. | | cluster_full | response flag (`cluster_full_of_actors_detected`) | same as before | X | No replacement in v2 fields, so kept as is. | ### Additional Notes - This hybrid approach addresses the race condition while still using legacy fields where v2 has no equivalent. - All existing autoscaler monitor tests still pass, which shows that the change is backward-compatible and does not break existing behavior. ## Changed Behavior (Observed) (Autoscaler config & serving code are same as this [https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864)) After switching to v2 autoscaler state (cluster resource), the issue no longer occurs: - Even with `gcs_pull_resource_loads_period_milliseconds=20000`, Node Provider only launches a single `ray.worker.4090.standard` node. (No extra requests for additional nodes are observed.) [debug.log](https://github.com/user-attachments/files/22659163/debug.log) ## Related issue number Closes #52864 Signed-off-by: jinbum-kim <jinbum9958@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com>
towards fixing test_in_docker for windows Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
as cluster envs are anyscale specific concepts Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
out of `util.py`. also adding its own `py_library` Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? I've encountered an issue where Ray sends SIGKILL to child processes (grandchild will not receive the signal) launched by a Ray actor. As a result, the subprocess cannot catch the signal to gracefully clean up its child processes. Therefore, the grandchild processes of the actor will leak. I'm glad to see #56476 by @codope, and I also built a similar solution myself. This PR adds the case where I met. @codope why not enable this feature by default? ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Kai-Hsun Chen <khchen@x.ai>
into `anyscale_job_runner`. it is only used in `anyscale_job_runner` now Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
it is only used in glue.py Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
should all be using hermetic python with python 3.8 or above now Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
make that `test_in_docker` does not depend on the entire `ray_release` library, but only depends on python files that are required for the test db to work. this removes the dependency of `cryptography` library from `ray_ci`, so that windows wheels can be built and windows tests can run again. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…ystem reserved resources (#57653) Signed-off-by: irabbani <israbbani@gmail.com> Signed-off-by: israbbani <israbbani@gmail.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: harshit <harshit@anyscale.com>
Cleaning out plasma and the plasma client and its neighbors. The plasma client had a pimpl implementation, even though we didn't really need anything that would come with pimpl out of plasma. So just killing the separate impl class and just having the plasma client and its interface. One note about this is that it needs `shared_from_this` and the old plasma client would always contain a shared ptr to the impl, so had to refactor the raylet to use a shared ptr to the plasma client so we could keep using the `shared_from_this`. Other cleanup: - a lot of the ipc functions always returned status::ok so changed to void - some extra reserving of vectors and moving - unnecessary consts in pull manager that would prevent moves etc. --------- Signed-off-by: dayshah <dhyey2019@gmail.com>
#57626) The test times out frequently in CI. Before this change, the test took `~40s` to run on my laptop. After the change, the test took `~15s` to run on my laptop. There also seems to be hanging related to in-order execution semantics, so for now flipping to `allow_out_of_order_exection=True`. @dayshah will add the `False` variant when he fixes the underlying issue. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
This PR refactors the `TaskExecutionEvent` proto in two ways:
- Rename the file to `events_task_lifecycle_event.proto`
- Refactor the task_state from a map to an array of TaskState and
timestamp. Also rename the field to `state_transitions` for consistency.
This PR depends on the upstream to update their logic to consume this
new schema.
Test:
- CI
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> Renames task execution event to task lifecycle event and changes its
schema from a state map to an ordered state_transitions list, updating
core, GCS, dashboard, builds, tests, and docs.
>
> - **Proto/API changes (breaking)**
> - Rename `TaskExecutionEvent` → `TaskLifecycleEvent` and update
`RayEvent.EventType` (`TASK_EXECUTION_EVENT` → `TASK_LIFECYCLE_EVENT`).
> - Replace `task_state` map with `state_transitions` (list of `{state,
timestamp}`) in `events_task_lifecycle_event.proto`.
> - Update `events_base_event.proto` field from `task_execution_event` →
`task_lifecycle_event` and imports/BUILD deps accordingly.
> - **Core worker**
> - Update buffer/conversion logic in
`src/ray/core_worker/task_event_buffer.{h,cc}` to populate/emit
`TaskLifecycleEvent` with `state_transitions`.
> - **GCS**
> - Update `GcsRayEventConverter` to consume `TASK_LIFECYCLE_EVENT` and
convert `state_transitions` to `state_ts_ns`.
> - **Dashboard/Aggregator**
> - Switch exposable type defaults/env to `TASK_LIFECYCLE_EVENT` in
`python/.../aggregator_agent.py`.
> - **Tests**
> - Adjust unit tests to new event/type and schema across core worker,
GCS, and dashboard.
> - **Docs**
> - Update event export guide references to new lifecycle event proto.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
61507e8. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
Signed-off-by: Cuong Nguyen <can@anyscale.com>
- removing enum capability enum as it is not being used, for more details: #56707 (comment) --------- Signed-off-by: harshit <harshit@anyscale.com>
… default (#57623) Previously we were using `DeprecationWarning` which is silenced by default. Now this is printed: ``` >>> ray.init(local_mode=True) /Users/eoakes/code/ray/python/ray/_private/client_mode_hook.py:104: FutureWarning: `local_mode` is an experimental feature that is no longer maintained and will be removed in the near future. For debugging consider using the Ray distributed debugger. return func(*args, **kwargs) ``` --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
- adding a new note about using filesystem as a broker in celery --------- Signed-off-by: harshit <harshit@anyscale.com>
<!-- Thank you for contributing to Ray! 🚀 --> <!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete --> ## Description <!-- Briefly describe what this PR accomplishes and why it's needed --> Improved the Ray pull request template to make it less overwhelming for contributors while giving maintainers better information for reviews and release notes. The new template has clearer sections and organized checklists that are much easier to fill out. This should encourage more contributions while making the review process smoother and release note generation more straightforward. ## Related issues <!-- Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234" --> ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [x] Enhancement 🚀 - [ ] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes⚠️ - [x] No <!-- If yes, describe what breaks and how users should migrate --> **Testing:** - [ ] Added/updated tests for my changes - [x] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [x] Signed off every commit (`git commit -s`) - [x] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context <!-- Optional: Add screenshots, examples, performance impact, breaking change details --> --------- Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com> Signed-off-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…epseek support) (#56906) Signed-off-by: Jiang Wu <JWu@cclgroup.com> Signed-off-by: Jiang Wu <jwu@cclgroup.com> Co-authored-by: Nikhil G <nrghosh@users.noreply.github.com>
This PR refactors the operator metrics logging tests in `test_stats.py` to improve clarity, reliability, and maintainability. - Replaced `test_op_metrics_logging` and `test_op_state_logging` with a single, more focused test: `test_executor_logs_metrics_on_operator_completion` - Uses pytest's `caplog` fixture instead of mocking the logger (more idiomatic) - Tests the core behavior (operator completion metrics logged exactly once) without depending on exact log message formatting - Eliminates reliance on helper functions and complex string matching - More descriptive test name following unit testing best practices - Reduced test code complexity while maintaining coverage of critical logging behavior Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? as titled <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
…ray into add-external-scaler-enabled
Signed-off-by: harshit <harshit@anyscale.com>
|
creating new PR for the same |
|
new PR link: #57727 |
- adding docs for POST API, here are more details: https://docs.google.com/document/d/1KtMUDz1O3koihG6eh-QcUqudZjNAX3NsqqOMYh3BoWA/edit?tab=t.0#heading=h.2vf4s2d7ca46 - also, making changes for the external scaler enabled in the existing serve application docs to be merged after #57554 --------- Signed-off-by: harshit <harshit@anyscale.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com>
this is being done as part of the custom autoscaling story