Skip to content

Conversation

@harshit-anyscale
Copy link
Contributor

  • adding external scaler enabled flag in the application config, which will dictate, whether to allow the external scalers to update the num replicas for an application or not.

this is being done as part of the custom autoscaling story

Signed-off-by: harshit <harshit@anyscale.com>
@harshit-anyscale harshit-anyscale requested a review from a team as a code owner October 8, 2025 11:31
@harshit-anyscale harshit-anyscale self-assigned this Oct 8, 2025
cursor[bot]

This comment was marked as outdated.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an external_scaler_enabled flag, which is a great addition for custom autoscaling. The implementation is mostly solid, but I've found a few critical and high-severity issues related to correctness that should be addressed. Specifically, there are issues with the order of operations in the controller, a bug in how the new flag is passed in application_state.py, and incorrect boolean logic in deployment_info.py. I've also included a medium-severity suggestion to improve code style. Please review my comments for details and suggested fixes.

@ray-gardener ray-gardener bot added the serve Ray Serve Related Issue label Oct 8, 2025
@harshit-anyscale harshit-anyscale added the go add ONLY when ready to merge, run all tests label Oct 8, 2025
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

Signed-off-by: harshit <harshit@anyscale.com>
cursor[bot]

This comment was marked as outdated.

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
cursor[bot]

This comment was marked as outdated.

Signed-off-by: harshit <harshit@anyscale.com>
cursor[bot]

This comment was marked as outdated.

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
cursor[bot]

This comment was marked as outdated.

harshit-anyscale and others added 3 commits October 14, 2025 17:49
Signed-off-by: harshit <harshit@anyscale.com>
and allow a test to match with one of filters in the same attribute
instead of saying it's mismatch if any of the filter fail

Signed-off-by: kevin <kevin@anyscale.com>
aslonnie and others added 26 commits October 14, 2025 18:04
…57663)

also fixes typo in build file

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
… Troubleshooting Guide (#55236)

## Why are these changes needed?
Some users may not know how to configure the `ReconcileConcurrency` in
kuberay.

Docs link:
https://anyscale-ray--55236.com.readthedocs.build/en/55236/cluster/kubernetes/troubleshooting/troubleshooting.html#how-to-configure-reconcile-concurrency-when-there-are-large-mount-of-crs

ray-project/kuberay#3909

---------

Signed-off-by: Future-Outlier <eric901201@gmail.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
…ses. (#57269)

This PR stacks on #57244.

For more details about the resource isolation project see
#54703.

In the previous ray cgroup hierarchy, all processes that were in the
path `--cgroup-path` were moved into the system cgroup. This changes the
hierarchy to now have a separate cgroup for all non-ray processes.

The new cgroup hierarchy looks like
```
      cgroup_path (e.g. /sys/fs/cgroup)
            |
    ray-node_<node_id>
    |                 |
  system             user
    |               |    |
  leaf        workers  non-ray
```

The cgroups contain the following processes
* system/leaf (all ray non-worker processes e.g. raylet,
runtime_env_agent, gcs_server, ...)
* user/workers (all ray worker processes)
* user/non-ray (all non-ray processes migrated from cgroup_path).

Note: If you're running ray inside a container, all non-ray processes
running in the container will be migrated to `user/non-ray`

The following controllers will be enabled
* cgroup_path (cpu, memory)
* ray-node_<node_id> (cpu, memory)
* system (memory)

The following constraints are applied
* system (cpu.weight, memory.min)
* user (cpu.weight)

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
… to fix over-provisioning (#57130)

This PR is related to
[https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864)
The v1 autoscaler monitor currently is pulling metrics from two
different modules in GCS:

- **`GcsResourceManager`** (v1, legacy): manages `node_resource_usages_`
and updates it at two different intervals (`UpdateNodeResourceUsage`
every 0.1s, `UpdateResourceLoads` every 1s).
- **`GcsAutoscalerStateManager`** (v2): manages `node_resource_info_`
and updates it via `UpdateResourceLoadAndUsage`. This module is already
the source for the v2 autoscaler.

| Field | Source (before, v1) | Source (after) | Change? | Notes |
| -------------------------- | -------------------------- |
-------------------------- | -------------------------- |
-------------------------- |
| current cluster resources | RaySyncer |
`GcsResourceManager::UpdateNodeResourceUsage` | 100ms
(`raylet_report_resources_period_milliseconds`) |
[gcs_resource_manager.cc#L170](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_resource_manager.cc#L170)
|
| current pending resources | GcsServer |
`GcsResourceManager::UpdateResourceLoads` | 1s
(`gcs_pull_resource_loads_period_milliseconds`) |
[gcs_server.cc#L422](https://github.com/ray-project/ray/blob/main/src/ray/gcs/gcs_server.cc#L422)
|

Because these two modules update asynchronously, the autoscaler can end
up seeing inconsistent resource states. That causes a race condition
where extra nodes may be launched before the updated availability
actually shows up. In practice, this means clusters can become
over-provisioned even though the demand was already satisfied.

In the long run, the right fix is to fully switch the v1 autoscaler over
to GcsAutoscalerStateManager::HandleGetClusterResourceState, just like
v2 already does. But since v1 will eventually be deprecated, this PR
takes a practical interim step: it merges the necessary info from both
GcsResourceManager::HandleGetAllResourceUsage and
GcsAutoscalerStateManager::HandleGetClusterResourceState in a hybrid
approach.

This keeps v1 correct without big changes, while still leaving the path
open for a clean migration to v2 later on.

## Details

This PR follows the fix suggested by @rueian in #52864 by switching the
v1 autoscaler's node state source from
`GcsResourceManager::HandleGetAllResourceUsage` to
`GcsAutoscalerStateManager::HandleGetClusterResourceState`.

Root Cause: The v1 autoscaler previously getting data from two
asynchronous update cycles:

- Node resources: updated every ~100ms via `UpdateNodeResourceUsage`
- Resource demands: updated every ~1s via `UpdateResourceLoads`

This created a race condition where newly allocated resources would be
visible before demand metrics updated, causing the autoscaler to
incorrectly perceive unmet demand and launch extra nodes.

The Fix: By using v2's `HandleGetClusterResourceState` for node
iteration, both current resources and pending demands now come from the
same consistent snapshot (same tick), so the extra-node race condition
goes away.

## Proposed Changes in update_load_metrics()

This PR updates how the v1 autoscaler collects cluster metrics.

Most node state information is now taken from **v2
(`GcsAutoscalerStateManager::HandleGetClusterResourceState`)**, while
certain fields still rely on **v1
(`GcsResourceManager::HandleGetAllResourceUsage`)** because v2 doesn't
have an equivalent yet.

| Field | Source (before, v1) | Source (after) | Change? | Notes |
| -------------------------- | -------------------------- |
-------------------------- | -------------------------- |
-------------------------- |
| Node states (id, ip, resources, idle duration) |
[gcs.proto#L526-L527](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/gcs.proto#L526-L527)
(`resources_batch_data.batch`) |
[autoscaler.proto#L206-L212](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/protobuf/autoscaler.proto#L206-L212)
(`cluster_resource_state.node_states`) | O | Now aligned with v2.
Verified no regressions in tests. |
| waiting_bundles / infeasible_bundles | `resource_load_by_shape` | same
as before | X | v2 does not separate ready vs infeasible requests. Still
needed for metrics/debugging. |
| pending_placement_groups | `placement_group_load` | same as before | X
| No validated equivalent in v2 yet. May migrate later. |
| cluster_full | response flag (`cluster_full_of_actors_detected`) |
same as before | X | No replacement in v2 fields, so kept as is. |

### Additional Notes

- This hybrid approach addresses the race condition while still using
legacy fields where v2 has no equivalent.
- All existing autoscaler monitor tests still pass, which shows that the
change is backward-compatible and does not break existing behavior.

## Changed Behavior (Observed)

(Autoscaler config & serving code are same as this
[https://github.com/ray-project/ray/issues/52864](https://github.com/ray-project/ray/issues/52864))

After switching to v2 autoscaler state (cluster resource), the issue no
longer occurs:

- Even with `gcs_pull_resource_loads_period_milliseconds=20000`, Node
Provider only launches a single `ray.worker.4090.standard` node. (No
extra requests for additional nodes are observed.)


[debug.log](https://github.com/user-attachments/files/22659163/debug.log)

## Related issue number

Closes #52864 

Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
towards fixing test_in_docker for windows

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
as cluster envs are anyscale specific concepts

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
out of `util.py`. also adding its own `py_library`

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

I've encountered an issue where Ray sends SIGKILL to child processes
(grandchild will not receive the signal) launched by a Ray actor. As a
result, the subprocess cannot catch the signal to gracefully clean up
its child processes. Therefore, the grandchild processes of the actor
will leak.

I'm glad to see #56476 by
@codope, and I also built a similar solution myself. This PR adds the
case where I met.

@codope why not enable this feature by default?

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Kai-Hsun Chen <khchen@x.ai>
into `anyscale_job_runner`. it is only used in `anyscale_job_runner` now

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
it is only used in glue.py

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
should all be using hermetic python with python 3.8 or above now

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
make that `test_in_docker` does not depend on the entire `ray_release`
library, but only depends on python files that are required for the test
db to work. this removes the dependency of `cryptography` library from
`ray_ci`, so that windows wheels can be built and windows tests can run
again.

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…ystem reserved resources (#57653)

Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: israbbani <israbbani@gmail.com>
Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: harshit <harshit@anyscale.com>
Cleaning out plasma and the plasma client and its neighbors.

The plasma client had a pimpl implementation, even though we didn't
really need anything that would come with pimpl out of plasma. So just
killing the separate impl class and just having the plasma client and
its interface. One note about this is that it needs `shared_from_this`
and the old plasma client would always contain a shared ptr to the impl,
so had to refactor the raylet to use a shared ptr to the plasma client
so we could keep using the `shared_from_this`.

Other cleanup:
- a lot of the ipc functions always returned status::ok so changed to
void
- some extra reserving of vectors and moving
- unnecessary consts in pull manager that would prevent moves
etc.

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
#57626)

The test times out frequently in CI.

Before this change, the test took `~40s` to run on my laptop. After the
change, the test took `~15s` to run on my laptop.

There also seems to be hanging related to in-order execution semantics,
so for now flipping to `allow_out_of_order_exection=True`. @dayshah will
add the `False` variant when he fixes the underlying issue.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
This PR refactors the `TaskExecutionEvent` proto in two ways:
- Rename the file to `events_task_lifecycle_event.proto`
- Refactor the task_state from a map to an array of TaskState and
timestamp. Also rename the field to `state_transitions` for consistency.

This PR depends on the upstream to update their logic to consume this
new schema.

Test:
- CI

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Renames task execution event to task lifecycle event and changes its
schema from a state map to an ordered state_transitions list, updating
core, GCS, dashboard, builds, tests, and docs.
> 
> - **Proto/API changes (breaking)**
> - Rename `TaskExecutionEvent` → `TaskLifecycleEvent` and update
`RayEvent.EventType` (`TASK_EXECUTION_EVENT` → `TASK_LIFECYCLE_EVENT`).
> - Replace `task_state` map with `state_transitions` (list of `{state,
timestamp}`) in `events_task_lifecycle_event.proto`.
> - Update `events_base_event.proto` field from `task_execution_event` →
`task_lifecycle_event` and imports/BUILD deps accordingly.
> - **Core worker**
> - Update buffer/conversion logic in
`src/ray/core_worker/task_event_buffer.{h,cc}` to populate/emit
`TaskLifecycleEvent` with `state_transitions`.
> - **GCS**
> - Update `GcsRayEventConverter` to consume `TASK_LIFECYCLE_EVENT` and
convert `state_transitions` to `state_ts_ns`.
> - **Dashboard/Aggregator**
> - Switch exposable type defaults/env to `TASK_LIFECYCLE_EVENT` in
`python/.../aggregator_agent.py`.
> - **Tests**
> - Adjust unit tests to new event/type and schema across core worker,
GCS, and dashboard.
> - **Docs**
>   - Update event export guide references to new lifecycle event proto.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
61507e8. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Signed-off-by: Cuong Nguyen <can@anyscale.com>
- removing enum capability enum as it is not being used, for more
details:
#56707 (comment)

---------

Signed-off-by: harshit <harshit@anyscale.com>
… default (#57623)

Previously we were using `DeprecationWarning` which is silenced by
default. Now this is printed:
```
>>> ray.init(local_mode=True)
/Users/eoakes/code/ray/python/ray/_private/client_mode_hook.py:104: FutureWarning: `local_mode` is an experimental feature that is no longer maintained and will be removed in the near future. For debugging consider using the Ray distributed debugger.
  return func(*args, **kwargs)
```

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
- adding a new note about using filesystem as a broker in celery

---------

Signed-off-by: harshit <harshit@anyscale.com>
<!-- Thank you for contributing to Ray! 🚀 -->
<!-- Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- 💡 Tip: Mark as draft if you want early feedback, or ready for
review when it's complete -->

## Description

<!-- Briefly describe what this PR accomplishes and why it's needed -->

Improved the Ray pull request template to make it less overwhelming for
contributors while giving maintainers better information for reviews and
release notes. The new template has clearer sections and organized
checklists that are much easier to fill out. This should encourage more
contributions while making the review process smoother and release note
generation more straightforward.

## Related issues

<!-- Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234" -->

## Types of change

- [ ] Bug fix 🐛
- [ ] New feature ✨
- [x] Enhancement 🚀
- [ ] Code refactoring 🔧
- [ ] Documentation update 📖
- [ ] Chore 🧹
- [ ] Style 🎨

## Checklist

**Does this PR introduce breaking changes?**
- [ ] Yes ⚠️
- [x] No
<!-- If yes, describe what breaks and how users should migrate -->

**Testing:**
- [ ] Added/updated tests for my changes
- [x] Tested the changes manually
- [ ] This PR is not tested ❌ _(please explain why)_

**Code Quality:**
- [x] Signed off every commit (`git commit -s`)
- [x] Ran pre-commit hooks ([setup
guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))

**Documentation:**
- [ ] Updated documentation (if applicable) ([contribution
guide](https://docs.ray.io/en/latest/ray-contribute/docs.html))
- [ ] Added new APIs to `doc/source/` (if applicable)

## Additional context

<!-- Optional: Add screenshots, examples, performance impact, breaking
change details -->

---------

Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…epseek support) (#56906)

Signed-off-by: Jiang Wu <JWu@cclgroup.com>
Signed-off-by: Jiang Wu <jwu@cclgroup.com>
Co-authored-by: Nikhil G <nrghosh@users.noreply.github.com>
This PR refactors the operator metrics logging tests in `test_stats.py`
to improve clarity, reliability, and maintainability.

- Replaced `test_op_metrics_logging` and `test_op_state_logging` with a
single, more focused test:
`test_executor_logs_metrics_on_operator_completion`
- Uses pytest's `caplog` fixture instead of mocking the logger (more
idiomatic)
- Tests the core behavior (operator completion metrics logged exactly
once) without depending on exact log message formatting
- Eliminates reliance on helper functions and complex string matching
- More descriptive test name following unit testing best practices
- Reduced test code complexity while maintaining coverage of critical
logging behavior

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
as titled
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
@harshit-anyscale
Copy link
Contributor Author

creating new PR for the same

@harshit-anyscale
Copy link
Contributor Author

new PR link: #57727

abrarsheikh pushed a commit that referenced this pull request Nov 24, 2025
- adding docs for POST API, here are more details:
https://docs.google.com/document/d/1KtMUDz1O3koihG6eh-QcUqudZjNAX3NsqqOMYh3BoWA/edit?tab=t.0#heading=h.2vf4s2d7ca46

- also, making changes for the external scaler enabled in the existing
serve application docs

to be merged after #57554

---------

Signed-off-by: harshit <harshit@anyscale.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.