Skip to content

Conversation

@dragongu
Copy link
Contributor

@dragongu dragongu commented Nov 18, 2025

Summary

Add support for configurable upscaling step size in the actor pool autoscaler. This enables rapid scale-up and efficient resource utilization by allowing the autoscaler to scale up multiple actors at once, instead of scaling up one actor at a time.

Description

Background

Currently, the actor pool autoscaler scales up actors one at a time, which can be slow in certain scenarios:

  1. Slow actor startup: When actor initialization logic is complex, actors may remain in pending state for extended periods. The autoscaler skips scaling when it encounters pending actors, preventing further scaling.

  2. Elastic cluster with unstable resources: In environments where available resources are uncertain, users often configure large concurrency ranges (e.g., (10,1000)) for map_batches. In these cases, rapid startup and scaling are critical to utilize available resources efficiently.

Solution

This PR adds support for configurable upscaling step size in the actor pool autoscaler. Instead of always scaling up by 1 actor at a time, the autoscaler can now scale up multiple actors based on utilization metrics, while respecting resource constraints.

Related issues

@dragongu dragongu requested a review from a team as a code owner November 18, 2025 02:09
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a configurable upscaling step size for the actor pool autoscaler, which is a great improvement for scenarios requiring rapid scaling. The implementation is solid, calculating a desired scaling delta based on utilization and then capping it with various limits like resource budget, the newly configured max delta, and the pool's max size. The new configuration is well-documented, and the tests have been updated to cover the new logic. I have one suggestion to add a safeguard against a potential ZeroDivisionError if the upscaling threshold is misconfigured. Overall, this is a well-executed enhancement.

@dragongu dragongu force-pushed the feature/support_autoscaling_with_step branch from ff5e52f to 80b0d68 Compare November 18, 2025 02:46
@dragongu dragongu force-pushed the feature/support_autoscaling_with_step branch from 80b0d68 to 1deec25 Compare November 18, 2025 02:59
@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Nov 18, 2025
@dragongu dragongu force-pushed the feature/support_autoscaling_with_step branch from 1deec25 to 191de57 Compare November 19, 2025 13:59
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Existing test expectations incompatible with new reason message format

The test expects an exact match on the reason message format "utilization of 1.5 >= 1.0", but the code now generates a longer, more detailed format including plan_delta, max_scale_up, max_upscaling_delta, and final_delta. The test assertions checking exact reason strings will fail because they weren't updated to match the new message format. This causes multiple test cases in test_actor_pool_scaling() to fail (e.g., lines 91-95 and lines 162-167).

python/ray/data/tests/test_autoscaler.py#L90-L95

assert autoscaler._derive_target_scaling_config(
actor_pool=actor_pool,
op=op,
op_state=op_state,
) == ActorPoolScalingRequest(delta=delta, force=force, reason=expected_reason)

Fix in Cursor Fix in Web


@dragongu dragongu force-pushed the feature/support_autoscaling_with_step branch 2 times, most recently from 29c8a16 to 7939eed Compare November 19, 2025 14:04
@dragongu dragongu force-pushed the feature/support_autoscaling_with_step branch 2 times, most recently from 251ec34 to 38b5207 Compare November 19, 2025 16:14
@dragongu dragongu force-pushed the feature/support_autoscaling_with_step branch from 38b5207 to cae9cca Compare November 19, 2025 23:38
@dragongu dragongu force-pushed the feature/support_autoscaling_with_step branch from cae9cca to b8eb4dd Compare November 20, 2025 01:29
…r pool scaling

Signed-off-by: dragongu <andrewgu@vip.qq.com>
@dragongu dragongu force-pushed the feature/support_autoscaling_with_step branch from b8eb4dd to 0b594c8 Compare November 20, 2025 03:04
@bveeramani bveeramani enabled auto-merge (squash) November 20, 2025 22:36
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 20, 2025
@bveeramani bveeramani merged commit 8147683 into ray-project:master Nov 20, 2025
8 checks passed
400Ping pushed a commit to 400Ping/ray that referenced this pull request Nov 21, 2025
…r pool scaling (ray-project#58726)

## Summary

Add support for configurable upscaling step size in the actor pool
autoscaler. This enables rapid scale-up and efficient resource
utilization by allowing the autoscaler to scale up multiple actors at
once, instead of scaling up one actor at a time.

## Description

### Background

Currently, the actor pool autoscaler scales up actors one at a time,
which can be slow in certain scenarios:

1. **Slow actor startup**: When actor initialization logic is complex,
actors may remain in pending state for extended periods. The autoscaler
skips scaling when it encounters pending actors, preventing further
scaling.

2. **Elastic cluster with unstable resources**: In environments where
available resources are uncertain, users often configure large
concurrency ranges (e.g., (10,1000)) for `map_batches`. In these cases,
rapid startup and scaling are critical to utilize available resources
efficiently.

### Solution

This PR adds support for configurable upscaling step size in the actor
pool autoscaler. Instead of always scaling up by 1 actor at a time, the
autoscaler can now scale up multiple actors based on utilization
metrics, while respecting resource constraints.

## Related issues

<!-- Add related issue numbers if applicable -->

Signed-off-by: dragongu <andrewgu@vip.qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants