-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[Data] Optimize autoscaler to support configurable step size for actor pool scaling #58726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Optimize autoscaler to support configurable step size for actor pool scaling #58726
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a configurable upscaling step size for the actor pool autoscaler, which is a great improvement for scenarios requiring rapid scaling. The implementation is solid, calculating a desired scaling delta based on utilization and then capping it with various limits like resource budget, the newly configured max delta, and the pool's max size. The new configuration is well-documented, and the tests have been updated to cover the new logic. I have one suggestion to add a safeguard against a potential ZeroDivisionError if the upscaling threshold is misconfigured. Overall, this is a well-executed enhancement.
python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py
Outdated
Show resolved
Hide resolved
ff5e52f to
80b0d68
Compare
80b0d68 to
1deec25
Compare
python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py
Outdated
Show resolved
Hide resolved
1deec25 to
191de57
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Existing test expectations incompatible with new reason message format
The test expects an exact match on the reason message format "utilization of 1.5 >= 1.0", but the code now generates a longer, more detailed format including plan_delta, max_scale_up, max_upscaling_delta, and final_delta. The test assertions checking exact reason strings will fail because they weren't updated to match the new message format. This causes multiple test cases in test_actor_pool_scaling() to fail (e.g., lines 91-95 and lines 162-167).
python/ray/data/tests/test_autoscaler.py#L90-L95
ray/python/ray/data/tests/test_autoscaler.py
Lines 90 to 95 in 191de57
| assert autoscaler._derive_target_scaling_config( | |
| actor_pool=actor_pool, | |
| op=op, | |
| op_state=op_state, | |
| ) == ActorPoolScalingRequest(delta=delta, force=force, reason=expected_reason) | |
python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py
Outdated
Show resolved
Hide resolved
29c8a16 to
7939eed
Compare
python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py
Outdated
Show resolved
Hide resolved
251ec34 to
38b5207
Compare
python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py
Outdated
Show resolved
Hide resolved
38b5207 to
cae9cca
Compare
cae9cca to
b8eb4dd
Compare
…r pool scaling Signed-off-by: dragongu <andrewgu@vip.qq.com>
b8eb4dd to
0b594c8
Compare
…r pool scaling (ray-project#58726) ## Summary Add support for configurable upscaling step size in the actor pool autoscaler. This enables rapid scale-up and efficient resource utilization by allowing the autoscaler to scale up multiple actors at once, instead of scaling up one actor at a time. ## Description ### Background Currently, the actor pool autoscaler scales up actors one at a time, which can be slow in certain scenarios: 1. **Slow actor startup**: When actor initialization logic is complex, actors may remain in pending state for extended periods. The autoscaler skips scaling when it encounters pending actors, preventing further scaling. 2. **Elastic cluster with unstable resources**: In environments where available resources are uncertain, users often configure large concurrency ranges (e.g., (10,1000)) for `map_batches`. In these cases, rapid startup and scaling are critical to utilize available resources efficiently. ### Solution This PR adds support for configurable upscaling step size in the actor pool autoscaler. Instead of always scaling up by 1 actor at a time, the autoscaler can now scale up multiple actors based on utilization metrics, while respecting resource constraints. ## Related issues <!-- Add related issue numbers if applicable --> Signed-off-by: dragongu <andrewgu@vip.qq.com>
Summary
Add support for configurable upscaling step size in the actor pool autoscaler. This enables rapid scale-up and efficient resource utilization by allowing the autoscaler to scale up multiple actors at once, instead of scaling up one actor at a time.
Description
Background
Currently, the actor pool autoscaler scales up actors one at a time, which can be slow in certain scenarios:
Slow actor startup: When actor initialization logic is complex, actors may remain in pending state for extended periods. The autoscaler skips scaling when it encounters pending actors, preventing further scaling.
Elastic cluster with unstable resources: In environments where available resources are uncertain, users often configure large concurrency ranges (e.g., (10,1000)) for
map_batches. In these cases, rapid startup and scaling are critical to utilize available resources efficiently.Solution
This PR adds support for configurable upscaling step size in the actor pool autoscaler. Instead of always scaling up by 1 actor at a time, the autoscaler can now scale up multiple actors based on utilization metrics, while respecting resource constraints.
Related issues