[Data] Optimize autoscaler to support configurable step size for actor pool scaling #58726

dragongu · 2025-11-18T02:09:37Z

Summary

Add support for configurable upscaling step size in the actor pool autoscaler. This enables rapid scale-up and efficient resource utilization by allowing the autoscaler to scale up multiple actors at once, instead of scaling up one actor at a time.

Description

Background

Currently, the actor pool autoscaler scales up actors one at a time, which can be slow in certain scenarios:

Slow actor startup: When actor initialization logic is complex, actors may remain in pending state for extended periods. The autoscaler skips scaling when it encounters pending actors, preventing further scaling.
Elastic cluster with unstable resources: In environments where available resources are uncertain, users often configure large concurrency ranges (e.g., (10,1000)) for map_batches. In these cases, rapid startup and scaling are critical to utilize available resources efficiently.

Solution

This PR adds support for configurable upscaling step size in the actor pool autoscaler. Instead of always scaling up by 1 actor at a time, the autoscaler can now scale up multiple actors based on utilization metrics, while respecting resource constraints.

Related issues

gemini-code-assist

Code Review

This pull request introduces a configurable upscaling step size for the actor pool autoscaler, which is a great improvement for scenarios requiring rapid scaling. The implementation is solid, calculating a desired scaling delta based on utilization and then capping it with various limits like resource budget, the newly configured max delta, and the pool's max size. The new configuration is well-documented, and the tests have been updated to cover the new logic. I have one suggestion to add a safeguard against a potential ZeroDivisionError if the upscaling threshold is misconfigured. Overall, this is a well-executed enhancement.

python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py

python/ray/data/tests/test_autoscaler.py

cursor

Bug: Existing test expectations incompatible with new reason message format

The test expects an exact match on the reason message format "utilization of 1.5 >= 1.0", but the code now generates a longer, more detailed format including plan_delta, max_scale_up, max_upscaling_delta, and final_delta. The test assertions checking exact reason strings will fail because they weren't updated to match the new message format. This causes multiple test cases in test_actor_pool_scaling() to fail (e.g., lines 91-95 and lines 162-167).

python/ray/data/tests/test_autoscaler.py#L90-L95

ray/python/ray/data/tests/test_autoscaler.py

Lines 90 to 95 in 191de57

    
           assert autoscaler._derive_target_scaling_config( 
        
               actor_pool=actor_pool, 
        
               op=op, 
        
               op_state=op_state, 
        
           ) == ActorPoolScalingRequest(delta=delta, force=force, reason=expected_reason)

python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py

python/ray/data/tests/test_autoscaler.py

python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py

…r pool scaling Signed-off-by: dragongu <andrewgu@vip.qq.com>

…r pool scaling (ray-project#58726) ## Summary Add support for configurable upscaling step size in the actor pool autoscaler. This enables rapid scale-up and efficient resource utilization by allowing the autoscaler to scale up multiple actors at once, instead of scaling up one actor at a time. ## Description ### Background Currently, the actor pool autoscaler scales up actors one at a time, which can be slow in certain scenarios: 1. **Slow actor startup**: When actor initialization logic is complex, actors may remain in pending state for extended periods. The autoscaler skips scaling when it encounters pending actors, preventing further scaling. 2. **Elastic cluster with unstable resources**: In environments where available resources are uncertain, users often configure large concurrency ranges (e.g., (10,1000)) for `map_batches`. In these cases, rapid startup and scaling are critical to utilize available resources efficiently. ### Solution This PR adds support for configurable upscaling step size in the actor pool autoscaler. Instead of always scaling up by 1 actor at a time, the autoscaler can now scale up multiple actors based on utilization metrics, while respecting resource constraints. ## Related issues  Signed-off-by: dragongu <andrewgu@vip.qq.com>

dragongu requested a review from a team as a code owner November 18, 2025 02:09

gemini-code-assist bot reviewed Nov 18, 2025

View reviewed changes

python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py Outdated Show resolved Hide resolved

richardliaw requested a review from gvspraveen November 18, 2025 02:33

dragongu force-pushed the feature/support_autoscaling_with_step branch from ff5e52f to 80b0d68 Compare November 18, 2025 02:46

cursor bot reviewed Nov 18, 2025

View reviewed changes

python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py Show resolved Hide resolved

dragongu force-pushed the feature/support_autoscaling_with_step branch from 80b0d68 to 1deec25 Compare November 18, 2025 02:59

gvspraveen requested review from bveeramani and iamjustinhsu November 18, 2025 03:09

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Nov 18, 2025

bveeramani reviewed Nov 19, 2025

View reviewed changes

dragongu force-pushed the feature/support_autoscaling_with_step branch from 1deec25 to 191de57 Compare November 19, 2025 13:59

cursor bot reviewed Nov 19, 2025

View reviewed changes

python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py Outdated Show resolved Hide resolved

dragongu force-pushed the feature/support_autoscaling_with_step branch 2 times, most recently from 29c8a16 to 7939eed Compare November 19, 2025 14:04

cursor bot reviewed Nov 19, 2025

View reviewed changes

python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py Outdated Show resolved Hide resolved

dragongu force-pushed the feature/support_autoscaling_with_step branch 2 times, most recently from 251ec34 to 38b5207 Compare November 19, 2025 16:14

iamjustinhsu reviewed Nov 19, 2025

View reviewed changes

python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py Outdated Show resolved Hide resolved

bveeramani approved these changes Nov 19, 2025

View reviewed changes

python/ray/data/tests/test_autoscaler.py Show resolved Hide resolved

dragongu force-pushed the feature/support_autoscaling_with_step branch from 38b5207 to cae9cca Compare November 19, 2025 23:38

cursor bot reviewed Nov 19, 2025

View reviewed changes

python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py Show resolved Hide resolved

python/ray/data/_internal/actor_autoscaler/default_actor_autoscaler.py Show resolved Hide resolved

dragongu force-pushed the feature/support_autoscaling_with_step branch from cae9cca to b8eb4dd Compare November 20, 2025 01:29

[Data] Optimize autoscaler to support configurable step size for acto…

0b594c8

…r pool scaling Signed-off-by: dragongu <andrewgu@vip.qq.com>

dragongu force-pushed the feature/support_autoscaling_with_step branch from b8eb4dd to 0b594c8 Compare November 20, 2025 03:04

bveeramani enabled auto-merge (squash) November 20, 2025 22:36

github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 20, 2025

bveeramani merged commit 8147683 into ray-project:master Nov 20, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Optimize autoscaler to support configurable step size for actor pool scaling #58726

[Data] Optimize autoscaler to support configurable step size for actor pool scaling #58726

Uh oh!

dragongu commented Nov 18, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	assert autoscaler._derive_target_scaling_config(
	actor_pool=actor_pool,
	op=op,
	op_state=op_state,
	) == ActorPoolScalingRequest(delta=delta, force=force, reason=expected_reason)

[Data] Optimize autoscaler to support configurable step size for actor pool scaling #58726

[Data] Optimize autoscaler to support configurable step size for actor pool scaling #58726

Uh oh!

Conversation

dragongu commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Description

Background

Solution

Related issues

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Bug: Existing test expectations incompatible with new reason message format

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dragongu commented Nov 18, 2025 •

edited

Loading