feat(ray): Implement dynamic scale-in for RaySwordfishActor #5903

huleilei · 2025-12-31T12:04:38Z

This commit implements the dynamic scaling down (scale-in) functionality for RaySwordfishActor to release idle resources.

Key changes:

Implement retire_idle_ray_workers in RayWorkerManager to identify and release idle workers.
Add pending_release_blacklist to track retiring workers and prevent them from being reused or causing "worker died" errors.
Move scale-down cooldown logic to RayWorkerManager to prevent frequent scale-down operations.
Optimize retire_idle_ray_workers to reduce lock contention by releasing the lock before performing Ray/Python operations.
Update try_autoscale in flotilla.py to support empty resource requests, enabling Ray to scale down resources.
Fix unit tests in src/daft-distributed/src/scheduling/worker.rs and ensure compatibility with the scheduler loop.

This addresses the issue where udfActor could not dynamically scale down and prevents "worker died" errors during graceful shutdown.

Changes Made

Related Issues

This commit implements the dynamic scaling down (scale-in) functionality for RaySwordfishActor to release idle resources. Key changes: - Implement `retire_idle_ray_workers` in `RayWorkerManager` to identify and release idle workers. - Add `pending_release_blacklist` to track retiring workers and prevent them from being reused or causing "worker died" errors. - Move scale-down cooldown logic to `RayWorkerManager` to prevent frequent scale-down operations. - Optimize `retire_idle_ray_workers` to reduce lock contention by releasing the lock before performing Ray/Python operations. - Update `try_autoscale` in `flotilla.py` to support empty resource requests, enabling Ray to scale down resources. - Fix unit tests in `src/daft-distributed/src/scheduling/worker.rs` and ensure compatibility with the scheduler loop. This addresses the issue where `udfActor` could not dynamically scale down and prevents "worker died" errors during graceful shutdown.

greptile-apps · 2025-12-31T12:08:10Z

Greptile Summary

This PR implements dynamic scale-in functionality for RaySwordfishActor to automatically release idle Ray workers and reduce resource consumption when cluster capacity is no longer needed.

Key Changes:

Added retire_idle_ray_workers method to RayWorkerManager that identifies idle workers (based on configurable idle threshold) and gracefully releases them
Introduced pending_release_blacklist mechanism to prevent Ray from immediately respawning workers that are being retired, with a configurable TTL (default 120s)
Added ActorState enum and idle duration tracking to RaySwordfishWorker for state management
Integrated downscale logic into the scheduler loop to automatically retire idle workers while maintaining a minimum survivor count (default 1)
Updated try_autoscale to support empty resource requests and call clear_autoscaling_requests() to signal Ray that resources can be scaled down
Added comprehensive test coverage for the new functionality

Configuration:

DAFT_AUTOSCALING_DOWNSCALE_ENABLED: Enable/disable downscaling (default: true)
DAFT_AUTOSCALING_DOWNSCALE_IDLE_SECONDS: Idle threshold before retirement (default: 60s)
DAFT_AUTOSCALING_MIN_SURVIVOR_WORKERS: Minimum workers to keep (default: 1)
DAFT_AUTOSCALING_PENDING_RELEASE_EXCLUDE_SECONDS: Blacklist TTL (default: 120s)

Issues Found:

Lock contention in retire_idle_ray_workers: The state mutex is held during Python GIL operations, which can block other critical operations like task submission

Confidence Score: 3/5

This PR implements important functionality but has a critical lock contention issue that could impact performance under load
The implementation is generally sound with good test coverage, but the mutex lock is held during Python operations in retire_idle_ray_workers, which violates the locking order principle mentioned in the custom rules and can cause significant performance degradation. The blacklist mechanism is well-designed, and the integration with the scheduler loop is clean. Once the lock contention issue is resolved, this would be safe to merge.
src/daft-distributed/src/python/ray/worker_manager.rs requires attention to fix the lock contention issue in retire_idle_ray_workers

Important Files Changed

Filename	Overview
src/daft-distributed/src/python/ray/worker_manager.rs	Implements retire_idle_ray_workers with pending_release_blacklist mechanism; potential lock contention during worker release
src/daft-distributed/src/python/ray/worker.rs	Adds ActorState tracking and idle duration calculation; clean implementation with proper state transitions
src/daft-distributed/src/scheduling/scheduler/scheduler_actor.rs	Integrates downscale logic into scheduler loop; uses environment variables for configuration without defaults in code comments

Sequence Diagram

sequenceDiagram
    participant S as SchedulerActor
    participant WM as RayWorkerManager
    participant W as RaySwordfishWorker
    participant R as Ray
    participant F as Flotilla

    Note over S: Scheduler loop iteration
    S->>WM: worker_snapshots()
    WM->>WM: refresh_workers()
    WM->>F: start_ray_workers(existing_ids)
    F->>R: ray.nodes()
    R-->>F: node list
    F-->>WM: RaySwordfishWorker instances
    WM-->>S: worker snapshots

    S->>S: schedule_tasks()
    S->>S: get_autoscaling_request()
    
    alt Scale up needed
        S->>WM: try_autoscale(bundles)
        WM->>WM: Clear pending_release_blacklist
        WM->>F: try_autoscale(bundles)
        F->>R: request_resources(bundles)
    else No scale up, check scale down
        S->>S: Count idle workers
        alt idle workers > min_survivor_workers
            S->>WM: retire_idle_ray_workers(num_to_retire, false)
            WM->>WM: Lock state mutex
            WM->>WM: Identify idle workers (idle_duration >= threshold)
            WM->>WM: Remove workers from ray_workers
            WM->>WM: Add to pending_release_blacklist
            WM->>WM: Unlock state mutex
            WM->>W: release(py)
            W->>W: Check active tasks == 0
            W->>W: set_state(Releasing)
            W->>R: shutdown()
            W->>W: set_state(Released)
            WM->>F: clear_autoscaling_requests()
            F->>R: request_resources([])
        end
    end

    Note over S: Job completion
    S->>WM: retire_idle_ray_workers(all_workers, true)
    WM->>F: clear_autoscaling_requests()
    F->>R: request_resources([])

greptile-apps

Additional Comments (3)

src/daft-distributed/src/python/ray/worker_manager.rs, line 287-340 (link)

logic: Holding the mutex lock while calling worker.release(py) and Python operations can cause significant lock contention. The state mutex is held from line 288 through line 340, during which Python GIL operations occur (lines 334-340). This blocks other operations like submit_tasks_to_workers unnecessarily.

Consider releasing the lock before Python operations:
src/daft-distributed/src/python/ray/worker.rs, line 138-146 (link)

style: The release method silently returns early if there are inflight tasks without setting state or logging. This could lead to confusion during debugging.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
src/daft-distributed/src/scheduling/scheduler/scheduler_actor.rs, line 138-144 (link)

style: Environment variable parsing with defaults lacks documentation. Consider adding comments explaining these configuration options and their defaults.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

_{8 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

codecov · 2025-12-31T12:54:02Z

Codecov Report

❌ Patch coverage is 51.10410% with 155 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.30%. Comparing base (29ffd49) to head (e8b7527).

Files with missing lines	Patch %	Lines
.../daft-distributed/src/python/ray/worker_manager.rs	0.00%	109 Missing ⚠️
src/daft-distributed/src/python/ray/worker.rs	0.00%	34 Missing ⚠️
...aft-distributed/src/scheduling/scheduler/linear.rs	0.00%	6 Missing ⚠️
daft/runners/flotilla.py	20.00%	4 Missing ⚠️
...ibuted/src/scheduling/scheduler/scheduler_actor.rs	95.83%	1 Missing ⚠️
src/daft-distributed/src/scheduling/worker.rs	99.24%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5903      +/-   ##
==========================================
- Coverage   72.37%   72.30%   -0.07%     
==========================================
  Files         965      965              
  Lines      125733   126029     +296     
==========================================
+ Hits        90996    91130     +134     
- Misses      34737    34899     +162

Files with missing lines	Coverage Δ
...ft-distributed/src/scheduling/scheduler/default.rs	`88.99% <100.00%> (+0.02%)`	⬆️
...c/daft-distributed/src/scheduling/scheduler/mod.rs	`88.04% <ø> (ø)`
...ibuted/src/scheduling/scheduler/scheduler_actor.rs	`90.25% <95.83%> (+0.30%)`	⬆️
src/daft-distributed/src/scheduling/worker.rs	`86.74% <99.24%> (+12.50%)`	⬆️
daft/runners/flotilla.py	`46.85% <20.00%> (-0.79%)`	⬇️
...aft-distributed/src/scheduling/scheduler/linear.rs	`87.50% <0.00%> (-2.25%)`	⬇️
src/daft-distributed/src/python/ray/worker.rs	`0.00% <0.00%> (ø)`
.../daft-distributed/src/python/ray/worker_manager.rs	`0.00% <0.00%> (ø)`

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions bot added the feat label Dec 31, 2025

huleilei mentioned this pull request Dec 31, 2025

WIP： feat: add downscale support via idle worker retirement in flotilla mode #5516

Closed

4 tasks

huleilei marked this pull request as draft December 31, 2025 12:05

greptile-apps bot reviewed Dec 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ray): Implement dynamic scale-in for RaySwordfishActor #5903

feat(ray): Implement dynamic scale-in for RaySwordfishActor #5903

huleilei commented Dec 31, 2025

Uh oh!

greptile-apps bot commented Dec 31, 2025

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

codecov bot commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(ray): Implement dynamic scale-in for RaySwordfishActor #5903

Are you sure you want to change the base?

feat(ray): Implement dynamic scale-in for RaySwordfishActor #5903

Conversation

huleilei commented Dec 31, 2025

Changes Made

Related Issues

Uh oh!

greptile-apps bot commented Dec 31, 2025

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (3)

Uh oh!

codecov bot commented Dec 31, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot left a comment •

edited

Loading