Skip to content

Implement Adaptive Scaling for Distributed Task Scheduling #48536

Open
@VadisettyRahul

Description

@VadisettyRahul

Description

Ray currently relies on static configurations for task scheduling, limiting efficiency during dynamically changing workloads. Adding adaptive scaling would allow clusters to automatically expand or contract based on resource demands, improving both utilization and response times.

Proposed Solution:

1. Monitor Resource Usage:

  • Add a monitoring module to track CPU, GPU, and memory usage across nodes.
  • Use Ray's existing metrics API to track real-time usage statistics and resource availability.

2. Implement Auto-Scaling Logic:

  • Develop scaling logic that activates when usage exceeds or drops below pre-defined thresholds.
  • Add configuration options to allow users to set upper and lower limits for scaling.
  • Use Ray’s autoscaler as a foundation, modifying it to support adaptive responses to real-time metrics.

3. Dynamic Task Assignment:

  • Adjust task allocation dynamically based on resource availability, optimizing performance and load balancing.
  • Allow tasks to prioritize nodes with greater availability or lower load to minimize latency.

4. Testing & Validation:

  • Design unit tests for threshold-based scaling, ensuring tasks are allocated efficiently.
  • Perform integration tests on clusters of varying sizes to confirm adaptive scaling functionality.

Expected Outcome: This feature would enable clusters to dynamically respond to changing loads, improving resource efficiency and overall task execution speed.

Use case

No response

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weekscoreIssues that should be addressed in Ray CoreenhancementRequest for new feature and/or capability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions