Open
Description
Description
Ray currently relies on static configurations for task scheduling, limiting efficiency during dynamically changing workloads. Adding adaptive scaling would allow clusters to automatically expand or contract based on resource demands, improving both utilization and response times.
Proposed Solution:
1. Monitor Resource Usage:
- Add a monitoring module to track CPU, GPU, and memory usage across nodes.
- Use Ray's existing metrics API to track real-time usage statistics and resource availability.
2. Implement Auto-Scaling Logic:
- Develop scaling logic that activates when usage exceeds or drops below pre-defined thresholds.
- Add configuration options to allow users to set upper and lower limits for scaling.
- Use Ray’s autoscaler as a foundation, modifying it to support adaptive responses to real-time metrics.
3. Dynamic Task Assignment:
- Adjust task allocation dynamically based on resource availability, optimizing performance and load balancing.
- Allow tasks to prioritize nodes with greater availability or lower load to minimize latency.
4. Testing & Validation:
- Design unit tests for threshold-based scaling, ensuring tasks are allocated efficiently.
- Perform integration tests on clusters of varying sizes to confirm adaptive scaling functionality.
Expected Outcome: This feature would enable clusters to dynamically respond to changing loads, improving resource efficiency and overall task execution speed.
Use case
No response