description |
---|
#deep_learning_training_workloads #resource_scheduler #homogeneous_cluster |
Presented in OSDI 2022.
Authors: Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni (Microsoft Research), Vijay Chidambaram (UT-Austin & VMware Research).
Code: https://github.com/msr-fiddle/synergy
This paper presents a scheduler for DNN training jobs named Synergy, which considers the resource sensitivity to the allocation of CPU and memory resources.
It doesn't consider fractional GPU allocations (no GPU share).
- It proposes two algorithms to enable multi-dimensional bin-packing.
- Synergy-Opt
- Find approximate solutions using ILP formulation (typical Microsoft style...).
- Computationally expensive.
- Synergy-Tune
- Sort the pending jobs by their GPU demands, followed by CPU, and memory demand.
- Pick the server with the least amount of free resources just enough to fit the demand vector of the job.
- The GPU demand is fixed, but the auxiliary resource allocations (CPU, memory) are fungible.
- Within 10% of the optimal value (Synergy-Opt).
- Synergy-Opt
- Compared to a naive greedy algorithm Synergy-Greedy.
- First-fit. Place the job on the server that can satisfy the job's demands in all dimensions.
- Problems
- Result in GPU fragmentation as auxiliary resources are exhausted by jobs.
- Hurt the fairness as some jobs can be skipped over for a long time if their demands cannot be satisfied in the cluster.
- (In my view, this is a poor baseline...)
A prototype of Synergy and an associate event-driven simulator are implemented in Python.
- Testbed
- 4 node cluster, each node with 500GB DRAM, 24 CPU cores, and 8 V100 GPUs
- Simulation
- Consider two clusters:
- 16-node cluster (same node configuration as above)
- 64-node cluster (same node configuration as above)
- Consider two clusters:
- Assume a CPU:GPU ratio of 3 and fair-share memory of 62.5GB per GPU.