MLFS is a machine learning job feature based job scheduler for machine learning clusters running both data parallelism and model parallelism machine learning jobs. The corresponding research work is published in CoNEXT 2020. Please click here for the detail of the paper.
- Install prerequisites (tested with Ubuntu 16.04, Tensorflow v1.8.0)
pip3 install -r (all the required softwares)
Required software: tensorflow==1.8.0, opencv-python==3.4.0.12, tqdm==4.19.6, pandas==0.22.0, matplotlib==2.2.0
numpy==1.16.2, scikit-learn==0.19.1, Python3==python 3.7.3
- To train a new model, put training data in
MLFS
, then insim/
runpython RLmodel.py
and then run
python train_RL_model.py
The reward signal and meta-setting of video can be modified in RLmodel.py
.
- To test the trained model in simulated environment, first copy over the model to
test/models
and modify theNN_MODEL
field oftest/train_RL_model.py
, and then intest/
runpython evaluator.py
and then run
python test.py
- To run real-world experiments, distribute the whole folder to each physical machine within the cluster. Then, copy the trained RL model to
MLFS
and modify theNN_MODEL
filed oftest/train_RL_model.py
. Next, update the IP addresses incluster.py
andserver.py
. Finally, intest
run
python real-test.py
The results will be saved to test/results
folder.
We thank the valuable suggestions from the anonymous reviewers and the shepherd. This research was supported in part by U.S. NSF grants, Microsoft Research Faculty Fellowship, and AWS Machine Learning Research Awards.