Skip to content

Dynamic resources changes for multi-dimensional parallelism training

Notifications You must be signed in to change notification settings

MachineLearningSystem/24SOSP-tenplex

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tenplex

Tenplex is a state management library for DL systems that enables jobs to change their parallelism dynamically after the GPU allocation changes at runtime.

You can find the Tenplex paper at https://arxiv.org/abs/2312.05181

About

Tenplex let's you train a model with multi-dimensional parallelism, i.e. tensor, data, and pipeline parallelism, resource-independently. That means you can change the resources during the training without affecting convergence.

When to use Tenplex?

  • Elasticity, e.g. spot instances
  • Redeployment, e.g. preemption
  • Failure recovery, e.g. GPU failure

We implemented the prototype with Megatron-LM to get the parallelization configuration for a given set of resources.

Install

Prerequisites

Install tenplex-run

git clone https://github.com/kungfu-team/tenplex
cd tenplex
make install

Install Tensor Store (mlfs)

echo "deb https://europe-west2-apt.pkg.dev/projects/tenplex tenplex main" | sudo tee /etc/apt/sources.list.d/tenplex.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/packages-cloud-google-apt.gpg >/dev/null
sudo apt update
sudo apt install -y mlfs

Examples

Examples are in the benchmark directory. For instance, to run the dynamic resources benchmark in benchmark/dynamic_resources, just execute ./run.sh in the directory.

Citation

If you use Tenplex for your research, please cite our paper:

@inproceedings{wagenlander2024tenplex,
  title={Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections},
  author={Marcel Wagenlander, Guo Li, Bo Zhao, Luo Mai, Peter Pietzuch},
  booktitle={Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles},
  year={2024}
}

About

Dynamic resources changes for multi-dimensional parallelism training

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Go 65.9%
  • Python 25.4%
  • Shell 7.2%
  • Dockerfile 0.6%
  • CMake 0.4%
  • Makefile 0.3%
  • Other 0.2%