Skip to content

An Efficient Pipelined Data Parallel Approach for Training Large Model

Notifications You must be signed in to change notification settings

AlibabaPAI/DAPPLE

Repository files navigation

DAPPLE: An Efficient Pipelined Data Parallel Approach for Large Models Training

DAPPLE is a distributed training framework which combines pipeline parallelism and data parallelism to address aforementioned scheduling and planning challenges with synchronous training. This framework features a profiler, a planner and a runtime system. The profiler takes a user’s DNN model as input, and profiles execution time, activation and parameter sizes for each layer. Sample profiling results for some models are given in profiling results. Taking profiling results as input, DAPPLE planner generates an optimized hybrid parallelization plan on a given global batch size, which is further split into multiple micro-batches and scheduled for execution by DAPPLE runtime.

This repository contains the source code implementation of DAPPLE's planning results on 5 typical models: VGG19, AmoebaNet, BERT, GNMT, and XLNET.

Running the DAPPLE experiments

DAPPLE Planner

All the planner-related experiments can be reproduced on any machine, regardless of the environment. We've provided a detailed how-to in PLANNER_REPRODUCTION.md.

DAPPLE Runtime

Please see the launch script run.sh for each model for details.

Using the Planner

Install from Python PyPI, as a Python3 package

PyPI: https://pypi.org/project/HPGO/

pip3 install HPGO

Build from source

rustup default nightly
cargo build --release
maturin build --release
pip3 install xxx.whl

Example Usage of Python API

# Import HPGO Python API
import HPGO
# Construct the Conductor object
# conductor_from_torch_graph_and_seps(profile_filename, profile_batch_size, global_batch_size, devices)
conductor = HPGO.conductor_from_torch_graph_and_seps("./profiling_results/xlnet-36-pbs-1.txt", 1, 128, [8, 16])
result = conductor.py_orchestrate()
print(result)

License

The DAPPLE Planner is open sourced under the terms of BSD-3-Clause, details of which can be found in the src/LICENSE.md file

The file src/input/torch_graph_py.rs contains Python source code from PipeDream, which is licensed under the MIT License.