Monarch is a framework-agnostic storage tiering middleware for single-node deep learning training at HPC centers. It enables DL frameworks to transparently leverage local storage devices of compute nodes, even for datasets that may not fit entirely on such resources. Check the full paper on Monarch paper for more details.
Monarch accelerates DL training, reduces I/O variability, and alleviates the I/O pressure at shared storage systems.
Monarch mediates dataset read
requests between DL frameworks (e.g., TensorFlow, PyTorch) and HPC storage resources (local storage, Lustre), while providing a data placement strategy that is fine-tuned for the I/O patterns of DL training workloads.
Namely, data placement is done as a background task, to avoid adding extra latency at the critical I/O path of DL frameworks.
Monarch prefetches content from large files, stored at the Parallel File System (Lustre), to faster storage mediums, which not only promotes the use of faster storage resources, but also avoids unnecessary accesses to the PFS. When combined, these contributions i) accelerate DL training, ii) reduce I/O variability, and iii) diminish I/O pressure at the PFS
Please cite the following paper if you use Monarch:
Accelerating Deep Learning Training Through Transparent Storage Tiering. Marco Dantas, Diogo Leitão, Peter Cui, Ricardo Macedo, Xinlian Liu, Weijia Xu, João Paulo. 22nd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid 2022).
@inproceedings {Dantas2022Monarch,
title = {{Accelerating Deep Learning Training Through Transparent Storage Tiering}},
author = {Marco Dantas and Diogo Leitão and Peter Cui and Ricardo Macedo and Xinlian Liu and Weijia Xu and João Paulo},
booktitle = {{22nd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing}},
year = {2022},
publisher = {{IEEE}}
}
This tutorial will guide on how to set up and use Monarch.
pastor
contains a C++ source files, library dependencies and python bindings to build the software.configurations/frontera
contains predefined configurations to run in the Frontera system.common
contains common resources, for example a script to run the controller.pytorch/scripts
contains all the necessary shell scripts to test the software using PyTorch.tensorflow/scripts
contains all the necessary shell scripts to test the software using TensorFlow.tensorflow/resources/imagenet
contains the necessary scripts for the tensorflow's dataset generation.integration (deprecated)
directory containing a submodule to the tensorflow integration repository
Monarch is written with C++14 and was built and tested with g++-8.3.0
and cmake-3.17
.
The core library depends on the abseil-cpp v20210324.2 and yaml-cpp v0.6.3.
Make sure to define the INSTALL_DIR
variable with the full path of the installation directory for Monarch dependencies.
$ export INSTALL_DIR=~/mydir
# Export gcc/g++ version on Frontera system
$ export CC=/opt/apps/gcc/8.3.0/bin/gcc
$ export CXX=/opt/apps/gcc/8.3.0/bin/g++
$ git clone git@github.com:abseil/abseil-cpp.git
$ cd abseil-cpp
$ git checkout 20210324.2
$ mkdir build && cd build
$ cmake .. -DCMAKE_INSTALL_PREFIX=$INSTALL_DIR -DCMAKE_POSITION_INDEPENDENT_CODE=TRUE
$ cmake --build . --target install
$ git clone git@github.com:jbeder/yaml-cpp.git
$ cd yaml-cpp
$ git checkout yaml-cpp-0.6.3
$ mkdir build && cd build
$ cmake .. -DCMAKE_INSTALL_PREFIX=$INSTALL_DIR -DCMAKE_POSITION_INDEPENDENT_CODE=TRUE
$ make install
$ git clone git@github.com:dsrhaslab/monarch.git
$ cd monarch/pastor
$ mkdir build; cd build
$ cmake ..; make
$ export MONARCH_DIR=$(pwd)
To use Monarch all you need to do is make sure to set the environmental variable MONARCH_CONFIGS_PATH
.
This variable should contain the full path to an existing configuration file and run LD_PRELOAD=$MONARCH_DIR ./executable
Basic configurations can be found in the configurations
directory.
---
data_plane:
hierarchical_stage:
shared_tpool_size: "1"
hierarchy:
- type: "file_system"
subtype: "posix"
max_storage_size: "1000028234496"
block_size: "max"
prefix: "/home/dantas/datasets/staged_hymenoptera_data/"
- type: "file_system"
subtype: "posix"
block_size: "max"
prefix: "/home/dantas/datasets/hymenoptera_data/"
handlers:
control_policy: "solo_placement"
async_placement: true
dedicated_thread_pool: true
data_governance:
metadata_options:
shared_file_descriptors: true
metadata_container_service:
prefix: "train"
workspace: "/home/dantas/monarch_output"
debug_logs: true
profiler:
active: true
collect_frequency: "5"
This repository contains the scripts used to evaluate Monarch's performance on the Frontera supercomputer.
In all shell scripts, the variable WORKSPACE
needs to be defined beforehand.
Define it as the string to the absolute path of a valid directory.
Make sure to put this repository inside that same directory.
The variable MONARCH_CONFIGS_PATH
needs to be changed on the training scripts.
This also applies to any other variable defined at train.sh
, such as MONARCH_DIR
or VENV_DIR
among others.
The MONARCH_DIR
should point to .../monarch/pastor/build/libmonarch.so
.
We thank the Texas Advanced Computing Center (TACC) for providing access to computational resources of Frontera. Work realized within the scope of the project BigHPC (POCI-01-0247-FEDER-045924), funded by the European Regional Development Fund, through the Operational Programme for Competitiveness and Internationalization (COMPETE 2020 Programme) and by National Funds through the Portuguese Foundation for Science and Technology, I.P. on the scope of the UT Austin Portugal Program within project PAStor (UTA-EXPL/CA/0075/2019) and PhD Fellowship SFRH/BD/146059/2019.