Skip to content

Latest commit

 

History

History

torchvision

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

DeepSpeed Autotuning

This example demonstrates how to use the DeepSpeed Autotune (dsat) feature with two parallel examples, one written as a DeepSpeedTrial class and the other written using Core API. The relevant code can be found under deepspeed_trial/ and core_api/, respectively. Each example trains a torchvision model on randomly generated ImageNet-like data (for speed and simplicity).

Files

The two subdirectories closely mirror each other.

Both contain identical ds_config.json files which use a simple zero-1 DeepSpeed (DS) configuration. They also contain nearly identicaldeepspeed.yaml Determined configuration files: core_api/deepspeed.yaml only differs from deepspeed_trial/deepspeed.yaml in its entrypoint and the inclusion of parameters which control the metric-reporting and checkpointing frequencies.

Model code can be found in the following files:

  • deepspeed_trial/model_def.py contains the DeepSpeedTrial subclass. The only dsat-specific code in this file comes is the dsat.get_ds_config_from_hparams helper function.
  • core_api/script.py contains a bare-bones training loop written with Core API. The script handles preemption, metric-reporting, and checkpointing. In addition to the dsat.get_ds_config_from_hparams helper function, the forward and backward steps are wrapped in the dsat.dsat_reporting_context context manager.

The deepspeed.yaml files define standard single-Trial experiments which can be run in the usual way by calling

python3 -m determined.pytorch.dsat binary deepspeed.yaml .

after cd-ing into the relevant directory. The code path which utilizes dsat is described in the following section.

Basic Usage

There are three available search methods for DeepSpeed Autotune:

  • asha: uses the ASHA algorithm to adaptively search over randomly selected DeepSpeed configurations

  • binary: tunes the optimal batch size for a handful of randomly generated DeepSpeed configurations via binary search.

  • random: performs a search over randomly generated DeepSpeed configurations which implements aggressive early-stopping criteria based on domain-knowledge of DeepSpeed and the search history.

    After cd-ing into either of the two subdirectories above, a asha dsat experiment can be launched by entering the following command, for instance:

python3 -m determined.pytorch.dsat asha deepspeed.yaml .

Similar commands are available for binary and random. The full options for each dsat search method can be found as in python3 -m determined.pytorch.dsat asha --help and similar for the other search methods.

See the documentation for more on the available DeepSpeed Autotuning options.