You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Updates to support Accelerate and multigpu training (#37)
Summary:
## Changes:
- Added Accelerate Library and refactored experiment.py to use it
- Needed to move `init_optimizer` and `ExperimentConfig` to a separate file to be compatible with submitit/hydra
- Needed to make some modifications to data loaders etc to work well with the accelerate ddp wrappers
- Loading/saving checkpoints incorporates an unwrapping step so remove the ddp wrapped model
## Tests
Tested with both `torchrun` and `submitit/hydra` on two gpus locally. Here are the commands:
**Torchrun**
Modules loaded:
```sh
1) anaconda3/2021.05 2) cuda/11.3 3) NCCL/2.9.8-3-cuda.11.3 4) gcc/5.2.0. (but unload gcc when using submit)
```
```sh
torchrun --nnodes=1 --nproc_per_node=2 experiment.py --config-path ./configs --config-name repro_singleseq_nerf_test
```
**Submitit/Hydra Local test**
```sh
~/pytorch3d/projects/implicitron_trainer$ HYDRA_FULL_ERROR=1 python3.9 experiment.py --config-name repro_singleseq_nerf_test --multirun --config-path ./configs hydra/launcher=submitit_local hydra.launcher.gpus_per_node=2 hydra.launcher.tasks_per_node=2 hydra.launcher.nodes=1
```
**Submitit/Hydra distributed test**
```sh
~/implicitron/pytorch3d$ python3.9 experiment.py --config-name repro_singleseq_nerf_test --multirun --config-path ./configs hydra/launcher=submitit_slurm hydra.launcher.gpus_per_node=8 hydra.launcher.tasks_per_node=8 hydra.launcher.nodes=1 hydra.launcher.partition=learnlab hydra.launcher.timeout_min=4320
```
## TODOS:
- Fix distributed evaluation: currently this doesn't work as the input format to the evaluation function is not suitable for gathering across gpus (needs to be nested list/tuple/dicts of objects that satisfy `is_torch_tensor`) and currently `frame_data` contains `Cameras` type.
- Refactor the `accelerator` object to be accessible by all functions instead of needing to pass it around everywhere? Maybe have a `Trainer` class and add it as a method?
- Update readme with installation instructions for accelerate and also commands for running jobs with torchrun and submitit/hydra
X-link: fairinternal/pytorch3d#37
Reviewed By: davnov134, kjchalup
Differential Revision: D37543870
Pulled By: bottler
fbshipit-source-id: be9eb4e91244d4fe3740d87dafec622ae1e0cf76
0 commit comments