Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can create_multi_gpu_supervised_trainer inherit from monai's SupervisedTrainer instead of Ignite's create_supervised_trainer? #5910

Closed
chezhia opened this issue Jan 28, 2023 · 6 comments

Comments

@chezhia
Copy link

chezhia commented Jan 28, 2023

Is your feature request related to a problem? Please describe.
I was trying to use multi_gpu_supervised_trainer in my existing workflow that currently uses monai's SupervisedTrainer, but the switch was not straightforward because of the create_multi_gpu_supervised_trainer inheriting directly from Ignite's trainer class. There are also no proper tutorials explaining how to use this multi_gpu class for more realistic workflows.

Describe the solution you'd like
Ideally, I'd expect the SupervisedTrainer to support multi-gpu workloads without needing a separate class.

Describe alternatives you've considered
I am considering other pytorch wrappers like lightning or catalyst. The tutorials seem to use different approaches for multi-gpu workloads and I'd really like to see a default approach for this. Since monai uses ignite as the base class for its trainer implementation, I thought it is the default approach, but it's not clear.

Additional context
What is the preferred approach in MONAI to do multi-gpu training?

@Nic-Ma
Copy link
Contributor

Nic-Ma commented Feb 21, 2023

Hi @chezhia ,

For multi-gpu training with SupervisedTrainer, please refer to this tutorial example:
https://github.com/Project-MONAI/tutorials/blob/main/acceleration/distributed_training/unet_training_workflows.py
create_multi_gpu_supervised_trainer is slightly out-date.
CC @ericspod What do you think or plan for our previous create_multi_gpu_supervised_trainer API?

Thanks.

@chezhia
Copy link
Author

chezhia commented Feb 21, 2023

Thanks for the response. I ended up creating a version of multi gpu trainer that is derived from SupervsiedTrainer and it works for my use cases, but I think there is room for improvement:

`def SupervisedTrainer_multi_gpu( max_epochs: int,
train_data_loader,
network: torch.nn.Module,
optimizer: Optimizer,
loss_function: Callable,
device: Sequence[str | torch.device] | None = None,
epoch_length: int | None = None,
non_blocking: bool = False,
iteration_update: Callable[[Engine, Any], Any] | None = None,
inferer: Inferer | None = None,
postprocessing: Transform | None = None,
key_train_metric: dict[str, Metric] | None = None,
additional_metrics: dict[str, Metric] | None = None,
train_handlers: Sequence | None = None,
amp: bool = False,
distributed: bool = False,
):

devices_ = device
if not device:
    devices_ = get_devices_spec(device)

net = network

if distributed:
    if len(devices_) > 1:
        raise ValueError(f"for distributed training, `devices` must contain only 1 GPU or CPU, but got {devices_}.")
    net = DistributedDataParallel(net, device_ids=devices_)
elif len(devices_) > 1:
    net = DataParallel(net, device_ids = devices_)

return SupervisedTrainer(device=devices_[0],
        network = net,
        optimizer = optimizer,
        loss_function=loss_function,
        max_epochs=max_epochs,
        train_data_loader=train_data_loader,
        epoch_length=epoch_length,
        non_blocking=non_blocking,
        iteration_update=iteration_update,
        inferer = inferer,
        postprocessing=postprocessing,
        key_train_metric=key_train_metric,
        additional_metrics=additional_metrics,
        train_handlers=train_handlers,
        amp=amp,
        )`

@ericspod
Copy link
Member

I honestly don't use these routines myself, for multi-GPU I've been using bundles which have a variation of this script to essentially monkey-patch in the needed components. Outside of bundles perhaps we should deprecate these routines and add the functionality to the trainer/evaluator classes. CC @wyli

@chezhia
Copy link
Author

chezhia commented Feb 21, 2023

Most of the routines I have looked at are wrappers for .py scripts and do not work in a jupyter notebook environment. The solution I posted works for notebooks and would like something similar in the official release without having to worry about local rank, world size etc.

@Nic-Ma
Copy link
Contributor

Nic-Ma commented Feb 22, 2023

Hi @ericspod @chezhia , thanks for the discussion, I created a ticket and will deprecate them: #6041

Thanks.

@wyli
Copy link
Contributor

wyli commented Sep 22, 2023

create_multigpu_supervised_trainer now deprecated and removed in favor of SupervisedTrainer: #7019

@wyli wyli closed this as completed Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants