-
Notifications
You must be signed in to change notification settings - Fork 143
Distribution across multi-gpu nodes #128
Comments
@SeanNaren Not sure if this solve your problem. Distributed training using DistributedDataParallel across multi-gpu nodes can be achieved by setting up |
@jwwandy seems like for now this is a good enough solution and is working with a modified version of the launch script! Once i've migrated changes to a public branch I'll close this ticket with a link to the fixes |
@jwwandy how do set ip address of master pod |
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
@hyperparameters It has been a while since last time I worked on it. From my experience, Pytorch-operator will set up environment variable The code where the operator set up the env var is here: pytorch-operator/pkg/controller.v1/pytorch/pod.go Lines 255 to 280 in 61fefa8
|
@jwwandy thankyou for quick response. Yes it worked without the need to explicitly set the variable. |
@jwwandy do you have sample code of how you achieved this with the pytorchoperator? Currently struggling with this problem |
Thanks for the work in this! This is somewhat tied #30, but I'm used to using DistributedDataParallel with a script similar to this to use multi-gpu for speed/performance over the DataParallel wrapper!
I've started using kubeflow for single GPU nodes, but I'm curious if there is any way I could use two separate 8GPU nodes to train while using the DistributedDataParallel locally on each 8GPU node? Anything I can do to help include this?
The text was updated successfully, but these errors were encountered: