-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Parallel issue with types.MethodType #112
Comments
Thanks for reporting this issue. |
I actually think the original way of adding unbound method (features, logits, forward) to the class is the way to go as it can always correctly bind the instance which calls the unbound method to method itself (that's why unbound method exists, right?), so that weights, devices and any other information will be correct. |
That way we can use the unbound method way in modify_xxx functions for correct data_parallel behavior in training while not suffering from the multi-instantiation issue of the same model, or more specifically, VGG model. |
P..S. I also suggest to make modify_vgg consistent with other models, such as resnet, so that features method output 4-d tensor (e.g. batch x 7 x 7 x Channel) as in other models, rather than a reshaped and transformed (by the two fully-connected layers) one. That way the modified torchvision VGG model can be used interchangeably with other models as feature extractor of the user defined model. |
I made a simple fix, with those models rewritten. Pls check it out. #145 |
I found that when using nn.data_parallel along with this library, there will be issues during model forward on multi-GPU as when modifying torch vision network (such as in modify_resnets function) types.MethodType bound model instance on GPU 0, so when forward is called on GPU 1, model and input will be located on different GPUs and thus lead to errors.
Using the original way of bounding function to class instead of instance seems to solve this issue, but may suffer from other problem as in #71. Is there any way to fix this issue without introducing another?
The text was updated successfully, but these errors were encountered: