-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DeepLabV3+ Support #2689
Add DeepLabV3+ Support #2689
Conversation
Lint failed here
This has become required CI check I believe. It's ok, most PRs always fail linting. Don't worry. |
My apologies, I've reformatted the files based on the log. |
Codecov Report
@@ Coverage Diff @@
## master #2689 +/- ##
==========================================
+ Coverage 73.39% 73.40% +0.01%
==========================================
Files 99 99
Lines 8825 8830 +5
Branches 1391 1392 +1
==========================================
+ Hits 6477 6482 +5
+ Misses 1929 1920 -9
- Partials 419 428 +9
Continue to review full report at Codecov.
|
I apologize for the few mistakes I found at the last minute. Everything seems to be stable now. |
@ali-nsua thanks for the PR !
Concerning DeeplabV3+, I may be wrong but the backbone of the model is a modified Xception according to the paper. Probably, we wont get paper's performances with ResNet. |
Hi @vfdev-5 |
Hi, Thanks for the PR! As @vfdev-5 mentioned, having pre-trained weights for the model (that reproduces within a reasonable tolerance reported results) is a must before we can add it to torchvision. From a quick look at the paper for ResNet101 (Table 3), I had the impression that the new decoder improved results by ~1.5 points, and that most of the reported mIoU improvements come from using Xception + test-time augmentation. I would be willing to consider adding DeepLabV3+ with ResNet101 backbone if we manage to retrain it and match results (which might be a fairly involved task), ie., get around 1.5 mIoU improvement on top of DeepLabV3 ResNet101 from torchvision. About adding Xception to torchvision, that's a separate discussion and I would prefer to have it in a different issue. IIRC it wasn't very good at transfer learning (or on other tasks like detection), but I might be confusing it with something else (as it seemed to be successfully used here) |
Hi, |
@ali-nsua when training the model, please try using the reference training scripts in https://github.com/pytorch/vision/tree/master/references/segmentation so that we have a single entry-point for training everything |
Sure thing. |
@fmassa Just a thought: It could be helpful adding an argument to the training script to allow the segmentation datasets to be downloaded if they're not already cached locally, that is if they are publicly available. Just let me know if any of these are okay, and I can create a separate PR for those. |
@fmassa, @vfdev-5, @oke-aditya P.S. I also managed to fix the output strides to 16 and 8 for training and evaluation respectively, since they proved to perform best in the paper. I would be very grateful for any comments or suggestions. -- Slight problem: the pretrained torchvision models are all based on COCO, while the paper reports training on VOC. I guess I should try COCO as well, then evaluate on VOC? |
@ali-nsua thanks for all the work here!
That's the tricky part about contributing models, as they generally require a lot of computing power. Over the next 6 months or so we will have more bandwidth to contribute more models ourselves as well.
The modifications of the reference scripts would need to be submitted as well somehow, so that we can ensure reproducibility, but let's leave this for another discussion.
Nice! From the paper it seems to be still ~3 points behind the reported numbers, but it's getting there!
The pretrained models in torchvision are indeed based on COCO, and for consistency it would be good if we could keep the same things here. Also, many of the segmentation papers actually pre-train on COCO (and then finetune on VOC, see section 4.3 in the paper), so providing the models pre-trained on COCO actually has a lot of value as well. |
@fmassa Thank you for your comments. I couldn't agree more that pretraining on COCO would be the way to go, but I unfortunately do not have access to COCO, and I believe it is not publicly available. I will definitely find a way to train it once I find a copy. To sum up, I'll try finding a copy of COCO and training DLV3+ on that, see where we can get. Thanks again for your time. |
Great work @ali-nsua |
Hi, I wonder can I only get the segmentation for top-1 prediction in an image? At this time, it is displaying the segmentations for all classes present in the image. |
Hi, with torch.no_grad():
output = model(x)['out']
output_predictions = output.argmax(1) # output.shape = (batch_size, n_classes) |
Hi, |
Thanks! I didn't understand that last part unfortunately. Could you please elaborate? |
Hi @ali-nsua! Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
Here is DeepLabV3+ from detectron2. Maybe that might help you in some way. |
@ali-nsua Just happened to find this PR, some thoughts from my own experience: To sum up, It is very hard to precisely reproduce the absolute results in this field, especially from an ablation study. Relative improvements are much more sensible. So I guess good choice for ~1.5 relative improvement here:
When I train the DeepLabV3 from torchvision on VOC with no testing tricks at 321x321, I can already get 78.11 avg performance here, and 78.7 mIoU is also reported with mmsegmentation for 512x512 inputs. So for instance with these scripts the Table 3 results in the paper is not referenceable, and you probably need a DeepLabV3 performance from your own training script first then attain a ~1.5 improv. upon it. Alternatively, since your script is similar to the reference code, you can refer to a logging of VOC performance from prior implementations here in torchvision, but I can't seem to find results other than COCO yet? Some additional heads up, it might be impossible to get that ~1.5 with ResNet-101 if you also look at here. It seems the V3+ decoder do not work well on ResNet as expected. |
@voldemortX Thank you for your comments. Unfortunately, I got caught up with so much work, and I didn't have any reliable resources available, I was just renting off of cloud. Based on your comments I guess I could try this again with ResNet-50, and try to find a set of hyperparams, augs, and the like that would potentially work. |
Totally understand the bandwidth problem. Good work with this PR! If I come by some V3+ results someday that can help you, I'll just post them here. |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
1 similar comment
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
I added DeepLabV3+ (ResNet backbone) to the segmentation models, since the "decoder" only needed a few changes, like using low-level features from layer1 in resnet. I've tested it on both ResNet50 and ResNet101 backbones, and it seems to work, but I haven't had the chance to train it fully and verify it can reproduce the results from the paper yet.
Any thoughts and comments on the changes are greatly appreciated.