Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DeepLabV3+ Support #2689

Closed
wants to merge 0 commits into from
Closed

Add DeepLabV3+ Support #2689

wants to merge 0 commits into from

Conversation

alihassanijr
Copy link

I added DeepLabV3+ (ResNet backbone) to the segmentation models, since the "decoder" only needed a few changes, like using low-level features from layer1 in resnet. I've tested it on both ResNet50 and ResNet101 backbones, and it seems to work, but I haven't had the chance to train it fully and verify it can reproduce the results from the paper yet.
Any thoughts and comments on the changes are greatly appreciated.

@oke-aditya
Copy link
Contributor

oke-aditya commented Sep 19, 2020

Lint failed here

./torchvision/models/segmentation/_utils.py:49:36: E128 continuation line under-indented for visual indent
./torchvision/models/segmentation/_utils.py:50:36: E128 continuation line under-indented for visual indent
./torchvision/models/segmentation/_utils.py:51:36: E128 continuation line under-indented for visual indent
./torchvision/models/segmentation/_utils.py:52:36: E128 continuation line under-indented for visual indent
./torchvision/models/segmentation/_utils.py:53:36: E128 continuation line under-indented for visual indent
./torchvision/models/segmentation/_utils.py:54:36: E128 continuation line under-indented for visual indent
./torchvision/models/segmentation/_utils.py:55:36: E128 continuation line under-indented for visual indent
./torchvision/models/segmentation/_utils.py:56:36: E128 continuation line under-indented for visual indent
./torchvision/models/segmentation/_utils.py:62:1: W293 blank line contains whitespace
./torchvision/models/segmentation/_utils.py:72:1: W391 blank line at end of file
./torchvision/models/segmentation/_utils.py:72:1: W293 blank line contains whitespace
./torchvision/models/segmentation/segmentation.py:8:121: E501 line too long (141 > 120 characters)
./torchvision/models/segmentation/deeplabv3.py:12:1: E303 too many blank lines (3)

This has become required CI check I believe.

It's ok, most PRs always fail linting. Don't worry.

@alihassanijr
Copy link
Author

My apologies, I've reformatted the files based on the log.

@codecov
Copy link

codecov bot commented Sep 19, 2020

Codecov Report

Merging #2689 (cc0f598) into master (78159d6) will increase coverage by 0.01%.
The diff coverage is 98.24%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2689      +/-   ##
==========================================
+ Coverage   73.39%   73.40%   +0.01%     
==========================================
  Files          99       99              
  Lines        8825     8830       +5     
  Branches     1391     1392       +1     
==========================================
+ Hits         6477     6482       +5     
+ Misses       1929     1920       -9     
- Partials      419      428       +9     
Impacted Files Coverage Δ
torchvision/models/segmentation/deeplabv3.py 98.80% <97.82%> (-1.20%) ⬇️
torchvision/models/segmentation/_utils.py 80.76% <100.00%> (+0.76%) ⬆️
torchvision/models/segmentation/segmentation.py 71.42% <100.00%> (+3.98%) ⬆️
torchvision/ops/feature_pyramid_network.py 91.20% <0.00%> (-3.30%) ⬇️
torchvision/models/detection/retinanet.py 72.98% <0.00%> (-2.90%) ⬇️
torchvision/models/detection/anchor_utils.py 92.10% <0.00%> (-2.70%) ⬇️
torchvision/transforms/functional_pil.py 66.19% <0.00%> (-1.88%) ⬇️
torchvision/ops/deform_conv.py 70.96% <0.00%> (-1.34%) ⬇️
torchvision/models/detection/backbone_utils.py 94.28% <0.00%> (-1.27%) ⬇️
torchvision/ops/poolers.py 97.05% <0.00%> (-1.02%) ⬇️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 78159d6...a1b40a7. Read the comment docs.

@alihassanijr
Copy link
Author

I apologize for the few mistakes I found at the last minute. Everything seems to be stable now.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Sep 21, 2020

@ali-nsua thanks for the PR !
Adding a new model to torchvision can be a bit difficult due to several points:

  • usefulness of the model (popularity)
  • pretrained weights
  • implementation details

Concerning DeeplabV3+, I may be wrong but the backbone of the model is a modified Xception according to the paper. Probably, we wont get paper's performances with ResNet.
Maybe, we can start a discussion about introducing Xception to torchvision and then building a DeepLabV3+.
What do you think @fmassa ?

@alihassanijr
Copy link
Author

Hi @vfdev-5
Thank you for your response. You are right, one of the main points of the paper is using Xception as the encoder backbone. However, they did mention that a ResNet101 could also be used.
I could try implementing Xception as well and adding it as an option for the backbone.
Thanks again, and I look forward to your and everyone else's feedback.

@fmassa
Copy link
Member

fmassa commented Sep 22, 2020

Hi,

Thanks for the PR!

As @vfdev-5 mentioned, having pre-trained weights for the model (that reproduces within a reasonable tolerance reported results) is a must before we can add it to torchvision.

From a quick look at the paper for ResNet101 (Table 3), I had the impression that the new decoder improved results by ~1.5 points, and that most of the reported mIoU improvements come from using Xception + test-time augmentation.

I would be willing to consider adding DeepLabV3+ with ResNet101 backbone if we manage to retrain it and match results (which might be a fairly involved task), ie., get around 1.5 mIoU improvement on top of DeepLabV3 ResNet101 from torchvision.

About adding Xception to torchvision, that's a separate discussion and I would prefer to have it in a different issue. IIRC it wasn't very good at transfer learning (or on other tasks like detection), but I might be confusing it with something else (as it seemed to be successfully used here)

@alihassanijr
Copy link
Author

Hi,

Thanks for the PR!

As @vfdev-5 mentioned, having pre-trained weights for the model (that reproduces within a reasonable tolerance reported results) is a must before we can add it to torchvision.

From a quick look at the paper for ResNet101 (Table 3), I had the impression that the new decoder improved results by ~1.5 points, and that most of the reported mIoU improvements come from using Xception + test-time augmentation.

I would be willing to consider adding DeepLabV3+ with ResNet101 backbone if we manage to retrain it and match results (which might be a fairly involved task), ie., get around 1.5 mIoU improvement on top of DeepLabV3 ResNet101 from torchvision.

About adding Xception to torchvision, that's a separate discussion and I would prefer to have it in a different issue. IIRC it wasn't very good at transfer learning (or on other tasks like detection), but I might be confusing it with something else (as it seemed to be successfully used here)

Hi,
Thank you so much for your input. I'll try to retrain it using ResNet101 and see how it'll do. I'll add updates here.

@fmassa
Copy link
Member

fmassa commented Sep 22, 2020

@ali-nsua when training the model, please try using the reference training scripts in https://github.com/pytorch/vision/tree/master/references/segmentation so that we have a single entry-point for training everything

@alihassanijr
Copy link
Author

@ali-nsua when training the model, please try using the reference training scripts in https://github.com/pytorch/vision/tree/master/references/segmentation so that we have a single entry-point for training everything

Sure thing.

@alihassanijr
Copy link
Author

alihassanijr commented Sep 23, 2020

@fmassa
I noticed something when I was running the reference training script. Apparently the ToTensor transform was raising a warning about a numpy array not being writable. I looked around and fixed it by switching from np.asarray to np.array on my clone. I could push it to this PR as well if you think it would help, or maybe I'm doing something wrong.

Just a thought: It could be helpful adding an argument to the training script to allow the segmentation datasets to be downloaded if they're not already cached locally, that is if they are publicly available.

Just let me know if any of these are okay, and I can create a separate PR for those.

@alihassanijr
Copy link
Author

alihassanijr commented Sep 27, 2020

@fmassa, @vfdev-5, @oke-aditya
Running the model on VOC actually helped me a great deal in debugging my implementation. However, with only a limited underpowered GPU, I was able to train it twice with about 75.0 mIoU on the VOC val set. Of course, I had to modify the reference scripts in two key areas: 1- I had to set the classifier learning rate to 10 times the original lr, just like the aux classifier, and 2- I had to change the crop size from 480 to 513, as reported in the paper.
I'll try tuning it again and will hopefully be back with better results, closer to the ones reported in the paper next week.

P.S. I also managed to fix the output strides to 16 and 8 for training and evaluation respectively, since they proved to perform best in the paper.

I would be very grateful for any comments or suggestions.

-- Slight problem: the pretrained torchvision models are all based on COCO, while the paper reports training on VOC. I guess I should try COCO as well, then evaluate on VOC?

@fmassa
Copy link
Member

fmassa commented Sep 30, 2020

@ali-nsua thanks for all the work here!

However, with only a limited underpowered GPU

That's the tricky part about contributing models, as they generally require a lot of computing power. Over the next 6 months or so we will have more bandwidth to contribute more models ourselves as well.

Of course, I had to modify the reference scripts in two key areas

The modifications of the reference scripts would need to be submitted as well somehow, so that we can ensure reproducibility, but let's leave this for another discussion.

I was able to train it twice with about 75.0 mIoU on the VOC val set.

Nice! From the paper it seems to be still ~3 points behind the reported numbers, but it's getting there!

Slight problem: the pretrained torchvision models are all based on COCO, while the paper reports training on VOC. I guess I should try COCO as well, then evaluate on VOC?

The pretrained models in torchvision are indeed based on COCO, and for consistency it would be good if we could keep the same things here. Also, many of the segmentation papers actually pre-train on COCO (and then finetune on VOC, see section 4.3 in the paper), so providing the models pre-trained on COCO actually has a lot of value as well.

@alihassanijr
Copy link
Author

@fmassa Thank you for your comments. I couldn't agree more that pretraining on COCO would be the way to go, but I unfortunately do not have access to COCO, and I believe it is not publicly available. I will definitely find a way to train it once I find a copy.
About the paper, I could be wrong, but I think they pretrained the model on COCO with the Xception backbone. With the ResNet101 backbone, they haven't mentioned anything other than the backbone being pretrained on ImageNet. However, I found that the COCO-pretrained DLV3 which is already available does very well on the VOC val set without any specific fine-tuning. One can just run the reference script with --test-only --pretrained --dataset voc to check that.

To sum up, I'll try finding a copy of COCO and training DLV3+ on that, see where we can get.

Thanks again for your time.

@oke-aditya
Copy link
Contributor

Great work @ali-nsua
Here is the COCO dataset 👍

@giangnguyen2412
Copy link

Hi, I wonder can I only get the segmentation for top-1 prediction in an image? At this time, it is displaying the segmentations for all classes present in the image.

@alihassanijr
Copy link
Author

Hi, I wonder can I only get the segmentation for top-1 prediction in an image? At this time, it is displaying the segmentations for all classes present in the image.

Hi,
Sorry for the delay, I was unavailable for a few days.
Sure, just try this:

with torch.no_grad():
    output = model(x)['out']
output_predictions = output.argmax(1)  # output.shape = (batch_size, n_classes)

@alihassanijr
Copy link
Author

Great work @ali-nsua
Here is the COCO dataset 👍

Hi,
I'm terribly sorry but due to limited compute resources, I was only able to train the network once and unfortunately it did not quite reach the expected results. Is there any time limit on PRs?

@oke-aditya
Copy link
Contributor

oke-aditya commented Nov 11, 2020

I don't think so. E.g. #1697 was last model added which took time as well.

P.S. You might need to sign CLA please have a look.

I guess DeepLabV3+ was added to Detectron2 (0.3 release) recently. You can have look there as well 👍

@alihassanijr
Copy link
Author

I don't think so. E.g. #1697 was last model added which took time as well.

P.S. You might need to sign CLA please have a look.

I guess this was added to Detectron2 (0.3 release) recently. You can have look there as well 👍

Thanks!
I just signed CLA.

I didn't understand that last part unfortunately. Could you please elaborate?

@facebook-github-bot
Copy link

Hi @ali-nsua!

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@oke-aditya
Copy link
Contributor

Here is DeepLabV3+ from detectron2.

Maybe that might help you in some way.

@voldemortX
Copy link
Contributor

@ali-nsua Just happened to find this PR, some thoughts from my own experience:
Training on VOC is fast and single card trainable so its very suitable for getting PoCs. But if you want to achieve a ~1.5 improvement upon DeepLabV3 looking at an ablation study table, then probably training the already implemented DeepLabV3 from torchvision on VOC is needed, since different implementations (augmentation, size, testing scheme, learning rate schedule, training epochs, regularizations, OS, batch size) make very different performance on mean IoU, especially the PASCAL VOC dataset. And the ablation studies usually are not detailed and maybe imperfect in implementations.

To sum up, It is very hard to precisely reproduce the absolute results in this field, especially from an ablation study. Relative improvements are much more sensible. So I guess good choice for ~1.5 relative improvement here:

I would be willing to consider adding DeepLabV3+ with ResNet101 backbone if we manage to retrain it and match results (which might be a fairly involved task), ie., get around 1.5 mIoU improvement on top of DeepLabV3 ResNet101 from torchvision.

When I train the DeepLabV3 from torchvision on VOC with no testing tricks at 321x321, I can already get 78.11 avg performance here, and 78.7 mIoU is also reported with mmsegmentation for 512x512 inputs. So for instance with these scripts the Table 3 results in the paper is not referenceable, and you probably need a DeepLabV3 performance from your own training script first then attain a ~1.5 improv. upon it. Alternatively, since your script is similar to the reference code, you can refer to a logging of VOC performance from prior implementations here in torchvision, but I can't seem to find results other than COCO yet?

Some additional heads up, it might be impossible to get that ~1.5 with ResNet-101 if you also look at here. It seems the V3+ decoder do not work well on ResNet as expected.

@alihassanijr
Copy link
Author

@voldemortX Thank you for your comments. Unfortunately, I got caught up with so much work, and I didn't have any reliable resources available, I was just renting off of cloud.

Based on your comments I guess I could try this again with ResNet-50, and try to find a set of hyperparams, augs, and the like that would potentially work.

@voldemortX
Copy link
Contributor

@voldemortX Thank you for your comments. Unfortunately, I got caught up with so much work, and I didn't have any reliable resources available, I was just renting off of cloud.

Based on your comments I guess I could try this again with ResNet-50, and try to find a set of hyperparams, augs, and the like that would potentially work.

Totally understand the bandwidth problem. Good work with this PR! If I come by some V3+ results someday that can help you, I'll just post them here.

@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

1 similar comment
@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants