-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SSD architecture with VGG16 backbone #3403
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3403 +/- ##
==========================================
- Coverage 79.73% 78.34% -1.39%
==========================================
Files 105 106 +1
Lines 9818 10003 +185
Branches 1579 1614 +35
==========================================
+ Hits 7828 7837 +9
- Misses 1513 1688 +175
- Partials 477 478 +1
Continue to review full report at Codecov.
|
- Skeleton for Default Boxes generator class - Dynamic estimation of configuration when possible - Addition of types
@datumbox @fmassa , following the discussion on #3611 (comment): should we start introducing every attribute, function, method and class here with a leading underscore (apart from those we want to explicitly expose)? I would go as far as to also introduce new file names with underscores, so that the public API is very clear: every object that isn't explicitly exposed in an It takes a bit of self discipline but it can be very helpful to us in the long-run, to avoid issues like the one in #3611 where we unfortunately can't make a seemingly harmless change. Leading underscores also have somewhat of a self-documenting flavour which is helpful when reading / reviewing code |
…e the catch of exception for eval only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me, thanks a lot for all your work Vasilis!
I've only one minor comment regarding the doc, otherwise good to merge!
@@ -65,14 +73,16 @@ class GeneralizedRCNNTransform(nn.Module): | |||
It returns a ImageList for the inputs, and a List[Dict[Tensor]] for the targets | |||
""" | |||
|
|||
def __init__(self, min_size, max_size, image_mean, image_std): | |||
def __init__(self, min_size, max_size, image_mean, image_std, size_divisible=32, fixed_size=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I see this in the future is that we will have different transform
for different models
SSDTransform(...)
GeneralizedRCNNTransform(...)
DETRTranform(...)
and the way to avoid too much code duplication will be by having nice abstractions for the joint transforms, so that each one of those will be able to be easily implemented. Something like
GeneralizedRCNNTransform = Compose(ResizeWithMinSize(...), RandomFlip(...), Normalize(...),)
But we are not there yet
boxes = target["boxes"][is_within_crop_area] | ||
ious = torchvision.ops.boxes.box_iou(boxes, torch.tensor([[left, top, right, bottom]], | ||
dtype=boxes.dtype, device=boxes.device)) | ||
if ious.max() < min_jaccard_overlap: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double-checked the logic and it seems good to me.
For the future, we might be able to avoid some of the excessive continue
by more carefully selecting the sampling.
For example, in the first block we can sample the aspect ratio in log-scale so that the aspect ration will be correct from the beginning, and then sample one value for the scale.
The same can be done for the crop (sampling so that none of the values are zero after rounding).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, this can definitely be improved. I've implemented it straight as originally described and cross-referencing it with the original implementation to be sure as many similar implementations online are bugged. I would not touch this until there are proper unit-tests in place to ensure we maintain the same behaviour as this transform was crucial for hitting the accuracy reported on the paper.
Summary: * Early skeleton of API. * Adding MultiFeatureMap and vgg16 backbone. * Making vgg16 backbone same as paper. * Making code generic to support all vggs. * Moving vgg's extra layers a separate class + L2 scaling. * Adding header vgg layers. * Fix maxpool patching. * Refactoring code to allow for support of different backbones & sizes: - Skeleton for Default Boxes generator class - Dynamic estimation of configuration when possible - Addition of types * Complete the implementation of DefaultBox generator. * Replace randn with empty. * Minor refactoring * Making clamping between 0 and 1 optional. * Change xywh to xyxy encoding. * Adding parameters and reusing objects in constructor. * Temporarily inherit from Retina to avoid dup code. * Implement forward methods + temp workarounds to inherit from retina. * Inherit more methods from retinanet. * Fix type error. * Add Regression loss. * Fixing JIT issues. * Change JIT workaround to minimize new code. * Fixing initialization bug. * Add classification loss. * Update todos. * Add weight loading support. * Support SSD512. * Change kernel_size to get output size 1x1 * Add xavier init and refactoring. * Adding unit-tests and fixing JIT issues. * Add a test for dbox generator. * Remove unnecessary import. * Workaround on GeneralizedRCNNTransform to support fixed size input. * Remove unnecessary random calls from the test. * Remove more rand calls from the test. * change mapping and handling of empty labels * Fix JIT warnings. * Speed up loss. * Convert 0-1 dboxes to original size. * Fix warning. * Fix tests. * Update comments. * Fixing minor bugs. * Introduce a custom DBoxMatcher. * Minor refactoring * Move extra layer definition inside feature extractor. * handle no bias on init. * Remove fixed image size limitation * Change initialization values for bias of classification head. * Refactoring and update test file. * Adding ResNet backbone. * Minor refactoring. * Remove inheritance of retina and general refactoring. * SSD should fix the input size. * Fixing messages and comments. * Silently ignoring exception if test-only. * Update comments. * Update regression loss. * Restore Xavier init everywhere, update the negative sampling method, change the clipping approach. * Fixing tests. * Refactor to move the losses from the Head to the SSD. * Removing resnet50 ssd version. * Adding support for best performing backbone and its config. * Refactor and clean up the API. * Fix lint * Update todos and comments. * Adding RandomHorizontalFlip and RandomIoUCrop transforms. * Adding necessary checks to our tranforms. * Adding RandomZoomOut. * Adding RandomPhotometricDistort. * Moving Detection transforms to references. * Update presets * fix lint * leave compose and object * Adding scaling for completeness. * Adding params in the repr * Remove unnecessary import. * minor refactoring * Remove unnecessary call. * Give better names to DBox* classes * Port num_anchors estimation in generator * Remove rescaling and fix presets * Add the ability to pass a custom head and refactoring. * fix lint * Fix unit-test * Update todos. * Change mean values. * Change the default parameter of SSD to train the full VGG16 and remove the catch of exception for eval only. * Adding documentation * Adding weights and updating readmes. * Update the model weights with a more performing model. * Adding doc for head. * Restore import. Reviewed By: NicolasHug Differential Revision: D28169152 fbshipit-source-id: cec34141fad09538e0a29c6eb7834b24e2d8528e
@datumbox @oke-aditya hi, a question about the augmention (https://github.com/pytorch/vision/blob/main/references/detection/transforms.py) of SSD:
since the operation of |
Resolves #440 and partially resolves #1422
This PR implements SSD with VGG16 backbone as described in the original paper.
Trained using the code committed at a167edc. The current best pre-trained model was trained with (using latest git hash):
Submitted batch job 40773612
Accuracy metrics:
Validated with:
Speed benchmark:
0.83 sec per image on CPU