-
Notifications
You must be signed in to change notification settings - Fork 2.1k
An example on learnable image resizing #455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Another) excellent example @sayakpaul
I left a few suggestions. PTAL. Thank you 👍 👍 👍
A couple more (hopefully last) ideas for improvement of your awesome example to help new users:
- Description: How to optimally learn representations of images for a given resolution.
+ Description: Customized pre-processing with a learned resizer for optimal image representation learning for a given resolution.
Anyway, these are just some ideas. Hope this helps. |
IMO this would overcomplicate things. So, I would just keep the things as they are for now. I think I have made efforts to describe the purpose of why would someone be interested in the learning task in the first place. On the other hand, the description you provided would break the check for its length. So, keeping that in mind (and to make a description that makes sense) I would prefer the current one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I have a few questions about the underlying paper, curious to hear your explanations.
examples/vision/learnable_resizer.py
Outdated
|
||
| Model | Number of parameters (Million) | Top-1 accuracy | | ||
|:-------------------------: |:-------------------------------: |:--------------: | | ||
| With learnable resizer | 7.051717 | 52.02 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resizer is a residual convnet. Isn't the improved accuracy simply a consequence of adding more layers (representational power) to the model, compared to the baseline? Maybe this is explained in the paper (didn't have time to read it). How do we argue the scientific interest of these results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great question. So, to show it empirically here's what the authors have done:
- Take a pre-trained model trained some size, say (224 x 224).
- Now, first, use it to infer predictions on images resized to a lower resolution. Record the performance.
- For the second experiment, plug in the resizer module at the top of the pre-trained model and warm-start the training. Record performance.
Now, they argue that using the second option is better because it helps the model learn how to adjust the representations better with respect to the given resolution. I agree this might be because of an increase in the number of parameters (which I have mentioned too), a few more systematic experiments would have been better to establish this claim. Experiments such as analyzing the channel-similarity index and cross-channel interaction, visualizing if the focus of the network gets better with the resizer (Grad-CAM could be used to do this), and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense! Is this explained in the example sufficiently clearly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add these in the example briefly if you'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the following to clarify:
Now, a question worth asking here is - _isn't the improved accuracy simply a consequence
of adding more layers (the resizer is a mini network after all) to the model, compared to
the baseline?_
To show that it is not the case, the authors conduct the following experiment:
* Take a pre-trained model trained some size, say (224 x 224).
* Now, first, use it to infer predictions on images resized to a lower resolution. Record
the performance.
* For the second experiment, plug in the resizer module at the top of the pre-trained
model and warm-start the training. Record the performance.
Now, the authors argue that using the second option is better because it helps the model
learn how to adjust the representations better with respect to the given resolution.
Since the results purely empirical, a few more experiments such as analyzing the
cross-channel interaction would have been even better. It is worth noting that elements
like [Squeeze and Excitation (SE) blocks](https://arxiv.org/abs/1709.01507), [Global Context (GC) blocks](https://arxiv.org/pdf/1904.11492) also add a few
parameters to an existing network but they are known to help a network process
information in systematic ways to improve the overall performance.
@fchollet I have made the requested changes and also provided my explanations to your questions. PTAL when you have a moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think backbone
needs change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check input_shape
!
Correct me if i am wrong.
| Model | Number of parameters (Million) | Top-1 accuracy | | ||
|:-------------------------: |:-------------------------------: |:--------------: | | ||
| With the learnable resizer | 7.051717 | 67.67% | | ||
| Without the learnable resizer | 7.039554 | 60.19% | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These results are pretty convincing, the increase in parameter count is small and the increase in accuracy is solid!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. I did run the experiments for like three times and each time I was able to get a better performance with the resizer. To be absolutely sure and reproducible, I trained the models with the same initial random weights (I have mentioned it in the example).
@fchollet I have added an extended explanation with regard to the results. PTAL and let me know if I am good to add the rest of the files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check these,
Looks great! Please add the generated files. |
Done @fchollet. Over to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again for the great example! 👍
@sayakpaul num_res_blocks = 4
def res_block(x):
inputs = x
x = conv_block(x, 16, 3, 1)
x = conv_block(x, 16, 3, 1, activation=None)
return layers.Add()([inputs, x])
# Residual passes.
for _ in range(num_res_blocks): < -------
x = res_block(bottleneck) |
@innat I am unable to understand what you mean because I don't immediately see a problem in here: # Residual passes.
for _ in range(num_res_blocks): < -------
x = res_block(bottleneck) Are you saying should it be something like the following? for _ in range(num_res_blocks): < -------
x = res_block(bottleneck)
bottleneck = x If so, feel free to push a fix, I'd appreciate it. |
@sayakpaul def residual_block(x):
shortcut = x
def conv_bn_leaky(inputs, filters, kernel_size, strides):
x = layers.Conv2D(filters, kernel_size, strides=strides,
use_bias=False, padding='same')(inputs)
x = layers.BatchNormalization()(x)
x = layers.LeakyReLU()(x)
return x
def conv_bn(inputs, filters, kernel_size, strides):
x = layers.Conv2D(filters, kernel_size, strides, padding='same')(inputs)
x = layers.BatchNormalization()(x)
return x
x = conv_bn_leaky(x, 16, 3, 1)
x = conv_bn(x, 16, 3, 1)
x = layers.add([shortcut, x])
return x ...
# Intermediate resizing as a bottleneck.
bottleneck = layers.Resizing(
*TARGET_SIZE, interpolation=interpolation
)(x)
# Residual passes.
# for _ in range(num_res_blocks):
# x = res_block(bottleneck)
# Residual passes.
x = residual_block(bottleneck)
for i in range(1, num_res_blocks):
x = residual_block(x)
# Projection.
x = layers.Conv2D(
filters=filters, kernel_size=3, strides=1, padding="same", use_bias=False
)(x)
... |
Got it. But that introduces more changes. If we tackle the variable locally the changes are minimized. But in any case, feel free to push a fix. |
Adaptive image resizing is being used for some time. It has either been used in the form of progressive resizing or in more explicit adaptive forms such as in EfficientNetV2.
This example, however, introduces an example that focuses on this question as investigated here - "how to optimally learn representations for a given image resolution?" The improvements are quite nice and I hope it will be useful for the community to improve their vision models.