An example on learnable image resizing #455

sayakpaul · 2021-04-30T10:42:35Z

Adaptive image resizing is being used for some time. It has either been used in the form of progressive resizing or in more explicit adaptive forms such as in EfficientNetV2.

This example, however, introduces an example that focuses on this question as investigated here - "how to optimally learn representations for a given image resolution?" The improvements are quite nice and I hope it will be useful for the community to improve their vision models.

examples/vision/learnable_resizer.py

8bitmp3

(Another) excellent example @sayakpaul

I left a few suggestions. PTAL. Thank you 👍 👍 👍

examples/vision/learnable_resizer.py

8bitmp3 · 2021-05-02T09:58:35Z

A couple more (hopefully last) ideas for improvement of your awesome example to help new users:

Since the example is technically about the cool idea of learned resizers (instead of traditional ones), maybe we could mention that in the description explicitly for the less informed users and to help with search. (Keeping in mind that the title already mentions "learning to resize").

For example:

- Description: How to optimally learn representations of images for a given resolution.
+ Description: Customized pre-processing with a learned resizer for optimal image representation learning for a given resolution.

Maybe we should use "learned resizer" instead of "learnable resizer":
- The literature (https://arxiv.org/pdf/1811.12231.pdf and https://arxiv.org/pdf/2103.09950v1.pdf) used in the example specifically refers to it as "learned features" or "learned resizer".
- (Having said that, some papers, such as https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123570647.pdf, use "learnable" 🤷‍♂️ .)

Anyway, these are just some ideas. Hope this helps.

sayakpaul · 2021-05-02T11:19:52Z

IMO this would overcomplicate things. So, I would just keep the things as they are for now. I think I have made efforts to describe the purpose of why would someone be interested in the learning task in the first place. On the other hand, the description you provided would break the check for its length. So, keeping that in mind (and to make a description that makes sense) I would prefer the current one.

fchollet

Thanks for the PR! I have a few questions about the underlying paper, curious to hear your explanations.

examples/vision/learnable_resizer.py

fchollet · 2021-05-07T02:36:56Z

examples/vision/learnable_resizer.py

+
+|           Model           	| Number of  parameters (Million) 	| Top-1 accuracy 	|
+|:-------------------------:	|:-------------------------------:	|:--------------:	|
+|   With learnable resizer  	|             7.051717            	|      52.02     	|


The resizer is a residual convnet. Isn't the improved accuracy simply a consequence of adding more layers (representational power) to the model, compared to the baseline? Maybe this is explained in the paper (didn't have time to read it). How do we argue the scientific interest of these results?

This is a great question. So, to show it empirically here's what the authors have done:

Take a pre-trained model trained some size, say (224 x 224).

Now, first, use it to infer predictions on images resized to a lower resolution. Record the performance.

For the second experiment, plug in the resizer module at the top of the pre-trained model and warm-start the training. Record performance.

Now, they argue that using the second option is better because it helps the model learn how to adjust the representations better with respect to the given resolution. I agree this might be because of an increase in the number of parameters (which I have mentioned too), a few more systematic experiments would have been better to establish this claim. Experiments such as analyzing the channel-similarity index and cross-channel interaction, visualizing if the focus of the network gets better with the resizer (Grad-CAM could be used to do this), and so on.

That makes sense! Is this explained in the example sufficiently clearly?

I can add these in the example briefly if you'd like.

I added the following to clarify:

Now, a question worth asking here is - _isn't the improved accuracy simply a consequence of adding more layers (the resizer is a mini network after all) to the model, compared to the baseline?_ To show that it is not the case, the authors conduct the following experiment: * Take a pre-trained model trained some size, say (224 x 224). * Now, first, use it to infer predictions on images resized to a lower resolution. Record the performance. * For the second experiment, plug in the resizer module at the top of the pre-trained model and warm-start the training. Record the performance. Now, the authors argue that using the second option is better because it helps the model learn how to adjust the representations better with respect to the given resolution. Since the results purely empirical, a few more experiments such as analyzing the cross-channel interaction would have been even better. It is worth noting that elements like [Squeeze and Excitation (SE) blocks](https://arxiv.org/abs/1709.01507), [Global Context (GC) blocks](https://arxiv.org/pdf/1904.11492) also add a few parameters to an existing network but they are known to help a network process information in systematic ways to improve the overall performance.

examples/vision/learnable_resizer.py

sayakpaul · 2021-05-09T10:05:57Z

@fchollet I have made the requested changes and also provided my explanations to your questions. PTAL when you have a moment.

s-mrb

I think backbone needs change!

examples/vision/learnable_resizer.py

s-mrb

Please check input_shape!
Correct me if i am wrong.

examples/vision/learnable_resizer.py

fchollet · 2021-05-09T23:21:05Z

examples/vision/learnable_resizer.py

+|           Model           	| Number of  parameters (Million) 	| Top-1 accuracy 	|
+|:-------------------------:	|:-------------------------------:	|:--------------:	|
+|   With the learnable resizer  	|             7.051717            	|      67.67%     	|
+| Without the learnable resizer 	|             7.039554            	|      60.19%      	|


These results are pretty convincing, the increase in parameter count is small and the increase in accuracy is solid!

Yes, exactly. I did run the experiments for like three times and each time I was able to get a better performance with the resizer. To be absolutely sure and reproducible, I trained the models with the same initial random weights (I have mentioned it in the example).

sayakpaul · 2021-05-10T04:48:52Z

@fchollet I have added an extended explanation with regard to the results. PTAL and let me know if I am good to add the rest of the files.

s-mrb

Check these,

examples/vision/learnable_resizer.py

fchollet · 2021-05-11T05:03:52Z

Looks great! Please add the generated files.

sayakpaul · 2021-05-11T06:04:45Z

Done @fchollet. Over to you.

fchollet

Thank you again for the great example! 👍

innat · 2021-10-22T09:30:29Z

@sayakpaul
I think it doesn't connect more than one residual block.

num_res_blocks = 4 

def res_block(x):
    inputs = x
    x = conv_block(x, 16, 3, 1)
    x = conv_block(x, 16, 3, 1, activation=None)
    return layers.Add()([inputs, x])

# Residual passes.
for _ in range(num_res_blocks): < ------- 
    x = res_block(bottleneck)

sayakpaul · 2021-10-22T10:00:20Z

@innat I am unable to understand what you mean because I don't immediately see a problem in here:

# Residual passes.
for _ in range(num_res_blocks): < ------- 
    x = res_block(bottleneck)

Are you saying should it be something like the following?

for _ in range(num_res_blocks): < ------- 
    x = res_block(bottleneck)
    bottleneck = x

If so, feel free to push a fix, I'd appreciate it.

innat · 2021-10-22T10:17:23Z

@sayakpaul
Actually, if we pass num_res_blocks=2 (i.e. more than one res block), it won't add new res-blocks. Apart from that, all is ok. I think, instead of conv_block and res_block (two functions in the code), we can also do something as follows:

def residual_block(x):
    shortcut = x

    def conv_bn_leaky(inputs, filters, kernel_size, strides):
        x = layers.Conv2D(filters, kernel_size, strides=strides, 
                          use_bias=False, padding='same')(inputs)
        x = layers.BatchNormalization()(x)
        x = layers.LeakyReLU()(x)
        return x 
    
    def conv_bn(inputs, filters, kernel_size, strides):
        x = layers.Conv2D(filters, kernel_size, strides, padding='same')(inputs)
        x = layers.BatchNormalization()(x)
        return x 
    
    x = conv_bn_leaky(x, 16, 3, 1)
    x = conv_bn(x, 16, 3, 1)
    x = layers.add([shortcut, x])
    return x

...
# Intermediate resizing as a bottleneck.
bottleneck = layers.Resizing(
    *TARGET_SIZE, interpolation=interpolation
)(x)
    

# Residual passes.
# for _ in range(num_res_blocks):
#    x = res_block(bottleneck)

# Residual passes.
x = residual_block(bottleneck)
for i in range(1, num_res_blocks):
   x = residual_block(x)

# Projection.
    x = layers.Conv2D(
        filters=filters, kernel_size=3, strides=1, padding="same", use_bias=False
    )(x)
...

sayakpaul · 2021-10-22T10:38:22Z

Got it. But that introduces more changes. If we tackle the variable locally the changes are minimized.

But in any case, feel free to push a fix.

adding initial py script

43102f0

google-cla bot added the cla: yes label Apr 30, 2021

adding a note on adaptive resizing

f9b53ea