Skip to content

Conversation

sayakpaul
Copy link
Contributor

Adaptive image resizing is being used for some time. It has either been used in the form of progressive resizing or in more explicit adaptive forms such as in EfficientNetV2.

This example, however, introduces an example that focuses on this question as investigated here - "how to optimally learn representations for a given image resolution?" The improvements are quite nice and I hope it will be useful for the community to improve their vision models.

@google-cla google-cla bot added the cla: yes label Apr 30, 2021
Copy link
Contributor

@8bitmp3 8bitmp3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Another) excellent example @sayakpaul

I left a few suggestions. PTAL. Thank you 👍 👍 👍

@8bitmp3
Copy link
Contributor

8bitmp3 commented May 2, 2021

A couple more (hopefully last) ideas for improvement of your awesome example to help new users:

  • Since the example is technically about the cool idea of learned resizers (instead of traditional ones), maybe we could mention that in the description explicitly for the less informed users and to help with search. (Keeping in mind that the title already mentions "learning to resize").

    For example:

- Description: How to optimally learn representations of images for a given resolution.
+ Description: Customized pre-processing with a learned resizer for optimal image representation learning for a given resolution.

Anyway, these are just some ideas. Hope this helps.

@sayakpaul
Copy link
Contributor Author

IMO this would overcomplicate things. So, I would just keep the things as they are for now. I think I have made efforts to describe the purpose of why would someone be interested in the learning task in the first place. On the other hand, the description you provided would break the check for its length. So, keeping that in mind (and to make a description that makes sense) I would prefer the current one.

Copy link
Contributor

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I have a few questions about the underlying paper, curious to hear your explanations.


| Model | Number of parameters (Million) | Top-1 accuracy |
|:-------------------------: |:-------------------------------: |:--------------: |
| With learnable resizer | 7.051717 | 52.02 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resizer is a residual convnet. Isn't the improved accuracy simply a consequence of adding more layers (representational power) to the model, compared to the baseline? Maybe this is explained in the paper (didn't have time to read it). How do we argue the scientific interest of these results?

Copy link
Contributor Author

@sayakpaul sayakpaul May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great question. So, to show it empirically here's what the authors have done:

  • Take a pre-trained model trained some size, say (224 x 224).
  • Now, first, use it to infer predictions on images resized to a lower resolution. Record the performance.
  • For the second experiment, plug in the resizer module at the top of the pre-trained model and warm-start the training. Record performance.

Now, they argue that using the second option is better because it helps the model learn how to adjust the representations better with respect to the given resolution. I agree this might be because of an increase in the number of parameters (which I have mentioned too), a few more systematic experiments would have been better to establish this claim. Experiments such as analyzing the channel-similarity index and cross-channel interaction, visualizing if the focus of the network gets better with the resizer (Grad-CAM could be used to do this), and so on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense! Is this explained in the example sufficiently clearly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add these in the example briefly if you'd like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the following to clarify:

Now, a question worth asking here is -  _isn't the improved accuracy simply a consequence
of adding more layers (the resizer is a mini network after all) to the model, compared to
the baseline?_

To show that it is not the case, the authors conduct the following experiment:

* Take a pre-trained model trained some size, say (224 x 224).

* Now, first, use it to infer predictions on images resized to a lower resolution. Record
the performance.

* For the second experiment, plug in the resizer module at the top of the pre-trained
model and warm-start the training. Record the performance.

Now, the authors argue that using the second option is better because it helps the model
learn how to adjust the representations better with respect to the given resolution.
Since the results purely empirical, a few more experiments such as analyzing the
cross-channel interaction would have been even better. It is worth noting that elements
like [Squeeze and Excitation (SE) blocks](https://arxiv.org/abs/1709.01507), [Global Context (GC) blocks](https://arxiv.org/pdf/1904.11492) also add a few
parameters to an existing network but they are known to help a network process
information in systematic ways to improve the overall performance. 

@sayakpaul
Copy link
Contributor Author

@fchollet I have made the requested changes and also provided my explanations to your questions. PTAL when you have a moment.

Copy link
Contributor

@s-mrb s-mrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think backbone needs change!

Copy link
Contributor

@s-mrb s-mrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check input_shape!
Correct me if i am wrong.

| Model | Number of parameters (Million) | Top-1 accuracy |
|:-------------------------: |:-------------------------------: |:--------------: |
| With the learnable resizer | 7.051717 | 67.67% |
| Without the learnable resizer | 7.039554 | 60.19% |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These results are pretty convincing, the increase in parameter count is small and the increase in accuracy is solid!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. I did run the experiments for like three times and each time I was able to get a better performance with the resizer. To be absolutely sure and reproducible, I trained the models with the same initial random weights (I have mentioned it in the example).

@sayakpaul
Copy link
Contributor Author

@fchollet I have added an extended explanation with regard to the results. PTAL and let me know if I am good to add the rest of the files.

Copy link
Contributor

@s-mrb s-mrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check these,

@fchollet
Copy link
Contributor

Looks great! Please add the generated files.

@sayakpaul
Copy link
Contributor Author

Done @fchollet. Over to you.

Copy link
Contributor

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again for the great example! 👍

@fchollet fchollet merged commit 562d0fe into keras-team:master May 11, 2021
@innat
Copy link
Contributor

innat commented Oct 22, 2021

@sayakpaul
I think it doesn't connect more than one residual block.

num_res_blocks = 4 

def res_block(x):
    inputs = x
    x = conv_block(x, 16, 3, 1)
    x = conv_block(x, 16, 3, 1, activation=None)
    return layers.Add()([inputs, x])

# Residual passes.
for _ in range(num_res_blocks): < ------- 
    x = res_block(bottleneck)

@sayakpaul
Copy link
Contributor Author

@innat I am unable to understand what you mean because I don't immediately see a problem in here:

# Residual passes.
for _ in range(num_res_blocks): < ------- 
    x = res_block(bottleneck)

Are you saying should it be something like the following?

for _ in range(num_res_blocks): < ------- 
    x = res_block(bottleneck)
    bottleneck = x

If so, feel free to push a fix, I'd appreciate it.

@innat
Copy link
Contributor

innat commented Oct 22, 2021

@sayakpaul
Actually, if we pass num_res_blocks=2 (i.e. more than one res block), it won't add new res-blocks. Apart from that, all is ok. I think, instead of conv_block and res_block (two functions in the code), we can also do something as follows:

def residual_block(x):
    shortcut = x

    def conv_bn_leaky(inputs, filters, kernel_size, strides):
        x = layers.Conv2D(filters, kernel_size, strides=strides, 
                          use_bias=False, padding='same')(inputs)
        x = layers.BatchNormalization()(x)
        x = layers.LeakyReLU()(x)
        return x 
    
    def conv_bn(inputs, filters, kernel_size, strides):
        x = layers.Conv2D(filters, kernel_size, strides, padding='same')(inputs)
        x = layers.BatchNormalization()(x)
        return x 
    
    x = conv_bn_leaky(x, 16, 3, 1)
    x = conv_bn(x, 16, 3, 1)
    x = layers.add([shortcut, x])
    return x
...
# Intermediate resizing as a bottleneck.
bottleneck = layers.Resizing(
    *TARGET_SIZE, interpolation=interpolation
)(x)
    

# Residual passes.
# for _ in range(num_res_blocks):
#    x = res_block(bottleneck)

# Residual passes.
x = residual_block(bottleneck)
for i in range(1, num_res_blocks):
   x = residual_block(x)

# Projection.
    x = layers.Conv2D(
        filters=filters, kernel_size=3, strides=1, padding="same", use_bias=False
    )(x)
...

@sayakpaul
Copy link
Contributor Author

Got it. But that introduces more changes. If we tackle the variable locally the changes are minimized.

But in any case, feel free to push a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants