[Discussion] What to improve on ConvNeXt? #1093

timothylimyl · 2022-01-18T08:30:46Z

timothylimyl
Jan 18, 2022

Hi all,

In reference to the ConvNeXt paper, what do you think we can improve?

It will be great if we can continue to add more improvements that has already been proven by other cnn/transformers research papers to improve performance or reduce inference time of the model.

An obvious one will be to incorporate an auxiliary attention layer such as Convolutional Block Attention Module (CBAM). Another one that comes to mind is edit some blocks to be parallel convolutions with repeated fusion (like HRNet).

Hi @rwightman , I am interested to help out with training if there are any good ideas to improve the model. I can contribute the trained weights once it is done.

It will be great to consolidate more proven ideas together for CNN to properly benchmark a "modern" CNN to a vision transformer.

bonlime · 2022-01-18T10:08:53Z

bonlime
Jan 18, 2022

From my experience of training on Imagenet and architecture design there are several things that could be beneficial for speed.

Change first stage to use simple conv3x3, it gives much better utilisation. For even better capacity one could use ideas from Non-Deep Networks / RepVGG (conv3x3 + conv1x1 + residual -> merged to single conv3x3 on inference). See TResNet / Gpu Efficient Network papers for further discussion.
Change second stage to same conv3x3 but add groups + Modified Squeeze-Excitation attention (use single projection + sigmoid instead of 2 projections as in original paper)
IMHO large kernel DW conv is so slow that it could be better to use self-attention in last 2 stages but 1. use it less often (see paper: Pay Attention When Required where they propose 1:2 ration for Self-Attention:Feed Forward blocks) 2. use some variant of SA which avoids N^2 memory complexity. From my experiments XCiT gives descent performance. Very recent UFO self-attention (UFO-ViT: High Performance Linear Vision Transformer without Softmax) was even slightly better than XCiT. I'm not sure that XCiT/UFO should even be called transformer, as it could be seen simply as dynamic 1x1 convolution, so in my opinion it still would be convolutional architecture.

btw i'm open for collaboration. i have ideas, code and 100+ experiments done on imagenet so far, but i'm short on GPUs recently so had to stop the research. feel free to email me at bonlimezak at gmail com

0 replies

timothylimyl · 2022-01-19T09:09:22Z

timothylimyl
Jan 19, 2022
Author

Hi @bonlime ,

Let's keep the conversation here for now, I think it will be beneficial as there may already be people doing the same thing or people with better suggestions on how to move forward. I hope to not end up wasting GPU resources on the same idea.

I will get back to this thread over the weekend when I am free. I am very interested to read the papers that you have mentioned. GPUs avaialbility should not be a problem.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] What to improve on ConvNeXt? #1093

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

[Discussion] What to improve on ConvNeXt? #1093

timothylimyl Jan 18, 2022

Replies: 2 comments

bonlime Jan 18, 2022

timothylimyl Jan 19, 2022 Author

timothylimyl
Jan 18, 2022

bonlime
Jan 18, 2022

timothylimyl
Jan 19, 2022
Author