Open
Description
Paper
Link: https://arxiv.org/abs/2007.04242
Year: 2020
Summary
- propose dynamic group convolution (DGC) that adaptively selects which part of input channels to be connected within each group for individual samples on the fly
- introduce a tiny auxiliary feature selector for each group to dynamically decide which part of input channels to be connected based on the activations of all of input channels
- Multiple groups can adaptively capture abundant and complementary visual/semantic features for each input image
- similar computational efficiency as the conventional group convolution simultaneously
Contributions and Distinctions from Previous Works
- existing group convolutions, it can be found that they have two key disadvantages: 1) They weaken the representation capability of the normal convolution by introducing sparse neuron connections and suffer from decreasing performance especially for those difficult samples; 2) They have fixed neuron connection routines, regardless of the specific properties of individual inputs.
Methods
- automatically select the most important input channels conditioned on the input images
- split the output channels into multiple groups, each of them is generated by
an auxiliary head that equips with an input channel selector to decide which
part of input channels should be selected for convolution calculation (see the
blue and green areas) - input channel selector in each head adopts a gating strategy to dynamically determine the most important subset of input channels according to their importance scores generated by a saliency generator. Then, the normal convolution is conducted based on the selected subset of input channels generating the output channels in each head. Finally, the output channels from different heads are concatenated and shuffled, which would be connected to a BN layer and non-linear activation layer.
- saliency generator assigns each input channel a score representing its importance. Each head has a specific saliency generator, which encourages different heads to use different subpart of the input channels and achieve diversified feature representations.