-
Notifications
You must be signed in to change notification settings - Fork 1
Description
0. Article Information and Links
- Paper link: https://seanseattle.github.io/SMIS/
- Release date: YYYY/MM/DD
- Number of citations (as of 2020/MM/DD):
- Implementation code: https://github.com/Seanseattle/SMIS
- Supplemental links (e.g. results):
- Publication: CVPR 2020
1. What do the authors try to accomplish?
A general framework that uses segmentation maps to semantically edit an image (like GauGAN), BUT only make localized edits that preserve the rest of the image. Traverse the latent space of that isolated label of the segmentation map.
2. What's great compared to previous research?
Introduces new applications:
- Appearance mixture (new!)
- Semantic Manipulation (old, previously seen in GauGAN)
- Style Morphing (new!)
3. Where are the key elements of the technology and method?
Note this paper builds on top of the GauGAN paper (link to my notes).
Architecture Overview: GroupDNet
Standard ConvNet entangles featuremaps, which would prevent localized editing. Using Grouped Convolutions enables class independence.
Class-specific latent code
Latent code is broken into class-specific latent codes. 1 code per label in the segmentation map, e.g. 19 cloth labels --> 19 latent codes. (This seems to imply that we must have separate models for each semantic map type.)
The latent code is created by encoding the input image with encoding layers.

The input image is split into C semantically segmented parts. The encoder uses C Group Convolutions which effectively operate on only their relevant segmentation of the image.
The latent code is forced to look Gaussian N(0,1) using KL divergence. Using this Gaussian-like latent code input enables the user to cleanly walk the latent space for each class.
Modify SPADE normalization to work for GroupConvs
Replace SPADE's Convs with GroupConvs, call this Conditional Group Normalization.
The Conditional Group Block is akin to SPADE's ResBlk variant, but using the proposed CG-Norm instead.
Loss
- LGAN is hinge version of GAN loss
- LFM is feature matching loss between real and synthesized image; multi-layer discriminator extracts features from real and synthesized
- LP is the VGG perceptual loss
- LKL is KL divergence of the latent code from Gaussian N(0,1)
4. How did you verify that it works?
5. Things to discuss? (e.g. weaknesses, potential for future work, relation to other work)
- Authors note that this architecture is very SLOW.
- Vary the shape in addition to the texture





