[SMIS] Semantically Multi-modal Image Synthesis (Apr 2020 CVPR)

## 0. Article Information and Links

- Paper link: https://seanseattle.github.io/SMIS/
- Release date: YYYY/MM/DD
- Number of citations (as of 2020/MM/DD): 
- Implementation code: https://github.com/Seanseattle/SMIS
- Supplemental links (e.g. results): 
- Publication: CVPR 2020


## 1. What do the authors try to accomplish?
A general framework that uses segmentation maps to semantically edit an image (like GauGAN), BUT only make localized edits that preserve the rest of the image. Traverse the latent space of that isolated label of the segmentation map.

## 2. What's great compared to previous research?

Introduces new applications:

1) Appearance mixture (new!)
<img src="https://user-images.githubusercontent.com/8121216/85487100-c42dd200-b580-11ea-8115-e007b04528d2.png" alt="CG-Norm" width="75%">

2) Semantic Manipulation (old, previously seen in GauGAN)
<img src="https://user-images.githubusercontent.com/8121216/85487190-fb03e800-b580-11ea-840a-03f38a533fc9.png" alt="CG-Norm" width="75%">

3) Style Morphing (new!)
<img src="https://user-images.githubusercontent.com/8121216/85487264-1a027a00-b581-11ea-8fe4-ed8511dcfac9.png" alt="CG-Norm" width="75%">


## 3. Where are the key elements of the technology and method?
Note this paper builds on top of the GauGAN paper ([link to my notes](https://github.com/andrewjong/Deep-Learning-Paper-Surveys/issues/12)).

### Architecture Overview: GroupDNet
Standard ConvNet entangles featuremaps, which would prevent localized editing. Using Grouped Convolutions enables class independence.

![image](https://user-images.githubusercontent.com/8121216/85476155-7b6c1e00-b56c-11ea-957b-8e3db2e34e28.png)

#### Class-specific latent code
Latent code is broken into class-specific latent codes. 1 code per label in the segmentation map, e.g. 19 cloth labels --> 19 latent codes. (This seems to imply that we must have separate models for each semantic map type.)

The latent code is created by encoding the input image with encoding layers.
<img src="https://user-images.githubusercontent.com/8121216/85485592-e6722080-b57d-11ea-92a7-a5614de821ef.png" alt="CG-Norm" width="50%">
The input image is split into _C_ semantically segmented parts. The encoder uses _C_ Group Convolutions which effectively operate on only their relevant segmentation of the image.

The latent code is forced to look Gaussian N(0,1) using KL divergence. Using this Gaussian-like latent code input enables the user to cleanly walk the latent space for each class.


#### Modify SPADE normalization to work for GroupConvs

Replace SPADE's Convs with GroupConvs, call this _Conditional Group Normalization_.

<img src="https://user-images.githubusercontent.com/8121216/85485283-46b49280-b57d-11ea-91f7-5dee1ad6111f.png" alt="CG-Norm" width="70%">

The Conditional Group Block is akin to SPADE's ResBlk variant, but using the proposed CG-Norm instead.

### Loss
![image](https://user-images.githubusercontent.com/8121216/85486437-7cf31180-b57f-11ea-9baf-42ebb4c42261.png)
- LGAN is hinge version of GAN loss
- LFM is feature matching loss between real and synthesized image; multi-layer discriminator extracts features from real and synthesized
- LP is the VGG perceptual loss
- LKL is KL divergence of the latent code from Gaussian N(0,1)


## 4. How did you verify that it works?

## 5. Things to discuss? (e.g. weaknesses, potential for future work, relation to other work)
- Authors note that this architecture is very SLOW.
- Vary the shape in addition to the texture

## 6. Are there any papers to read next?

## 7. References

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SMIS] Semantically Multi-modal Image Synthesis (Apr 2020 CVPR) #7

0. Article Information and Links

1. What do the authors try to accomplish?

2. What's great compared to previous research?

3. Where are the key elements of the technology and method?

Architecture Overview: GroupDNet

Class-specific latent code

Modify SPADE normalization to work for GroupConvs

Loss

4. How did you verify that it works?

5. Things to discuss? (e.g. weaknesses, potential for future work, relation to other work)

6. Are there any papers to read next?

7. References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[SMIS] Semantically Multi-modal Image Synthesis (Apr 2020 CVPR) #7

Description

0. Article Information and Links

1. What do the authors try to accomplish?

2. What's great compared to previous research?

3. Where are the key elements of the technology and method?

Architecture Overview: GroupDNet

Class-specific latent code

Modify SPADE normalization to work for GroupConvs

Loss

4. How did you verify that it works?

5. Things to discuss? (e.g. weaknesses, potential for future work, relation to other work)

6. Are there any papers to read next?

7. References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions