Skip to content

Add VAEImageDecoder for StableDiffusionV3 #1796

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 28, 2024

Conversation

james77777778
Copy link
Collaborator

Numerics check:
https://colab.research.google.com/drive/1YsWvZ0NBINDgdqipsldso1Y1NKJUDkUf?usp=sharing

Future works:

  • Implement CLIPPreprocessor
  • Wrap CLIPTextEncoder and T5XXLTextEncoder for the use in StableDiffusionV3
  • Implement VAEImageDecoder
  • Implement MMDiT
  • Implement StableDiffusionV3 (inference model)

@divyashreepathihalli @mattdangerw @SamanehSaadat

Copy link
Collaborator

@divyashreepathihalli divyashreepathihalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!! LGTM!

@divyashreepathihalli divyashreepathihalli added the kokoro:force-run Runs Tests on GPU label Aug 26, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Aug 26, 2024
Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! minor nits and a few design notes I don't think we need to solve in this PR

from keras_nlp.src.utils.keras_utils import standardize_data_format


class VAEImageDecoder(Backbone):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that with pali gemma, our "backbone" contains all the weights needed from a pre-trained model. So in that case the image encoder and text decoder collectively form a single backbone class.

We should discuss the high level flows that we want as we go, but our current approach is...

  • StableDiffusionBackbone should contain all the pretrained weights for using the entire model without a specific task setup. This can come from stitching other backbones/sub models together. No preprocessing.
  • StableDiffusion[TaskName] would wrap the backbone with a setup for a particular task. Preprocessing included. Ideally allowing both find-tuning and inference, but that would depend on the task at hand. For stable diffusion the main task is definitely text to image, though I'm not sure what we should call that. StableDiffusionImageGenerator is kinda long.

Copy link
Collaborator Author

@james77777778 james77777778 Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure of how we wanted to assemble these encoders and the decoder, so I made them as a Backbone first.

We should discuss the high level flows that we want as we go, but our current approach is...

Got it. Will make the encoders and decoder as a keras.Model to follow that pattern.

I think the task name, ImageGenerator, is a bit ambiguous. Maybe we should call it TextToImage instead?
It is also possible to use SD3 for ImageToImage and Inpaint tasks.
https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/stable_diffusion_3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextToImage sounds fine to me. Shorter.

Got it. Will make the encoders and decoder as a keras.Model to follow that pattern.

I suspect we still have more to figure out here. For these big "composite models" with lots of sub-components, it would be good if we allowed loading sub models individually some how. E.g. load the text encoder of a T5 model, or just the image encoder of PaliGemma. That's a valid use case, that fit's with the flexibility we'd like to shoot for, and we don't support it today. But a probably for another PR I think.

@divyashreepathihalli divyashreepathihalli added the kokoro:force-run Runs Tests on GPU label Aug 27, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Aug 27, 2024
@mattdangerw mattdangerw merged commit 536474a into keras-team:keras-hub Aug 28, 2024
10 checks passed
@james77777778 james77777778 deleted the add-vae-decoder branch August 29, 2024 02:26
mattdangerw pushed a commit to mattdangerw/keras-hub that referenced this pull request Sep 10, 2024
* Add `VAEImageDecoder` for StableDiffusionV3

* Use `keras.Model` for `VAEImageDecoder` and follows the coding style in `VAEAttention`
mattdangerw pushed a commit that referenced this pull request Sep 11, 2024
* Add `VAEImageDecoder` for StableDiffusionV3

* Use `keras.Model` for `VAEImageDecoder` and follows the coding style in `VAEAttention`
mattdangerw pushed a commit that referenced this pull request Sep 13, 2024
* Add `VAEImageDecoder` for StableDiffusionV3

* Use `keras.Model` for `VAEImageDecoder` and follows the coding style in `VAEAttention`
mattdangerw pushed a commit that referenced this pull request Sep 17, 2024
* Add `VAEImageDecoder` for StableDiffusionV3

* Use `keras.Model` for `VAEImageDecoder` and follows the coding style in `VAEAttention`
divyashreepathihalli added a commit that referenced this pull request Sep 25, 2024
* Add VGG16 backbone (#1737)

* Agg Vgg16 backbone

* update names

* update tests

* update test

* add image classifier

* incorporate review comments

* Update test case

* update backbone test

* add image classifier

* classifier cleanup

* code reformat

* add vgg16 image classifier

* make vgg generic

* update doc string

* update docstring

* add classifier test

* update tests

* update docstring

* address review comments

* code reformat

* update the configs

* address review comments

* fix task saved model test

* update init

* code reformatted

* Add `ResNetBackbone` and `ResNetImageClassifier` (#1765)

* Add ResNetV1 and ResNetV2

* Address comments

* Add CSP DarkNet backbone and classifier (#1774)

* Add CSP DarkNet

* Add CSP DarkNet

* snake_case function names

* change use_depthwise to block_type

* Add `FeaturePyramidBackbone` and port weights from `timm` for `ResNetBackbone` (#1769)

* Add FeaturePyramidBackbone and update ResNetBackbone

* Simplify the implementation

* Fix CI

* Make ResNetBackbone compatible with timm and add FeaturePyramidBackbone

* Add conversion implementation

* Update docstrings

* Address comments

* Add DenseNet (#1775)

* Add DenseNet

* fix testcase

* address comments

* nit

* fix lint errors

* move description

* Add ViTDetBackbone (#1776)

* add vit det vit_det_backbone

* update docstring

* code reformat

* fix tests

* address review comments

* bump year on all files

* address review comments

* rename backbone

* fix tests

* change back to ViT

* address review comments

* update image shape

* Add Mix transformer (#1780)

* Add MixTransformer

* fix testcase

* test changes and comments

* lint fix

* update config list

* modify testcase for 2 layers

* update input_image_shape -> image_shape (#1785)

* update input_image_shape -> image_shape

* update docstring example

* code reformat

* update tests

* Create __init__.py (#1788)

add missing __init__ file to vit_det

* Hack package build script to rename to keras-hub (#1793)

This is a temporary way to test out the keras-hub branch.
- Does a global rename of all symbols during package build.
- Registers the "old" name on symbol export for saving compat.
- Adds a github action to publish every commit to keras-hub as
  a new package.
- Removes our descriptions on PyPI temporarily, until we want
  to message this more broadly.

* Add CLIP and T5XXL for StableDiffusionV3 (#1790)

* Add `CLIPTokenizer`, `T5XXLTokenizer`, `CLIPTextEncoder` and `T5XXLTextEncoder`.

* Make CLIPTextEncoder as Backbone

* Add `T5XXLPreprocessor` and remove `T5XXLTokenizer`

Add `CLIPPreprocessor`

* Use `tf = None` at the top

* Replace manual implementation of `CLIPAttention` with `MultiHeadAttention`

* Add Bounding Box Utils (#1791)

* Bounding box utils

* - Correct test cases

* - Remove hard tensorflow dtype

* - fix api gen

* - Fix import for test cases
- Use setup for converters test case

* - fix api_gen issue

* - FIx api gen

* - Fix api gen error

* - Correct test cases as per new api changes

* mobilenet_v3 added in keras-nlp (#1782)

* mobilenet_v3 added in keras-nlp

* minor bug fixed in mobilenet_v3_backbone

* formatting corrected

* refactoring backbone

* correct_pad_downsample method added

* refactoring backbone

* parameters updated

* Testcaseupdated, expected output shape corrected

* code formatted with black

* testcase updated

* refactoring and description added

* comments updated

* added mobilenet v1 and v2

* merge conflict resolved

* version arg removed, and config options added

* input_shape changed to image_shape in arg

* config updated

* input shape corrected

* comments resolved

* activation function format changed

* minor bug fixed

* minor bug fixed

* added vision_backbone_test

* channel_first bug resolved

* channel_first cases working

* comments  resolved

* formatting fixed

* refactoring

---------

Co-authored-by: ushareng <usha.rengaraju@gmail.com>

* Pkgoogle/efficient net migration (#1778)

* migrating efficientnet models to keras-hub

* merging changes from other sources

* autoformatting pass

* initial consolidation of efficientnet_backbone

* most updates and removing separate implementation

* cleanup, autoformatting, keras generalization

* removed layer examples outside of effiicient net

* many, mainly documentation changes, small test fixes

* Add the ResNet_vd backbone (#1766)

* Add ResNet_vd to ResNet backbone

* Addressed requested parameter changes

* Fixed tests and updated comments

* Added new parameters to docstring

* Add `VAEImageDecoder` for StableDiffusionV3 (#1796)

* Add `VAEImageDecoder` for StableDiffusionV3

* Use `keras.Model` for `VAEImageDecoder` and follows the coding style in `VAEAttention`

* Replace `Backbone` with `keras.Model` in `CLIPTextEncoder` and `T5XXLTextEncoder` (#1802)

* Add pyramid output for densenet, cspDarknet (#1801)

* add pyramid outputs

* fix testcase

* format fix

* make common testcase for pyramid outputs

* change default shape

* simplify testcase

* test case change and add channel axis

* Add `MMDiT` for StableDiffusionV3 (#1806)

* Add `MMDiT`

* Update

* Update

* Update implementation

* Add remaining bbox utils (#1804)

* - Add formats, iou, utils for bounding box

* - Add `AnchorGenerator`, `BoxMatcher` and `NonMaxSupression` layers

* - Remove scope_name  not required.

* use default keras name scope

* - Correct format error

* - Remove layers as of now and keep them at model level till keras core supports them

* - Correct api_gen

* Fix timm conversion for rersnet (#1814)

* Add `StableDiffusion3`

* Fix `_normalize_inputs`

* Separate CLIP encoders from SD3 backbone.

* Simplify `text_to_image` function.

* Address comments

* Minor update and add docstrings.

* Add VGG16 backbone (#1737)

* Agg Vgg16 backbone

* update names

* update tests

* update test

* add image classifier

* incorporate review comments

* Update test case

* update backbone test

* add image classifier

* classifier cleanup

* code reformat

* add vgg16 image classifier

* make vgg generic

* update doc string

* update docstring

* add classifier test

* update tests

* update docstring

* address review comments

* code reformat

* update the configs

* address review comments

* fix task saved model test

* update init

* code reformatted

* Add `ResNetBackbone` and `ResNetImageClassifier` (#1765)

* Add ResNetV1 and ResNetV2

* Address comments

* Add CSP DarkNet backbone and classifier (#1774)

* Add CSP DarkNet

* Add CSP DarkNet

* snake_case function names

* change use_depthwise to block_type

* Add `FeaturePyramidBackbone` and port weights from `timm` for `ResNetBackbone` (#1769)

* Add FeaturePyramidBackbone and update ResNetBackbone

* Simplify the implementation

* Fix CI

* Make ResNetBackbone compatible with timm and add FeaturePyramidBackbone

* Add conversion implementation

* Update docstrings

* Address comments

* Add DenseNet (#1775)

* Add DenseNet

* fix testcase

* address comments

* nit

* fix lint errors

* move description

* Add ViTDetBackbone (#1776)

* add vit det vit_det_backbone

* update docstring

* code reformat

* fix tests

* address review comments

* bump year on all files

* address review comments

* rename backbone

* fix tests

* change back to ViT

* address review comments

* update image shape

* Add Mix transformer (#1780)

* Add MixTransformer

* fix testcase

* test changes and comments

* lint fix

* update config list

* modify testcase for 2 layers

* update input_image_shape -> image_shape (#1785)

* update input_image_shape -> image_shape

* update docstring example

* code reformat

* update tests

* Create __init__.py (#1788)

add missing __init__ file to vit_det

* Hack package build script to rename to keras-hub (#1793)

This is a temporary way to test out the keras-hub branch.
- Does a global rename of all symbols during package build.
- Registers the "old" name on symbol export for saving compat.
- Adds a github action to publish every commit to keras-hub as
  a new package.
- Removes our descriptions on PyPI temporarily, until we want
  to message this more broadly.

* Add CLIP and T5XXL for StableDiffusionV3 (#1790)

* Add `CLIPTokenizer`, `T5XXLTokenizer`, `CLIPTextEncoder` and `T5XXLTextEncoder`.

* Make CLIPTextEncoder as Backbone

* Add `T5XXLPreprocessor` and remove `T5XXLTokenizer`

Add `CLIPPreprocessor`

* Use `tf = None` at the top

* Replace manual implementation of `CLIPAttention` with `MultiHeadAttention`

* Add Bounding Box Utils (#1791)

* Bounding box utils

* - Correct test cases

* - Remove hard tensorflow dtype

* - fix api gen

* - Fix import for test cases
- Use setup for converters test case

* - fix api_gen issue

* - FIx api gen

* - Fix api gen error

* - Correct test cases as per new api changes

* mobilenet_v3 added in keras-nlp (#1782)

* mobilenet_v3 added in keras-nlp

* minor bug fixed in mobilenet_v3_backbone

* formatting corrected

* refactoring backbone

* correct_pad_downsample method added

* refactoring backbone

* parameters updated

* Testcaseupdated, expected output shape corrected

* code formatted with black

* testcase updated

* refactoring and description added

* comments updated

* added mobilenet v1 and v2

* merge conflict resolved

* version arg removed, and config options added

* input_shape changed to image_shape in arg

* config updated

* input shape corrected

* comments resolved

* activation function format changed

* minor bug fixed

* minor bug fixed

* added vision_backbone_test

* channel_first bug resolved

* channel_first cases working

* comments  resolved

* formatting fixed

* refactoring

---------

Co-authored-by: ushareng <usha.rengaraju@gmail.com>

* Pkgoogle/efficient net migration (#1778)

* migrating efficientnet models to keras-hub

* merging changes from other sources

* autoformatting pass

* initial consolidation of efficientnet_backbone

* most updates and removing separate implementation

* cleanup, autoformatting, keras generalization

* removed layer examples outside of effiicient net

* many, mainly documentation changes, small test fixes

* Add the ResNet_vd backbone (#1766)

* Add ResNet_vd to ResNet backbone

* Addressed requested parameter changes

* Fixed tests and updated comments

* Added new parameters to docstring

* Add `VAEImageDecoder` for StableDiffusionV3 (#1796)

* Add `VAEImageDecoder` for StableDiffusionV3

* Use `keras.Model` for `VAEImageDecoder` and follows the coding style in `VAEAttention`

* Replace `Backbone` with `keras.Model` in `CLIPTextEncoder` and `T5XXLTextEncoder` (#1802)

* Add pyramid output for densenet, cspDarknet (#1801)

* add pyramid outputs

* fix testcase

* format fix

* make common testcase for pyramid outputs

* change default shape

* simplify testcase

* test case change and add channel axis

* Add `MMDiT` for StableDiffusionV3 (#1806)

* Add `MMDiT`

* Update

* Update

* Update implementation

* Add remaining bbox utils (#1804)

* - Add formats, iou, utils for bounding box

* - Add `AnchorGenerator`, `BoxMatcher` and `NonMaxSupression` layers

* - Remove scope_name  not required.

* use default keras name scope

* - Correct format error

* - Remove layers as of now and keep them at model level till keras core supports them

* - Correct api_gen

* Fix timm conversion for rersnet (#1814)

* Fix

* Update

* Rename to diffuser and decoder

* Define functional model

* Merge from upstream/master

* Delete old SD3

* Fix copyright

* Rename to keras_hub

* Address comments

* Update

* Fix CI

* Fix bugs occurred in keras3.1

---------

Co-authored-by: Divyashree Sreepathihalli <divyashreepathihalli@gmail.com>
Co-authored-by: Sachin Prasad <sachinprasad@google.com>
Co-authored-by: Matt Watson <1389937+mattdangerw@users.noreply.github.com>
Co-authored-by: Siva Sravana Kumar Neeli <113718461+sineeli@users.noreply.github.com>
Co-authored-by: Usha Rengaraju <34335028+ushareng@users.noreply.github.com>
Co-authored-by: ushareng <usha.rengaraju@gmail.com>
Co-authored-by: pkgoogle <132095473+pkgoogle@users.noreply.github.com>
Co-authored-by: gowthamkpr <47574994+gowthamkpr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants