[ViT] Vision Transformer (ViT) backbone, layers, and image classifier #1989

sineeli · 2024-11-21T23:52:17Z

This PR introduces a Vision Transformer (ViT) implementation

Backbone
Preprocessor
Image classifier
Weights transfer script

mattdangerw

This looks great! Very nice work. Just a couple comments.

keras_hub/src/models/vit/vit_presets.py

tools/checkpoint_conversion/convert_vit_checkpoints.py

mattdangerw · 2024-12-03T21:46:49Z

keras_hub/src/models/image_classifier.py

@@ -137,7 +139,10 @@ def __init__(
        # === Functional Model ===
        inputs = self.backbone.input
        x = self.backbone(inputs)
-        x = self.pooler(x)
+        if pooling == "token":  # used for Vision Transformer(ViT)


"token" feels like a bit a weird name here, especially when compared to "avg" or "max". Maybe "first"?

Actually wouldn't this also break for other classifier types? I think this "token" pooling would fail to actually pool over a 2d output from most backbone, and similarly global avg 2d pooling would fail to pool correctly for a vit backbone right (since it's a 1d sequence after patching)? Instead we should subclass here, and not let pooling be configurable for vit. See https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/vgg/vgg_image_classifier.py as an example of this

Oh yes, I was thinking earlier to subclass and totally write a new one. Thanks for point out I will make the changes required.

@mattdangerw

Also from hugging face I observed that there is one more dense layer if the model is not used for ImageClassification which they call pooling layer and it just has a dense layer(which just projects the same number of hidden dimension) and a tanh activation.

Should we include this, if we are consider for ImageClassification this layer wouldn't be present.

ViTModel: https://github.com/huggingface/transformers/blob/91b8ab18b778ae9e2f8191866e018cd1dc7097be/src/transformers/models/vit/modeling_vit.py#L576

Image Classification: https://github.com/huggingface/transformers/blob/91b8ab18b778ae9e2f8191866e018cd1dc7097be/src/transformers/models/vit/modeling_vit.py#L823C37-L823C54

Any thoughts ?

Original Code from jax call it representation size: https://github.com/google-research/vision_transformer/blob/c6de1e5378c9831a8477feb30994971bdc409e46/vit_jax/models_vit.py#L296C13-L296C32

sineeli · 2024-12-04T19:27:57Z

Training Colab: https://colab.research.google.com/gist/sineeli/39c9276cbbe1cd9bdc5f6349decdc186/-keras-hub-vit-training.ipynb

mattdangerw

Looking good, but let's fix the broken pooling configurations.

sineeli · 2024-12-09T22:22:41Z

Model ID	Weights Transfer Status
Base
hf://google/vit-base-patch16-224	successful backbone + head weights transfer
hf://google/vit-base-patch16-224-in21k	layer name mismatch from `model.safetensors`
hf://google/vit-base-patch16-384	successful backbone + head weights transfer
hf://google/vit-base-patch32-224-in21k	layer name mismatch from `model.safetensors`
hf://google/vit-base-patch32-384	successful backbone + head weights transfer
Large
hf://google/vit-large-patch16-224	No `model.saftensors`
hf://google/vit-large-patch16-224-in21k	layer name mismatch from `model.safetensors`
hf://google/vit-large-patch16-384	No `model.saftensors`
hf://google/vit-large-patch32-224-in21k	No `model.saftensors`
hf://google/vit-large-patch32-384	No `model.saftensors`
Huge
hf://google/vit-huge-patch14-224-in21k	layer name mismatch from `model.safetensors`

mattdangerw

Looks good! Just some comments on the classifier API

mattdangerw · 2024-12-12T19:54:14Z

keras_hub/src/models/vit/vit_image_classifier.py

+                `"token_unpooled"`: Ouputs directly tokens from `ViTBackbone`
+        representation_size: Optional dimensionality of the intermediate
+            representation layer before the final classification layer.
+            If `None`, the output of the transformer is directly used."


trailing quote mark?

mattdangerw · 2024-12-12T20:00:28Z

keras_hub/src/models/vit/vit_image_classifier.py

+        self.preprocessor = preprocessor
+
+        if representation_size is not None:
+            self.representation_layer = keras.layers.Dense(


maybe call this intermediate_dim? Fits with other places we have an arg for the middle size on a two layer MLP.

mattdangerw · 2024-12-12T20:01:19Z

keras_hub/src/models/vit/vit_image_classifier.py

+        elif pooling == "gap":
+            ndim = len(ops.shape(x))
+            x = ops.mean(x, axis=list(range(1, ndim - 1)))  # (1,) or (1,2)
+        elif pooling == "token_unpooled":


Will this change the output shape? Output of an image classifier should be (batch_size, num_classes). How is this expected to be used?

Oh yeah this part is not required, backbone will serve the purpose if users wants to use for some other task rather than image classification.

It is used in jax code as they have single network. Thanks mat!

mattdangerw · 2024-12-12T20:34:22Z

looks good to me! will merge once green

sineeli added 24 commits November 13, 2024 18:16

vit base

741b889

Add vit backbone, classifier and preprocessor layers

13dae08

update args

b64b137

add default args

429d635

correct build method

6d69abc

fix build issues

2e87884

fix bugs

bd3cce0

Update backbone args and configs

4232a06

correct position ids dtype

32b08c5

build token layer

cc938c6

token layer build

78812de

assign correct dtype to TokenLayer

8a20465

fix build shape of token layer

de754cc

correct mlp dens var names

84ba896

use default norm mean and std as per hugging face config

7a70e16

correct position_ids

81e3021

remove separate token layer

d3061d6

correct position ids

618e163

Checkpoint conversion script and minor changes

2338637

correct flag type

95e5868

correct key name

9d2e5bd

use flat list later we can extract in between layers if needed

ac7d1d3

Add test cases and correct dtype polciy for model

8065c01

add proper docstrings

a8be824

sineeli requested review from mattdangerw and divyashreepathihalli November 21, 2024 23:52

correct test cases

3f027a0

sineeli added the kokoro:force-run Runs Tests on GPU label Nov 23, 2024

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Nov 23, 2024

use numpy for test data

05acb70

nit

ae2b800

sineeli added the kokoro:force-run Runs Tests on GPU label Nov 27, 2024

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Nov 27, 2024

sineeli added 2 commits December 2, 2024 13:34

Merge branch 'master' into sineeli/ViT

26c2224

add presets

92149d5

mattdangerw reviewed Dec 3, 2024

View reviewed changes

sineeli added 3 commits December 4, 2024 16:33

load vit preset from hugging face directly

5374c70

nit

ebee9ef

handle num classes case for ViT

93064bd

mattdangerw reviewed Dec 7, 2024

View reviewed changes

replace toke with first

e206e7b

sineeli added kokoro:force-run Runs Tests on GPU and removed kokoro:force-run Runs Tests on GPU labels Dec 9, 2024

sineeli requested review from mattdangerw and divyashreepathihalli December 9, 2024 19:50

convert all vit checkpoints using tools

7a39d5b

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Dec 10, 2024

Add custom ImageClassifier for ViT

0827954

sineeli added the kokoro:force-run Runs Tests on GPU label Dec 12, 2024

mattdangerw reviewed Dec 12, 2024

View reviewed changes

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Dec 12, 2024

remove token pooling and rename representation_size to intermediate_dim

ae9319a

mattdangerw added the kokoro:force-run Runs Tests on GPU label Dec 12, 2024

mattdangerw approved these changes Dec 12, 2024

View reviewed changes

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Dec 12, 2024

mattdangerw merged commit 15564ca into keras-team:master Dec 12, 2024
10 checks passed

[ViT] Vision Transformer (ViT) backbone, layers, and image classifier #1989

[ViT] Vision Transformer (ViT) backbone, layers, and image classifier #1989

Uh oh!

Conversation

sineeli commented Nov 21, 2024

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sineeli Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sineeli commented Dec 4, 2024

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

sineeli commented Dec 9, 2024

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw commented Dec 12, 2024

Uh oh!

Uh oh!

Uh oh!

sineeli Dec 10, 2024 •

edited

Loading