Vision Transformer implementation

Vision transformer is a neural network architecture that is used for Computer Vision tasks such as classification, segmentation, object detection. In comparison to CNN Neural Networks, transformers do not have any Convolutional blocks (if we do not take patch embedding into account).

ViT adapted this architecture for original transformers that were used for language tasks (sentiment analysis, text prediction, summarization...) The first time Transformers were mentioned was in the "Attention is all you need" paper

Parts of ViT

Patch embedding

This layer is used to create patches from our images. After this encoding we add positional encoding which represents the position of patches in the original image.

We also add class token, which is an additional embedding that we will use for our class prediction.

Transformer encoder

Unlike the original implementation of Transformer network, we will use only the encoder blocks (similar to BERT models). The encoder consists of a multi-head attention block, MLP (Multi-Layer Perceptron) block and we use Layer Normalization. We can stack multiple Transformer blocks together.

Classifier

A linear layer that will take our class token embedding and return the logits of our prediction.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
__pycache__		__pycache__
.gitignore		.gitignore
README.md		README.md
attention.py		attention.py
cats_vs_dogs_dataset.py		cats_vs_dogs_dataset.py
constants.py		constants.py
embeding.py		embeding.py
mlp.py		mlp.py
run.py		run.py
training.py		training.py
transformer_block.py		transformer_block.py
vision_transformer.py		vision_transformer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision Transformer implementation

Parts of ViT

Patch embedding

Transformer encoder

Classifier

About

Uh oh!

Releases

Packages

Languages

KostaR99/VisionTransformer

Folders and files

Latest commit

History

Repository files navigation

Vision Transformer implementation

Parts of ViT

Patch embedding

Transformer encoder

Classifier

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages