Skip to content

KostaR99/VisionTransformer

Repository files navigation

Vision Transformer implementation

Vision transformer is a neural network architecture that is used for Computer Vision tasks such as classification, segmentation, object detection. In comparison to CNN Neural Networks, transformers do not have any Convolutional blocks (if we do not take patch embedding into account).

ViT adapted this architecture for original transformers that were used for language tasks (sentiment analysis, text prediction, summarization...) The first time Transformers were mentioned was in the "Attention is all you need" paper

Parts of ViT

Patch embedding

This layer is used to create patches from our images. After this encoding we add positional encoding which represents the position of patches in the original image.

We also add class token, which is an additional embedding that we will use for our class prediction.

Transformer encoder

Unlike the original implementation of Transformer network, we will use only the encoder blocks (similar to BERT models). The encoder consists of a multi-head attention block, MLP (Multi-Layer Perceptron) block and we use Layer Normalization. We can stack multiple Transformer blocks together.

Classifier

A linear layer that will take our class token embedding and return the logits of our prediction.

About

ViT implementation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages