-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit b71941c
Showing
2 changed files
with
677 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
# Vision Transformer (ViT): Concepts and Math | ||
|
||
Welcome to the Vision Transformer (ViT) project! In this repository, we delve into the concepts and mathematical foundations of the Vision Transformer architecture, a groundbreaking model for image classification tasks. | ||
|
||
## Overview | ||
|
||
The Vision Transformer (ViT) is a powerful deep learning architecture that applies the transformer model, originally designed for natural language processing tasks, to computer vision tasks such as image classification. ViT revolutionizes image processing by replacing traditional convolutional layers with self-attention mechanisms. | ||
|
||
## How ViT Works: Concepts and Math | ||
|
||
### Self-Attention Mechanism | ||
|
||
At the heart of ViT lies the self-attention mechanism, which captures relationships between different elements of a sequence. For images, the input is divided into fixed-size patches. Self-attention computes attention scores between all pairs of patches and then combines them to generate context-aware representations. | ||
|
||
### Patch Embedding | ||
|
||
The input image is divided into non-overlapping patches, which are linearly embedded to form a sequence of 1D vectors. These patch embeddings retain spatial information and are passed through the transformer encoder. | ||
|
||
### Positional Encodings | ||
|
||
Since transformers lack inherent positional information, positional encodings are added to the patch embeddings to convey the order of patches. These encodings provide crucial spatial context to the self-attention mechanism. | ||
|
||
### Transformer Encoder | ||
|
||
The transformer encoder processes the patch embeddings using multiple layers of self-attention and feedforward neural networks. Self-attention captures global and local relationships, while feedforward layers contribute non-linearity. | ||
|
||
### Classification Head | ||
|
||
After encoding, the resulting sequence is fed into a classification head. This head typically consists of a few fully connected layers, culminating in softmax activations to predict class probabilities. | ||
|
||
## Mathematical Formulation | ||
|
||
### Self-Attention | ||
|
||
For a query `Q`, key `K`, and value `V`, the scaled dot-product attention is computed as: | ||
|
||
Attention(Q, K, V) = softmax(QK^T / √d_k) * V | ||
|
||
vbnet | ||
Copy code | ||
|
||
Where `d_k` is the dimension of the query/key vectors. | ||
|
||
### Positional Encodings | ||
|
||
Positional encodings are added to the patch embeddings as sinusoidal functions: | ||
|
||
PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) | ||
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)) | ||
|
||
csharp | ||
Copy code | ||
|
||
### Multi-Head Self-Attention | ||
|
||
In multi-head self-attention, the output is obtained by concatenating the outputs of multiple scaled dot-product attentions, followed by linear transformations. | ||
|
||
### Feedforward Neural Network | ||
|
||
A feedforward neural network consists of two linear transformations with a non-linear activation function in between: | ||
|
||
FFN(x) = max(0, xW_1 + b_1)W_2 + b_2 | ||
|
||
rust | ||
Copy code | ||
|
||
## Getting Started | ||
|
||
To understand ViT's concepts and mathematics, follow these steps: | ||
|
||
1. Clone this repository: | ||
|
||
Contributing | ||
Contributions to this project are welcome! If you have explanations, derivations, or code examples related to ViT's concepts and math, feel free to open an issue or submit a pull request. Please follow the contribution guidelines outlined in the repository. | ||
|
||
License | ||
This project is licensed under the MIT License. | ||
|
||
This repository aims to provide a comprehensive understanding of the concepts and mathematics behind the Vision Transformer (ViT) architecture. By exploring the provided resources, you can deepen your insights into ViT's mechanisms and applications. | ||
|
||
For further exploration and technical details, refer to the original ViT paper and associated research. | ||
|
||
Author: Kai Kleinbard | ||
Date: August 18, 2023 | ||
|
||
vbnet |
Oops, something went wrong.