Skip to content

Commit

Permalink
new
Browse files Browse the repository at this point in the history
  • Loading branch information
cre8ture committed Aug 18, 2023
0 parents commit b71941c
Show file tree
Hide file tree
Showing 2 changed files with 677 additions and 0 deletions.
86 changes: 86 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Vision Transformer (ViT): Concepts and Math

Welcome to the Vision Transformer (ViT) project! In this repository, we delve into the concepts and mathematical foundations of the Vision Transformer architecture, a groundbreaking model for image classification tasks.

## Overview

The Vision Transformer (ViT) is a powerful deep learning architecture that applies the transformer model, originally designed for natural language processing tasks, to computer vision tasks such as image classification. ViT revolutionizes image processing by replacing traditional convolutional layers with self-attention mechanisms.

## How ViT Works: Concepts and Math

### Self-Attention Mechanism

At the heart of ViT lies the self-attention mechanism, which captures relationships between different elements of a sequence. For images, the input is divided into fixed-size patches. Self-attention computes attention scores between all pairs of patches and then combines them to generate context-aware representations.

### Patch Embedding

The input image is divided into non-overlapping patches, which are linearly embedded to form a sequence of 1D vectors. These patch embeddings retain spatial information and are passed through the transformer encoder.

### Positional Encodings

Since transformers lack inherent positional information, positional encodings are added to the patch embeddings to convey the order of patches. These encodings provide crucial spatial context to the self-attention mechanism.

### Transformer Encoder

The transformer encoder processes the patch embeddings using multiple layers of self-attention and feedforward neural networks. Self-attention captures global and local relationships, while feedforward layers contribute non-linearity.

### Classification Head

After encoding, the resulting sequence is fed into a classification head. This head typically consists of a few fully connected layers, culminating in softmax activations to predict class probabilities.

## Mathematical Formulation

### Self-Attention

For a query `Q`, key `K`, and value `V`, the scaled dot-product attention is computed as:

Attention(Q, K, V) = softmax(QK^T / √d_k) * V

vbnet
Copy code

Where `d_k` is the dimension of the query/key vectors.

### Positional Encodings

Positional encodings are added to the patch embeddings as sinusoidal functions:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

csharp
Copy code

### Multi-Head Self-Attention

In multi-head self-attention, the output is obtained by concatenating the outputs of multiple scaled dot-product attentions, followed by linear transformations.

### Feedforward Neural Network

A feedforward neural network consists of two linear transformations with a non-linear activation function in between:

FFN(x) = max(0, xW_1 + b_1)W_2 + b_2

rust
Copy code

## Getting Started

To understand ViT's concepts and mathematics, follow these steps:

1. Clone this repository:

Contributing
Contributions to this project are welcome! If you have explanations, derivations, or code examples related to ViT's concepts and math, feel free to open an issue or submit a pull request. Please follow the contribution guidelines outlined in the repository.

License
This project is licensed under the MIT License.

This repository aims to provide a comprehensive understanding of the concepts and mathematics behind the Vision Transformer (ViT) architecture. By exploring the provided resources, you can deepen your insights into ViT's mechanisms and applications.

For further exploration and technical details, refer to the original ViT paper and associated research.

Author: Kai Kleinbard
Date: August 18, 2023

vbnet
Loading

0 comments on commit b71941c

Please sign in to comment.