Skip to content

πŸš€ Self-implemented Vision Transformer (ViT) with Local Attention! Unlike standard ViT, this version integrates local attention for improved efficiency. Fully customizable with configurable patch embeddings, attention mechanisms, transformer layers as well as mixing global and local attention.

Notifications You must be signed in to change notification settings

Komil-parmar/ViT_with_Local_Attention

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Vision Transformer with Local Attention πŸš€

πŸ” A self-implementation of Vision Transformer (ViT) with Local Attention Mechanism.

🌟 Features

βœ… Patch Embedding with Learnable Tokens
βœ… Multi-Head Global and Local Self-Attention
βœ… Mix Global and Local Self-Attention Mechanism within the same model
βœ… Add CLS Token to the local attention window of each patch βœ… Transformer Encoder with Configurable Layers
βœ… Customizable Vision Transformer for Various Image Sizes
βœ… Designed for GPU Acceleration πŸš€

πŸ“ Note

❗ Important: Using Local Attention Mechanism in Vision Transformer makes the computations much faster and efficient but does not reduce the complexity (number of parameters) of the model.

πŸ“‚ Repository Structure

πŸ“ ViT
 β”œβ”€β”€ ViT.py               # Main implementation of ViT with Local Attention
 β”œβ”€β”€ README.md            # You are here! πŸ“œ
 β”œβ”€β”€ examples/            # Example scripts (Coming Soon)  
 β”œβ”€β”€ experiments/         # Training and testing experiments (Coming Soon)  

⚑ Quick Start

from ViT import VisionTransformer

model = VisionTransformer(
    img_size=32, patch_size=4, in_channels=3, num_classes=10,
    embed_dim=128, num_heads=4, num_layers=6, mlp_dim=256,
    dropout=0.1, window_size=[3, 3, 5, 5, None, None], add_cls_token=True
)
print(model)

πŸ“œ License & Usage Notice

❗ Important: This repository is currently not licensed.
If you use any part of this code, please provide credit and notify me via GitHub Issues or email.
πŸ’¬ Friendly Note: If you're just experimenting, learning, or using this code for research, no need to add any formalities while contacting me!
Feel free to explore and have fun with it. A formal open-source license will be added later.

πŸš€ Future Work

  • πŸ“Œ Optimize the 'cls_token_in_every_window' case
  • πŸ“Œ Add training and fine-tuning scripts
  • πŸ“Œ Implement efficient local attention techniques
  • πŸ“Œ Implement variable sized cls token (n patches)
  • πŸ“Œ Release benchmark results

πŸ’¬ Contributing

πŸ’‘ Suggestions and improvements are welcome! Feel free to open an issue or contribute via pull requests.

πŸ‘¨β€πŸŽ“ About Me

Hi! I'm a college student, self-learning deep learning and AI. I truly appreciate any kind of feedback, reviews, or suggestions, as they help me grow! 😊

πŸ“Œ You can learn more about me on my GitHub: GitHub Profile

About

πŸš€ Self-implemented Vision Transformer (ViT) with Local Attention! Unlike standard ViT, this version integrates local attention for improved efficiency. Fully customizable with configurable patch embeddings, attention mechanisms, transformer layers as well as mixing global and local attention.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages