π A self-implementation of Vision Transformer (ViT) with Local Attention Mechanism.
β
Patch Embedding with Learnable Tokens
β
Multi-Head Global and Local Self-Attention
β
Mix Global and Local Self-Attention Mechanism within the same model
β
Add CLS Token to the local attention window of each patch
β
Transformer Encoder with Configurable Layers
β
Customizable Vision Transformer for Various Image Sizes
β
Designed for GPU Acceleration π
β Important: Using Local Attention Mechanism in Vision Transformer makes the computations much faster and efficient but does not reduce the complexity (number of parameters) of the model.
π ViT
βββ ViT.py # Main implementation of ViT with Local Attention
βββ README.md # You are here! π
βββ examples/ # Example scripts (Coming Soon)
βββ experiments/ # Training and testing experiments (Coming Soon)
from ViT import VisionTransformer
model = VisionTransformer(
img_size=32, patch_size=4, in_channels=3, num_classes=10,
embed_dim=128, num_heads=4, num_layers=6, mlp_dim=256,
dropout=0.1, window_size=[3, 3, 5, 5, None, None], add_cls_token=True
)
print(model)
β Important: This repository is currently not licensed.
If you use any part of this code, please provide credit and notify me via GitHub Issues or email.
π¬ Friendly Note: If you're just experimenting, learning, or using this code for research, no need to add any formalities while contacting me!
Feel free to explore and have fun with it.
A formal open-source license will be added later.
- π Optimize the 'cls_token_in_every_window' case
- π Add training and fine-tuning scripts
- π Implement efficient local attention techniques
- π Implement variable sized cls token (n patches)
- π Release benchmark results
π‘ Suggestions and improvements are welcome! Feel free to open an issue or contribute via pull requests.
Hi! I'm a college student, self-learning deep learning and AI. I truly appreciate any kind of feedback, reviews, or suggestions, as they help me grow! π
π You can learn more about me on my GitHub: GitHub Profile