- Contains implementations of prominent ViT architectures broken down into modular components like encoder, attention mechanism, and decoder.
- Makes it easy to develop custom models by composing components of different architectures.
git clone https://github.com/SforAiDl/vformer.git
cd vformer/
python setup.py install
- Vanilla ViT
- Swin Transformer
- Pyramid Vision Transformer
- CrossViT
- Compact Vision Transformer
- Compact Convolutional Transformer
- Visformer
- Vision Transformers for Dense Prediction
- CvT
- ConViT
- ViViT
To instantiate and use a Swin Transformer model -
import torch
from vformer.models.classification import SwinTransformer
image = torch.randn(1, 3, 224, 224) # Example data
model = SwinTransformer(
img_size=224,
patch_size=4,
in_channels=3,
n_classes=10,
embed_dim=96,
depths=[2, 2, 6, 2],
num_heads=[3, 6, 12, 24],
window_size=7,
drop_rate=0.2,
)
logits = model(image)
VFormer
has a modular design and allows for easy experimentation using blocks/modules of different architectures. For example, if desired, you can use just the encoder or the windowed attention layer of the Swin Transformer model.
from vformer.attention import WindowAttention
window_attn = WindowAttention(
dim=128,
window_size=7,
num_heads=2,
**kwargs,
)
from vformer.encoder import SwinEncoder
swin_encoder = SwinEncoder(
dim=128,
input_resolution=(224, 224),
depth=2,
num_heads=2,
window_size=7,
**kwargs,
)