Skip to content

swiss-ai/benchmark-image-tokenzier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Benchmark: Discrete Image Tokenizers

Model Approach Token Type Training Resolution Inference Resolution # Tokens per Image Codebook Size Training Data Augmented Image Understanding Image Generation Pretraining Data
Open-MagVit2 VQ-VAE + MLM Spatial (2D Grid) 256×256 Flexible (e.g., 256×256) 16×16 Compression 262,144 Unknown Imagenet2012
Emu3-VisionTokenizer VQ-GAN (MoVQGAN) Spatial (2D Grid) ≥ 512×512 Flexible (e.g., 512×512) 8×8 Compression 32,768 Unknown laion-high-resolution
Cosmos VQ-AE (Discrete) Spatial (2D Grid) Flexible (256px to 4K) Original 16×16 or 8×8 Compression 64,000 Unknown VIDEO:
Driving (11%),
Hand motion and object manipulation (16%),
Human motion and activity (10%),
Spatial awareness and navigation (16%),
First person point-of-view (8%),
Nature dynamics (20%),
Dynamic camera movements (8%),
Synthetically rendered (4%),
Others (7%)
FlowMo Hi Diffusion Autoencoder (Transformer-based) Sequential (1D latent) 256×256 256×256 1,024 16,384 Unknown Imagenet2012
TiTok 1D VQ-VAE (Transformer-based) Sequential (1D latent) 256×256, 512×512 256×256, 512×512 256 4,096 Unknown ImageNet
Selftok Diffusion-based AR Prior Sequential (Autoregressive Prior) 256×256 256×256 512 / 1,024 / 1,536 32,768 Unknown DataComp: 25.45%,
LAION-2B En: 25.36%,
LAION-2B Multi: 24.26%,
COYO-700M: 12.96%,
In-house T2I: 7.98%,
In-house Text: 4.00%
UniTok VQ-VAE Sequential (1D latent) 256×256 flexible flexible(8 x 256 for 256 x 256) 8 x 16000 Unknown DataComp-1B
DetailFlow Autoregressive Sequential (AR coarse-to-fine, next-detail-prediction) 256×256 256×256 128 / 256 / 512 8,192 Unknown - ImageNet-1K
TokenFlow VQ-VAE (Transformer-based) Spatial (2D latent, next-scale-prediction) 256×256 / 384x384 256×256 / 384x384 16x16 / 27x27 32,768 Unknown LAION and COYO-700M (no ocr data!)
VILA-U RQ-VAE Spatial (2D latent) 256×256 256×256 16x16x4 16,384 Unknown COYO-700M

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages