Mohammadreza Salehi, Shashanka Venkataramanan, Ioana Simion, Efstratios Gavves, Cees G. M. Snoek, Yuki M. Asano,
- Introduction
- GPU Requirements
- Environment Setup
- Loading pretrained models
- Training with MoSiC
- Evaluation
- Dataset Preparation
- Visualizations
- Citation
- License
MoSiC is a motion-guided self-supervised learning framework that learns dense visual representations from unlabeled videos. Motivated by the idea that "things that move together belong together", MoSiC tracks points across frames and clusters them using optimal transport to ensure features remain spatiotemporally consistent—even through occlusions and motion. By propagating cluster assignments along motion trajectories, it enforces object permanence and temporal coherence without requiring labels. Applied on top of strong image-pretrained models like DINOv2, MoSiC yields 1–6% gains across four dense prediction benchmarks, setting a new state-of-the-art on both video and image tasks.
Key features of MoSiC include:
- Enhancing dense features of pretrained vision models through video-based fine-tuning
- Enforcing temporal semantic consistency via self-supervised clustering of motion trajectories in unlabeled videos
- Improving representations across diverse backbones, including EVA-CLIP, DINO, and DINOv2(-R)
- Requiring only 1.6 GPU-hours on YTVOS using 8×A6000 GPUs
- Achieving state-of-the-art improvements (1–6%) across multiple image and video benchmarks
MoSiC is optimized for efficient training. While our experiments use 8×A6000 GPUs to enable larger batch sizes and better performance, training on a single A6000 is also possible with smaller batches. Larger batch sizes were found to consistently improve performance.
We recommend using conda to install the dependencies for MoSiC. If you haven't installed conda yet, you can find the instructions here.
The setup steps are:
1 – Create a new environment from the provided YAML file:
conda env create -f mosic_environment.yml
2 – Activate the environment:
conda activate MoSiC
We provide MoSiC checkpoints for multiple backbones: DINO, DINOv2, EVA-CLIP, and DINOv2R. You can download the relevant checkpoint here.
To use MoSiC embeddings for downstream dense prediction tasks, install timm
and torch
, then run the following (example shown for ViT-S/16):
import torch
from timm.models.vision_transformer import vit_small_patch16_224
path_to_checkpoint = "<your path to downloaded MoSiC ckpt>"
model = vit_small_patch16_224()
state_dict = torch.load(path_to_checkpoint)
# Adjust key names if needed
model.load_state_dict(state_dict, strict=False)
# Extract semantically rich patch embeddings (e.g., 16x16 patches)
features = model.forward_features(batch)
To train MoSiC, simply run:
./train.sh
The training script uses the following key parameters:
--batch_size 64
: Number of samples processed per GPU--frame_sampling_mode regular
: Uses regular interval sampling for video frames--regular_step 6
: Samples every 6th frame from the video--num_clip_frames 12
: Number of frames to process in each video clip--num_clips 1
: Number of clips to sample from each video--num_epochs 8
: Total number of training epochs--num_prototypes 100
: Number of clusters for the optimal transport clustering--feature_upsampling nearest
: Uses nearest neighbor upsampling for features--num_workers 8
: Number of data loading workers per GPU--model_type dinov2-s
: Uses DINOv2-small as the backbone model--dataset ytvos
: Trains on the YouTube-VOS dataset--mask_ratio 0
: No masking applied to the input--grid_size 16
: Size of the feature grid (16x16 patches)--crop_scale 0.4
: Random crop scale for data augmentation--wandb_mode online
: Enables online logging to Weights & Biases--use_EMA_teacher True
: Uses Exponential Moving Average for the teacher model--teacher_feature_upsampling nearest
: Uses nearest neighbor upsampling for teacher features--save_dir
: Directory to save the trained model checkpoints
The script is configured to use 8 GPUs by default. For single GPU training, modify the CUDA_VISIBLE_DEVICES
and --nproc_per_node
parameters accordingly.
For standard evaluation tasks including linear evaluation, overclustering, and unsupervised object segmentation, you can use our provided checkpoint with the evaluation scripts from the NeCo repository. Simply download our checkpoint, load it into their evaluation framework, and run their standard evaluation protocols to reproduce our reported results.
For visual in-context learning evaluation, we use a modified version of open-hummingbird-eval that we adapted for multi-GPU evaluation. The hummingbird_eval.sh
script supports the following key parameters:
--model
: Model type (e.g., mosic_dinov2-l)--input-size
: Input image size (default: 518)--batch-size
: Batch size per GPU (default: 24)--embeddings-size
: Size of embeddings (default: 1024)--patch-size
: Size of image patches (default: 14)--memory-size
: Memory size for processing (default: 10240000)--num-workers
: Number of data loading workers (default: 2)--dataset
: Dataset name (e.g., ade20k)--data-dir
: Path to dataset directory--train-split
: Training split file name
The script is configured to use 3 GPUs by default (--nproc_per_node=3
). You can modify the number of GPUs and other parameters as needed for your setup. For evaluating on dataset fractions, the train-splits can be downloaded from this Google Drive folder.
Example usage:
./hummingbird_eval.sh
The evaluation results will be saved in the hb/
directory with the naming format hummingbird_MoSiC_dinov2-l_ade20k{split_name}.log
.
Our model uses the same dataset structure as described in the Timetuning dataset documentation. Please follow the guidelines there to properly format your datasets for use with our model.
The figure shows MoSiC's in-context scene understanding capabilities on Pascal VOC. By training DINOv2's dense representations on unlabeled videos, MoSiC achieves precise segmentation boundaries and object identification.
If you find this repository useful, please consider giving a star ⭐ and citation:
@inproceedings{salehi2025mosic,
title={MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning},
author={Salehi, Mohammadreza and Venkataramanan, Shashanka and Simion, Ioana and Gavves, Efstratios and Snoek, Cees GM and Asano, Yuki M},
booktitle={International Conference on Computer Vision},
year={2024}
}
This project is licensed under the Apache License 2.0.