This is a framework for training multimodal vision-language-action (VLA) model for robotics in JAX. It primarily supports PaliGemma for now, though more base models will be added in the future.
We develop with uv
, but other environment managers should work fine. To install the dependencies, run:
uv venv
uv sync
To train a model, run:
python -m palivla/train.py --config_file palivla/configs/bridge_config.py
This repository is (for now) a fork of big_vision
.
If you use PaliVLA in your own project, please cite this repository:
@misc{palivla,
author = {Kyle Stachowicz},
title = {PaliVLA},
year = {2024},
url = {https://github.com/kylestach/bigvision-palivla},
note = {GitHub repository}
}