Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding
Joshua Jones, Oier Mees, Carmelo Sferrazza, Kyle Stachowicz, Pieter Abbeel, Sergey Levine
This repo contains code to Fuse heterogeneous Sensory (FuSE) data, like touch sensing or audio, into generalist robot policies via language grounding. We release both a dataset of 26,866 robot trajectories collected heterogeneous sensory modalities and checkpoints for our two main models: Octo a large diffusion-based transformer model and a 3B VLA based on PaliGemma. Our code is built on top of the Octo and PaliVLA codebases.
Install PaliVLA:
cd palivla_digit
uv venv
source .venv/bin/activate
uv sync --extra [gpu or tpu]
uv pip install -e ../octo_digit --no-deps
uv pip install -e ../bridge_with_digit/widowx_envs
uv pip install -e .
Install Octo:
cd octo_digit
uv venv
source .venv/bin/activate
uv sync --extra [gpu or tpu]
uv pip install -e ../bridge_with_digit/widowx_envs
uv pip install -e .
We provide a dataset containing 26,866 trajectories collected on a WidowX robot at the RAIL lab @ UC Berkeley, USA. It contains visual, tactile, sound and action data collected across several environments, annotated with natural language. You can download the dataset from the following HuggingFace dataset.
For Octo:
python octo_digit/scripts/finetune_fuse.py --config=scripts/configs/fuse_config.py
For PaliVLA:
python palivla_digit/palivla/train_fuse.py --config=palivla_digit/palivla/configs/fuse_config.py
Install bridge_with_digit
on the robot controller, and start the action server.
Download the pretrained models from the HuggingFace model hub.
For Octo:
python octo_digit/eval/fuse_eval.py --checkpoint_weights_path=ckpt.pth
For PaliVLA:
python palivla_digit/eval_palivla.py --checkpoint_dir=ckpt.pth
This project is licensed under the MIT License - see the LICENSE file for details. PaliVLA is licensed under the Apache 2.0 License - see the LICENSE file for details.
@article{jones2025fuse,
title={Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding},
author={Jones, Joshua and Mees, Oier and Sferrazza, Carmelo and Stachowicz, Kyle and Abbeel, Pieter and Levine, Sergey},
journal={arXiv preprint arXiv:2501.04693},
year={2025}
}