News
[2024.11.10] LAPA has won the CoRL 2024 LangRob Workshop Best Paper Award (among 75 accepted papers)! 🥳
LAPA
-
Unsupervised approach for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels.
-
Outperforms the current state-of-the-art VLA model trained with ground-truth actions, building a new SOTA VLA model.
-
Achieves over 30x greater pretraining efficiency compared to conventional VLA pretraining.
conda create -n lapa python=3.10 -y
conda activate lapa
git clone https://github.com/LatentActionPretraining/LAPA.git
pip install -r requirements.txt
mkdir lapa_checkpoints && cd lapa_checkpoints
Next, download the model checkpoint from Huggingface repository. Download, three files under lapa_checkpoints
directory.
wget https://huggingface.co/latent-action-pretraining/LAPA-7B-openx/resolve/main/tokenizer.model
wget https://huggingface.co/latent-action-pretraining/LAPA-7B-openx/resolve/main/vqgan
wget https://huggingface.co/latent-action-pretraining/LAPA-7B-openx/resolve/main/params
To run LAPA checkpoint which is pretrained on Open-X Embodiment dataset, run the following command:
cd ..
python -m latent_pretraining.inference
This will generate the latent action conditioned on the input image and the natural language instruction.
You can change the input image and the instruction to a custom instance. Note that the output space is the latent action space (which a space size of
For fine-tuning LAPA on real world trajectories, you have to first preprocess the dataset to discretize the action space. We assume that there is a json file (--input_path
) where the json file has the following row format:
{
"id": "data/finetune_data/episode_0/step_0",
"image": "data/finetune_data/episode_0/step_0.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nWhat action should the robot take to `pick up the milk and put it in the sink`"
},
{
"from": "gpt",
"raw_actions": [
0.0004934221506118809,
-0.00011252239346504211,
-0.001941084861755371,
0.013634951062806884,
0.013678191591275368,
-0.004913635449167675,
0.0
],
"states": {
"eef_pos": [
0.24725835025310516,
-0.022094586864113808,
0.9283081889152527
],
"eef_euler": [
3.1202197128587876,
-0.7113159765223936,
-0.10937155062330725
],
"gripper_state": 0.0
}
}
]
}
where finetune_data
contains the images of fine-tuning trajectories.
Run the following commands to preprocess the fine-tuning dataset and fine-tune LAPA.
python data/finetune_preprocess.py --input_path "/path_to_json_file" --output_filename "data/real_finetune.jsonl" --csv_filename "data/real_finetune.csv"
./scripts/finetune_real.sh
We ran the experiments with 4 80GB-A100 GPUs. To change the number of GPUs being used, change the second index of --mesh_dim
in the script to the number of GPUs.
For fine-tuning on SIMPLER rollout trajectories (100 trajecories), run the following command:
./scripts/finetune_simpler.sh
After finetuning, to deploy the model, run the following command:
python -m latent_pretraining.deploy --load_checkpoint "params::/path_to_the_finetuned_ckpt" --action_scale_file "data/real_finetune.csv"
where load_checkpoint
includes the path to the finet-uned checkpoint and action_scale_file
includes the path to the csv file constructed during data preprocessing of fine-tuning dataset.
We provide the code for latent action quantization pretraining.
conda create -n laq python=3.10 -y
conda activate laq
cd laq
pip install -e .
accelerate launch train_sthv2.py
Note that the current data loader code is based on something-something v2 dataset structure where the directory consists of multiple trajectories and each trajectory contain multiple images. To train on custom dataset, either change the data structure or modify the existing data loading code.
After training, you can use the trained quantization model as an inverse dynamics model to obtain latent actions for training data.
python inference_sthv2.py
Add arguments based on the training arguements. For the input_file
argument, it should be a jsonl file which contains id
, image
, instruction
keys as the metadata and vision
which is the output of the vqgan model consisting of 256 discrete image tokens as the otuput.
We provide the code to do latent pretraining from pretrained LWM checkpoint. First, download the LWM-Chat-1M-Jax model under lwm_checkpoints
directory. Then, download the pretraining dataset from this link under the data
directory. Run the following command for latent pretraining:
./scripts/latent_pretrain_openx.sh
We experimented with 8 H100 GPUs for 34 hours. We have empirically observed that 70K steps with a batch size of 256 is enough to get decent performance on downstream tasks after fine-tuning.
As a reproducible simulation, we release the setup that we tested with. First, install packages required for our latent-pretraining and SIMPLER following the installation guide.
The inference script is provided in scripts/lapa_bridge.sh
.
The codebase is based on Large-World-Model repository. For latent action quantization, we referred to Phenaki code. For deployment code, we referred to the OpenVLA code. For the SIMPLER evaluation code, we referred to the SIMPLER repository.
If you use this codebase, or otherwise found our work valuable, please cite:
@article{ye2024latent,
title={Latent Action Pretraining from Videos},
author={Ye, Seonghyeon and Jang, Joel and Jeon, Byeongguk and Joo, Sejune and Yang, Jianwei and Peng, Baolin and Mandlekar, Ajay and Tan, Reuben and Chao, Yu-Wei and Lin, Bill Yuchen and others},
journal={arXiv preprint arXiv:2410.11758},
year={2024}
}
If you have additional question, feel free to send an email to latentactionpretraining@gmail.com.
LAPA's code and model weights are released under the MIT License.