CLIP-RT

[Project Page] [Paper] [Citations]

CLIP-RT (CLIP-based Robotics Transformer) is a vision-language-action (VLA) model for generalist manipulation policies. We seamlessly extend OpenAI's CLIP to robot learning. It learns to predict the robotic action specified in natural language, given an image and natural language instruction. We found CLIP-RT effectively learns end-to-end robotic policies for novel robotic manipulation tasks.

Approach

Usage

CLIP-RT is based on an open source implementation of CLIP, OpenCLIP. You can easily use CLIP models with different configurations through a plug-and-play approach. In our project, we used pytorch v2.3.1 and open_clip_torch v2.26.1. For more details, please consult OpenCLIP's directory.

python3 -m venv clip-rt
source clip-rt/bin/activate
pip install -U pip
pip install open_clip_torch

import json
import torch
import open_clip
import numpy as np
from PIL import Image

model_name = 'ViT-H-14-378-quickgelu'
model_path = 'clip-rt-finetuned.pt'
prompt = "what motion should the robot arm perform to complete the instruction '{}'?"
lookup_table = json.load(open("docs/language_to_action.json"))
action_classes = list(lookup_table.keys()) # ["lower arm by 5cm", "rotate the gripper..."]

model, _, preprocess = open_clip.create_model_and_transforms(model_name=model_name, pretrained=model_path)
model.eval()  # model in train mode by default, impacts some models with BatchNorm or stochastic depth active
tokenizer = open_clip.get_tokenizer(model_name)

image = preprocess(Image.open("docs/example.png")).unsqueeze(0)
inst = tokenizer(prompt.format("close the laptop"))
actions = tokenizer(action_classes)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    inst_features = model.encode_text(inst)
    context_features = image_features + inst_features
    action_features = model.encode_text(actions)

    context_features /= context_features.norm(dim=-1, keepdim=True)
    action_features /= action_features.norm(dim=-1, keepdim=True)
    action_probs = (context_features @ action_features.T).sigmoid() # [.92, .01, ...]

pred = np.argmax(action_probs.squeeze(0).numpy())
pred = action_classes[pred] # language action
pred = lookup_table[pred]   # low-level action

Pretrained Models

We provide two pretrained models as:

Model	Trained Data	Link
CLIP-RT (pretrained)	Open X-Embodiment data	Download
CLIP-RT (fine-tuning)	Open X-Embodiment data + In-domain data	Download

Training CLIP-RT

Install

You can then install clip for training with pip install 'open_clip_torch[training]'.

Pretraining

We pretrain CLIP-RT using the Open X-Embodiment dataset curated by OpenVLA. Since the dataset does not contain natural language supervision for robot learning, we extract this supervision from the low-level action and save as webdataset:

Download Open X-Embodiment data (see OpenVLA)
Preprocess for pretraining

cd oxe_data_preprocess
python preprocess.py

Train CLIP-RT. If you want to change configurations, please see the shell script below.

cd open_clip/src
./scripts/train.sh

Fine-tuning on in-domain data

Preprocess for fine-tuning

OpenCLIP supports the csv file or the webdataset for training. We construct the csv file as:

import csv

with open(csv_path, 'w', newline='') as f:
    csv_out = csv.writer(f, delimiter=',')
    csv_out.writerow(['filepath', 'caption', 'supervision', 'label'])

    # we assume each sample is a tuple of four data
    for sample in samples:
        item = []

	# a path for raw image 
	item.append(sample['image_path'])

        # natural language instruction
	prompt = "what motion should the robot arm perform to complete the instruction '{}'?" 
	item.append(prompt.format(sample['instruction']))
		
	# natural language supervision (e.g., move the arm forward by 1cm)
	item.append(sample['supervision'])

	# label for natural language supervision.
	# this can be any integer number.
	# just ensure: set the same label for natural language supervisions that share the same low-level action 
        item.append(sample['label'])
        csv_out.writerow(item)

Please check open_clip/src/training/data.py to see how CLIP-RT load data.

Fine-tune CLIP-RT.

cd open_clip/src
./scripts/finetune.sh

Acknowledgements

We use OpenCLIP for model implementation and OpenVLA for data preprocessing. Thanks!

Citing

If you found this repository useful, please consider citing:

@article{kang2024cliprt,
  title={CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision},
  author={Kang, Gi-Cheon and Kim, Junghyun and Shim, Kyuhwan and Lee, Jun Ki and Zhang, Byoung-Tak},
  journal={arXiv preprint arXiv:2411.00508},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CLIP-RT

Approach

Usage

Pretrained Models

Training CLIP-RT

Install

Pretraining

Fine-tuning on in-domain data

Acknowledgements

Citing

Files

README.md

Latest commit

History

README.md

File metadata and controls

CLIP-RT

Approach

Usage

Pretrained Models

Training CLIP-RT

Install

Pretraining

Fine-tuning on in-domain data

Acknowledgements

Citing