RoboBrain 2.0

A Vision-Language Model for Robotic Understanding and Interaction

Installation | Quick Start | Documentation | Troubleshooting

Overview

RoboBrain 2.0 is a state-of-the-art vision-language model designed for robotic perception and interaction tasks. This repository provides a modular Python implementation with support for:

Visual question answering
Object grounding (bounding box detection)
Affordance prediction
Trajectory generation
Pointing tasks
Multi-turn conversational memory

The implementation supports both cloud-based model inference via Hugging Face and local weight deployment for offline operation.

Features

Feature	Description
Multi-Task Support	General QA, grounding, affordance, trajectory, and pointing
Multi-Turn Memory	Maintains conversation context across multiple queries
Local Inference	Run entirely offline with pre-downloaded weights
Interactive CLI	Command-line chat interface for rapid prototyping
Modular Design	Separate scripts for each task type
Conversation Persistence	Save and load conversations as JSON files
Thinking Mode	Optional chain-of-thought reasoning display

Requirements

Hardware

Component	Minimum	Recommended
GPU VRAM	8 GB	16 GB+
System RAM	16 GB	32 GB
Storage	20 GB	50 GB

Software

Python 3.10 or higher
CUDA 11.8 or higher (for GPU acceleration)
Git

Dependencies

Core dependencies are managed via requirements.txt:

transformers>=4.42.0
accelerate>=0.30.0
torch>=2.1.0
timm>=0.9.16
pillow
sentencepiece
bitsandbytes
matplotlib
huggingface_hub
python-dotenv
qwen-vl-utils>=0.0.8

Installation

Method A: Conda Environment (Recommended)

This method ensures reproducible environments across different systems.

# Clone the repository
git clone https://github.com/YasiruDEX/Robobrain-2.0.git
cd Robobrain-2.0

# Create and activate the conda environment
conda env create -f environment.yml --force
conda activate robobrain2-env

# Install remaining pip dependencies
python -m pip install -r requirements.txt

Alternatively, use the provided setup script:

chmod +x scripts/setup_conda_env.sh
./scripts/setup_conda_env.sh

Method B: Virtual Environment

For systems without conda:

# Clone the repository
git clone https://github.com/YasiruDEX/Robobrain-2.0.git
cd Robobrain-2.0

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\activate   # Windows

# Install dependencies
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

Verification

Verify the installation:

python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA: {torch.cuda.is_available()}')"

Expected output:

PyTorch: 2.x.x
CUDA: True

Configuration

Hugging Face Authentication

The model requires authentication for first-time download from Hugging Face.

Create a Hugging Face account at huggingface.co
Generate an access token at huggingface.co/settings/tokens
Accept the model license at BAAI/RoboBrain2.0-3B
Configure the token:

cp .env.example .env

Edit .env and add your token:

HF_TOKEN=hf_your_token_here

Security Note: Never commit .env files containing tokens. The .gitignore is configured to exclude this file.

Quick Start

Basic Test

Run the general QA script to verify installation:

python scripts/run_general_qa.py

Expected output:

Loading RoboBrain 2.0 model...
Model loaded successfully!
Processing image: RoboBrain2.0_lib/assets/demo/navigation.jpg
Question: What do you see in this image?
Answer: [Model response]

All Task Scripts

# Visual Question Answering
python scripts/run_general_qa.py

# Object Grounding
python scripts/run_visual_grounding.py

# Affordance Prediction
python scripts/run_affordance.py

# Trajectory Generation
python scripts/run_trajectory.py

# Pointing Tasks
python scripts/run_pointing.py

Results are saved to the results/ directory.

Documentation

Task Scripts

Each script in the scripts/ directory handles a specific inference task:

Script	Task	Output
`run_general_qa.py`	Visual question answering	Text response
`run_visual_grounding.py`	Object detection	Bounding box `[x1, y1, x2, y2]`
`run_affordance.py`	Action affordance	Affordance map
`run_trajectory.py`	Motion trajectory	Trajectory points
`run_pointing.py`	Object pointing	Point coordinates `[(x, y), ...]`

Multi-Turn Conversation

The multi-turn conversation system enables context-aware interactions where the model maintains memory of previous exchanges.

Architecture

ConversationMemory
    |
    +-- Turn (dataclass)
    |       - question: str
    |       - answer: str
    |       - task: str
    |       - timestamp: datetime
    |       - image: Optional[str]
    |
    +-- Methods
            - add_turn()
            - get_context_prompt()
            - save() / load()
            - get_conversation_summary()

MultiTurnInference
    |
    +-- model: UnifiedInference
    +-- memory: ConversationMemory
    +-- Methods
            - ask()
            - ground()
            - get_affordance()
            - get_trajectory()
            - point_at()

Interactive Chat Interface

Launch the interactive chat:

python scripts/interactive_chat.py

With a specific image:

python scripts/interactive_chat.py --image /path/to/image.jpg

Command Reference

Command	Arguments	Description
`/image`	`<path>`	Set image for analysis
`/task`	`<type>`	Switch task type
`/history`	-	Display conversation history
`/clear`	-	Clear conversation memory
`/save`	`<file>`	Save conversation to JSON
`/load`	`<file>`	Load previous conversation
`/thinking`	`on\|off`	Toggle chain-of-thought display
`/context`	`on\|off`	Toggle context injection
`/help`	-	Display command reference
`/quit`	-	Exit application

Supported Task Types

Task	Description	Output Format
`general`	Visual question answering	Natural language
`grounding`	Object bounding box	`[x1, y1, x2, y2]`
`affordance`	Action affordance map	Coordinate list
`trajectory`	Motion path	Trajectory points
`pointing`	Object pointing	`[(x, y), ...]`

Programmatic API

Basic Usage

from scripts.utils import get_model
from scripts.conversation_memory import MultiTurnInference

# Initialize model
model, repo_dir = get_model()

# Create chat instance
chat = MultiTurnInference(model, repo_dir)

# Set image
chat.set_image("path/to/image.jpg")

# Query the model
response = chat.ask("What objects are visible?")
print(response["answer"])

Multi-Turn Conversation

# First query
r1 = chat.ask("What animals are in this image?")
print(r1["answer"])  # "I can see two cats."

# Follow-up query (uses conversation context)
r2 = chat.ask("What are they doing?")
print(r2["answer"])  # "They are sleeping on the couch."

# Switch to pointing task
r3 = chat.point_at("the cats")
print(r3["answer"])  # "[(395, 186), (175, 170)]"

Conversation Persistence

# Save conversation
chat.save_conversation("conversations/session_001.json")

# Load in new session
new_chat = MultiTurnInference(model, repo_dir)
new_chat.load_conversation("conversations/session_001.json")

# Continue conversation
r4 = new_chat.ask("What color are the cats?")

API Reference

MultiTurnInference Class

Method	Parameters	Returns	Description
`set_image`	`path: str`	`None`	Set current image
`ask`	`prompt: str, task: str, enable_thinking: bool`	`dict`	Query with context
`ground`	`description: str`	`dict`	Get bounding box
`get_affordance`	`action: str`	`dict`	Get affordance map
`get_trajectory`	`action: str`	`dict`	Get trajectory
`point_at`	`description: str`	`dict`	Get point coordinates
`reset`	-	`None`	Clear memory
`save_conversation`	`path: str`	`None`	Save to JSON
`load_conversation`	`path: str`	`None`	Load from JSON
`show_history`	-	`None`	Print history

Response Dictionary

{
    "answer": str,           # Model response
    "turn_number": int,      # Current turn index
    "context_used": bool,    # Whether context was injected
    "thinking": Optional[str] # Chain-of-thought (if enabled)
}

Local Model Weights

For offline operation, download and place model weights in the weights/ directory:

Directory Structure

weights/
├── config.json
├── generation_config.json
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── preprocessor_config.json
├── special_tokens_map.json
├── tokenizer.json
└── tokenizer_config.json

Downloading Weights

Using Hugging Face CLI:

huggingface-cli download BAAI/RoboBrain2.0-3B --local-dir weights/

Using Python:

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="BAAI/RoboBrain2.0-3B",
    local_dir="weights/",
    token="hf_your_token"
)

The scripts automatically detect and use local weights when available.

Project Structure

Robobrain-2.0/
├── Notebooks/
│   ├── robobrain2-quick-test.ipynb    # Quick test notebook
│   └── multi_turn_conversation.ipynb   # Multi-turn demo notebook
├── RoboBrain2.0_lib/                   # Cloned RoboBrain repository
│   └── assets/demo/                    # Demo images
├── conversations/                       # Saved conversations
├── docs/
│   └── images/                         # Documentation images
├── results/                            # Output visualizations
├── scripts/
│   ├── conversation_memory.py          # Multi-turn memory system
│   ├── interactive_chat.py             # CLI chat interface
│   ├── run_affordance.py               # Affordance task script
│   ├── run_general_qa.py               # QA task script
│   ├── run_pointing.py                 # Pointing task script
│   ├── run_trajectory.py               # Trajectory task script
│   ├── run_visual_grounding.py         # Grounding task script
│   ├── setup_conda_env.sh              # Environment setup script
│   ├── test_multi_turn.py              # Multi-turn test suite
│   └── utils.py                        # Shared utilities
├── weights/                            # Local model weights (optional)
├── .env.example                        # Environment template
├── .gitignore
├── environment.yml                     # Conda environment spec
├── README.md
└── requirements.txt                    # Python dependencies

Troubleshooting

Common Errors and Solutions

Error: CUDA Out of Memory

Symptom:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate X MiB

Solutions:

Close other GPU-intensive applications
Reduce batch size (if applicable)

Enable model offloading:

model.to("cpu")
torch.cuda.empty_cache()

Use a GPU with more VRAM (16GB+ recommended)

Error: Module Not Found

Symptom:

ModuleNotFoundError: No module named 'qwen_vl_utils'

Solution:

pip install qwen-vl-utils>=0.0.8

Symptom:

ModuleNotFoundError: No module named 'decorator'

Solution:

pip install decorator pygments

Error: Hugging Face Authentication Failed

Symptom:

huggingface_hub.utils._errors.GatedRepoError: 403 Client Error

Solutions:

Verify token is set correctly in .env
Accept the model license at BAAI/RoboBrain2.0-3B
Regenerate token if expired
Use local weights instead (see Local Model Weights)

Error: Kernel Failed to Start (Jupyter)

Symptom:

The kernel failed to start due to the missing module 'X'

Solution:

conda activate robobrain2-env
pip install pygments decorator ipykernel
python -m ipykernel install --user --name robobrain2-env

Then select the robobrain2-env kernel in Jupyter.

Error: Git Clone Failed

Symptom:

fatal: destination path 'RoboBrain2.0_lib' already exists

Solution:

rm -rf RoboBrain2.0_lib
# Run script again

Error: Image Not Found

Symptom:

FileNotFoundError: [Errno 2] No such file or directory: 'path/to/image.jpg'

Solution:

Verify the image path exists and is accessible:

import os
print(os.path.exists("path/to/image.jpg"))

Use absolute paths when possible:

from pathlib import Path
image_path = Path(__file__).parent / "assets" / "demo" / "image.jpg"

Error: Torch Version Mismatch

Symptom:

RuntimeError: CUDA error: no kernel image is available for execution

Solution:

Reinstall PyTorch with the correct CUDA version:

pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Frequently Asked Questions

General

Q: What GPU is required to run RoboBrain 2.0?

A: A GPU with at least 8GB VRAM is required. For optimal performance, 16GB+ is recommended. The model has been tested on NVIDIA RTX 3080, RTX 4090, A100, and V100 GPUs.

Q: Can I run RoboBrain 2.0 on CPU only?

A: While technically possible, CPU inference is extremely slow and not recommended for practical use. Expect inference times of several minutes per query.

Q: What image formats are supported?

A: JPEG, PNG, BMP, and WebP formats are supported. Both local file paths and HTTP/HTTPS URLs are accepted.

Q: How do I update to the latest version?

A:

git pull origin main
pip install -r requirements.txt --upgrade

Multi-Turn Conversation

Q: How many turns can the conversation memory hold?

A: By default, the last 10 turns are retained. This can be configured:

chat = MultiTurnInference(model, repo_dir)
chat.memory.max_turns = 20  # Increase to 20 turns

Q: Does multi-turn conversation affect inference speed?

A: Minimally. The context prompt adds approximately 5-10% overhead to inference time. The benefit of contextual understanding typically outweighs this cost.

Q: Can I disable conversation context for specific queries?

A: Yes. Either disable globally or per-query:

# Disable globally
chat.use_context = False

# Or use single-turn inference directly
from scripts.utils import get_model
model, repo_dir = get_model()
# Use model directly without MultiTurnInference wrapper

Q: Are conversations persisted across sessions?

A: Not automatically. Use save_conversation() before exiting and load_conversation() when resuming.

Model and Weights

Q: How large are the model weights?

A: Approximately 6GB for the 3B parameter model.

Q: Can I use quantized models?

A: The current implementation uses full precision. Quantization support (INT8, INT4) is planned for future releases.

Q: Where can I find model documentation?

A: See the official repository at FlagOpen/RoboBrain2.0.

Integration

Q: Can I use RoboBrain 2.0 in a web application?

A: Yes. Wrap the API in a REST endpoint using Flask or FastAPI:

from flask import Flask, request, jsonify
from scripts.utils import get_model
from scripts.conversation_memory import MultiTurnInference

app = Flask(__name__)
model, repo_dir = get_model()
chat = MultiTurnInference(model, repo_dir)

@app.route("/query", methods=["POST"])
def query():
    data = request.json
    response = chat.ask(data["prompt"])
    return jsonify(response)

Q: Is there a Docker container available?

A: Not currently. A Dockerfile is planned for future releases.

Testing

Run the test suite to verify functionality:

# Multi-turn memory tests
python scripts/test_multi_turn.py

# Quick verification
python -c "from scripts.utils import get_model; print('Import successful')"

Test Coverage

Test	Description	Command
Unit Tests	Memory system	`python scripts/test_multi_turn.py`
Integration	Full pipeline	`python scripts/run_general_qa.py`
Import	Dependencies	`python -c "import scripts.utils"`

Contributing

Contributions are welcome. Please follow these guidelines:

Fork the repository
Create a feature branch: git checkout -b feature/your-feature
Commit changes: git commit -m "Add your feature"
Push to branch: git push origin feature/your-feature
Open a Pull Request

Code Style

Follow PEP 8 guidelines
Add docstrings for public functions
Include type hints where applicable
Write tests for new functionality

License

This project is licensed under the MIT License. See LICENSE for details.

The RoboBrain 2.0 model weights are subject to the BAAI license.

Acknowledgments

BAAI for the RoboBrain 2.0 model
FlagOpen for the original implementation
Hugging Face for model hosting

Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Notebooks		Notebooks
conversations		conversations
docs/images		docs/images
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

YasiruDEX/Robobrain-2.0

Folders and files

Latest commit

History

Repository files navigation

RoboBrain 2.0

Table of Contents

Overview

Features

Requirements

Hardware

Software

Dependencies

Installation

Method A: Conda Environment (Recommended)

Method B: Virtual Environment

Verification

Configuration

Hugging Face Authentication

Quick Start

Basic Test

All Task Scripts

Documentation

Task Scripts

Multi-Turn Conversation

Architecture

Interactive Chat Interface

Command Reference

Supported Task Types

Programmatic API

Basic Usage

Multi-Turn Conversation

Conversation Persistence

API Reference

Local Model Weights

Directory Structure

Downloading Weights

Project Structure

Troubleshooting

Common Errors and Solutions

Error: CUDA Out of Memory

Error: Module Not Found

Error: Hugging Face Authentication Failed

Error: Kernel Failed to Start (Jupyter)

Error: Git Clone Failed

Error: Image Not Found

Error: Torch Version Mismatch

Frequently Asked Questions

General

Multi-Turn Conversation

Model and Weights

Integration

Testing

Test Coverage

Contributing

Code Style

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages