Skip to content

jink-ucla/TRACE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TRACE: Textual Reasoning for Affordance Coordinate Extraction

A Vision-Language Model that enhances spatial affordance prediction through explicit textual Chain of Reasoning (CoR)

Python PyTorch License Status arXiv

🧠 Chain of Reasoning β€’ 🎯 Spatial Precision β€’ πŸ€– Vision-Language Model β€’ πŸ‘οΈ Attention Visualization


πŸ” Overview of TRACE's Reasoning Process

TRACE Reasoning Process

πŸ”„ TRACE's multi-step reasoning pipeline: Given an image and natural language instruction, the system determines the goal subtype, establishes relevant reference surfaces, defines target areas through explicit reasoning, and generates precise normalized coordinates.


πŸ“– Introduction

TRACE (Textual Reasoning for Affordance Coordinate Extraction) is an enhanced Vision-Language Model that predicts image keypoint affordances by integrating an explicit textual Chain of Reasoning (CoR) into the spatial affordance prediction process. Building upon the RoboPoint framework, TRACE teaches models not only to predict precise spatial coordinates but also to articulate the reasoning behind their predictions.

✨ Key Innovations

🧠 Textual Chain of Reasoning

Unlike visual CoT methods that generate intermediate images, TRACE uses lightweight textual reasoning that leverages the VLM's native linguistic capabilities

πŸ“Š Enhanced Dataset

200,000 training samples with programmatically generated explicit reasoning steps

🎯 Improved Performance

Achieves 48.1% accuracy on the challenging Where2Place benchmark, a 9.6% relative improvement over the original RoboPoint model

πŸ” Interpretable Predictions

Provides clear rationales for why specific spatial locations are selected

πŸ’‘ Key Insight: The approach addresses the critical gap between high-level reasoning and precise, low-level spatial understanding required for physical manipulation. As shown in the figure above, our model follows a multi-step reasoning process that first determines the goal subtype, establishes relevant reference surfaces, defines target areas, and finally generates normalized coordinates.


πŸ“‹ Contents


βš™οΈ Install

./environment_setup.sh

or follow the instructions below in order.

conda create -n trace python=3.10 -y
conda activate trace

pip install --upgrade pip  # enable PEP 660 support

# this is optional if you prefer to system built-in nvcc.
conda install -c nvidia cuda=12.1 -y

pip install -e .

# this is optional if you don't need to train the model
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

πŸ“¦ Data Samples

🎯 50 Representative Samples Available for Review

We provide 50 carefully selected samples from our TRACE dataset for review purposes:

🎯 25 Object Reference Samples

Demonstrating spatial reasoning for object-relative positioning

🌟 25 Free Space Reference Samples

Showing vacant area identification with reasoning chains

πŸ“ Location: data_samples/sample_50_cor_data.zip

πŸ“‹ Contents:

  • βœ… Input images and natural language instructions
  • 🧠 Complete Chain of Reasoning (CoR) explanations
  • 🎯 Ground-truth coordinate annotations
  • πŸ“Š Comparison with baseline methods

πŸ’‘ These samples illustrate the key contribution of our work: explicit textual reasoning that justifies spatial coordinate predictions.

πŸš€ Full Dataset Availability

HuggingFace Dataset Size

The complete TRACE reasoning dataset (200,000 samples) is now publicly available on πŸ€— HuggingFace: jink-ucla/TRACE

  • βœ… Complete training and evaluation splits
  • 🧠 All Chain of Reasoning annotations
  • πŸ“– Detailed dataset documentation and usage examples
  • πŸ“Š Comparison baselines and evaluation metrics

πŸ—‚οΈ Dataset Construction

The TRACE dataset consists of 200,000 training samples created by enhancing the RoboPoint data generation pipeline. The dataset is composed of two data sources:

  • 100,000 novel reasoning-augmented samples with explicit textual Chain of Reasoning (CoR)
  • 100,000 standard visual instruction-tuning samples from LVIS and VQA datasets

The key innovation is the programmatic generation of explicit textual reasoning steps using the Gemini API, which breaks down the spatial reasoning process into interpretable steps.

Each data sample includes:

  • Input image and natural language instruction
  • Multi-step textual reasoning process
  • Final normalized 2D coordinates: {(x_i, y_i) | x_i, y_i ∈ [0, 1]}

Example reasoning structure:

  1. Goal Subtype Identification: Determine if the task involves placement affordance, reference object identification, etc.
  2. Reference Surface Establishment: Identify the relevant surface or area in the image
  3. Target Area Definition: Define the specific region based on the instruction
  4. Coordinate Generation: Output precise normalized coordinates

πŸ‹οΈ Training Configuration

Model Architecture:

  • Base LLM: Vicuna-v1.5-13B
  • Vision Encoder: CLIP-ViT-Large-Patch14-336 (penultimate layer features)
  • Projector: 2-layer MLP with GELU activation
  • Optimization: Flash Attention 2 for memory efficiency

Training Setup:

For 13B Model (Main Results):

  • Optimization Method: Full fine-tuning (FFT) for maximum performance
  • Optimizer: AdamW with learning rate 2Γ—10⁻⁢
  • Scheduler: Cosine annealing with 3% warmup
  • Duration: 1 epoch on TRACE dataset

For 7B Model (Ablations & Analysis):

  • Optimization Method: Low-Rank Adaptation (LoRA) with rank r=128, Ξ±=256
  • Optimizer: AdamW with learning rate 2Γ—10⁻⁢
  • Scheduler: Cosine annealing with 3% warmup
  • Precision: bfloat16 mixed-precision with gradient checkpointing
  • Duration: 1 epoch on TRACE dataset

Data Processing Optimizations:

  • Lazy preprocessing for memory efficiency
  • Square aspect ratio padding for uniform input
  • Grouping by modality length to minimize padding
  • 12 dataloader workers to prevent bottlenecks

🎯 Model Weights

Model Weights Google Drive

Available Models

Model Type Training Size Performance Download
TRACE-13B Fine-tuned Full Fine-tuning 13B 48.1% W2P Coming Soon
TRACE-7B Fine-tuned LoRA (r=128) 7B Used for analysis Coming Soon
Vicuna-v1.5-13B Base Model Pre-trained 13B Required for TRACE-13B Coming Soon
Vicuna-v1.5-7B Base Model Pre-trained 7B Required for TRACE-7B Coming Soon

πŸ’‘ Note: Model weights will be made publicly available soon. Links will be updated in the table above.

πŸ“ˆ Evaluation

πŸ† Benchmarks and Results

Performance Improvement Significance

We evaluate TRACE on challenging spatial affordance prediction benchmarks:

Model RoboRefIt Where2Place (W2P) W2P (hard)
🎯 RoboPoint(FFT)+TRACE πŸ† 42.9% Β± 0.8 πŸ† 48.1% Β± 0.1 πŸ† 55.0% Β± 3.5
RoboPoint(FFT) 41.7% Β± 0.6 43.9% Β± 0.6 46.9% Β± 4.2
🎯 RoboPoint(LoRA)+TRACE πŸ† 48.1% Β± 2.8 πŸ† 43.7% Β± 4.1 πŸ† 41.2% Β± 7.3
RoboPoint(LoRA) 40.6% Β± 3.0 36.1% Β± 1.3 30.7% Β± 0.2
SpaceLLaVA 20.0% Β± 0.5 15.0% Β± 1.6 13.6% Β± 2.1
GPT-4o 6.5% Β± 0.8 18.7% Β± 2.6 17.8% Β± 4.8
Gemini 5.2% Β± 0.1 7.8% Β± 0.2 6.6% Β± 0.2

🎯 Key Results:

πŸ“Š Quantitative Improvements

  • 9.6% relative improvement over RoboPoint baseline on Where2Place
  • Statistically significant improvement on W2P benchmark (p=0.022 < 0.05)
  • 34.2% relative gain on challenging W2P (hard) subset over baseline

🌟 Qualitative Strengths

  • Consistent improvements across all benchmark categories
  • Particularly strong performance on challenging unseen relation types (W2P hard)
  • Dose-dependent relationship between CoR data quantity and performance

πŸ§ͺ Running Evaluations

To evaluate on Where2Place:

# Generate results
python robopoint/eval/model_vqa.py \
    --model-path trace-v1-vicuna-v1.5-13b \
    --image-folder datasets/where2place/images \
    --question-file datasets/where2place/point_questions.jsonl \
    --answer-file output/trace-v1-vicuna-v1.5-13b.jsonl

# Compute accuracy
python robopoint/eval/summarize_vqa.py --answer output/trace-v1-vicuna-v1.5-13b.jsonl

🎨 Visualization

TRACE includes comprehensive visualization tools to analyze model predictions and Chain of Reasoning outputs:

# Visualize model comparisons with reasoning analysis
python visualization/visualize_results.py \
    --answer-files output/robopoint-baseline.jsonl output/trace-v1-vicuna-v1.5-13b.jsonl \
    --labels robopoint trace \
    --data-dir datasets/where2place/images \
    --output output/visualization_results \
    --num 10

Parameter Explanation:

  • --answer-files: Model output files from model_vqa.py
    • TRACE answer file: Contains reasoning chains + coordinates
    • Baseline file: Contains coordinates only
  • --labels: Labels for each model in the visualization plots
  • --data-dir: Benchmark dataset location (images + ground-truth masks)
  • --output: Directory where visualization results will be saved
  • --num: Number of samples to visualize

Visualization Features:

  • 🎯 Coordinate Prediction Overlay: Visual comparison of predicted vs ground-truth points
  • 🧠 Chain of Reasoning Display: Step-by-step reasoning process visualization
  • πŸ“Š Model Comparison: Side-by-side comparison of different model outputs
  • πŸ” Error Analysis: Detailed analysis of prediction accuracy and failure cases

πŸ‘οΈ Attention Analysis

TRACE provides unique insights into the model's reasoning process through comprehensive attention visualization and batch processing capabilities:

# Batch process Where2Place dataset with reasoning milestone attention
CUDA_VISIBLE_DEVICES=7 python visualization/attention_map.py \
    --model-path [MODEL_WEIGHTS_PLACEHOLDER] \
    --model-base [BASE_MODEL_PLACEHOLDER] \
    --dataset-dir [DATASET_PLACEHOLDER] \
    --output-dir where2place_individual_results \
    --start-idx 0 --end-idx 25

Attention Analysis Features:

  • πŸ” Multi-step Attention Tracking: Visualize how attention changes during each reasoning milestone:
    1. Identify Reference Object - Initial context establishment
    2. Define Target Area - Spatial area definition
    3. Determine Goal Subtype - Task classification (critical reasoning step)
    4. Generate Output - Coordinate generation
    5. Final Answer - Complete response with overlays
  • πŸ“Š Comprehensive Visualizations:
    • Individual milestone images with transparent attention overlays
    • Combined milestone progression visualization
    • Ground truth mask overlays (cyan)
    • Predicted coordinate points (red dots)
  • πŸ“ˆ Batch Processing: Process entire datasets with statistical analysis
  • 🎯 Interactive Dashboard: Summary statistics and success rates
  • πŸ’Ύ Detailed Output: Individual files for each reasoning step with descriptive names

Key Parameters:

  • --model-path: Path to TRACE model
  • --model-base: Base model path
  • --dataset-dir: Dataset directory (expects images/ and masks/ subdirectories)
  • --output-dir: Output directory for all visualizations and analysis
  • --start-idx/--end-idx: Process specific range of images
  • --resume: Resume from existing results

πŸ’‘ Key Finding: The attention analysis reveals that TRACE exhibits diffuse attention during initial steps (reference identification, target definition) but concentrated attention during goal subtype determination. During final coordinate generation, there is minimal visual attention, indicating the model relies primarily on its completed textual reasoning chain rather than continuous visual grounding, demonstrating the effectiveness of the Chain of Reasoning approach.

πŸ“„ Citation

arXiv Status

If you find this work helpful, please consider citing:

@misc{park2025tracetextualreasoningaffordance,
  title={TRACE: Textual Reasoning for Affordance Coordinate Extraction},
  author={Sangyun Park and Jin Kim and Yuchen Cui and Matthew S. Brown},
  year={2025},
  eprint={2511.01999},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2511.01999}
}

Note: This work originally builds upon the foundation of RoboPoint (Yuan et al., 2024) but represents a significant extension with Chain of Reasoning capabilities.

πŸ“š Reference to Original Foundation:

@article{yuan2024robopoint,
  title={RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics},
  author={Yuan, Wentao and Duan, Jiafei and Blukis, Valts and Pumacay, Wilbert and Krishna, Ranjay and Murali, Adithyavairavan and Mousavian, Arsalan and Fox, Dieter},
  journal={arXiv preprint arXiv:2406.10721},
  year={2024}
}

Acknowledgements

This work was initially inspired by RoboPoint (Yuan et al., 2024). We thank the original authors for their open-source contribution.

  • RoboPoint: Initial foundation that inspired our Chain of Reasoning approach
  • LLaVA: Visual instruction tuning pipeline and multimodal architecture

Limitations and Future Work

While TRACE demonstrates significant improvements in spatial affordance prediction, some limitations remain:

  • Synthetic Reasoning: The reasoning chains are programmatically generated and may not capture the full complexity of human spatial reasoning
  • No Confidence Estimates: Like RoboPoint, TRACE doesn't provide confidence scores for predicted points
  • Fixed Output Structure: The number of output points is not controllable
  • Attention Control: While attention analysis provides insights, the model lacks explicit mechanisms to control the attention process

Future Directions:

  • Extending CoR to multi-step manipulation and navigation tasks
  • Incorporating human-generated reasoning examples
  • Adding confidence estimation and controllable output generation
  • Exploring more sophisticated reasoning structures for complex spatial relationships

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 2

  •  
  •