A Vision-Language Model that enhances spatial affordance prediction through explicit textual Chain of Reasoning (CoR)
π§ Chain of Reasoning β’ π― Spatial Precision β’ π€ Vision-Language Model β’ ποΈ Attention Visualization
π TRACE's multi-step reasoning pipeline: Given an image and natural language instruction, the system determines the goal subtype, establishes relevant reference surfaces, defines target areas through explicit reasoning, and generates precise normalized coordinates.
TRACE (Textual Reasoning for Affordance Coordinate Extraction) is an enhanced Vision-Language Model that predicts image keypoint affordances by integrating an explicit textual Chain of Reasoning (CoR) into the spatial affordance prediction process. Building upon the RoboPoint framework, TRACE teaches models not only to predict precise spatial coordinates but also to articulate the reasoning behind their predictions.
|
Unlike visual CoT methods that generate intermediate images, TRACE uses lightweight textual reasoning that leverages the VLM's native linguistic capabilities 200,000 training samples with programmatically generated explicit reasoning steps |
Achieves 48.1% accuracy on the challenging Where2Place benchmark, a 9.6% relative improvement over the original RoboPoint model Provides clear rationales for why specific spatial locations are selected |
π‘ Key Insight: The approach addresses the critical gap between high-level reasoning and precise, low-level spatial understanding required for physical manipulation. As shown in the figure above, our model follows a multi-step reasoning process that first determines the goal subtype, establishes relevant reference surfaces, defines target areas, and finally generates normalized coordinates.
- βοΈ Install
- π¦ Data Samples
- ποΈ Dataset Construction
- ποΈ Training Configuration
- π― Model Weights
- π Evaluation
- π¨ Visualization
- ποΈ Attention Analysis
./environment_setup.shor follow the instructions below in order.
conda create -n trace python=3.10 -y
conda activate trace
pip install --upgrade pip # enable PEP 660 support
# this is optional if you prefer to system built-in nvcc.
conda install -c nvidia cuda=12.1 -y
pip install -e .
# this is optional if you don't need to train the model
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
We provide 50 carefully selected samples from our TRACE dataset for review purposes:
|
Demonstrating spatial reasoning for object-relative positioning |
Showing vacant area identification with reasoning chains |
π Location: data_samples/sample_50_cor_data.zip
π Contents:
- β Input images and natural language instructions
- π§ Complete Chain of Reasoning (CoR) explanations
- π― Ground-truth coordinate annotations
- π Comparison with baseline methods
π‘ These samples illustrate the key contribution of our work: explicit textual reasoning that justifies spatial coordinate predictions.
The complete TRACE reasoning dataset (200,000 samples) is now publicly available on π€ HuggingFace: jink-ucla/TRACE
- β Complete training and evaluation splits
- π§ All Chain of Reasoning annotations
- π Detailed dataset documentation and usage examples
- π Comparison baselines and evaluation metrics
The TRACE dataset consists of 200,000 training samples created by enhancing the RoboPoint data generation pipeline. The dataset is composed of two data sources:
- 100,000 novel reasoning-augmented samples with explicit textual Chain of Reasoning (CoR)
- 100,000 standard visual instruction-tuning samples from LVIS and VQA datasets
The key innovation is the programmatic generation of explicit textual reasoning steps using the Gemini API, which breaks down the spatial reasoning process into interpretable steps.
Each data sample includes:
- Input image and natural language instruction
- Multi-step textual reasoning process
- Final normalized 2D coordinates:
{(x_i, y_i) | x_i, y_i β [0, 1]}
Example reasoning structure:
- Goal Subtype Identification: Determine if the task involves placement affordance, reference object identification, etc.
- Reference Surface Establishment: Identify the relevant surface or area in the image
- Target Area Definition: Define the specific region based on the instruction
- Coordinate Generation: Output precise normalized coordinates
Model Architecture:
- Base LLM: Vicuna-v1.5-13B
- Vision Encoder: CLIP-ViT-Large-Patch14-336 (penultimate layer features)
- Projector: 2-layer MLP with GELU activation
- Optimization: Flash Attention 2 for memory efficiency
Training Setup:
For 13B Model (Main Results):
- Optimization Method: Full fine-tuning (FFT) for maximum performance
- Optimizer: AdamW with learning rate 2Γ10β»βΆ
- Scheduler: Cosine annealing with 3% warmup
- Duration: 1 epoch on TRACE dataset
For 7B Model (Ablations & Analysis):
- Optimization Method: Low-Rank Adaptation (LoRA) with rank r=128, Ξ±=256
- Optimizer: AdamW with learning rate 2Γ10β»βΆ
- Scheduler: Cosine annealing with 3% warmup
- Precision: bfloat16 mixed-precision with gradient checkpointing
- Duration: 1 epoch on TRACE dataset
Data Processing Optimizations:
- Lazy preprocessing for memory efficiency
- Square aspect ratio padding for uniform input
- Grouping by modality length to minimize padding
- 12 dataloader workers to prevent bottlenecks
| Model | Type | Training | Size | Performance | Download |
|---|---|---|---|---|---|
| TRACE-13B | Fine-tuned | Full Fine-tuning | 13B | 48.1% W2P | Coming Soon |
| TRACE-7B | Fine-tuned | LoRA (r=128) | 7B | Used for analysis | Coming Soon |
| Vicuna-v1.5-13B | Base Model | Pre-trained | 13B | Required for TRACE-13B | Coming Soon |
| Vicuna-v1.5-7B | Base Model | Pre-trained | 7B | Required for TRACE-7B | Coming Soon |
π‘ Note: Model weights will be made publicly available soon. Links will be updated in the table above.
We evaluate TRACE on challenging spatial affordance prediction benchmarks:
| Model | RoboRefIt | Where2Place (W2P) | W2P (hard) |
|---|---|---|---|
| π― RoboPoint(FFT)+TRACE | π 42.9% Β± 0.8 | π 48.1% Β± 0.1 | π 55.0% Β± 3.5 |
| RoboPoint(FFT) | 41.7% Β± 0.6 | 43.9% Β± 0.6 | 46.9% Β± 4.2 |
| π― RoboPoint(LoRA)+TRACE | π 48.1% Β± 2.8 | π 43.7% Β± 4.1 | π 41.2% Β± 7.3 |
| RoboPoint(LoRA) | 40.6% Β± 3.0 | 36.1% Β± 1.3 | 30.7% Β± 0.2 |
| SpaceLLaVA | 20.0% Β± 0.5 | 15.0% Β± 1.6 | 13.6% Β± 2.1 |
| GPT-4o | 6.5% Β± 0.8 | 18.7% Β± 2.6 | 17.8% Β± 4.8 |
| Gemini | 5.2% Β± 0.1 | 7.8% Β± 0.2 | 6.6% Β± 0.2 |
|
|
To evaluate on Where2Place:
# Generate results
python robopoint/eval/model_vqa.py \
--model-path trace-v1-vicuna-v1.5-13b \
--image-folder datasets/where2place/images \
--question-file datasets/where2place/point_questions.jsonl \
--answer-file output/trace-v1-vicuna-v1.5-13b.jsonl
# Compute accuracy
python robopoint/eval/summarize_vqa.py --answer output/trace-v1-vicuna-v1.5-13b.jsonlTRACE includes comprehensive visualization tools to analyze model predictions and Chain of Reasoning outputs:
# Visualize model comparisons with reasoning analysis
python visualization/visualize_results.py \
--answer-files output/robopoint-baseline.jsonl output/trace-v1-vicuna-v1.5-13b.jsonl \
--labels robopoint trace \
--data-dir datasets/where2place/images \
--output output/visualization_results \
--num 10Parameter Explanation:
--answer-files: Model output files frommodel_vqa.py- TRACE answer file: Contains reasoning chains + coordinates
- Baseline file: Contains coordinates only
--labels: Labels for each model in the visualization plots--data-dir: Benchmark dataset location (images + ground-truth masks)--output: Directory where visualization results will be saved--num: Number of samples to visualize
Visualization Features:
- π― Coordinate Prediction Overlay: Visual comparison of predicted vs ground-truth points
- π§ Chain of Reasoning Display: Step-by-step reasoning process visualization
- π Model Comparison: Side-by-side comparison of different model outputs
- π Error Analysis: Detailed analysis of prediction accuracy and failure cases
TRACE provides unique insights into the model's reasoning process through comprehensive attention visualization and batch processing capabilities:
# Batch process Where2Place dataset with reasoning milestone attention
CUDA_VISIBLE_DEVICES=7 python visualization/attention_map.py \
--model-path [MODEL_WEIGHTS_PLACEHOLDER] \
--model-base [BASE_MODEL_PLACEHOLDER] \
--dataset-dir [DATASET_PLACEHOLDER] \
--output-dir where2place_individual_results \
--start-idx 0 --end-idx 25Attention Analysis Features:
- π Multi-step Attention Tracking: Visualize how attention changes during each reasoning milestone:
- Identify Reference Object - Initial context establishment
- Define Target Area - Spatial area definition
- Determine Goal Subtype - Task classification (critical reasoning step)
- Generate Output - Coordinate generation
- Final Answer - Complete response with overlays
- π Comprehensive Visualizations:
- Individual milestone images with transparent attention overlays
- Combined milestone progression visualization
- Ground truth mask overlays (cyan)
- Predicted coordinate points (red dots)
- π Batch Processing: Process entire datasets with statistical analysis
- π― Interactive Dashboard: Summary statistics and success rates
- πΎ Detailed Output: Individual files for each reasoning step with descriptive names
Key Parameters:
--model-path: Path to TRACE model--model-base: Base model path--dataset-dir: Dataset directory (expectsimages/andmasks/subdirectories)--output-dir: Output directory for all visualizations and analysis--start-idx/--end-idx: Process specific range of images--resume: Resume from existing results
π‘ Key Finding: The attention analysis reveals that TRACE exhibits diffuse attention during initial steps (reference identification, target definition) but concentrated attention during goal subtype determination. During final coordinate generation, there is minimal visual attention, indicating the model relies primarily on its completed textual reasoning chain rather than continuous visual grounding, demonstrating the effectiveness of the Chain of Reasoning approach.
If you find this work helpful, please consider citing:
@misc{park2025tracetextualreasoningaffordance,
title={TRACE: Textual Reasoning for Affordance Coordinate Extraction},
author={Sangyun Park and Jin Kim and Yuchen Cui and Matthew S. Brown},
year={2025},
eprint={2511.01999},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2511.01999}
}Note: This work originally builds upon the foundation of RoboPoint (Yuan et al., 2024) but represents a significant extension with Chain of Reasoning capabilities.
π Reference to Original Foundation:
@article{yuan2024robopoint,
title={RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics},
author={Yuan, Wentao and Duan, Jiafei and Blukis, Valts and Pumacay, Wilbert and Krishna, Ranjay and Murali, Adithyavairavan and Mousavian, Arsalan and Fox, Dieter},
journal={arXiv preprint arXiv:2406.10721},
year={2024}
}This work was initially inspired by RoboPoint (Yuan et al., 2024). We thank the original authors for their open-source contribution.
- RoboPoint: Initial foundation that inspired our Chain of Reasoning approach
- LLaVA: Visual instruction tuning pipeline and multimodal architecture
While TRACE demonstrates significant improvements in spatial affordance prediction, some limitations remain:
- Synthetic Reasoning: The reasoning chains are programmatically generated and may not capture the full complexity of human spatial reasoning
- No Confidence Estimates: Like RoboPoint, TRACE doesn't provide confidence scores for predicted points
- Fixed Output Structure: The number of output points is not controllable
- Attention Control: While attention analysis provides insights, the model lacks explicit mechanisms to control the attention process
Future Directions:
- Extending CoR to multi-step manipulation and navigation tasks
- Incorporating human-generated reasoning examples
- Adding confidence estimation and controllable output generation
- Exploring more sophisticated reasoning structures for complex spatial relationships
