TRACE: Textual Reasoning for Affordance Coordinate Extraction

A Vision-Language Model that enhances spatial affordance prediction through explicit textual Chain of Reasoning (CoR)

🧠 Chain of Reasoning • 🎯 Spatial Precision • 🤖 Vision-Language Model • 👁️ Attention Visualization

🔍 Overview of TRACE's Reasoning Process

🔄 TRACE's multi-step reasoning pipeline: Given an image and natural language instruction, the system determines the goal subtype, establishes relevant reference surfaces, defines target areas through explicit reasoning, and generates precise normalized coordinates.

📖 Introduction

TRACE (Textual Reasoning for Affordance Coordinate Extraction) is an enhanced Vision-Language Model that predicts image keypoint affordances by integrating an explicit textual Chain of Reasoning (CoR) into the spatial affordance prediction process. Building upon the RoboPoint framework, TRACE teaches models not only to predict precise spatial coordinates but also to articulate the reasoning behind their predictions.

✨ Key Innovations

🧠 Textual Chain of Reasoning

Unlike visual CoT methods that generate intermediate images, TRACE uses lightweight textual reasoning that leverages the VLM's native linguistic capabilities

📊 Enhanced Dataset

200,000 training samples with programmatically generated explicit reasoning steps

🎯 Improved Performance

Achieves 48.1% accuracy on the challenging Where2Place benchmark, a 9.6% relative improvement over the original RoboPoint model

🔍 Interpretable Predictions

Provides clear rationales for why specific spatial locations are selected

💡 Key Insight: The approach addresses the critical gap between high-level reasoning and precise, low-level spatial understanding required for physical manipulation. As shown in the figure above, our model follows a multi-step reasoning process that first determines the goal subtype, establishes relevant reference surfaces, defines target areas, and finally generates normalized coordinates.

⚙️ Install

./environment_setup.sh

or follow the instructions below in order.

conda create -n trace python=3.10 -y
conda activate trace

pip install --upgrade pip  # enable PEP 660 support

# this is optional if you prefer to system built-in nvcc.
conda install -c nvidia cuda=12.1 -y

pip install -e .

# this is optional if you don't need to train the model
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

📦 Data Samples

🎯 50 Representative Samples Available for Review

We provide 50 carefully selected samples from our TRACE dataset for review purposes:

🎯 25 Object Reference Samples

Demonstrating spatial reasoning for object-relative positioning

🌟 25 Free Space Reference Samples

Showing vacant area identification with reasoning chains

📁 Location: data_samples/sample_50_cor_data.zip

📋 Contents:

✅ Input images and natural language instructions
🧠 Complete Chain of Reasoning (CoR) explanations
🎯 Ground-truth coordinate annotations
📊 Comparison with baseline methods

💡 These samples illustrate the key contribution of our work: explicit textual reasoning that justifies spatial coordinate predictions.

🚀 Full Dataset Availability

The complete TRACE reasoning dataset (200,000 samples) is now publicly available on 🤗 HuggingFace: jink-ucla/TRACE

✅ Complete training and evaluation splits
🧠 All Chain of Reasoning annotations
📖 Detailed dataset documentation and usage examples
📊 Comparison baselines and evaluation metrics

🗂️ Dataset Construction

The TRACE dataset consists of 200,000 training samples created by enhancing the RoboPoint data generation pipeline. The dataset is composed of two data sources:

100,000 novel reasoning-augmented samples with explicit textual Chain of Reasoning (CoR)
100,000 standard visual instruction-tuning samples from LVIS and VQA datasets

The key innovation is the programmatic generation of explicit textual reasoning steps using the Gemini API, which breaks down the spatial reasoning process into interpretable steps.

Each data sample includes:

Input image and natural language instruction
Multi-step textual reasoning process
Final normalized 2D coordinates: {(x_i, y_i) | x_i, y_i ∈ [0, 1]}

Example reasoning structure:

Goal Subtype Identification: Determine if the task involves placement affordance, reference object identification, etc.
Reference Surface Establishment: Identify the relevant surface or area in the image
Target Area Definition: Define the specific region based on the instruction
Coordinate Generation: Output precise normalized coordinates

🏋️ Training Configuration

Model Architecture:

Base LLM: Vicuna-v1.5-13B
Vision Encoder: CLIP-ViT-Large-Patch14-336 (penultimate layer features)
Projector: 2-layer MLP with GELU activation
Optimization: Flash Attention 2 for memory efficiency

Training Setup:

For 13B Model (Main Results):

Optimization Method: Full fine-tuning (FFT) for maximum performance
Optimizer: AdamW with learning rate 2×10⁻⁶
Scheduler: Cosine annealing with 3% warmup
Duration: 1 epoch on TRACE dataset

For 7B Model (Ablations & Analysis):

Optimization Method: Low-Rank Adaptation (LoRA) with rank r=128, α=256
Optimizer: AdamW with learning rate 2×10⁻⁶
Scheduler: Cosine annealing with 3% warmup
Precision: bfloat16 mixed-precision with gradient checkpointing
Duration: 1 epoch on TRACE dataset

Data Processing Optimizations:

Lazy preprocessing for memory efficiency
Square aspect ratio padding for uniform input
Grouping by modality length to minimize padding
12 dataloader workers to prevent bottlenecks

🎯 Model Weights

Available Models

Model	Type	Training	Size	Performance	Download
TRACE-13B	Fine-tuned	Full Fine-tuning	13B	48.1% W2P	Coming Soon
TRACE-7B	Fine-tuned	LoRA (r=128)	7B	Used for analysis	Coming Soon
Vicuna-v1.5-13B	Base Model	Pre-trained	13B	Required for TRACE-13B	Coming Soon
Vicuna-v1.5-7B	Base Model	Pre-trained	7B	Required for TRACE-7B	Coming Soon

💡 Note: Model weights will be made publicly available soon. Links will be updated in the table above.

📈 Evaluation

🏆 Benchmarks and Results

We evaluate TRACE on challenging spatial affordance prediction benchmarks:

Model	RoboRefIt	Where2Place (W2P)	W2P (hard)
🎯 RoboPoint(FFT)+TRACE	🏆 42.9% ± 0.8	🏆 48.1% ± 0.1	🏆 55.0% ± 3.5
RoboPoint(FFT)	41.7% ± 0.6	43.9% ± 0.6	46.9% ± 4.2
🎯 RoboPoint(LoRA)+TRACE	🏆 48.1% ± 2.8	🏆 43.7% ± 4.1	🏆 41.2% ± 7.3
RoboPoint(LoRA)	40.6% ± 3.0	36.1% ± 1.3	30.7% ± 0.2
SpaceLLaVA	20.0% ± 0.5	15.0% ± 1.6	13.6% ± 2.1
GPT-4o	6.5% ± 0.8	18.7% ± 2.6	17.8% ± 4.8
Gemini	5.2% ± 0.1	7.8% ± 0.2	6.6% ± 0.2

🎯 Key Results:

📊 Quantitative Improvements

9.6% relative improvement over RoboPoint baseline on Where2Place
Statistically significant improvement on W2P benchmark (p=0.022 < 0.05)
34.2% relative gain on challenging W2P (hard) subset over baseline

🌟 Qualitative Strengths

Consistent improvements across all benchmark categories
Particularly strong performance on challenging unseen relation types (W2P hard)
Dose-dependent relationship between CoR data quantity and performance

🧪 Running Evaluations

To evaluate on Where2Place:

# Generate results
python robopoint/eval/model_vqa.py \
    --model-path trace-v1-vicuna-v1.5-13b \
    --image-folder datasets/where2place/images \
    --question-file datasets/where2place/point_questions.jsonl \
    --answer-file output/trace-v1-vicuna-v1.5-13b.jsonl

# Compute accuracy
python robopoint/eval/summarize_vqa.py --answer output/trace-v1-vicuna-v1.5-13b.jsonl

🎨 Visualization

TRACE includes comprehensive visualization tools to analyze model predictions and Chain of Reasoning outputs:

# Visualize model comparisons with reasoning analysis
python visualization/visualize_results.py \
    --answer-files output/robopoint-baseline.jsonl output/trace-v1-vicuna-v1.5-13b.jsonl \
    --labels robopoint trace \
    --data-dir datasets/where2place/images \
    --output output/visualization_results \
    --num 10

Parameter Explanation:

--answer-files: Model output files from model_vqa.py
- TRACE answer file: Contains reasoning chains + coordinates
- Baseline file: Contains coordinates only
--labels: Labels for each model in the visualization plots
--data-dir: Benchmark dataset location (images + ground-truth masks)
--output: Directory where visualization results will be saved
--num: Number of samples to visualize

Visualization Features:

🎯 Coordinate Prediction Overlay: Visual comparison of predicted vs ground-truth points
🧠 Chain of Reasoning Display: Step-by-step reasoning process visualization
📊 Model Comparison: Side-by-side comparison of different model outputs
🔍 Error Analysis: Detailed analysis of prediction accuracy and failure cases

👁️ Attention Analysis

TRACE provides unique insights into the model's reasoning process through comprehensive attention visualization and batch processing capabilities:

# Batch process Where2Place dataset with reasoning milestone attention
CUDA_VISIBLE_DEVICES=7 python visualization/attention_map.py \
    --model-path [MODEL_WEIGHTS_PLACEHOLDER] \
    --model-base [BASE_MODEL_PLACEHOLDER] \
    --dataset-dir [DATASET_PLACEHOLDER] \
    --output-dir where2place_individual_results \
    --start-idx 0 --end-idx 25

Attention Analysis Features:

🔍 Multi-step Attention Tracking: Visualize how attention changes during each reasoning milestone:
1. Identify Reference Object - Initial context establishment
2. Define Target Area - Spatial area definition
3. Determine Goal Subtype - Task classification (critical reasoning step)
4. Generate Output - Coordinate generation
5. Final Answer - Complete response with overlays
📊 Comprehensive Visualizations:
- Individual milestone images with transparent attention overlays
- Combined milestone progression visualization
- Ground truth mask overlays (cyan)
- Predicted coordinate points (red dots)
📈 Batch Processing: Process entire datasets with statistical analysis
🎯 Interactive Dashboard: Summary statistics and success rates
💾 Detailed Output: Individual files for each reasoning step with descriptive names

Key Parameters:

--model-path: Path to TRACE model
--model-base: Base model path
--dataset-dir: Dataset directory (expects images/ and masks/ subdirectories)
--output-dir: Output directory for all visualizations and analysis
--start-idx/--end-idx: Process specific range of images
--resume: Resume from existing results

💡 Key Finding: The attention analysis reveals that TRACE exhibits diffuse attention during initial steps (reference identification, target definition) but concentrated attention during goal subtype determination. During final coordinate generation, there is minimal visual attention, indicating the model relies primarily on its completed textual reasoning chain rather than continuous visual grounding, demonstrating the effectiveness of the Chain of Reasoning approach.

📄 Citation

If you find this work helpful, please consider citing:

@misc{park2025tracetextualreasoningaffordance,
  title={TRACE: Textual Reasoning for Affordance Coordinate Extraction},
  author={Sangyun Park and Jin Kim and Yuchen Cui and Matthew S. Brown},
  year={2025},
  eprint={2511.01999},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2511.01999}
}

Note: This work originally builds upon the foundation of RoboPoint (Yuan et al., 2024) but represents a significant extension with Chain of Reasoning capabilities.

📚 Reference to Original Foundation:

@article{yuan2024robopoint,
  title={RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics},
  author={Yuan, Wentao and Duan, Jiafei and Blukis, Valts and Pumacay, Wilbert and Krishna, Ranjay and Murali, Adithyavairavan and Mousavian, Arsalan and Fox, Dieter},
  journal={arXiv preprint arXiv:2406.10721},
  year={2024}
}

Acknowledgements

This work was initially inspired by RoboPoint (Yuan et al., 2024). We thank the original authors for their open-source contribution.

RoboPoint: Initial foundation that inspired our Chain of Reasoning approach
LLaVA: Visual instruction tuning pipeline and multimodal architecture

Limitations and Future Work

While TRACE demonstrates significant improvements in spatial affordance prediction, some limitations remain:

Synthetic Reasoning: The reasoning chains are programmatically generated and may not capture the full complexity of human spatial reasoning
No Confidence Estimates: Like RoboPoint, TRACE doesn't provide confidence scores for predicted points
Fixed Output Structure: The number of output points is not controllable
Attention Control: While attention analysis provides insights, the model lacks explicit mechanisms to control the attention process

Future Directions:

Extending CoR to multi-step manipulation and navigation tasks
Incorporating human-generated reasoning examples
Adding confidence estimation and controllable output generation
Exploring more sophisticated reasoning structures for complex spatial relationships

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data_samples		data_samples
figures		figures
robopoint		robopoint
scripts		scripts
visualization		visualization
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment_setup.sh		environment_setup.sh
pyproject.toml		pyproject.toml

License

jink-ucla/TRACE

Folders and files

Latest commit

History

Repository files navigation

TRACE: Textual Reasoning for Affordance Coordinate Extraction

🔍 Overview of TRACE's Reasoning Process

📖 Introduction

✨ Key Innovations

🧠 Textual Chain of Reasoning

📊 Enhanced Dataset

🎯 Improved Performance

🔍 Interpretable Predictions

📋 Contents

⚙️ Install

📦 Data Samples

🎯 50 Representative Samples Available for Review

🎯 25 Object Reference Samples

🌟 25 Free Space Reference Samples

🚀 Full Dataset Availability

🗂️ Dataset Construction

🏋️ Training Configuration

🎯 Model Weights

Available Models

📈 Evaluation

🏆 Benchmarks and Results

🎯 Key Results:

📊 Quantitative Improvements

🌟 Qualitative Strengths

🧪 Running Evaluations

🎨 Visualization

👁️ Attention Analysis

📄 Citation

Acknowledgements

Limitations and Future Work

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages