Reference Implementation 6 — an agentic Visual Question Answering evaluation framework by aravind-3105 · Pull Request #20 · VectorInstitute/interpretability-llms-agents

aravind-3105 · 2026-03-06T20:41:49Z

Summary

Adds Reference Implementation 6 — an agentic Visual Question Answering evaluation framework for the ChartQAPro dataset, built on CrewAI. The framework decomposes chart QA into an explicit Plan → OCR → Inspect → Verify loop, producing fully traceable Model Evaluation Packets (MEPs) per sample. Includes multi-pass evaluation, failure taxonomy, HTML reporting, a Streamlit dashboard, and optional Opik observability integration.

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📝 Documentation update
🔧 Refactoring (no functional changes)
⚡ Performance improvement
🧪 Test improvements
🔒 Security fix

Changes Made

Added implementations/agentic_vqa_eval/ — full reference implementation for agentic ChartQAPro evaluation
Implemented multi-agent pipeline: PlannerAgent (text-only LLM) → OcrReaderTool (optional perception pre-read) → VisionAgent (multimodal LLM + tool) → VerifierAgent (Pass 2.5 — VLM critique)
Introduced the MEP schema (mep/schema.py, mep/writer.py) — portable JSON trace capturing plan, OCR, vision, verifier output, tool logs, timestamps, and errors
Added multi-pass evaluation suite: eval_outputs.py (accuracy + LLM-as-judge), eval_traces.py (latency/replayability), eval_topk.py (hit@1/2/3), error_taxonomy.py (VLM-based failure classification)
Added eval/report.py (self-contained HTML report) and eval/dashboard.py (Streamlit sample browser)
Added Opik observability integration (opik_integration/) — trace viewer, prompt versioning, dataset registration, experiment comparison; fully optional via OPIK_URL_OVERRIDE
Added pyproject.toml with project dependencies, ruff/mypy/pytest config aligned with repo conventions
Added run_pipeline.ipynb (end-to-end execution) and analysis.ipynb (results visualization) notebooks

Testing

Tests pass locally (uv run pytest tests/)
Type checking passes (uv run mypy <src_dir>)
Linting passes (uv run ruff check src_dir/)
Manual testing performed (describe below)

Manual testing details:
Pipeline run end-to-end on ChartQAPro test split (25 samples) across openai_openai config. MEPs generated, all eval passes executed, summary CSV and HTML report produced. OCR ablation (--no_ocr) and verifier skip (--no_verifier) flags verified. Opik tracing validated with self-hosted Docker stack.

Screenshots/Recordings

Related Issues

Deployment Notes

Requires OPENAI_API_KEY and/or GEMINI_API_KEY in .env
Install from implementations/agentic_vqa_eval/ via uv sync then source .venv/bin/activate
Opik is optional — set OPIK_URL_OVERRIDE=http://localhost:5173/api to enable; omit to disable silently

Checklist

Code follows the project's style guidelines
Self-review of code completed
Documentation updated (if applicable)
No sensitive information (API keys, credentials) exposed

…ith dependencies and environment setup

…on updates

aravind-3105 added 2 commits March 6, 2026 13:40

Add initial implementation of Agentic ChartQAPro Evaluation Harness w…

892cd47

…ith dependencies and environment setup

Update dependencies and enhance README for virtual environment setup

b4bcb28

aravind-3105 requested a review from shainarazavi March 6, 2026 20:41

aravind-3105 self-assigned this Mar 6, 2026

aravind-3105 added the enhancement New feature or request label Mar 6, 2026

Update README with Docker installation instructions and package versi…

2458870

…on updates

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reference Implementation 6 — an agentic Visual Question Answering evaluation framework#20

Reference Implementation 6 — an agentic Visual Question Answering evaluation framework#20
aravind-3105 wants to merge 3 commits intomainfrom
ref-imp-6-agentic

aravind-3105 commented Mar 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aravind-3105 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of Change

Changes Made

Testing

Screenshots/Recordings

Related Issues

Deployment Notes

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aravind-3105 commented Mar 6, 2026 •

edited

Loading