Skip to content

Reference Implementation 6 — an agentic Visual Question Answering evaluation framework#20

Open
aravind-3105 wants to merge 3 commits intomainfrom
ref-imp-6-agentic
Open

Reference Implementation 6 — an agentic Visual Question Answering evaluation framework#20
aravind-3105 wants to merge 3 commits intomainfrom
ref-imp-6-agentic

Conversation

@aravind-3105
Copy link
Member

@aravind-3105 aravind-3105 commented Mar 6, 2026

Summary

Adds Reference Implementation 6 — an agentic Visual Question Answering evaluation framework for the ChartQAPro dataset, built on CrewAI. The framework decomposes chart QA into an explicit Plan → OCR → Inspect → Verify loop, producing fully traceable Model Evaluation Packets (MEPs) per sample. Includes multi-pass evaluation, failure taxonomy, HTML reporting, a Streamlit dashboard, and optional Opik observability integration.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📝 Documentation update
  • 🔧 Refactoring (no functional changes)
  • ⚡ Performance improvement
  • 🧪 Test improvements
  • 🔒 Security fix

Changes Made

  • Added implementations/agentic_vqa_eval/ — full reference implementation for agentic ChartQAPro evaluation
  • Implemented multi-agent pipeline: PlannerAgent (text-only LLM) → OcrReaderTool (optional perception pre-read) → VisionAgent (multimodal LLM + tool) → VerifierAgent (Pass 2.5 — VLM critique)
  • Introduced the MEP schema (mep/schema.py, mep/writer.py) — portable JSON trace capturing plan, OCR, vision, verifier output, tool logs, timestamps, and errors
  • Added multi-pass evaluation suite: eval_outputs.py (accuracy + LLM-as-judge), eval_traces.py (latency/replayability), eval_topk.py (hit@1/2/3), error_taxonomy.py (VLM-based failure classification)
  • Added eval/report.py (self-contained HTML report) and eval/dashboard.py (Streamlit sample browser)
  • Added Opik observability integration (opik_integration/) — trace viewer, prompt versioning, dataset registration, experiment comparison; fully optional via OPIK_URL_OVERRIDE
  • Added pyproject.toml with project dependencies, ruff/mypy/pytest config aligned with repo conventions
  • Added run_pipeline.ipynb (end-to-end execution) and analysis.ipynb (results visualization) notebooks

Testing

  • Tests pass locally (uv run pytest tests/)
  • Type checking passes (uv run mypy <src_dir>)
  • Linting passes (uv run ruff check src_dir/)
  • Manual testing performed (describe below)

Manual testing details:
Pipeline run end-to-end on ChartQAPro test split (25 samples) across openai_openai config. MEPs generated, all eval passes executed, summary CSV and HTML report produced. OCR ablation (--no_ocr) and verifier skip (--no_verifier) flags verified. Opik tracing validated with self-hosted Docker stack.

Screenshots/Recordings

Related Issues

Deployment Notes

  • Requires OPENAI_API_KEY and/or GEMINI_API_KEY in .env
  • Install from implementations/agentic_vqa_eval/ via uv sync then source .venv/bin/activate
  • Opik is optional — set OPIK_URL_OVERRIDE=http://localhost:5173/api to enable; omit to disable silently

Checklist

  • Code follows the project's style guidelines
  • Self-review of code completed
  • Documentation updated (if applicable)
  • No sensitive information (API keys, credentials) exposed

@aravind-3105 aravind-3105 self-assigned this Mar 6, 2026
@aravind-3105 aravind-3105 added the enhancement New feature or request label Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant