Skip to content

Conversation

@ping-Toven
Copy link

Summary

Add a new TauBench Verified benchmark suite by mirroring the existing TauBench integration and wiring it through registry, config, and docs metadata.

What are you adding?

  • Bug fix (non-breaking change which fixes an issue)
  • New benchmark/evaluation
  • New model provider
  • CLI enhancement
  • Performance improvement
  • Documentation update
  • API/SDK feature
  • Integration (CI/CD, tools)
  • Export/import functionality
  • Code refactoring
  • Breaking change
  • Other

Changes Made

  • Added tau_bench_verified_{retail,airline,telecom} evals plus dataset/solver/scorer wrappers following the existing TauBench file pattern.
  • Extended TauBench dataset/solver internals to support verified data sourcing and per-sample TAU2_DATA_DIR switching safely.
  • Registered benchmark metadata/eval group, added dependency group entries, regenerated docs benchmark snippet, and added dataset unit tests.

Testing

  • I have run the existing test suite (pytest)
  • I have added tests for my changes
  • I have tested with multiple model providers (if applicable)
  • I have run pre-commit hooks (pre-commit run --all-files)

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (if applicable)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Related Issues

N/A

Additional Context

Verified TauBench data can be overridden via OPENBENCH_TAU2_VERIFIED_DATA_DIR when needed.

@ping-Toven ping-Toven merged commit e783d94 into main Feb 9, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant