Benchmarking Suite for spec-kit Across Multiple Tasks and LLMS #388

jaycangel · 2025-09-20T09:07:12Z

jaycangel
Sep 20, 2025

**Feature Request: **

I’d like to propose adding a comprehensive benchmarking suite for spec-kit to evaluate its performance on SDD with different large language models on multiple tasks.

Motivation

Raw stats and leaderboards alone aren’t very informative for spec-kit-specific use cases. The SDD systematic approach to planning, research and task management drives different development behaviours and outcomes.

Proposal

Create a set of tasks with increasing complexity that represent realistic spec-driven development scenarios.
Have these tasks implemented by multiple LLMs via spec-kit, then run against automated externally verifiable specification focused test cases (this is very different from unit tests).
Publish the results as part of spec-kit’s documentation or a dedicated leaderboard.

Benefits

Feedback for spec-kit Development: Provide actionable insights into how spec-kit’s specs might need tuning or adjustment for different LLM service providers.
Automated Regression Testing: As spec-kit becomes more feature-rich, there needs to be an automated way of testing changes and preventing regressions.
Provide Examples for Users: Offer concrete examples to users on how to write effective specs by showcasing benchmark tasks and their implementations.
Establish Industry Benchmark Standard: Position spec-kit as the industry standard for measuring spec-driven development and as the most modern and efficient approach to agentic coding. Other industry benchmarks are becoming less relevant as they don't reflect how development is now actually done.

Would the maintainers and community be open to exploring this?

jeremyeder · 2025-09-22T18:48:02Z

jeremyeder
Sep 22, 2025

I think the industry needs refreshed best practices reference material for SDD+TDD with LLMs. spec-kit is one project trying to find that pattern. But it seems like spec-kit is an integration project, at least at the moment. I don't think SDD+TDD is the end-game for any of us, it's the obvious hammer+nail for right now to effect guardrails.

Other discussions on this board indicate you're not alone in seeing this gap.

1 reply

jaycangel Sep 25, 2025
Author

I agree with your point. Spec-kit is definitely still evolving, but without a benchmark and some kind of automated evaluation, it’s hard to see how progress can really be tracked or measured.

I also think we’ve moved past simply using base LLMs to make code edits. The structure and methodology around how changes are planned and applied make a significant difference to the final code quality. It’s not just raw “vibe coding” anymore—there’s clearly some process and discipline behind it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmarking Suite for spec-kit Across Multiple Tasks and LLMS #388

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Benchmarking Suite for spec-kit Across Multiple Tasks and LLMS #388

Uh oh!

jaycangel Sep 20, 2025

Motivation

Proposal

Benefits

Replies: 1 comment · 1 reply

Uh oh!

jeremyeder Sep 22, 2025

Uh oh!

jaycangel Sep 25, 2025 Author

jaycangel
Sep 20, 2025

Replies: 1 comment 1 reply

jeremyeder
Sep 22, 2025

jaycangel Sep 25, 2025
Author