-
-
Notifications
You must be signed in to change notification settings - Fork 237
Open
Labels
AIWork related to TLAi+Work related to TLAi+enhancementLets change things for the betterLets change things for the bettergood first issueYour entry point to contributing to TLA+Your entry point to contributing to TLA+help wantedWe need your helpWe need your help
Description
This issue proposes the creation of a TLA+ dataset and benchmark suite based on logic puzzles and existing community modules. The goal is to enable consistent evaluation of LLMs on TLA+-related tasks and facilitate broader adoption of formal methods in AI-assisted development.
📌 Motivation
There is growing interest in evaluating language models on formal specification tasks. TLA+ lacks a dataset to test model performance. Having such a dataset would support:
- Objective (or at least consistent) comparison of LLM performance on TLA+ tasks.
- Enhanced support for tools like Aider, Codex CLI in formal language contexts.
- A reference point for developers and researchers choosing the best tool/model for TLA+-related workflows.
🧩 Proposal Components
1. Logic Puzzles as Benchmarks
Logic puzzles like Die Hard, River Crossing, etc., could form the basis of a benchmark dataset. These puzzles:
- Are well known and have abundant prose descriptions available online.
- Require converting informal problem descriptions into formal specs.
- Frequently test a model’s ability to find counterexamples, a key TLA+ skill.
- Early results at https://www.youtube.com/watch?v=dZv0ykx5aGo&list=PLWLcqZLzY8u8C4dz41PglOOXncqr0c0du
2. TLA+ Dataset
- Create a dataset of problem statements (ideally in natural language), and their corresponding TLA+ specifications.
- CommunityModules, Examples, and other open-source TLA+ specs could serve as seeds for this dataset.
- Could follow the structure of existing coding problem repositories (e.g., Exercism, LeetCode) adapted for formal methods.
3. Benchmarking Framework
- Design scripts or tooling to run LLMs against the TLA+ dataset.
- Measure performance in terms of spec correctness, ability to catch errors (via TLC), and model explanation quality.
- Optionally, support community-submitted results for model comparisons.
🚀 Next Steps
- Identify and gather initial problem set (e.g., 5–10 logic puzzles).
- Draft initial problem statements and reference specs.
- Define benchmark evaluation criteria.
- Explore automation/tooling for running LLMs against the benchmark.
Related: FM-5
younes-io and will62794
Metadata
Metadata
Assignees
Labels
AIWork related to TLAi+Work related to TLAi+enhancementLets change things for the betterLets change things for the bettergood first issueYour entry point to contributing to TLA+Your entry point to contributing to TLA+help wantedWe need your helpWe need your help