Create and curate a TLA+ Dataset for LLM Training and Evaluation

This issue proposes the creation of a TLA+ dataset and benchmark suite based on logic puzzles and existing community modules. The goal is to enable consistent evaluation of LLMs on TLA+-related tasks and facilitate broader adoption of formal methods in AI-assisted development.

### 📌 Motivation

There is growing interest in evaluating language models on formal specification tasks. TLA+ lacks a dataset to test model performance. Having such a dataset would support:

* Objective (or at least consistent) comparison of LLM performance on TLA+ tasks.
* Enhanced support for tools like [Aider](https://github.com/paul-gauthier/aider), [Codex CLI](https://github.com/openai/codex) in formal language contexts.
* A reference point for developers and researchers choosing the best tool/model for TLA+-related workflows.

### 🧩 Proposal Components

#### 1. **Logic Puzzles as Benchmarks**

Logic puzzles like Die Hard, River Crossing, etc., could form the basis of a benchmark dataset. These puzzles:

* Are well known and have abundant prose descriptions available online.
* Require converting informal problem descriptions into formal specs.
* Frequently test a model’s ability to find counterexamples, a key TLA+ skill.
* Early results at https://www.youtube.com/watch?v=dZv0ykx5aGo&list=PLWLcqZLzY8u8C4dz41PglOOXncqr0c0du

#### 2. **TLA+ Dataset**

* Create a dataset of problem statements (ideally in natural language), and their corresponding TLA+ specifications.
* [CommunityModules](https://github.com/tlaplus/CommunityModules), [Examples](https://github.com/tlaplus/examples), and other open-source TLA+ specs could serve as seeds for this dataset.
* Could follow the structure of existing coding problem repositories (e.g., [Exercism](https://exercism.org/), LeetCode) adapted for formal methods.

#### 3. **Benchmarking Framework**

* Design scripts or tooling to run LLMs against the TLA+ dataset.
* Measure performance in terms of spec correctness, ability to catch errors (via TLC), and model explanation quality.
* Optionally, support community-submitted results for model comparisons.

### 🚀 Next Steps

* Identify and gather initial problem set (e.g., 5–10 logic puzzles).
* Draft initial problem statements and reference specs.
* Define benchmark evaluation criteria.
* Explore automation/tooling for running LLMs against the benchmark.

Related: FM-5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create and curate a TLA+ Dataset for LLM Training and Evaluation #1196

📌 Motivation

🧩 Proposal Components

1. Logic Puzzles as Benchmarks

2. TLA+ Dataset

3. Benchmarking Framework

🚀 Next Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Create and curate a TLA+ Dataset for LLM Training and Evaluation #1196

Description

📌 Motivation

🧩 Proposal Components

1. Logic Puzzles as Benchmarks

2. TLA+ Dataset

3. Benchmarking Framework

🚀 Next Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions