Skip to content

Create and curate a TLA+ Dataset for LLM Training and Evaluation #1196

@lemmy

Description

@lemmy

This issue proposes the creation of a TLA+ dataset and benchmark suite based on logic puzzles and existing community modules. The goal is to enable consistent evaluation of LLMs on TLA+-related tasks and facilitate broader adoption of formal methods in AI-assisted development.

📌 Motivation

There is growing interest in evaluating language models on formal specification tasks. TLA+ lacks a dataset to test model performance. Having such a dataset would support:

  • Objective (or at least consistent) comparison of LLM performance on TLA+ tasks.
  • Enhanced support for tools like Aider, Codex CLI in formal language contexts.
  • A reference point for developers and researchers choosing the best tool/model for TLA+-related workflows.

🧩 Proposal Components

1. Logic Puzzles as Benchmarks

Logic puzzles like Die Hard, River Crossing, etc., could form the basis of a benchmark dataset. These puzzles:

2. TLA+ Dataset

  • Create a dataset of problem statements (ideally in natural language), and their corresponding TLA+ specifications.
  • CommunityModules, Examples, and other open-source TLA+ specs could serve as seeds for this dataset.
  • Could follow the structure of existing coding problem repositories (e.g., Exercism, LeetCode) adapted for formal methods.

3. Benchmarking Framework

  • Design scripts or tooling to run LLMs against the TLA+ dataset.
  • Measure performance in terms of spec correctness, ability to catch errors (via TLC), and model explanation quality.
  • Optionally, support community-submitted results for model comparisons.

🚀 Next Steps

  • Identify and gather initial problem set (e.g., 5–10 logic puzzles).
  • Draft initial problem statements and reference specs.
  • Define benchmark evaluation criteria.
  • Explore automation/tooling for running LLMs against the benchmark.

Related: FM-5

Metadata

Metadata

Assignees

No one assigned

    Labels

    AIWork related to TLAi+enhancementLets change things for the bettergood first issueYour entry point to contributing to TLA+help wantedWe need your help

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions