Skip to content

ark2016/TFL-automatisation

TFL-automatisation

tests license python Powered by Claude

Multi-agent pipelines for Theory of Formal Languages problems — classify a language and produce an exam-ready proof that it belongs (or does not belong) to a given complexity class. Four independent pipelines, one shared orchestration pattern, one local web UI.

The system targets the formal languages course at ИУ-9, МГТУ им. Баумана. Inputs are Russian problem statements; outputs are structured Markdown / HTML / JSON reports with LaTeX-typeset proofs, grammar constructions, PDA diagrams, and case analyses.


At a glance

Pipeline Language class Agents Flagship theorems
REGagent_system/ Regular 12 Myhill–Nerode, pumping lemma (reg), closure under ⋃ ⋂ ¬ · *
CFLcfl_system/ Context-free 15 Bar-Hillel pumping, Ogden's lemma, Parikh's theorem, interchange lemma, CFL ∩ REG
DCFLdcfl_system/ Deterministic CFL 8 DCFL pumping, Shallit's lemma, inherent ambiguity
LLll_system/ LL(k) grammars 10 FIRST / FOLLOW, LL(k) conflict detection, left-recursion and factoring transforms

Each pipeline runs in parallel over its specialist agents, fuses their results through a reasoning agent, audits the selected proof with an independent proof-checker, and formalizes the verdict via a structured Markdown document. See Architecture below.


Quickstart

# One-time setup
python -m venv .venv
.venv/Scripts/activate        # or source .venv/bin/activate on *nix
pip install -r requirements.txt   # anthropic, langgraph, python-dotenv, ...
echo "ANTHROPIC_API_KEY=sk-..." > .env

# Install Graphviz (for PDA state diagrams in HTML reports)
#   Windows: https://graphviz.org/download/ (add bin/ to PATH)
#   macOS:   brew install graphviz
#   Linux:   apt install graphviz
dot -V                        # sanity check

# Run the unified UI
.venv/Scripts/python -m ui_server.server --port 8765
# → open http://127.0.0.1:8765/

Or run a pipeline directly from the command line:

# Mock (no LLM, fast, free) — exercises validation + fallback reasoning
.venv/Scripts/python -m cfl_system.orchestrator \
    cfl_system/examples/task11_ai_bj_between.json \
    --save cfl_system/examples/live_outputs/

# Live (Anthropic API, real money)
.venv/Scripts/python -m cfl_system.orchestrator \
    cfl_system/examples/task11_ai_bj_between.json \
    --live --verbose \
    --save cfl_system/examples/live_outputs/

Artifacts land as {task}_result.{json,md,html} beside the IR.


Architecture

Every pipeline is a LangGraph StateGraph, wired with the same topology. Specialists run concurrently via Send() fan-out; the reasoning agent consolidates their evidence; disagreements trigger a selective retry or a hypothesis inversion.

flowchart TD
    IR([IR JSON]) --> V[validate_ir]
    V --> H[analyze_hypothesis<br/><i>pure fn: guess class</i>]
    H --> LP[language_preprocess<br/><i>pure fn: bounded, parikh, filter</i>]
    LP --> CL[classifier<br/><i>LLM advisory verdict</i>]
    CL --> D[setup_dispatch]

    D -.fan-out Send().-> S1[specialist #1]
    D -.fan-out Send().-> S2[specialist #2]
    D -.fan-out Send().-> SN[specialist #N]

    S1 --> CS[collect_specialists]
    S2 --> CS
    SN --> CS

    CS --> BO[build_oracle<br/><i>grammar / regex / predicate</i>]
    BO --> VC[verify_claims<br/><i>per-agent audit</i>]
    VC --> OT[oracle_test<br/><i>CYK / DFA / word-membership</i>]
    OT --> PC[proof_checker<br/><i>LLM: independent audit</i>]
    PC --> R{reasoning<br/><i>consolidate + verdict</i>}

    R -->|done| F[formalizer<br/><i>structured MD proof</i>]
    R -->|retry| RP[retry_planner<br/><i>pick agents + hints</i>]
    R -->|invert| INV[invert_hypothesis]

    RP --> D
    INV --> D

    F --> AR[assemble_result]
    AR --> END([result.json + .md + .html])

    style D fill:#1e3a5f,stroke:#4a90c0,color:#e6e8ec
    style R fill:#2d4a1e,stroke:#7acb9d,color:#e6e8ec
    style F fill:#4a3a1e,stroke:#d4a86a,color:#e6e8ec
Loading

Key properties

  • Fan-out parallelism. All specialists are dispatched in one Send() burst and run concurrently in LangGraph's Pregel thread pool. End-to-end wall time ≈ max(specialist time), not sum.
  • Selective retry. retry_planner can re-dispatch a subset of agents with specific hints (e.g. "try pumping word $a^p b^p c^p$ instead of $a^{2p}$") without re-running the whole pipeline.
  • Hypothesis inversion. If every specialist under the current guess (cfl / non_cfl) fails, invert_hypothesis flips the hypothesis and restarts the dispatch. Bounded by MAX_INVERSIONS.
  • Honest verification. proof_verified: true is set only if proof_checker actually ran and returned status = verified. Agent errors, API failures, and JSON-parse problems are tracked in state["errors"] and surfaced in the final result — no silent success.

Live-runner reliability layer

The LiveRunner (same pattern across all four pipelines) handles the messy reality of calling a frontier LLM in production:

Opus response → _extract_json (3 strategies: whole / fenced / braces)
              ├─ success → return parsed dict
              └─ failure → Haiku repair (claude-haiku-4-5, ~$0.01, ~2s)
                         ├─ success → return parsed dict
                         └─ failure → Opus retry with detailed parse-error
                                      (position + context snippet + prior
                                       response excerpt) up to N times
                                      └─ final failure → agent_error dict
                                         with errors recorded in state
  • Streaming API is mandatory — non-streaming requests are rejected by the Anthropic SDK when max_tokens × projected latency > 10 min, which the older 4096-token runs silently hit on formalizer / proof_checker.
  • Temperature handled per model. Opus 4.7 (adaptive thinking) deprecated temperature; LiveRunner drops it for that model via an allow-list.

The four pipelines

REG — agent_system/

Classify a regular-language problem and prove it (or disprove regularity). Specialists:

flowchart LR
    IR --> re[re_builder<br/><i>regex construction</i>]
    IR --> dfa[dfa_builder<br/><i>DFA + CYK oracle</i>]
    IR --> pump[pumping_agent<br/><i>pumping lemma</i>]
    IR --> ner[nerode_agent<br/><i>Myhill–Nerode</i>]
    IR --> clo[closure_agent<br/><i>closure properties</i>]
    IR --> gram[grammar_analyzer<br/><i>right-linear check</i>]
Loading

Theorems / methods:

  • Myhill–Nerode theorem. Infinite-rank right-congruence → not regular.
  • Pumping lemma. For non-regularity: choose word $z$, case analysis on partitions $z = uvw$.
  • Closure under boolean ops + concatenation + star. Reduce unknown language to a known one via $L \cup R$, $L \cap R$, $\overline{L}$, $h^{-1}(L)$.
  • Grammar analysis. Right-linear grammars generate exactly regular languages.

Formalizer can emit Lean 4 proof stubs in addition to Markdown (optional, gated by availability of a Lean build).

CFL — cfl_system/

Classify a context-free problem. Full specialist cast:

flowchart LR
    IR --> cfg[cfg_builder<br/><i>CFG construction</i>]
    IR --> pda[pda_builder<br/><i>PDA construction</i>]
    IR --> decomp[decomposition<br/><i>concat/union split</i>]
    IR --> pk[parikh<br/><i>semilinearity</i>]
    IR --> pump[pumping_cfl<br/><i>Bar-Hillel</i>]
    IR --> og[ogden<br/><i>Ogden's lemma</i>]
    IR --> clo[closure_reduction<br/><i>∩ REG, h, h⁻¹</i>]
    IR --> inter[interchange<br/><i>interchange lemma</i>]
    IR --> morph[morphism<br/><i>morphic image</i>]
Loading

Theorems / methods:

  • Bar-Hillel pumping lemma. For non-CFL: every long enough $z \in L$ decomposes as $uvwxy$ with $|vwx| \le p$ and $|vx| \ge 1$, and $u v^i w x^i y \in L$ for all $i \ge 0$. Case split on where $vwx$ falls.
  • Ogden's lemma. Strengthened pumping with marked positions — catches non-CFLs that survive plain pumping.
  • Parikh's theorem. Parikh image of any CFL is semilinear (a finite union of linear sets). Non-semilinear image $\Rightarrow$ language is not CFL. Used to knock out languages like ${a^n b^m : n = k^2}$ via Ginsburg–Spanier.
  • Closure under ∩ REG. If $L$ is CFL and $R$ is regular, $L \cap R$ is CFL. Contrapositive: intersect with a carefully chosen regex, then apply pumping to the simpler intersection.
  • Interchange lemma, morphism closure. Fallback attacks when the above are inconclusive.
  • CYK oracle. Pure-function membership test for the proposed CFG, validated against ≥3 positive and ≥3 negative words.

DCFL — dcfl_system/

Classify a deterministic-context-free problem. Fewer specialists, each heavier:

flowchart LR
    IR --> stack[stack_strategy<br/><i>DPDA construction</i>]
    IR --> clo[closure_reduction<br/><i>DCFL closure</i>]
    IR --> pump[dcfl_pumping<br/><i>DCFL pumping</i>]
    IR --> shall[shallit<br/><i>Shallit's lemma</i>]
    IR --> amb[inh_ambiguity<br/><i>inherent ambiguity</i>]
Loading

Theorems / methods:

  • DCFL pumping lemma. Unlike CFL pumping, bounds |uv| instead of just |vwx| — tighter, catches DCFL-but-not-REG distinctions.
  • Shallit's lemma. Counting-based tool: if specific ratios of letter frequencies force an unbounded stack state, the language cannot be DCFL.
  • Inherent ambiguity arguments. A DCFL is unambiguous; demonstrate two distinct parse trees for the same word to rule DCFL out.
  • Closure under complement (DCFL-specific). DCFLs are closed under complement but CFLs are not — useful discriminator.

LL — ll_system/

Classify a grammar as LL(k) for some bounded k, detect conflicts, suggest transformations.

flowchart LR
    IR --> gb[ll_grammar_builder<br/><i>build FIRST/FOLLOW</i>]
    IR --> sub[substitution_agent<br/><i>productions unfolding</i>]
    IR --> amb[ambiguity_detector<br/><i>ε/LL conflicts</i>]
    IR --> pre[prefix_classes_agent<br/><i>prefix classes for k</i>]
    IR --> mark[marker_analyzer<br/><i>marker-based LL</i>]
    IR --> tr[ll_grammar_transformer<br/><i>left-rec / factoring</i>]
Loading

Theorems / methods:

  • FIRST / FOLLOW sets. Computed per-nonterminal; the LL(1) test is $\text{FIRST}(\alpha_i) \cap \text{FIRST}(\alpha_j) = \emptyset$ for every pair of productions of the same nonterminal, plus a FOLLOW-disjointness condition when $\varepsilon$ is derivable.
  • LL(k) generalization. For $k &gt; 1$, prefix classes of length $k$ must be pairwise disjoint.
  • Grammar transformations. Left-recursion elimination, left-factoring — makes a non-LL(1) grammar potentially LL(1), detected by ll_grammar_transformer.
  • First/Follow oracle. Pure-function computation used both for verification and as an independent source of truth against the LLM-proposed sets.

Models and cost structure

All four pipelines use the same three-tier model stack:

Role Model (alias) Rate (input / output) Used by
Heavy reasoning claude-opus-4-7 $5 / $25 per MTok All specialists, reasoning, proof_checker, formalizer
Fast structured claude-sonnet-4-6 $3 / $15 per MTok classifier, retry_planner, input_parser
JSON repair claude-haiku-4-5 $1 / $5 per MTok _repair_json_with_haiku hook in LiveRunner

Actual run cost of the CFL pipeline on a medium problem (task_w1bw2w3_ticket50.json, closure_reduction + pumping, proof_checker verified): ≈ $2–3. Simple problems (single-path, short proof) come in at ≈ $1.

Opus 4.7 uses adaptive thinking — it picks its reasoning depth per query and does not accept the temperature parameter. LiveRunner._model_accepts_temperature() is an allow-list that drops the field for models that reject it.


TFL Lab — local web UI

ui_server/ serves a single-page workspace for running any pipeline against any IR in the browser.

  • Sidebar: project switcher (REG / CFL / DCFL / LL) + examples/*.json file list
  • Editor: monospace IR editor with live JSON validator and line gutter
  • Run bar: Mock Run (free, no LLM) and Live Run (gated by an explicit "I confirm API spend" checkbox with red border)
  • Result pane: four tabs — HTML (iframe of the rendered report with KaTeX + inline Graphviz SVG), JSON (formatted), Markdown (Preview ↔ Raw GitHub-style toggle), Log (streamed stderr, color-coded)

Backend is stdlib-only (http.server + ThreadingHTTPServer); frontend is vanilla JS + CDN-loaded marked + KaTeX. No framework, no build step.

.venv/Scripts/python -m ui_server.server --port 8765
# → http://127.0.0.1:8765/

Repository layout

.
├── agent_system/           # REG pipeline
│   ├── orchestrator.py     # LangGraph StateGraph + LiveRunner
│   ├── config.py           # models + max_tokens per agent
│   ├── prompts/            # system prompts (one per agent)
│   ├── lib/                # oracle, IR schema, renderer, word generator
│   ├── examples/           # task IRs
│   ├── templates/          # Lean 4 proof templates
│   └── tests/              # pytest suite
├── cfl_system/             # CFL pipeline (same layout)
├── dcfl_system/            # DCFL pipeline (same layout)
├── ll_system/              # LL pipeline (same layout)
├── ui_server/              # TFL Lab web UI
│   ├── server.py           # stdlib http.server
│   ├── static/index.html   # single-page frontend
│   └── __init__.py
├── pumping_regular/        # Legacy standalone pumping tool (reference)
├── reverse_morfism/        # Legacy inverse-homomorphism solver
├── pumping_len.py          # Standalone min pumping-length finder for regex
├── .env                    # ANTHROPIC_API_KEY (git-ignored)
└── README.md               # you are here

Each pipeline follows the same module contract — if you can read one, you can read all four.


Tests

Pure-function modules have full pytest coverage; LLM-driven layers are covered by mock-mode integration tests.

.venv/Scripts/python -m pytest agent_system/tests cfl_system/tests \
                               dcfl_system/tests ll_system/tests -q
# → 1294 passed, 3 skipped (as of current HEAD)

The 3 skipped tests exercise MockRunner against %TEMP% on Windows and skip in sandboxed CI environments.


External dependencies

  • Python 3.12+anthropic, langgraph, python-dotenv, pytest
  • Graphviz dot binary — rendering PDA state diagrams as inline SVG in HTML reports. Falls back to Mermaid via CDN if dot is absent. See cfl_system/CLAUDE.md for install notes.
  • Lean 4 (optional) — REG pipeline can emit Lean stubs via agent_system/templates/*.lean.
  • Anthropic API key in .env for --live runs. Mock mode works offline.

Design non-goals

  • Not a solver for arbitrary undecidable questions. All four pipelines assume the input is a well-posed problem from the exam problem set — the class being tested is decidable or admits a standard proof technique.
  • Not a formal proof assistant. Proofs are natural-language Markdown; the proof_checker agent is an independent LLM audit, not a machine-checked verification. Lean stubs (REG only) are the closest this project gets to machine-checked proofs, and even those are opt-in.
  • No multi-user state. TFL Lab binds to 127.0.0.1 and has no auth. It is a single-user research tool.

License

MIT — use, modify, distribute freely. Provided as-is, with no warranty.

Acknowledgements

Built for the Theory of Formal Languages course at МГТУ им. Н.Э. Баумана, ИУ-9.

Powered by Claude (Opus 4.7 / Sonnet 4.6 / Haiku 4.5) via the Anthropic API, orchestrated through LangGraph.

About

Multi-agent pipelines for formal-languages problems (REG/CFL/DCFL/LL) with a local web UI. Powered by Claude + LangGraph.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages