Skip to content

EqualExperts/satsuma-lang

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,127 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Satsuma logo

Equal Experts
An Equal Experts open-source project

Satsuma

CI Release Security

📖 Website, documentation & examples → equalexperts.github.io/satsuma-lang

Try it Live! — edit Satsuma in your browser and watch the visualization update (everything runs client-side; your source is never uploaded)

Satsuma is a domain-specific language for source-to-target data mapping.

It is designed to replace ad hoc spreadsheets, wiki tables, and overly verbose structured specs with a format that is:

  • readable by humans
  • parseable by tools
  • compact enough for AI agents to generate and consume reliably
  • stable enough to act as the source of truth for downstream tooling

Satsuma is intended to sit between systems and describe how data moves from one shape to another, whether those systems are databases, APIs, files, messages, events, or mixed enterprise platforms.

What makes Satsuma different is that it does not force a false choice between formal structure and human intent. The language keeps schemas, mappings, metadata, and references deterministic and parser-backed, while still allowing natural language exactly where real projects need it: notes, business rules, underspecified transforms, and review context.

That makes Satsuma a good fit for AI agents. Deterministic tooling can extract the structural facts with high confidence, while an LLM handles the reasoning over the natural-language parts. The parser and CLI are not competing with the agent; they are the reliable substrate that lets the agent reason safely.

Why Satsuma Exists

Most mapping specifications today are hard to trust operationally:

  • spreadsheets are inconsistent and drift from implementation
  • free-form docs are readable but not machine-checkable
  • YAML and JSON are parseable but too noisy for large mapping inventories
  • vendor tools often hide critical logic behind UI configuration

Satsuma aims to solve that by making mapping intent explicit in a language that both people and parsers can work with directly.

That matters even more in AI-assisted delivery. Agents can produce better code, better reviews, and better impact analysis when they operate against a constrained language instead of reverse-engineering free-form implementation logic.

The intended operating model is hybrid:

  • Satsuma captures the deterministic structure of the integration
  • the satsuma CLI extracts facts, topology, metadata, lineage, and NL content
  • the LLM reasons over those extracted facts plus the embedded natural-language intent

This is the core idea: use deterministic tools for what must be exact, and use LLM reasoning for what cannot be fully formalized without making the language unusable.

Natural Language as a First-Class Part of the Spec

Satsuma treats natural language as part of the specification surface, not as an embarrassing escape hatch.

In real mapping work, some intent is naturally formal:

  • source and target structures
  • field references
  • metadata tags
  • imports and reusable definitions
  • file-based workspace boundaries and import-reachable symbol graphs

Some intent is not naturally formal, at least not without turning the mapping document into a programming language:

  • business rules that depend on domain interpretation
  • legacy behaviors that are known but not fully codified
  • transformation notes that require analyst review
  • implementation guidance and caveats for downstream teams

Satsuma keeps both in one versioned artifact. That is important for AI agents:

  • the parser-backed parts provide reliable structure
  • the natural-language parts preserve the reasoning context humans actually use
  • the CLI can surface both without inventing semantics
  • the agent can then apply judgment instead of scraping prose from Excel cells

Extensible Metadata: Satsuma's Superpower

Satsuma's ( ) metadata accepts any vocabulary token. There are no reserved metadata keywords. When you write (merge upsert, match_on customer_id) or (dimension, scd 2), you are not using built-in language features — you are using conventions that anyone can define and document.

That means your organisation can encode its own standards directly into specs:

schema payments (
    owner "payments-team",
    data_domain "finance",
    cost_center "CC-4200",
    audit_level high,
    compliance {PCI-DSS, SOX}
) {
  ...
}

None of those tokens required a language change. They work today. Write an LLM-Guidelines.md that tells AI agents how to interpret your custom tokens (what DDL to generate, what tests to create, what policies to enforce), and every AI agent in your organisation knows how to act on them. The convention guides that ship with Satsuma (Kimball, Data Vault, Merge Strategy, Governance, Reports & Models) are ready-made examples of this pattern.

What Satsuma Covers

Satsuma supports:

  • schema and record structure declarations
  • source-to-target field mappings
  • computed target fields
  • transform bodies expressed as natural-language pipe steps plus value maps
  • conditional mapping logic
  • comments and rich notes with semantic intent
  • reusable fragments and imports
  • multi-file platform modeling for platform-wide lineage

The long-term tooling model is parser-first:

  1. grammar and parser
  2. AST/CST conventions
  3. linter and validator
  4. formatter
  5. editor support
  6. visualizers and generators

For the package map, dependency graph, and design principles behind the tooling, see Satsuma Tooling Architecture.

For agent workflows specifically, the model is:

  1. author or generate .stm
  2. validate and extract with deterministic tooling
  3. let the agent reason over the extracted structure plus NL intent
  4. generate code, reviews, documentation, or impact analysis from that combined view

Install the CLI

A universal pre-built package is published on every merge to main. It uses WASM internally, so the same package works on macOS, Linux, and Windows:

npm install -g https://github.com/EqualExperts/satsuma-lang/releases/download/latest/satsuma-cli-latest.tgz

This gives you the satsuma command on your PATH. Run satsuma --help to see available commands.

Can't install the CLI or VS Code extension? Satsuma is still useful without any tooling. LLMs can read and generate .stm files directly — our tests show 3–8x fewer tokens than equivalent spreadsheets or YAML — and plain-text files version-control cleanly. See Using Satsuma Without the CLI for workflows with web LLMs like ChatGPT, Gemini, or Claude.ai.

Example

note { "Customer sync — legacy CRM to analytics warehouse" }

schema crm (note "Legacy CRM — SQL Server 2008") {
  CUST_ID     INT            (pk)
  CUST_TYPE   CHAR(1)        (enum {R, B, G})  //! NULL means Retail
  FIRST_NM    VARCHAR(100)
  LAST_NM     VARCHAR(100)
  EMAIL_ADDR  VARCHAR(255)   (pii)
  STATUS      CHAR(1)        (enum {A, I})
}

schema warehouse (note "Analytics Model — PostgreSQL 16") {
  customer_id   UUID         (pk, required)
  customer_type VARCHAR(20)  (enum {retail, business, government})
  display_name  VARCHAR(200) (required)
  email_address VARCHAR(255) (format email)
  is_active     BOOLEAN
}

mapping {
  source { `crm` }
  target { `warehouse` }

  CUST_ID    -> customer_id   { uuid_v5("namespace", CUST_ID) }
  EMAIL_ADDR -> email_address { trim | lowercase | validate_email | null_if_invalid }
  STATUS     -> is_active     { map { A: true, I: false } }

  CUST_TYPE -> customer_type {
    map { R: "retail", B: "business", G: "government", null: "retail" }
  }

  -> display_name {
    "If @CUST_TYPE is null or R, trim and concat @FIRST_NM + ' ' + @LAST_NM.
     Otherwise 'UNKNOWN'."
  }
}

Pipe steps such as trim | lowercase | validate_email | null_if_invalid are natural-language instructions. Quotes are optional; tooling classifies them as nl, not as a separate structural pipeline type. map { ... } remains a structural value-mapping construct.

For richer examples, see examples/db-to-db/pipeline.stm, examples/edi-to-json/pipeline.stm, and examples/multi-source/multi-source-hub.stm.

Learn Satsuma

The Lessons are the fastest way to get productive with Satsuma and an AI agent. 10 core lessons build the mental model progressively, and 4 role-specific playbooks adapt it to how you actually work:

Lessons What you learn
01 — What Satsuma Is Really For The hybrid model, three delimiters, setting up your AI agent
02 — Reading Satsuma with an Agent Schemas, metadata, nested structures, asking the agent to explain
03 — Writing Schemas from Imperfect Inputs Drafting from DDL, JSON, Excel; preserving ambiguity
04 — Reuse, Imports, and Multi-File Thinking Fragments, imports, workspace organization
05 — Mapping Blocks Arrows, transforms, value maps, multi-source mappings
06 — Natural Language Transforms When to formalize vs. keep it natural, @ref references
07 — Nested Data, Arrays, and Complex Shapes Dotted paths, array notation, nested arrow blocks
08 — The Satsuma CLI The CLI as the agent's deterministic toolkit
09 — Human-Agent Workflows Impact analysis, coverage checks, PII audits, change review
10 — End-to-End Delivery The full delivery loop from gathering to versioned source of truth

Playbooks: Business Analyst · Data Engineer · Governance & Audit · Integration Engineer

Start with Lesson 01 or jump to a suggested reading path based on your role.

Repository Guide

Current Status

What exists today:

  • the Satsuma v2 language specification
  • a canonical example corpus (25 .stm files covering major integration patterns)
  • a tree-sitter parser (315 corpus tests, all examples parse clean)
  • a TypeScript CLI (satsuma) with commands for structural extraction, analysis, validation, formatting, and diff — see SATSUMA-CLI.md
  • satsuma fmt — opinionated, zero-config formatter (CLI + VS Code Format Document)
  • a VS Code extension with an LSP server (go-to-definition, find-references, completions, hover, rename, code lens, semantic tokens, diagnostics, folding, document symbols, formatting) and TextMate grammar
  • satsuma lint with 3 rules (hidden NL source refs, unresolved NL refs, duplicate definitions) and --fix support
  • namespace support for multi-team, multi-domain platform modelling
  • data modelling conventions for Kimball and Data Vault patterns with canonical examples
  • pre-built CLI release artifacts published on every merge to main

Agent Skills

Satsuma ships a suite of Agent Skills that turn .stm files into a hub for the rest of your stack. Each skill is self-contained, follows the agentskills.io standard, and works with any agent runtime that supports skills (Claude Code, Claude Desktop, etc.).

  • Excel → Satsuma (excel-to-satsuma) — surveys a spreadsheet, generates idiomatic Satsuma with chunked extraction, and self-critiques the output. Includes a Python CLI tool and /excel-to-satsuma slash command. A lite prompt for web LLMs is also available for zero-setup conversion.
  • Satsuma → Excel (satsuma-to-excel) — produces stakeholder-ready Excel workbooks from .stm files. A lite web-LLM prompt is also available.
  • Explain a spec (satsuma-explainer) — plain-English walkthroughs, risk assessments, PII audits, coverage checks, and impact analysis for non-technical stakeholders.
  • dbt → Satsuma (satsuma-from-dbt) — reverse-engineers Satsuma mapping specs from an existing dbt project so you can adopt Satsuma without starting over.
  • Satsuma → dbt (satsuma-to-dbt) — scaffolds an idiomatic dbt project (staging/marts, Kimball stars, Data Vault 2.0, exposures) from .stm specs.
  • Synthetic test data (satsuma-sample-data) — generates realistic CSV/JSON fixtures from a schema, respecting types, enums, PII patterns, and referential integrity across tables.
  • OpenLineage export (satsuma-to-openlineage) — emits OpenLineage events with column-level lineage for Marquez, DataHub, Atlan, and OpenMetadata.

Multi-File Lineage

Satsuma supports multi-file platform modeling through imports and namespaces. That gives tooling a consistent way to traverse lineage and impact across a platform from a chosen entry file, without treating a whole directory as an implicit merged workspace.

In practical terms:

  • library files define reusable schemas, fragments, and lookups
  • integration files define source/target structures and mapping blocks
  • namespace-qualified imports connect those files into one platform graph
  • only symbols reachable through the entry file's import graph are in scope

This matters when multiple teams have similarly named schemas or when lineage needs to cross project boundaries cleanly.

Parser-First Tooling

Downstream tools should be built on the parser, not on text heuristics.

The parser work lives in tooling/tree-sitter-satsuma/ and is responsible for syntax parsing only. Semantic validation, formatting, import resolution, and code generation should consume the parser output rather than reinterpreting raw Satsuma text.

The same principle applies to AI-agent integrations. Agents should prefer parser-backed CLI output over raw file scraping for structural questions, then apply reasoning only where the language intentionally carries natural-language meaning.

If you are contributing tooling, start here:

Development

Prerequisites

  • Node.js 22+
  • Python 3.12+

Setup

Install all dependencies across every package, build the WASM parser, and compile all bundles in one step from the repo root:

npm run install:all

This is the only setup step needed. It runs npm install in every package, builds satsuma-core, builds the WASM parser (tree-sitter build --wasm), runs the CLI prebuild to copy generated sources and WASM into dist, builds satsuma-viz, and compiles the VS Code LSP server.

To start fresh, wipe all node_modules and dist directories and reinstall:

npm run reinstall

Tree-sitter parser

cd tooling/tree-sitter-satsuma
npm run generate          # regenerate parser from grammar.js
npm test                  # corpus tests + fixture tests + CST tests + smoke tests

Individual test suites:

node_modules/.bin/tree-sitter test --wasm   # corpus tests only (always use --wasm)
python3 scripts/test_fixtures.py            # example and recovery fixtures
python3 scripts/test_cst_summary.py         # CST consumer unit tests
python3 scripts/test_smoke_summary.py       # smoke test all examples

Always use the --wasm flag with the tree-sitter CLI — there is no native build in this repository.

When changing grammar.js, regenerate and commit the generated parser sources:

npm run generate
git add src/

CI will fail if src/ diverges from grammar.js.

Satsuma CLI

cd tooling/satsuma-cli
npm test                  # full test suite
npm link                  # symlink `satsuma` onto your PATH for local use

Quick usage:

satsuma summary examples/db-to-db/pipeline.stm                       # structural overview
satsuma validate examples/db-to-db/pipeline.stm                      # structural + semantic validation
satsuma schema customers examples/db-to-db/pipeline.stm              # show a specific schema
satsuma lineage --from legacy_sqlserver examples/db-to-db/pipeline.stm   # trace data flow

To set up an AI agent, print the built-in reference:

satsuma agent-reference               # print grammar, conventions, and CLI guide
satsuma agent-reference > .github/copilot-instructions.md   # feed to Copilot

To build the distributable tarball (the same way CI and releases do):

npm run pack              # produces satsuma-cli.tgz, verified and smoke-tested

See SATSUMA-CLI.md for the full command reference.

VS Code extension

We haven't published the extension to the official VS Code Marketplace yet. A pre-built .vsix is published on every merge to main. Download it from the latest release and install:

  • In VS Code, open the Command Palette (Cmd+Shift+P) and choose Extensions: Install from VSIX...
  • Navigate to the downloaded .vsix and select it.
  • Commands like Satsuma: Show Warnings are available, and .stm files get syntax highlighting and LSP features.

To build from source or run the test suite:

cd tooling/vscode-satsuma
npm run check             # validate manifest/grammar + run all tests
npm run build             # build client + server + webview bundles

Viz harness (local Playwright tests)

tooling/satsuma-viz-harness/ is a standalone HTTP server and browser client for testing the <satsuma-viz> web component against canonical fixtures — without VS Code or the LSP in the loop.

This is a local developer-machine workflow only. It is not run in CI.

cd tooling/satsuma-viz-harness
npm run build             # build server + client bundles

To run the Playwright test suite, start the sentinel watcher in a separate terminal, then trigger a run by touching the sentinel file:

# Terminal 1 — leave running
./watch-and-test.sh

# Terminal 2 — trigger a run
touch .run-tests
# Results appear in .playwright-results.txt

The watcher kills any stale server on port 3333, runs npx playwright test, writes results to .playwright-results.txt, and resets the sentinel.

The test suite covers:

  • Overview rendering (schema cards, mapping nodes) for the sfdc-to-snowflake fixture
  • Clicking a mapping card to open the detail view
  • Field-hover and navigate event pipeline validation (programmatic dispatch)
  • Cross-file lineage merging for an import-reachable fixture set
  • Layout stability on the sap-po-to-mfcs larger fixture

CI

The full CI picture is documented in CI-WORKFLOWS.md. In brief:

  • CI (ci.yml) runs on every push and pull request to main. It verifies linting, the tree-sitter parser (corpus tests, generated-source staleness, conflict count), the CLI test suite, the VS Code extension tests and LSP server, a CLI tarball pack-and-install smoke test, and the Excel skill tests.
  • Release (release.yml) runs on every push to main to publish a rolling latest pre-release, and on manual dispatch to create a tagged release with changelog-sourced release notes.
  • Security (security.yml) runs on every push and PR, and is also called as a gate by the release workflow. It covers npm audit across all packages and Semgrep SAST.

Grammar conflict count is enforced in CI against tooling/tree-sitter-satsuma/CONFLICTS.expected — update that file when adding or removing documented conflicts.

Contributing

Contributions are most useful when they strengthen the parser-backed ecosystem.

Good contribution areas:

  • grammar and parser improvements
  • corpus tests and malformed-input recovery tests
  • additional canonical examples
  • AST/CST mapping documentation
  • editor tooling and visualization work

When syntax, semantics, or supported constructs are in question, prefer the spec over secondary docs and call out mismatches explicitly.

Concerned About Security?

Satsuma is a local-only, read-only analysis tool — it makes no network calls, stores no credentials, and runs no user-supplied code. The entire toolchain is open source and auditable. Automated security scanning (npm audit, Semgrep, Gitleaks, Dependabot) runs on every pull request, and releases are gated behind passing security checks.

For a full threat model, dependency audit, and enterprise adoption guidance, see the Security Report.

Maintained by Equal Experts

Satsuma is An Equal Experts open-source project — maintained by Equal Experts, the software consultancy that helps clients build the things they need to run their business.

The project was created by Thorben Louw and is now under EE stewardship. EE engineers use Satsuma across client engagements in data engineering and integration work, and we contribute the language and its tooling back to the community under the MIT licence so other teams can use it too.

License

This project is licensed under the MIT License.

Citing Satsuma

If you use Satsuma in academic or technical writing, please cite it. See CITATION.cff for machine-readable citation metadata.

About

A DSL and tooling for source-to-target mappings that works for humans AND AIs.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors