Skip to content

Support data vary dimensions for environment-isolated data storage #384

@stack72

Description

@stack72

Problem

In multi-environment workflows, the same model runs for dev, staging, and prod, but all data versions are stored under a single data name with one shared latest symlink:

.swamp/data/command/shell/{model-id}/result/
├── 1/       (dev run)
├── 2/       (prod run)
├── 3/       (dev run)
└── latest -> 3   # points to dev, not prod

This causes three problems:

  1. Interleaved data — versions from different environments are mixed together with no structural separation
  2. latest is environment-blind — it always points to whichever environment ran most recently, regardless of which environment you care about
  3. Unsafe CEL referencesmodel["deploy-app"].resource.result.result.attributes.exitCode silently returns data from the wrong environment if another environment ran more recently

Proposed Solution: Vary Dimensions

Rather than overriding the data name with a CEL expression (manual name construction), introduce a vary mechanism. You declare which dimensions the data should vary by, and the system computes the storage path automatically.

The composite key for data becomes: specName + dataName + vary1 + vary2 + ...

This is an extension, not an override — the vary dimensions are appended to the base specName+dataName to create isolated storage paths.

How it works

The workflow step declares which dimensions the data should vary by:

steps:
  - name: deploy-app
    task:
      type: model_method
      modelIdOrName: deploy-app
      methodName: execute
      inputs:
        environment: ${{ inputs.environment }}
    dataOutputOverrides:
      - specName: result
        vary:
          - environment

When environment=dev, the data is stored under a composite name that includes the vary dimension. When environment=prod, it gets a separate path. Each gets its own latest symlink.

Result on disk

Each environment gets its own data path with its own latest symlink:

.swamp/data/command/shell/{model-id}/
├── result-dev/
│   ├── 1/
│   ├── 2/
│   └── latest -> 2     # latest dev, always correct
├── result-staging/
│   ├── 1/
│   └── latest -> 1     # latest staging, always correct
└── result-prod/
    ├── 1/
    ├── 2/
    ├── 3/
    └── latest -> 3     # latest prod, always correct

Multiple vary dimensions

Vary is composable. Multiple dimensions create further isolation:

dataOutputOverrides:
  - specName: result
    vary:
      - environment
      - region

With environment=dev and region=us-east-1, the composite key becomes specName + dataName + dev + us-east-1, stored at something like result-dev-us-east-1/.

CEL access

Existing CEL patterns work naturally with the computed data names:

# Access latest prod result
${{ model["deploy-app"].resource.result.result-prod.attributes.exitCode }}

# Access latest dev us-east-1 result
${{ model["deploy-app"].resource.result.result-dev-us-east-1.attributes.exitCode }}

# data.latest() also works
${{ data.latest("deploy-app", "result-prod").attributes.exitCode }}

Why Vary Instead of Name Override

The original proposal was to add a name field to dataOutputOverrides with CEL expressions to construct data names manually. The vary approach is better because:

  • No manual name construction — you don't write CEL to build names, you just declare dimensions
  • The system knows the dimensions — can list, query, and reason about them
  • Composablevary: [environment, region] naturally extends to multi-dimensional isolation
  • It's additive (extension) not destructive (override) — the base specName+dataName stays intact

What This Replaces

This approach subsumes several related issues by solving the core data isolation problem through naming:

Current Codebase Context

The foundation for this already exists:

  • specName vs dataName are already distinct concepts in the codebase. specName is the key in the model's resources map; dataName is the on-disk directory name (second arg to writeResource(specName, dataName, data)).
  • latest symlink is already per-dataName, so varying the dataName automatically gives each variant its own latest.
  • CEL context already indexes as model[name].resource[specName][dataName], so varied data names populate naturally.
  • DataOutputOverride currently has: specName, lifetime, garbageCollection, tags.

Implementation

  1. Add a vary field to DataOutputOverride — an array of input/context key names
  2. At runtime, resolve the vary values from the step's inputs/context
  3. Compute the composite dataName by appending the resolved vary values to the base dataName
  4. Pass the composite dataName through to writeResource() / createFileWriter()
  5. The CEL context, latest symlinks, and data queries all work automatically since they're already keyed by dataName

Key files

  • src/domain/models/data_output_override.ts — add vary field to type + schema
  • src/domain/models/data_writer.ts — compute composite dataName from vary values
  • src/domain/models/method_execution_service.ts — pass vary context through
  • src/domain/workflows/execution_service.ts — resolve vary values from step inputs

Use Case

A team has a single deploy-pipeline workflow reused across dev, staging, and prod:

swamp workflow run deploy-pipeline --input-file inputs/dev.yaml
swamp workflow run deploy-pipeline --input-file inputs/prod.yaml

Each run produces data under a vary-extended name. The latest symlink for each environment is always correct. Downstream models can reference a specific environment's data via CEL without risk of reading the wrong environment's data.

Metadata

Metadata

Assignees

Labels

betaIssues required to close out before public betaenhancementNew feature or requestin-discussionA feature or issue that is in active discussion

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions