Skip to content

Comments

feat: add vary dimensions for environment-isolated data storage (#384)#435

Merged
stack72 merged 1 commit intomainfrom
feat/vary-dimensions
Feb 23, 2026
Merged

feat: add vary dimensions for environment-isolated data storage (#384)#435
stack72 merged 1 commit intomainfrom
feat/vary-dimensions

Conversation

@stack72
Copy link
Contributor

@stack72 stack72 commented Feb 23, 2026

Summary

Multi-environment workflows (dev, staging, prod) previously stored all data
versions under a single data name with one shared latest symlink. This caused
interleaved data, environment-blind latest pointers, and unsafe CEL references
that could silently return wrong-environment data.

This change introduces a vary mechanism on dataOutputOverrides that
automatically computes composite data names (e.g., result-prod,
result-dev-us-east-1) from declared dimension keys, keeping each environment's
data isolated with its own versioning and latest symlink.

What changed

Write-side: vary field on dataOutputOverrides

  • Added vary?: string[] to DataOutputOverride interface and Zod schema
  • Created composeDataName() utility that joins base name + vary values with
    hyphens (e.g., composeDataName("result", ["prod"])"result-prod")
  • Vary dimensions are resolved at step execution time from evaluated step inputs
    in execution_service.ts, producing a resolvedVarySuffix on the method
    context
  • Both createResourceWriter and createFileWriterFactory in data_writer.ts
    apply the suffix to instance names before writing
  • Step serialization (fromData/toData) round-trips the vary field

Read-side: CEL data.*() with vary dimensions

  • data.latest(model, spec, [varyValues]) — 3-arg form
  • data.version(model, spec, [varyValues], version) — 4-arg form
  • data.listVersions(model, spec, [varyValues]) — 3-arg form
  • Composite name is resolved in the CEL layer before delegating to the data
    cache — no changes needed to DataNamespace, DataCache, or model code
  • Fixed a latent bug where CEL int values (bigint) didn't match number keys
    in the DataCache Map, causing data.version() to silently return null

Documentation

  • design/workflow.md — vary syntax, on-disk layout, access patterns
  • design/expressions.md — 3-arg data.latest/version/listVersions with vary
  • Skill references updated with forEach + vary examples

Impact on existing workflows

Zero breaking changes. Existing workflows that don't use vary are
completely unaffected:

  • The vary field is optional — omitting it preserves current behavior
  • All 2-argument data.latest(), data.version(), and data.listVersions()
    calls continue to work unchanged
  • Models still call writeResource(specName, instanceName, data) with no
    changes — the vary mechanism intercepts and modifies the instance name only
    when vary is declared
  • The data.version() bigint fix improves reliability for all users, not just
    vary users

What this enables

Workflows can now safely run the same model across multiple environments with
isolated data:

steps:
  - name: scan-${{ self.env }}
    forEach:
      item: env
      in: ${{ inputs.environments }}
    task:
      type: model_method
      modelIdOrName: scanner
      methodName: execute
      inputs:
        environment: ${{ self.env }}
    dataOutputOverrides:
      - specName: result
        vary:
          - environment

This produces isolated on-disk storage per environment:

.swamp/data/scanner/{id}/
  result-dev/      → own versioning, own latest symlink
  result-staging/  → own versioning, own latest symlink
  result-prod/     → own versioning, own latest symlink

Downstream steps access the correct environment's data dynamically:

scanResult: ${{ data.latest('scanner', 'result', [self.env]).attributes.count }}

Plan vs implementation

All planned items were implemented with three deviations:

  1. Number() conversion in CEL — Not in plan. Discovered during integration
    testing that CEL int values don't match number keys in JavaScript Maps.
    Fixed in both 3-arg and 4-arg data.version() paths.
  2. Documentation examples — Initial implementation used hardcoded values
    like ['prod'] which defeats vary's purpose. Fixed to show dynamic access
    via self.env and inputs.environment.
  3. Test consolidation — Plan called for separate integration test files.
    Combined into integration/vary_test.ts for cohesion.

Test coverage

  • 8 unit tests for composeDataName() utility
  • 7 unit tests for CEL evaluator vary overloads
  • 4 unit tests for data writer with resolvedVarySuffix
  • 7 integration tests including full CLI workflow run with forEach + vary
  • All 1862 tests pass

UAT tracking

UAT scenarios filed at systeminit/swamp-uat#25 covering multi-environment
end-to-end workflows, cross-workflow data access, re-run versioning, and error
cases.

Closes #384

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

## Summary

Multi-environment workflows (dev, staging, prod) previously stored all data
versions under a single data name with one shared `latest` symlink. This caused
interleaved data, environment-blind `latest` pointers, and unsafe CEL references
that could silently return wrong-environment data.

This change introduces a `vary` mechanism on `dataOutputOverrides` that
automatically computes composite data names (e.g., `result-prod`,
`result-dev-us-east-1`) from declared dimension keys, keeping each environment's
data isolated with its own versioning and `latest` symlink.

## What changed

### Write-side: `vary` field on `dataOutputOverrides`

- Added `vary?: string[]` to `DataOutputOverride` interface and Zod schema
- Created `composeDataName()` utility that joins base name + vary values with
  hyphens (e.g., `composeDataName("result", ["prod"])` → `"result-prod"`)
- Vary dimensions are resolved at step execution time from evaluated step inputs
  in `execution_service.ts`, producing a `resolvedVarySuffix` on the method
  context
- Both `createResourceWriter` and `createFileWriterFactory` in `data_writer.ts`
  apply the suffix to instance names before writing
- Step serialization (`fromData`/`toData`) round-trips the `vary` field

### Read-side: CEL `data.*()` with vary dimensions

- `data.latest(model, spec, [varyValues])` — 3-arg form
- `data.version(model, spec, [varyValues], version)` — 4-arg form
- `data.listVersions(model, spec, [varyValues])` — 3-arg form
- Composite name is resolved in the CEL layer before delegating to the data
  cache — no changes needed to `DataNamespace`, `DataCache`, or model code
- Fixed a latent bug where CEL `int` values (bigint) didn't match `number` keys
  in the DataCache Map, causing `data.version()` to silently return null

### Documentation

- `design/workflow.md` — vary syntax, on-disk layout, access patterns
- `design/expressions.md` — 3-arg `data.latest/version/listVersions` with vary
- Skill references updated with forEach + vary examples

## Impact on existing workflows

**Zero breaking changes.** Existing workflows that don't use `vary` are
completely unaffected:

- The `vary` field is optional — omitting it preserves current behavior
- All 2-argument `data.latest()`, `data.version()`, and `data.listVersions()`
  calls continue to work unchanged
- Models still call `writeResource(specName, instanceName, data)` with no
  changes — the vary mechanism intercepts and modifies the instance name only
  when `vary` is declared
- The `data.version()` bigint fix improves reliability for all users, not just
  vary users

## What this enables

Workflows can now safely run the same model across multiple environments with
isolated data:

```yaml
steps:
  - name: scan-${{ self.env }}
    forEach:
      item: env
      in: ${{ inputs.environments }}
    task:
      type: model_method
      modelIdOrName: scanner
      methodName: execute
      inputs:
        environment: ${{ self.env }}
    dataOutputOverrides:
      - specName: result
        vary:
          - environment
```

This produces isolated on-disk storage per environment:

```
.swamp/data/scanner/{id}/
  result-dev/      → own versioning, own latest symlink
  result-staging/  → own versioning, own latest symlink
  result-prod/     → own versioning, own latest symlink
```

Downstream steps access the correct environment's data dynamically:

```yaml
scanResult: ${{ data.latest('scanner', 'result', [self.env]).attributes.count }}
```

## Plan vs implementation

All planned items were implemented with three deviations:

1. **`Number()` conversion in CEL** — Not in plan. Discovered during integration
   testing that CEL `int` values don't match `number` keys in JavaScript Maps.
   Fixed in both 3-arg and 4-arg `data.version()` paths.
2. **Documentation examples** — Initial implementation used hardcoded values
   like `['prod']` which defeats vary's purpose. Fixed to show dynamic access
   via `self.env` and `inputs.environment`.
3. **Test consolidation** — Plan called for separate integration test files.
   Combined into `integration/vary_test.ts` for cohesion.

## Test coverage

- 8 unit tests for `composeDataName()` utility
- 7 unit tests for CEL evaluator vary overloads
- 4 unit tests for data writer with `resolvedVarySuffix`
- 7 integration tests including full CLI workflow run with forEach + vary
- All 1862 tests pass

## UAT tracking

UAT scenarios filed at systeminit/swamp-uat#25 covering multi-environment
end-to-end workflows, cross-workflow data access, re-run versioning, and error
cases.

Closes #384

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@stack72 stack72 force-pushed the feat/vary-dimensions branch from 0df2b73 to 600126a Compare February 23, 2026 22:46
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: feat: add vary dimensions for environment-isolated data storage

Verdict: APPROVED

This PR introduces a well-designed vary mechanism for isolating data per environment in forEach workflows. The implementation is clean, follows project guidelines, and maintains full backward compatibility.

Code Quality ✓

  • TypeScript strict mode: No any types; proper use of unknown for CEL dynamic values
  • Named exports: All exports are named (e.g., composeDataName from mod.ts)
  • Copyright headers: Present on all new files
  • Error handling: Proper UserError for missing vary dimensions with clear messages

Domain-Driven Design ✓

  • composeDataName: Pure utility function appropriately placed in src/domain/data/
  • DataOutputOverride: Value Object with immutable semantics (documented in source)
  • Vary resolution: Happens in execution_service.ts (Domain Service) - correct layer for orchestrating step execution
  • CEL changes: Infrastructure layer (src/infrastructure/cel/) - appropriate separation

Test Coverage ✓

  • 8 unit tests for composeDataName() including edge cases (empty strings, whitespace)
  • 7 unit tests for CEL evaluator vary overloads
  • 4 unit tests for data writer with resolvedVarySuffix
  • 7 integration tests including full CLI workflow run with forEach + vary
  • Tests co-located with source files per project conventions

Security ✓

  • Input validation for empty base names and vary values
  • Error messages don't expose sensitive information
  • No injection vulnerabilities identified

Backward Compatibility ✓

  • All 2-argument data.latest(), data.version(), data.listVersions() calls unchanged
  • vary field is optional - existing workflows unaffected
  • The bigint→number fix in data.version() improves reliability for all users

Suggestions (non-blocking)

  1. Special character handling in vary values: composeDataName doesn't sanitize vary values for characters that could be problematic in file paths (e.g., /, \, ..). Current usage appears safe since values come from controlled inputs, but consider adding validation if vary values could come from untrusted sources in the future.

  2. Explicit bigint test: The PR mentions fixing a latent bigint comparison bug. Consider adding a test that explicitly verifies CEL int (bigint) values work correctly with data.version() to prevent regression.


The documentation updates in design/*.md and skill references are thorough and provide clear examples. The PR description is excellent with clear before/after explanations.

LGTM!

@stack72 stack72 merged commit d5dacad into main Feb 23, 2026
3 checks passed
@stack72 stack72 deleted the feat/vary-dimensions branch February 23, 2026 22:51
stack72 added a commit that referenced this pull request Feb 23, 2026
…gression test

Address review feedback from #435:

1. Path safety validation in composeDataName(): Vary values that contain
   path separators (/ or \) or relative path components (. or ..) are now
   rejected with clear error messages. This prevents vary values from
   creating data outside the expected directory structure, hardening
   against future use cases where vary values may come from untrusted
   sources.

2. Explicit bigint regression test: Added a test that simulates the
   DataCache's strict number-keyed Map.get() behavior to verify that CEL
   int values (which may be bigint) are correctly converted to number
   before cache lookup. This covers both the 3-arg and 4-arg (with vary)
   forms of data.version(), preventing regression of the fix from #435.

Test coverage:
- 4 new unit tests for path safety (forward slash, backslash, .., .)
- 1 new regression test for bigint-to-number conversion
- All 1867 tests pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support data vary dimensions for environment-isolated data storage

1 participant