Skip to content

Comments

refactor: unify data access — make data.latest() the canonical accessor#453

Merged
stack72 merged 1 commit intomainfrom
unify-data-access
Feb 24, 2026
Merged

refactor: unify data access — make data.latest() the canonical accessor#453
stack72 merged 1 commit intomainfrom
unify-data-access

Conversation

@stack72
Copy link
Contributor

@stack72 stack72 commented Feb 24, 2026

Closes #434

Summary

  • Remove the DataCache class and replace with direct synchronous disk reads via Deno's sync filesystem APIs
  • data.latest() now always returns fresh on-disk state instead of reading from a snapshot taken at buildContext() time
  • model.*.resource and model.*.file CEL patterns are deprecated with warnings but still work for backward compatibility

Why this is the right move

The DataCache was an implementation accident, not a design choice. It existed solely because CEL evaluation is synchronous but all UnifiedDataRepository methods were async. Deno provides sync filesystem APIs (readTextFileSync, readLinkSync, readDirSync, readFileSync), making the cache unnecessary complexity.

Simplicity of reasoning. Before this change, users and contributors had to think about "which accessor sees what data when" — data.latest() read from a cache snapshot taken at buildContext() time, and model.*.resource was eagerly populated from a findAllGlobal() scan. After this change, data.latest() always means "read the latest from disk" — no hidden staleness, no snapshot timing to reason about. The behavior is obvious.

Alpha is the right time. The user base is small, model.*.resource is embedded in docs/skills/examples, and this kind of simplification only gets harder later. We keep backward compatibility with deprecation warnings so nothing breaks immediately.

The alternative is worse. Keeping the cache but updating it after each workflow step would leave unnecessary abstraction in place, add update logic, and make the code harder to reason about ("is the cache up to date? when was it last refreshed?"). Removing it entirely is cleaner.

What changed

Core implementation

  • Removed DataCache class from model_resolver.ts — the unnecessary abstraction that bridged sync CEL with async repo methods
  • Added ModelCoordinates interface and ModelCoordinatesMap type — maps model names to disk coordinates (modelType + modelId), supporting orphan recovery when models are deleted and recreated with new UUIDs
  • Added 5 sync methods to UnifiedDataRepository interface and FileSystemUnifiedDataRepository:
    • getLatestVersionSync — reads latest symlink, falls back to dir scan
    • findByNameSync — resolves version, reads metadata.yaml
    • listVersionsSync — scans version directories
    • getContentSync — reads raw content file
    • findAllForModelSync — scans all data names for a model
  • Rewrote data.* CEL namespace functions to use sync disk reads instead of cache lookups
  • Fixed findByTag deduplication — prevents duplicate records when data exists under both current and orphan coordinates
  • Fixed findByTag/findBySpec to scan all versions — not just latest

Deprecation

  • Added deprecation warnings in CelEvaluator for model.*.resource and model.*.file patterns via LogTape logger with deduplication
  • model.*.resource and model.*.file are still eagerly populated in buildContext() for backward compatibility — they work but emit warnings
  • model.*.input, model.*.definition, model.*.execution are NOT deprecated — these are model metadata, not versioned data

Documentation

  • Updated design/expressions.md and design/vaults.md — examples use data.latest() instead of deprecated patterns
  • Updated repo_service.ts CLAUDE.md guidance — now recommends data.latest() over model.*.resource
  • Updated skill references (swamp-model, swamp-workflow) — marked data.latest() as preferred, model.*.resource as deprecated

Tests

  • Rewrote model_resolver_test.ts — removed DataCache tests, added 8+ sync behavior tests including deduplication and post-buildContext freshness
  • Added 8 sync method tests to unified_data_repository_test.ts
  • Added 3 deprecation warning tests to cel_evaluator_test.ts
  • Added integration test verifying data.latest() sees data written after buildContext()
  • Fixed 6 mock files to implement new sync interface methods

Plan vs implementation

All 5 planned steps were fully implemented as designed:

Step Planned Result
1. Add sync read methods to repo 5 sync methods + make helpers public Fully implemented
2. Remove DataCache, rewrite data.* functions Delete cache, add coords map, sync reads Fully implemented
3. Add deprecation warnings Regex detection, LogTape logger, dedup Fully implemented
4. Update tests Remove cache tests, add sync tests Fully implemented
5. Update documentation Design docs, skill refs, repo guidance Fully implemented

Three unplanned items were discovered and fixed during testing:

  1. 6 mock UnifiedDataRepository implementations needed sync method stubs to pass type checking
  2. findByTag/findBySpec initially only scanned latest version — fixed to scan all versions
  3. findByTag needed deduplication logic to prevent duplicate records when data exists under both current and orphan coordinates

Known trade-off

findByTag() and findBySpec() scan all metadata files synchronously on every call — O(total_data_items). This is acceptable for alpha: these functions are rarely used compared to data.latest(), the data is local filesystem (not network I/O), and if performance becomes a problem the fix (lightweight tag index during findAllGlobal()) is well-understood.

User impact

  • data.latest() now always returns fresh data — no more stale snapshots from buildContext() time. If step A writes data, step B's data.latest() sees it immediately.
  • model.*.resource/model.*.file still work but emit deprecation warnings; users should migrate to data.latest()
  • No breaking changes — all existing workflows and models continue to work without modification
  • Vary dimensions fully supported via data.latest() 3-argument form
  • Migration path is clear — replace model.<name>.resource.<spec>.<instance>.attributes.<field> with data.latest("<name>", "<dataName>").attributes.<field>

Test plan

  • deno check — passes
  • deno lint — passes
  • deno fmt — passes
  • deno run test — 1926 tests, 0 failures
  • deno run compile — binary compiles successfully
  • Integration tests for cel_data_access — 20/20 pass
  • Integration tests for data_tagging — 13/13 pass

🤖 Generated with Claude Code

Remove the DataCache class and replace it with direct synchronous disk reads
via Deno's sync filesystem APIs. This makes `data.latest()` always return
fresh on-disk state instead of reading from a snapshot taken at buildContext()
time. The `model.*.resource` and `model.*.file` CEL patterns are deprecated
with warnings, but still work for backward compatibility.

## Why this change

The DataCache was an implementation accident, not a design choice. It existed
solely because CEL evaluation is synchronous but all UnifiedDataRepository
methods were async. Deno provides sync filesystem APIs (readTextFileSync,
readLinkSync, readDirSync, readFileSync), making the cache unnecessary.

Before this change, users had to reason about "which accessor sees what data
when" — `data.latest()` read from a cache snapshot, and `model.*.resource`
was eagerly populated. After this change, `data.latest()` always means "read
the latest from disk" with no hidden staleness or snapshot timing to reason
about.

Alpha is the right time for this change. The user base is small, and this kind
of simplification only gets harder later. The alternative — keeping the cache
but updating it after each step — would leave unnecessary abstraction in place
and make the code harder to reason about.

## What changed

### Core implementation

- **Removed `DataCache` class** from `model_resolver.ts` — the unnecessary
  abstraction that bridged sync CEL with async repo methods
- **Added `ModelCoordinates` interface and `ModelCoordinatesMap` type** — maps
  model names to disk coordinates (modelType + modelId), supporting orphan
  recovery when models are deleted and recreated with new UUIDs
- **Added 5 sync methods** to `UnifiedDataRepository` interface and
  `FileSystemUnifiedDataRepository`:
  - `getLatestVersionSync` — reads `latest` symlink, falls back to dir scan
  - `findByNameSync` — resolves version, reads metadata.yaml
  - `listVersionsSync` — scans version directories
  - `getContentSync` — reads raw content file
  - `findAllForModelSync` — scans all data names for a model
- **Rewrote `data.*` CEL namespace functions** to use sync disk reads instead
  of cache lookups
- **Fixed `findByTag` deduplication** — prevents duplicate records when data
  exists under both current and orphan coordinates
- **Fixed `findByTag`/`findBySpec` to scan all versions** — not just latest

### Deprecation

- **Added deprecation warnings** in `CelEvaluator` for `model.*.resource` and
  `model.*.file` patterns via LogTape logger with deduplication
- `model.*.resource` and `model.*.file` are still eagerly populated in
  `buildContext()` for backward compatibility — they work but emit warnings
- `model.*.input`, `model.*.definition`, `model.*.execution` are NOT
  deprecated — these are model metadata, not versioned data

### Documentation

- Updated `design/expressions.md` and `design/vaults.md` — examples use
  `data.latest()` instead of deprecated patterns
- Updated `repo_service.ts` CLAUDE.md guidance — now recommends `data.latest()`
  over `model.*.resource`
- Updated skill references (swamp-model, swamp-workflow) — marked `data.latest()`
  as preferred, `model.*.resource` as deprecated

### Tests

- Rewrote `model_resolver_test.ts` — removed DataCache tests, added 8+ sync
  behavior tests including deduplication and post-buildContext freshness
- Added 8 sync method tests to `unified_data_repository_test.ts`
- Added 3 deprecation warning tests to `cel_evaluator_test.ts`
- Added integration test verifying `data.latest()` sees data written after
  `buildContext()`
- Fixed 6 mock files to implement new sync interface methods

## Plan vs implementation

All 5 planned steps were fully implemented. Three unplanned items were
discovered and fixed during testing:

1. 6 mock `UnifiedDataRepository` implementations needed sync method stubs
2. `findByTag`/`findBySpec` initially only scanned latest version — fixed to
   scan all versions
3. `findByTag` needed deduplication logic for orphan coordinate overlap

## Known trade-off

`findByTag()` and `findBySpec()` scan all metadata files synchronously on
every call — O(total_data_items). This is acceptable for alpha: these functions
are rarely used, the data is local filesystem, and if it becomes a problem the
fix (lightweight tag index during `findAllGlobal()`) is well-understood.

## User impact

- **`data.latest()` now always returns fresh data** — no more stale snapshots
  from buildContext() time
- **`model.*.resource`/`model.*.file` still work** but emit deprecation
  warnings; users should migrate to `data.latest()`
- **No breaking changes** — all existing workflows and models continue to work
- **Vary dimensions fully supported** via `data.latest()` 3-argument form

## Verification

- `deno check` — passes
- `deno lint` — passes
- `deno fmt` — passes
- `deno run test` — 1926 tests, 0 failures
- `deno run compile` — binary compiles successfully

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved

This is a well-structured refactoring that simplifies data access by removing the DataCache abstraction and replacing it with direct synchronous disk reads.

Code Quality & CLAUDE.md Compliance ✓

  • TypeScript strict mode: No any types found
  • Named exports: All exports use named exports
  • AGPLv3 headers: Present on all modified files
  • Test coverage: Comprehensive unit tests (8+ sync method tests in unified_data_repository_test.ts, rewritten model_resolver_test.ts, 3 deprecation tests in cel_evaluator_test.ts) and integration tests verifying data.latest() freshness

Domain-Driven Design ✓

  • ModelCoordinates and ModelCoordinatesMap are proper value objects (no identity, equality by value)
  • UnifiedDataRepository correctly abstracts persistence as a Repository for the Data aggregate
  • ModelResolver acts as a domain service for resolving model references
  • Sync methods are correctly added to the repository interface, not mixed into domain logic

Architecture ✓

  • Clean removal of unnecessary DataCache abstraction
  • data.latest() now always returns fresh on-disk state (no staleness issues)
  • Backward compatibility maintained via deprecation warnings for model.*.resource and model.*.file
  • Error handling in sync methods is consistent with existing async patterns (catches NotFound, rethrows others)

Security ✓

  • No security vulnerabilities identified
  • All file operations are read-only
  • Proper YAML parsing via existing parseYaml utility

Suggestions (non-blocking)

  1. Inline O(n) documentation: Consider adding a code comment in findByTag/findBySpec noting the O(total_data_items) scan characteristic (currently only in PR description)

  2. Minor encapsulation change: getDataNameDir and getMetadataPath changed from private to public — acceptable for the implementation, but worth noting for future maintainers

The known trade-off (O(n) scan in findByTag/findBySpec) is appropriate for alpha as documented in the PR.

🤖 Generated with Claude Code

@stack72 stack72 merged commit dff8ccb into main Feb 24, 2026
4 checks passed
@stack72 stack72 deleted the unify-data-access branch February 24, 2026 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Discussion: Unify data access patterns for vary dimensions

1 participant