feat: Rely on storage for metadata persistence and caching instead of `ShapeStatus` #3572

msfstef · 2025-12-09T17:32:46Z

With @magnetised 's move of the handle -> shape and shape -> handle lookups into the ShapeDb abstraction, we actually no longer need to persist anything else in ShapeStatus.

For snapshot_started? we can directly read from the storage, which has its own read-through cache, and since this is only called in await_snapshot_start? we can just directly call it there and get rid of all the unused APIs.

For latest_offset, I've taken that out of ShapeStatus and instead we read it from storage with the same read-through cache. The ShapeCache APIs are responsible for collating shape handles found in the indices with an offset from the storage to maintain the same outward facing API behaviour.

With this change, ShapeStatus no longer either recovers any of the metadata (getting rid of 66% of reads on a cold restore), nor persists any thing other than the lookups. Everything else is populated upon recovery.

I've gotten rid of all unused APIs and calls in the consumer and shape status, and I took the chance to do some API renaming for consistent behaviour (e.g. fetch_handle_by_shape and fetch_shape_by_handle, and both do {:ok, res} or :error).

I've also split Storage.get_current_position into Storage.get_latest_offset and Storage.get_pg_snapshot to avoid doing two reads when only one is needed, since in most places we need either one or the other. Only the consumer on initialization needs both, and it can simply call both APIs with the same performance.

I've also changed ETS tables for most things to set from ordered_set, basically all except the relation lookups since most don't do any ordered/range lookups.

If I have time I will also benchmark cold shape recovery for 50k shapes with this change.

codecov · 2025-12-09T17:33:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.13%. Comparing base (850ad3d) to head (fe0d5c8).
⚠️ Report is 4 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3572   +/-   ##
=======================================
  Coverage   88.13%   88.13%           
=======================================
  Files          18       18           
  Lines        1643     1643           
  Branches      409      412    +3     
=======================================
  Hits         1448     1448           
  Misses        193      193           
  Partials        2        2

Flag	Coverage Δ
packages/experimental	`87.73% <ø> (ø)`
packages/react-hooks	`86.48% <ø> (ø)`
packages/typescript-client	`93.76% <ø> (ø)`
packages/y-electric	`55.66% <ø> (ø)`
typescript	`88.13% <ø> (ø)`
unit-tests	`88.13% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

msfstef · 2025-12-10T10:10:37Z

benchmark this

msfstef · 2025-12-10T12:14:59Z

One thing I figured out is that the read through cache is not actually active if the shape consumer is not "alive", because when a consumer is shut down we also take down the stack ETS entry:

electric/packages/sync-service/lib/electric/shape_cache/pure_file_storage.ex

Lines 537 to 540 in c6fb885

    
           def terminate(writer_state(opts: opts) = state) do 
        
             close_all_files(state) 
        
             delete_shape_ets_entry(opts.stack_id, opts.shape_handle) 
        
           end

electric/packages/sync-service/lib/electric/shapes/consumer.ex

Lines 818 to 830 in c6fb885

    
           defp terminate_writer(state) do 
        
             {writer, state} = Map.pop(state, :writer) 
        
             try do 
        
               if writer, do: ShapeCache.Storage.terminate(writer) 
        
             rescue 
        
               # In the case of shape removal, the deletion of the storage directory 
        
               # may happen before we have a chance to terminate the storage 
        
               File.Error -> :ok 
        
             end 
        
             state 
        
           end

So somehow we need to keep alive this read-through cache, or use a different one since this one is also not alive if the consumer is never started (which would be common with lazy initialization). Open to ideas, I would like to have a single cache ideally rather than a separate one. Perhaps upon recovering shapes we do an initialization on them that isn't necessarily starting a whole process?

@magnetised would like to hear your thoughts

magnetised

Very nice. Feels like we're working with the grain of this thing

packages/sync-service/lib/electric/shapes/api/delete.ex

magnetised · 2025-12-10T15:24:01Z

packages/sync-service/lib/electric/shapes/consumer.ex

  end

-  defp mark_snapshot_started(%State{stack_id: stack_id, shape_handle: shape_handle} = state) do
-    :ok = ShapeCache.ShapeStatus.mark_snapshot_as_started(stack_id, shape_handle)


re our conversation yesterday about giving shape db some kind of validity flag, is it worth keeping this?

I think not because the validity will be reliant on a snapshot_finished? rather than this snapshot started, so it needs to be implemented in a different way anyway.

magnetised · 2025-12-10T15:28:37Z

packages/sync-service/test/electric/shapes/consumer_test.exs

-         ctx do
-      expect_shape_status(
-        set_latest_offset: fn _, @shape_handle1, _ ->
-          raise "The unexpected error"


could we keep this test but just move the crash onto a function that is called?

I tried but couldn't find something that made sense - we are literally crashing the process in the next test (which a raise does exactly that, sends an exit signal to itself AFAIK) so I figured this is unnecessary coverage

magnetised · 2025-12-10T15:30:22Z

benchmark this

magnetised · 2025-12-10T15:32:46Z

So somehow we need to keep alive this read-through cache, or use a different one since this one is also not alive if the consumer is never started (which would be common with lazy initialization). Open to ideas, I would like to have a single cache ideally rather than a separate one. Perhaps upon recovering shapes we do an initialization on them that isn't necessarily starting a whole process?

Good catch. We need some stack-level equivalent to the shape status ets for the storage. We have the Storage.stack_start_link/1 function so that shouldn't be too hard to wrangle in

msfstef · 2025-12-11T14:07:12Z

benchmark this

msfstef · 2025-12-11T14:08:07Z

@magnetised I've added a read-through cache on the storage layer so that readers directly go from memory when the shape is "sleeping"

github-actions · 2025-12-11T14:15:07Z

Benchmark results, triggered for 43ec0

write fanout completed

unrelated shapes one client latency completed

many shapes one client latency completed

concurrent shape creation completed

diverse shape fanout completed

magnetised

Looks good, though I'm confused why we're only reading from the read through cache when the writer is inactive. Feels messy to have two modes. Would it be horrible to remove the split?

magnetised · 2025-12-11T15:04:50Z

packages/sync-service/lib/electric/shape_cache/pure_file_storage.ex

    end
  end

+  defp delete_shape_ets_read_through_cache_entry(stack_id, shape_handle) do


nit: I'm all for clear function names but this is a little verbose, no?

very verbose - but couldn't think of a better name that retains the clarity

magnetised · 2025-12-11T15:08:03Z

packages/sync-service/lib/electric/shape_cache/pure_file_storage.ex

      WriteLoop.cached_chunk_boundaries(initial_acc)
    )

+    # remove any existing read-through cache now that writer is active


the read through cache is now unnecessary, and will also be outdated, so I'd rather remove it rather than leave outdated data hanging in memory

packages/sync-service/lib/electric/shape_cache/pure_file_storage.ex

packages/sync-service/test/electric/shape_cache/pure_file_storage_test.exs

msfstef · 2025-12-11T17:37:46Z

Looks good, though I'm confused why we're only reading from the read through cache when the writer is inactive. Feels messy to have two modes. Would it be horrible to remove the split?

@magnetised the read through cache is a cache that gets populated on read - the storage_meta cache is a purely write-populated cache. I was slightly worried about any potential conflicts, but that's easily resolvable.

Because storage_meta stores a bunch of things that are only relevant when a writer is active, I wasn't sure if it was reasonable to keep the whole thing around for as long as the shape itself is around. I think you're right that we can consolidate the two, I was kind of trying to not "touch" the main implementation too much in implementing this I suppose, while implementing the desired behaviour.

If you're willing to have a go at consolidating them I'd be happy to accept some help on this. We could also merge this and do it in a separate PR (?)

alco

Lovely!

magnetised · 2025-12-16T15:38:24Z

@msfstef I switched things so that some keys are always written to the stack cache, even if there's a writer active. Simplifies things at the expense of splitting reads/writes across two tables.

msfstef

@magnetised love the change - had some comments about naming that feels very hard to decode what the intent was, and questions around what the existing ETS storage does now that it is not getting updated for these keys

msfstef · 2025-12-16T15:41:50Z

packages/sync-service/lib/electric/shape_cache/pure_file_storage.ex

      ) do
    if old_last_persisted_txn_offset != last_persisted_txn_offset do
-      write_metadata!(opts, :last_persisted_txn_offset, last_persisted_txn_offset)
+      write_cached_metadata!(opts, :last_persisted_txn_offset, last_persisted_txn_offset)


Not sure how this affects performance, I had decided to leave it as is purely for the purpose of not doing additional roundtrips to ETS as there seemed to have been an intentional effort to avoid them.

This call is on the hot path so it's an important thing to note.

this last_persisted_txn_offset is available to the read path, so if we don't do a write-through here then reads will go to the disk (or will be stale).
also note that the original does a write to the disk here, then a write to ets immediately below, so there's not a huge difference (apart from having to split the writes between tables, so not being able to issue a single update_element call

msfstef · 2025-12-16T15:43:05Z

packages/sync-service/test/electric/plug/router_test.exs

          do: Postgrex.query!(ctx.db_conn, stmt)

-      Process.sleep(100)
+      Process.sleep(120)


we could alternatively just have a helper assert_with_timeout where it retries the assertion for a maximum of 500ms

msfstef · 2025-12-16T15:44:38Z

packages/sync-service/lib/electric/shape_cache/pure_file_storage.ex

+  ]
+
+  # Record that caches read-through metadata to avoid disk reads
+  defrecord :storage_meta_stack,


the name storage_meta_stack as opposed to storage_meta I have to admit is a bit confusing - what was the reasoning for this name?

not sure what you mean by "as opposed to" here. I renamed it from storage_meta_read_through to storage_meta_stack - the reasoning is that these are the values maintained at the stack level, not by the writer.

I'll admit that the naming isn't 100%. perhaps storage_meta_global or something

That's what I figured, the confusing aspect is that storage_meta is also stored in what we call the "stack ets" - so the whole thing is a bit hard to decode

a reference to it being a cache record (similar to how you renamed the ets to a cache table) might work?

msfstef · 2025-12-16T15:49:11Z

packages/sync-service/lib/electric/shape_cache/pure_file_storage.ex

  end

+  defp write_metadata_cache(%__MODULE__{} = opts, key, value)
+       when key in @stack_cached_keys do


if we're always writing to this, and not the other, should the other not be storing them in ETS at all? This looks like a good change but I'm confused that it's leaving "code residue" in the existing ETS that's a bit hard to decode

the existing ets is still used for everything that isn't made available for the read path. also, what do you mean by "the other" here. wanna be sure what you mean.

by "the other" I do mean the original stack ets - basically I'm asking, given that we're storing the same keys in two different ETS tables and records, how are we ensuring they are never out of sync?

…hot`

magnetised · 2025-12-17T16:26:48Z

benchmark this

github-actions · 2025-12-17T16:37:35Z

Benchmark results, triggered for 32829

write fanout completed

unrelated shapes one client latency completed

many shapes one client latency completed

concurrent shape creation completed

diverse shape fanout completed

magnetised · 2025-12-18T10:01:15Z

@msfstef this now uses the stack ets for the read cache and write cache, i've merged into this commit:

fe0d5c8 (#3572)

the read path is careful about initializing the cache from disk -- theres a potential race with the writer initialising so it uses insert_new, the write path isn't - it just initializes the cached values because it knows that it's in charge. rather than deleting the cached values when the writer terminates, it just nils the cached values that aren't needed for the write path.

i've looked at clean up as well and decided to always explictly delete the writer's ets table at the cost of a few extra file writes.

msfstef

Beautiful, much better @magnetised

Only question is around populating the read through cache - it seems that if it is missing, we populate it in its entirety with all read path keys (rather than lazily populating each key) - was this deliberate? Why not only load the desired key with actual data and the rest with placeholders?

packages/sync-service/lib/electric/shape_cache/pure_file_storage.ex

packages/sync-service/lib/electric/shape_cache/pure_file_storage/shared_records.ex

msfstef · 2025-12-18T10:17:21Z

packages/sync-service/lib/electric/shapes/consumer.ex

-      :removed -> :ok
-    end
+    # always need to terminate writer to remove the writer ets (which belongs
+    # to this process). leads to unecessary writes in the case of a deleted


to clarify - the unnecessary writes you mean after the shape has been marked for removal, it might receive some events that it will write to the log? in a future pr we could probably just fix that since we mark the removal anyway

otherwise the terminate_writer technically just closes files (which is a no-op if they were already closed)

a notionally deleted shape may have some buffered writes that the Storage.terminate call will flush to disk. we could save those writes, which is what I was doing by deleting the writer, but in the end I think it's worth knowing that the cleanup will always execute.

msfstef · 2025-12-18T10:22:35Z

packages/sync-service/lib/electric/shape_cache/pure_file_storage.ex

+  defp populate_read_through_cache!(%__MODULE__{} = opts, extra_keys) do
+    %{shape_handle: handle, stack_ets: stack_ets} = opts
+
+    read_keys = Enum.into(extra_keys, MapSet.new(@read_path_keys))


so to clarify - if there's no active "cache", then on any request for a single key will trigger as many reads as there are @read_path_keys? I thought the idea was to only populate it with the desired key and everything else would have a placeholder :not_cached

Unless the thought here is that if someone is doing a read, they will probably need all the read path keys anyway, so might as well load them all up front if someone needs a read. Will appreciate to know what the logic is

Unless the thought here is that if someone is doing a read, they will probably need all the read path keys anyway, so might as well load them all up front if someone needs a read

That's the logic -- there is no partial read: if you need one value you'll need them all.

magnetised · 2025-12-16T16:14:26Z

packages/sync-service/lib/electric/shape_cache/pure_file_storage.ex

      ) do
    if old_last_persisted_txn_offset != last_persisted_txn_offset do
-      write_metadata!(opts, :last_persisted_txn_offset, last_persisted_txn_offset)
+      write_cached_metadata!(opts, :last_persisted_txn_offset, last_persisted_txn_offset)


this last_persisted_txn_offset is available to the read path, so if we don't do a write-through here then reads will go to the disk (or will be stale).
also note that the original does a write to the disk here, then a write to ets immediately below, so there's not a huge difference (apart from having to split the writes between tables, so not being able to issue a single update_element call

magnetised · 2025-12-18T10:28:41Z

packages/sync-service/lib/electric/shapes/consumer.ex

-      :removed -> :ok
-    end
+    # always need to terminate writer to remove the writer ets (which belongs
+    # to this process). leads to unecessary writes in the case of a deleted


a notionally deleted shape may have some buffered writes that the Storage.terminate call will flush to disk. we could save those writes, which is what I was doing by deleting the writer, but in the end I think it's worth knowing that the cleanup will always execute.

magnetised · 2025-12-18T10:29:49Z

packages/sync-service/lib/electric/shape_cache/pure_file_storage.ex

+  defp populate_read_through_cache!(%__MODULE__{} = opts, extra_keys) do
+    %{shape_handle: handle, stack_ets: stack_ets} = opts
+
+    read_keys = Enum.into(extra_keys, MapSet.new(@read_path_keys))


Unless the thought here is that if someone is doing a read, they will probably need all the read path keys anyway, so might as well load them all up front if someone needs a read

That's the logic -- there is no partial read: if you need one value you'll need them all.

github-actions · 2025-12-24T15:18:15Z

This PR has been released! 🚀

The following packages include changes from this PR:

@core/sync-service@1.2.10

Thanks for contributing to Electric!

msfstef requested review from alco and magnetised December 9, 2025 17:32

msfstef force-pushed the msfstef/use-storage-directly-for-metadata branch from 20001a4 to 8ee5277 Compare December 9, 2025 17:34

msfstef self-assigned this Dec 9, 2025

magnetised approved these changes Dec 10, 2025

View reviewed changes

msfstef force-pushed the msfstef/use-storage-directly-for-metadata branch from 8ee5277 to 1f87aca Compare December 11, 2025 09:20

electric-sql deleted a comment from github-actions bot Dec 11, 2025

msfstef requested a review from magnetised December 11, 2025 14:07

magnetised approved these changes Dec 11, 2025

View reviewed changes

msfstef force-pushed the msfstef/use-storage-directly-for-metadata branch from b4abfa9 to 9e45664 Compare December 11, 2025 17:43

msfstef mentioned this pull request Dec 16, 2025

** (MatchError) no match of right hand side value: false in lib/electric/shape_cache/shape_status.ex:281: Electric.ShapeCache.ShapeStatus.initialise_shape/3 #3610

Closed

msfstef linked an issue Dec 16, 2025 that may be closed by this pull request

** (MatchError) no match of right hand side value: false in lib/electric/shape_cache/shape_status.ex:281: Electric.ShapeCache.ShapeStatus.initialise_shape/3 #3610

Closed

msfstef force-pushed the msfstef/use-storage-directly-for-metadata branch from 7e0d8c1 to f6dca76 Compare December 16, 2025 09:20

alco approved these changes Dec 16, 2025

View reviewed changes

msfstef mentioned this pull request Dec 16, 2025

feat: Unlink and cleanup ShapeStatus metadata separately #3531

Closed

magnetised force-pushed the msfstef/use-storage-directly-for-metadata branch from f6dca76 to 6f73ec4 Compare December 16, 2025 15:15

msfstef commented Dec 16, 2025

View reviewed changes

msfstef added 3 commits December 17, 2025 16:25

Rewire storage inside ShapeStatus

ad9580b

Replace latest offset API

5839c6b

Remove snapshot_started? API

ac63de2

msfstef and others added 19 commits December 17, 2025 16:25

Remove initialise shape API

bd3ff16

Fix stored shapes retrieval

b0c2b25

Rename meta table to shape hash lookup

7bfd54c

Move entire persistence logic to ShapeDb

546605c

Remove public API as it is unused

6f73ad7

Use regular set ETS for lookups

8d9cd04

Rename ShapeCache and ShapeStatus APIs for clarity

532dccc

Add API for get_latest_offset to avoid double reaeds

782ce5f

Split get_current_position to get_latest_offset and `get_pg_snaps…

7cf8067

…hot`

Add changeset

bc80642

Refactor caching logic in storage

93157ad

Repurpose unused storage meta key

7ec0930

Add read through cache

cb335a8

Move tests below no writer tests

f4503d0

Rename get_pg_snapshot to fetch_pg_snapshot

b053b3b

Rename get_latest_offset to fetch_latest_offset

b7a114a

Fix typo

3c700fc

always write to global ets

02cb151

increase wait for fewer flakes

4147e89

magnetised force-pushed the msfstef/use-storage-directly-for-metadata branch from 6f73ec4 to 32829d4 Compare December 17, 2025 16:25

unify the read- and write- metadata storage

fe0d5c8

magnetised force-pushed the msfstef/use-storage-directly-for-metadata branch from 28d702d to fe0d5c8 Compare December 18, 2025 10:03

msfstef commented Dec 18, 2025

View reviewed changes

magnetised approved these changes Dec 18, 2025

View reviewed changes

magnetised merged commit 3c88770 into main Dec 18, 2025
56 of 57 checks passed

magnetised deleted the msfstef/use-storage-directly-for-metadata branch December 18, 2025 10:48

feat: Rely on storage for metadata persistence and caching instead of ShapeStatus #3572

feat: Rely on storage for metadata persistence and caching instead of ShapeStatus #3572

Uh oh!

Conversation

msfstef commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

msfstef commented Dec 10, 2025

Uh oh!

msfstef commented Dec 10, 2025

Uh oh!

magnetised left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

magnetised commented Dec 10, 2025

Uh oh!

magnetised commented Dec 10, 2025

Uh oh!

msfstef commented Dec 11, 2025

Uh oh!

msfstef commented Dec 11, 2025

Uh oh!

github-actions bot commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

magnetised left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

msfstef commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alco left a comment

Choose a reason for hiding this comment

Uh oh!

magnetised commented Dec 16, 2025

Uh oh!

msfstef left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

feat: Rely on storage for metadata persistence and caching instead of `ShapeStatus` #3572

feat: Rely on storage for metadata persistence and caching instead of `ShapeStatus` #3572

msfstef commented Dec 9, 2025 •

edited

Loading

codecov bot commented Dec 9, 2025 •

edited

Loading

github-actions bot commented Dec 11, 2025 •

edited

Loading

msfstef commented Dec 11, 2025 •

edited

Loading

github-actions bot commented Dec 17, 2025 •

edited

Loading

magnetised commented Dec 18, 2025 •

edited

Loading