Add metrics for loading data from storage #696

DZakh · 2025-08-19T14:01:12Z

A lot of useful metrics for db load performance insights.

Summary by CodeRabbit

New Features
- Added storage-load metrics with per-operation timing, total time, counts, and size tracking; wired into ID- and field-based load paths.
Refactor
- Replaced the reorg detection counter with a gauge using improved labeling for clearer monitoring.
Chores
- Removed the deprecated per-partition fetched-block metric and stopped emitting it to streamline telemetry.

coderabbitai · 2025-08-19T14:01:19Z

Walkthrough

The changes remove the PartitionBlockFetched metric, switch ReorgCount from a Counter to a SafeGauge with int chainId labels, add a new StorageLoad instrumentation module, integrate StorageLoad timing and size metrics into LoadLayer’s load paths, and delete PartitionBlockFetched emissions from GlobalState.

Changes

Cohort / File(s)	Summary
Prometheus metrics overhaul `codegenerator/cli/npm/envio/src/Prometheus.res`	Removed `PartitionBlockFetched`. Replaced `ReorgCount`'s Counter with a `SafeGauge` using int chainId labels. Added `StorageLoad` module with operationLabelsSchema, multiple SafeCounter metrics (time, totalTime, count, where-size, size), `operationRef` state, `startOperation` and `endOperation` lifecycle logic.
Load instrumentation `codegenerator/cli/templates/static/codegen/src/LoadLayer.res`	Added `Prometheus.StorageLoad.startOperation` at load start and `endOperation` at completion for `loadById` and `loadByField`. `loadByField` accumulates sizes across batches; whereSize and size passed to `endOperation`. No public signatures changed; errors do not call `endOperation`.
GlobalState metric removal `codegenerator/cli/templates/static/codegen/src/globalState/GlobalState.res`	Removed emission of `Prometheus.PartitionBlockFetched` inside `validatePartitionQueryResponse` when `Env.Benchmark.shouldSaveData` is true; remaining logic unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller
  participant LL as LoadLayer
  participant DB as Database
  participant PM as Prometheus.StorageLoad

  rect rgba(230,245,255,0.6)
    note over LL,PM: New instrumentation
    Caller->>LL: loadById / loadByField(key,...)
    LL->>PM: startOperation(operation=key)
    PM-->>LL: timerRef
  end

  LL->>DB: fetch entities (batched for loadByField)
  DB-->>LL: entities (one or many batches)
  LL->>LL: accumulate size (loadByField)

  rect rgba(230,255,230,0.6)
    LL->>PM: endOperation(timerRef, operation=key, whereSize, size)
    PM-->>LL: metrics recorded (time, totalTime, counts, sizes)
  end

  alt Error occurs
    note over LL,PM: endOperation not invoked (existing behavior)
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Indexing performance metrics #545 — touches GlobalState.res's validatePartitionQueryResponse; related to metric emissions and may conflict with or duplicate changes.

Suggested reviewers

JasoonS
JonoPrest

Poem

A bunny times each hop and bite,
Counts the loads by day and night.
Gauges hum where counters slept,
Timers start and sizes kept.
Hooray — the metrics rabbit hopped with delight. 🥕

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0723a81 and 0d35ad8.

📒 Files selected for processing (1)

codegenerator/cli/templates/static/codegen/src/globalState/GlobalState.res (0 hunks)

💤 Files with no reviewable changes (1)

codegenerator/cli/templates/static/codegen/src/globalState/GlobalState.res

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch dz/storage-load-metrics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

codegenerator/cli/templates/static/codegen/src/LoadLayer.res (1)
169-172: Size accumulation approach is fine; optional: avoid mutable ref

Current ref-based accumulation works and is safe under JS’s single-threaded model. If you prefer an immutable style, gather lengths and reduce after Promise.all.

Example (illustrative; minimal edits shown):
let sizes = ref([])
...
entities->Array.length->sizes->Array.push->ignore
...
let totalSize = sizes.contents->Array.reduce(0, (acc, n) => acc + n)
timerRef->Prometheus.StorageLoad.endOperation(
  ~operation=key,
  ~whereSize=fieldValues->Array.length,
  ~size=totalSize,
)
Also applies to: 226-226, 230-234
codegenerator/cli/npm/envio/src/Prometheus.res (1)

621-631: Optional: clarify metric naming to avoid confusion between “time” and “total time”

Current names are close: envio_storage_load_time vs envio_storage_load_total_time. Consider renaming the first to envio_storage_load_envelope_time or similar to better convey that it measures the wall time while any operations of that label are in flight.

Also applies to: 633-649

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a021fd1 and 0723a81.

📒 Files selected for processing (3)

codegenerator/cli/npm/envio/src/Prometheus.res (1 hunks)
codegenerator/cli/templates/static/codegen/src/LoadLayer.res (4 hunks)
codegenerator/cli/templates/static/codegen/src/globalState/GlobalState.res (0 hunks)

💤 Files with no reviewable changes (1)

codegenerator/cli/templates/static/codegen/src/globalState/GlobalState.res

🧰 Additional context used

📓 Path-based instructions (4)

**/*.{res,resi}

📄 CodeRabbit Inference Engine (.cursor/rules/rescript.mdc)

**/*.{res,resi}: Never use [| item |] to create an array. Use [ item ] instead.
Must always use = for setting value to a field. Use := only for ref values created using ref function.
ReScript has record types which require a type definition before hand. You can access record fields by dot like foo.myField.
It's also possible to define an inline object, it'll have quoted fields in this case.
Use records when working with structured data, and objects to conveniently pass payload data between functions.
Never use %raw to access object fields if you know the type.

Files:

codegenerator/cli/templates/static/codegen/src/LoadLayer.res
codegenerator/cli/npm/envio/src/Prometheus.res

codegenerator/cli/templates/{dynamic/**/*.hbs,static/**}

📄 CodeRabbit Inference Engine (.cursor/rules/navigation.mdc)

Templates live under codegenerator/cli/templates: dynamic/ for Handlebars (.hbs), static/ for raw Rescript files copied verbatim.

Files:

codegenerator/cli/templates/static/codegen/src/LoadLayer.res

{**/generated/src/**/*.res,codegenerator/cli/templates/static/codegen/src/**/*.res,codegenerator/cli/templates/dynamic/codegen/src/**/*.res}

📄 CodeRabbit Inference Engine (.cursor/rules/navigation.mdc)

Runtime code lives in each project’s generated/src, but template versions (good for editing) are under codegenerator/cli/templates/static/codegen/src or codegenerator/cli/templates/dynamic/codegen/src.

Files:

codegenerator/cli/templates/static/codegen/src/LoadLayer.res

codegenerator/cli/npm/envio/**

📄 CodeRabbit Inference Engine (.cursor/rules/navigation.mdc)

Library-fied runtime shared across indexers lives in codegenerator/cli/npm/envio.

Files:

codegenerator/cli/npm/envio/src/Prometheus.res

🔇 Additional comments (2)

codegenerator/cli/templates/static/codegen/src/LoadLayer.res (1)

16-16: LGTM: loadById instrumentation is correctly placed

startOperation is initiated at the beginning of the async load and endOperation reports where-size and fetched size after data is written to the in-memory cache. Looks correct.

Also applies to: 47-51

codegenerator/cli/npm/envio/src/Prometheus.res (1)

618-692: Solid, self-contained StorageLoad instrumentation

Label schema is minimal, low-friction to adopt.

Envelope timer (pendingCount + operationRef.timerRef) and per-call timer separation gives both wall time and total time views.

Counters for attempts, where-size, and size provide actionable signals.

coderabbitai · 2025-08-19T14:07:30Z

codegenerator/cli/npm/envio/src/Prometheus.res

+  let startOperation = (~operation) => {
+    switch operations->Utils.Dict.dangerouslyGetNonOption(operation) {
+    | Some(operationRef) => operationRef.pendingCount = operationRef.pendingCount + 1
+    | None =>
+      operations->Js.Dict.set(
+        operation,
+        (
+          {
+            pendingCount: 1,
+            timerRef: Hrtime.makeTimer(),
+          }: operationRef
+        ),
+      )
+    }
+    Hrtime.makeTimer()
+  }


🛠️ Refactor suggestion

Add a failure path to prevent pendingCount leaks and skewed envio_storage_load_time

When endOperation isn’t called (e.g., exceptions in the load path), operationRef.pendingCount never returns to 0. This prevents envio_storage_load_time from being emitted again for that label and leaves the operations entry alive indefinitely.

Introduce a failOperation that:

Decrements pendingCount

Emits totalTimeCounter and whereSizeCounter

Does NOT increment counter (success) or sizeCounter

Emits timeCounter and deletes the operations entry when pendingCount reaches 0

Apply this diff within the StorageLoad module:

let endOperation = (timerRef, ~operation, ~whereSize, ~size) => { @@ sizeCounter->SafeCounter.handleInt(~labels={operation}, ~value=size) } + + /* Failure path: ensure pendingCount is decremented and time is accounted for, + without incrementing success counters or size. */ + let failOperation = (timerRef, ~operation, ~whereSize) => { + let operationRef = operations->Js.Dict.unsafeGet(operation) + operationRef.pendingCount = operationRef.pendingCount - 1 + if operationRef.pendingCount === 0 { + timeCounter->SafeCounter.handleInt( + ~labels={operation}, + ~value=operationRef.timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis, + ) + operations->Utils.Dict.deleteInPlace(operation) + } + totalTimeCounter->SafeCounter.handleInt( + ~labels={operation}, + ~value=timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis, + ) + whereSizeCounter->SafeCounter.handleInt(~labels={operation}, ~value=whereSize) + }

Then call failOperation in the LoadLayer catch blocks before re-raising the error (see the other file’s comment for an example).

Also applies to: 674-691

coderabbitai · 2025-08-19T14:07:30Z

codegenerator/cli/templates/static/codegen/src/LoadLayer.res

  let inMemTable = inMemoryStore->InMemoryStore.getInMemTable(~entityConfig)

  let load = async idsToLoad => {
+    let timerRef = Prometheus.StorageLoad.startOperation(~operation=key)


🛠️ Refactor suggestion

⚠️ Potential issue

endOperation is skipped on exceptions → pendingCount leak and broken aggregated timing

The current pattern starts an operation timer but only calls endOperation on the success path. If an exception is thrown (e.g., StorageError), the Prometheus.StorageLoad.operations entry’s pendingCount is never decremented. This prevents envio_storage_load_time from ever being emitted again for that operation label and leaves a dangling timerRef in the dict.

Recommend adding a failure path that decrements pendingCount and records attempt duration (without incrementing the “successful count” or “size”), then invoking it in catch blocks.

Apply this diff to add a failure handler in Prometheus.StorageLoad (see Prometheus.res comment for details), then invoke it from both load paths’ catch blocks.

Prometheus.res (adds failOperation):

@@ module StorageLoad = { @@ let endOperation = (timerRef, ~operation, ~whereSize, ~size) => { let operationRef = operations->Js.Dict.unsafeGet(operation) operationRef.pendingCount = operationRef.pendingCount - 1 if operationRef.pendingCount === 0 { timeCounter->SafeCounter.handleInt( ~labels={operation}, ~value=operationRef.timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis, ) operations->Utils.Dict.deleteInPlace(operation) } totalTimeCounter->SafeCounter.handleInt( ~labels={operation}, ~value=timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis, ) counter->SafeCounter.increment(~labels={operation}) whereSizeCounter->SafeCounter.handleInt(~labels={operation}, ~value=whereSize) sizeCounter->SafeCounter.handleInt(~labels={operation}, ~value=size) } + + /* Decrement pendingCount and record attempt duration on failures without + incrementing success counters or size. Prevents pendingCount leaks. */ + let failOperation = (timerRef, ~operation, ~whereSize) => { + let operationRef = operations->Js.Dict.unsafeGet(operation) + operationRef.pendingCount = operationRef.pendingCount - 1 + if operationRef.pendingCount === 0 { + timeCounter->SafeCounter.handleInt( + ~labels={operation}, + ~value=operationRef.timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis, + ) + operations->Utils.Dict.deleteInPlace(operation) + } + totalTimeCounter->SafeCounter.handleInt( + ~labels={operation}, + ~value=timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis, + ) + whereSizeCounter->SafeCounter.handleInt(~labels={operation}, ~value=whereSize) + } }

Then, call failOperation inside the catch branches in this file before re-raising:

ReScript (illustrative change; outside the changed lines)

// loadById catch (Lines 26-29 block) } catch { | Persistence.StorageError({message, reason}) => { Prometheus.StorageLoad.failOperation( timerRef, ~operation=key, ~whereSize=idsToLoad->Array.length, ) reason->ErrorHandling.mkLogAndRaise(~logger=eventItem->Logging.getEventLogger, ~msg=message) } } // loadByField catch (Lines 202-215 block inside the async index => { ... }) } catch { | Persistence.StorageError({message, reason}) => { Prometheus.StorageLoad.failOperation( timerRef, ~operation=key, ~whereSize=fieldValues->Array.length, ) reason->ErrorHandling.mkLogAndRaise( ~logger=Logging.createChildFrom( ~logger=eventItem->Logging.getEventLogger, ~params={ "operator": operatorCallName, "tableName": entityConfig.table.tableName, "fieldName": fieldName, "fieldValue": fieldValue, }, ), ~msg=message, ) } }

If you prefer not to expand the Prometheus API, the alternative is wrapping the whole load body in try/finally and calling endOperation in finally with size=0 on failures, but that would change the semantic of “successful operations” counter.

Also applies to: 47-51, 169-172, 226-226, 230-234

🤖 Prompt for AI Agents

In codegenerator/cli/templates/static/codegen/src/LoadLayer.res around line 16 (and also touching the catch blocks at ~26-29, ~169-172, ~202-215, ~226-234), the timerRef created by Prometheus.StorageLoad.startOperation is not ended on exceptions which leaks pendingCount and leaves a dangling timer; update the failure paths to call Prometheus.StorageLoad.failOperation(timerRef, ~operation=key, ~whereSize=<number of items attempted>) inside each StorageError catch before re-raising (use idsToLoad->Array.length for loadById and fieldValues->Array.length for loadByField), and ensure the Prometheus.res change adding failOperation is applied so these calls compile; alternatively wrap the whole load body in try/finally and call endOperation with size=0 on failure if you prefer not to add failOperation.

JonoPrest · 2025-08-20T10:50:15Z

codegenerator/cli/npm/envio/src/Prometheus.res

+    if operationRef.pendingCount === 0 {
+      timeCounter->SafeCounter.handleInt(
+        ~labels={operation},
+        ~value=operationRef.timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis,
+      )
+      operations->Utils.Dict.deleteInPlace(operation)
+    }
+    totalTimeCounter->SafeCounter.handleInt(


Did you by any chance swap totalTimeCounter and timeCounter?

Let's say we have 3 Entity.get queries running in parallel for 10ms

timeCounter - will increase by 10ms

totalTimeCounter - will increase by 30ms

Ah ok, understood. So timeCounter is accounting for concurrency 👍🏼

JonoPrest · 2025-08-20T10:51:35Z

codegenerator/cli/npm/envio/src/Prometheus.res

+  let totalTimeCounter = SafeCounter.makeOrThrow(
+    ~name="envio_storage_load_total_time",
+    ~help="Cumulative time spent loading data from storage during the indexing process. (milliseconds)",
+    ~labelSchema=operationLabelsSchema,
+  )


How does this represent cumulative time? It doesn't look like it builds on a previous time. Not sure I fully understand

We always increment the counter with the amount of time every query spent.

Ah sorry, I think I was confused. This is a counter and not a guage 👍🏼 got it

JonoPrest

Looks cool, maybe you can explain the timeCounter vs the totalTimeCounter?

Add metrics for loading data from storage

0723a81

DZakh requested a review from JonoPrest August 19, 2025 14:01

coderabbitai bot reviewed Aug 19, 2025

View reviewed changes

JonoPrest reviewed Aug 20, 2025

View reviewed changes

JonoPrest approved these changes Aug 20, 2025

View reviewed changes

Merge branch 'main' into dz/storage-load-metrics

0d35ad8

DZakh enabled auto-merge (squash) August 20, 2025 12:52

DZakh merged commit a1b6d02 into main Aug 20, 2025
2 checks passed

DZakh deleted the dz/storage-load-metrics branch August 20, 2025 13:00

coderabbitai bot mentioned this pull request Oct 27, 2025

Effect API metrics + latency metrics + Dev Console with disabled TUI #795

Merged

Add metrics for loading data from storage #696

Add metrics for loading data from storage #696

Uh oh!

Conversation

DZakh commented Aug 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

JonoPrest Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

DZakh Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

JonoPrest Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

JonoPrest Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

DZakh Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

JonoPrest Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

JonoPrest left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DZakh commented Aug 19, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 19, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)