Skip to content

Conversation

@DZakh
Copy link
Member

@DZakh DZakh commented Aug 19, 2025

A lot of useful metrics for db load performance insights.

image

Summary by CodeRabbit

  • New Features

    • Added storage-load metrics with per-operation timing, total time, counts, and size tracking; wired into ID- and field-based load paths.
  • Refactor

    • Replaced the reorg detection counter with a gauge using improved labeling for clearer monitoring.
  • Chores

    • Removed the deprecated per-partition fetched-block metric and stopped emitting it to streamline telemetry.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 19, 2025

Walkthrough

The changes remove the PartitionBlockFetched metric, switch ReorgCount from a Counter to a SafeGauge with int chainId labels, add a new StorageLoad instrumentation module, integrate StorageLoad timing and size metrics into LoadLayer’s load paths, and delete PartitionBlockFetched emissions from GlobalState.

Changes

Cohort / File(s) Summary
Prometheus metrics overhaul
codegenerator/cli/npm/envio/src/Prometheus.res
Removed PartitionBlockFetched. Replaced ReorgCount's Counter with a SafeGauge using int chainId labels. Added StorageLoad module with operationLabelsSchema, multiple SafeCounter metrics (time, totalTime, count, where-size, size), operationRef state, startOperation and endOperation lifecycle logic.
Load instrumentation
codegenerator/cli/templates/static/codegen/src/LoadLayer.res
Added Prometheus.StorageLoad.startOperation at load start and endOperation at completion for loadById and loadByField. loadByField accumulates sizes across batches; whereSize and size passed to endOperation. No public signatures changed; errors do not call endOperation.
GlobalState metric removal
codegenerator/cli/templates/static/codegen/src/globalState/GlobalState.res
Removed emission of Prometheus.PartitionBlockFetched inside validatePartitionQueryResponse when Env.Benchmark.shouldSaveData is true; remaining logic unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller
  participant LL as LoadLayer
  participant DB as Database
  participant PM as Prometheus.StorageLoad

  rect rgba(230,245,255,0.6)
    note over LL,PM: New instrumentation
    Caller->>LL: loadById / loadByField(key,...)
    LL->>PM: startOperation(operation=key)
    PM-->>LL: timerRef
  end

  LL->>DB: fetch entities (batched for loadByField)
  DB-->>LL: entities (one or many batches)
  LL->>LL: accumulate size (loadByField)

  rect rgba(230,255,230,0.6)
    LL->>PM: endOperation(timerRef, operation=key, whereSize, size)
    PM-->>LL: metrics recorded (time, totalTime, counts, sizes)
  end

  alt Error occurs
    note over LL,PM: endOperation not invoked (existing behavior)
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • Indexing performance metrics #545 — touches GlobalState.res's validatePartitionQueryResponse; related to metric emissions and may conflict with or duplicate changes.

Suggested reviewers

  • JasoonS
  • JonoPrest

Poem

A bunny times each hop and bite,
Counts the loads by day and night.
Gauges hum where counters slept,
Timers start and sizes kept.
Hooray — the metrics rabbit hopped with delight. 🥕

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0723a81 and 0d35ad8.

📒 Files selected for processing (1)
  • codegenerator/cli/templates/static/codegen/src/globalState/GlobalState.res (0 hunks)
💤 Files with no reviewable changes (1)
  • codegenerator/cli/templates/static/codegen/src/globalState/GlobalState.res
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dz/storage-load-metrics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@DZakh DZakh requested a review from JonoPrest August 19, 2025 14:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
codegenerator/cli/templates/static/codegen/src/LoadLayer.res (1)

169-172: Size accumulation approach is fine; optional: avoid mutable ref

Current ref-based accumulation works and is safe under JS’s single-threaded model. If you prefer an immutable style, gather lengths and reduce after Promise.all.

Example (illustrative; minimal edits shown):

let sizes = ref([])
...
entities->Array.length->sizes->Array.push->ignore
...
let totalSize = sizes.contents->Array.reduce(0, (acc, n) => acc + n)
timerRef->Prometheus.StorageLoad.endOperation(
  ~operation=key,
  ~whereSize=fieldValues->Array.length,
  ~size=totalSize,
)

Also applies to: 226-226, 230-234

codegenerator/cli/npm/envio/src/Prometheus.res (1)

621-631: Optional: clarify metric naming to avoid confusion between “time” and “total time”

Current names are close: envio_storage_load_time vs envio_storage_load_total_time. Consider renaming the first to envio_storage_load_envelope_time or similar to better convey that it measures the wall time while any operations of that label are in flight.

Also applies to: 633-649

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a021fd1 and 0723a81.

📒 Files selected for processing (3)
  • codegenerator/cli/npm/envio/src/Prometheus.res (1 hunks)
  • codegenerator/cli/templates/static/codegen/src/LoadLayer.res (4 hunks)
  • codegenerator/cli/templates/static/codegen/src/globalState/GlobalState.res (0 hunks)
💤 Files with no reviewable changes (1)
  • codegenerator/cli/templates/static/codegen/src/globalState/GlobalState.res
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{res,resi}

📄 CodeRabbit Inference Engine (.cursor/rules/rescript.mdc)

**/*.{res,resi}: Never use [| item |] to create an array. Use [ item ] instead.
Must always use = for setting value to a field. Use := only for ref values created using ref function.
ReScript has record types which require a type definition before hand. You can access record fields by dot like foo.myField.
It's also possible to define an inline object, it'll have quoted fields in this case.
Use records when working with structured data, and objects to conveniently pass payload data between functions.
Never use %raw to access object fields if you know the type.

Files:

  • codegenerator/cli/templates/static/codegen/src/LoadLayer.res
  • codegenerator/cli/npm/envio/src/Prometheus.res
codegenerator/cli/templates/{dynamic/**/*.hbs,static/**}

📄 CodeRabbit Inference Engine (.cursor/rules/navigation.mdc)

Templates live under codegenerator/cli/templates: dynamic/ for Handlebars (.hbs), static/ for raw Rescript files copied verbatim.

Files:

  • codegenerator/cli/templates/static/codegen/src/LoadLayer.res
{**/generated/src/**/*.res,codegenerator/cli/templates/static/codegen/src/**/*.res,codegenerator/cli/templates/dynamic/codegen/src/**/*.res}

📄 CodeRabbit Inference Engine (.cursor/rules/navigation.mdc)

Runtime code lives in each project’s generated/src, but template versions (good for editing) are under codegenerator/cli/templates/static/codegen/src or codegenerator/cli/templates/dynamic/codegen/src.

Files:

  • codegenerator/cli/templates/static/codegen/src/LoadLayer.res
codegenerator/cli/npm/envio/**

📄 CodeRabbit Inference Engine (.cursor/rules/navigation.mdc)

Library-fied runtime shared across indexers lives in codegenerator/cli/npm/envio.

Files:

  • codegenerator/cli/npm/envio/src/Prometheus.res
🔇 Additional comments (2)
codegenerator/cli/templates/static/codegen/src/LoadLayer.res (1)

16-16: LGTM: loadById instrumentation is correctly placed

startOperation is initiated at the beginning of the async load and endOperation reports where-size and fetched size after data is written to the in-memory cache. Looks correct.

Also applies to: 47-51

codegenerator/cli/npm/envio/src/Prometheus.res (1)

618-692: Solid, self-contained StorageLoad instrumentation

  • Label schema is minimal, low-friction to adopt.
  • Envelope timer (pendingCount + operationRef.timerRef) and per-call timer separation gives both wall time and total time views.
  • Counters for attempts, where-size, and size provide actionable signals.

Comment on lines +657 to +672
let startOperation = (~operation) => {
switch operations->Utils.Dict.dangerouslyGetNonOption(operation) {
| Some(operationRef) => operationRef.pendingCount = operationRef.pendingCount + 1
| None =>
operations->Js.Dict.set(
operation,
(
{
pendingCount: 1,
timerRef: Hrtime.makeTimer(),
}: operationRef
),
)
}
Hrtime.makeTimer()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add a failure path to prevent pendingCount leaks and skewed envio_storage_load_time

When endOperation isn’t called (e.g., exceptions in the load path), operationRef.pendingCount never returns to 0. This prevents envio_storage_load_time from being emitted again for that label and leaves the operations entry alive indefinitely.

Introduce a failOperation that:

  • Decrements pendingCount
  • Emits totalTimeCounter and whereSizeCounter
  • Does NOT increment counter (success) or sizeCounter
  • Emits timeCounter and deletes the operations entry when pendingCount reaches 0

Apply this diff within the StorageLoad module:

   let endOperation = (timerRef, ~operation, ~whereSize, ~size) => {
@@
     sizeCounter->SafeCounter.handleInt(~labels={operation}, ~value=size)
   }
+
+  /* Failure path: ensure pendingCount is decremented and time is accounted for,
+     without incrementing success counters or size. */
+  let failOperation = (timerRef, ~operation, ~whereSize) => {
+    let operationRef = operations->Js.Dict.unsafeGet(operation)
+    operationRef.pendingCount = operationRef.pendingCount - 1
+    if operationRef.pendingCount === 0 {
+      timeCounter->SafeCounter.handleInt(
+        ~labels={operation},
+        ~value=operationRef.timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis,
+      )
+      operations->Utils.Dict.deleteInPlace(operation)
+    }
+    totalTimeCounter->SafeCounter.handleInt(
+      ~labels={operation},
+      ~value=timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis,
+    )
+    whereSizeCounter->SafeCounter.handleInt(~labels={operation}, ~value=whereSize)
+  }

Then call failOperation in the LoadLayer catch blocks before re-raising the error (see the other file’s comment for an example).

Also applies to: 674-691

let inMemTable = inMemoryStore->InMemoryStore.getInMemTable(~entityConfig)

let load = async idsToLoad => {
let timerRef = Prometheus.StorageLoad.startOperation(~operation=key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

endOperation is skipped on exceptions → pendingCount leak and broken aggregated timing

The current pattern starts an operation timer but only calls endOperation on the success path. If an exception is thrown (e.g., StorageError), the Prometheus.StorageLoad.operations entry’s pendingCount is never decremented. This prevents envio_storage_load_time from ever being emitted again for that operation label and leaves a dangling timerRef in the dict.

Recommend adding a failure path that decrements pendingCount and records attempt duration (without incrementing the “successful count” or “size”), then invoking it in catch blocks.

Apply this diff to add a failure handler in Prometheus.StorageLoad (see Prometheus.res comment for details), then invoke it from both load paths’ catch blocks.

Prometheus.res (adds failOperation):

@@
 module StorageLoad = {
@@
   let endOperation = (timerRef, ~operation, ~whereSize, ~size) => {
     let operationRef = operations->Js.Dict.unsafeGet(operation)
     operationRef.pendingCount = operationRef.pendingCount - 1
     if operationRef.pendingCount === 0 {
       timeCounter->SafeCounter.handleInt(
         ~labels={operation},
         ~value=operationRef.timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis,
       )
       operations->Utils.Dict.deleteInPlace(operation)
     }
     totalTimeCounter->SafeCounter.handleInt(
       ~labels={operation},
       ~value=timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis,
     )
     counter->SafeCounter.increment(~labels={operation})
     whereSizeCounter->SafeCounter.handleInt(~labels={operation}, ~value=whereSize)
     sizeCounter->SafeCounter.handleInt(~labels={operation}, ~value=size)
   }
+
+  /* Decrement pendingCount and record attempt duration on failures without
+     incrementing success counters or size. Prevents pendingCount leaks. */
+  let failOperation = (timerRef, ~operation, ~whereSize) => {
+    let operationRef = operations->Js.Dict.unsafeGet(operation)
+    operationRef.pendingCount = operationRef.pendingCount - 1
+    if operationRef.pendingCount === 0 {
+      timeCounter->SafeCounter.handleInt(
+        ~labels={operation},
+        ~value=operationRef.timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis,
+      )
+      operations->Utils.Dict.deleteInPlace(operation)
+    }
+    totalTimeCounter->SafeCounter.handleInt(
+      ~labels={operation},
+      ~value=timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis,
+    )
+    whereSizeCounter->SafeCounter.handleInt(~labels={operation}, ~value=whereSize)
+  }
 }

Then, call failOperation inside the catch branches in this file before re-raising:

ReScript (illustrative change; outside the changed lines)

// loadById catch (Lines 26-29 block)
} catch {
| Persistence.StorageError({message, reason}) => {
    Prometheus.StorageLoad.failOperation(
      timerRef,
      ~operation=key,
      ~whereSize=idsToLoad->Array.length,
    )
    reason->ErrorHandling.mkLogAndRaise(~logger=eventItem->Logging.getEventLogger, ~msg=message)
}
}

// loadByField catch (Lines 202-215 block inside the async index => { ... })
} catch {
| Persistence.StorageError({message, reason}) => {
    Prometheus.StorageLoad.failOperation(
      timerRef,
      ~operation=key,
      ~whereSize=fieldValues->Array.length,
    )
    reason->ErrorHandling.mkLogAndRaise(
      ~logger=Logging.createChildFrom(
        ~logger=eventItem->Logging.getEventLogger,
        ~params={
          "operator": operatorCallName,
          "tableName": entityConfig.table.tableName,
          "fieldName": fieldName,
          "fieldValue": fieldValue,
        },
      ),
      ~msg=message,
    )
}
}

If you prefer not to expand the Prometheus API, the alternative is wrapping the whole load body in try/finally and calling endOperation in finally with size=0 on failures, but that would change the semantic of “successful operations” counter.

Also applies to: 47-51, 169-172, 226-226, 230-234

🤖 Prompt for AI Agents
In codegenerator/cli/templates/static/codegen/src/LoadLayer.res around line 16
(and also touching the catch blocks at ~26-29, ~169-172, ~202-215, ~226-234),
the timerRef created by Prometheus.StorageLoad.startOperation is not ended on
exceptions which leaks pendingCount and leaves a dangling timer; update the
failure paths to call Prometheus.StorageLoad.failOperation(timerRef,
~operation=key, ~whereSize=<number of items attempted>) inside each StorageError
catch before re-raising (use idsToLoad->Array.length for loadById and
fieldValues->Array.length for loadByField), and ensure the Prometheus.res change
adding failOperation is applied so these calls compile; alternatively wrap the
whole load body in try/finally and call endOperation with size=0 on failure if
you prefer not to add failOperation.

Comment on lines +677 to +684
if operationRef.pendingCount === 0 {
timeCounter->SafeCounter.handleInt(
~labels={operation},
~value=operationRef.timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis,
)
operations->Utils.Dict.deleteInPlace(operation)
}
totalTimeCounter->SafeCounter.handleInt(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you by any chance swap totalTimeCounter and timeCounter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say we have 3 Entity.get queries running in parallel for 10ms

  • timeCounter - will increase by 10ms
  • totalTimeCounter - will increase by 30ms

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, understood. So timeCounter is accounting for concurrency 👍🏼

Comment on lines +627 to +631
let totalTimeCounter = SafeCounter.makeOrThrow(
~name="envio_storage_load_total_time",
~help="Cumulative time spent loading data from storage during the indexing process. (milliseconds)",
~labelSchema=operationLabelsSchema,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this represent cumulative time? It doesn't look like it builds on a previous time. Not sure I fully understand

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always increment the counter with the amount of time every query spent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, I think I was confused. This is a counter and not a guage 👍🏼 got it

Copy link
Collaborator

@JonoPrest JonoPrest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks cool, maybe you can explain the timeCounter vs the totalTimeCounter?

@DZakh DZakh enabled auto-merge (squash) August 20, 2025 12:52
@DZakh DZakh merged commit a1b6d02 into main Aug 20, 2025
2 checks passed
@DZakh DZakh deleted the dz/storage-load-metrics branch August 20, 2025 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants