-
Notifications
You must be signed in to change notification settings - Fork 28
Add metrics for loading data from storage #696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes remove the PartitionBlockFetched metric, switch ReorgCount from a Counter to a SafeGauge with int chainId labels, add a new StorageLoad instrumentation module, integrate StorageLoad timing and size metrics into LoadLayer’s load paths, and delete PartitionBlockFetched emissions from GlobalState. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Caller
participant LL as LoadLayer
participant DB as Database
participant PM as Prometheus.StorageLoad
rect rgba(230,245,255,0.6)
note over LL,PM: New instrumentation
Caller->>LL: loadById / loadByField(key,...)
LL->>PM: startOperation(operation=key)
PM-->>LL: timerRef
end
LL->>DB: fetch entities (batched for loadByField)
DB-->>LL: entities (one or many batches)
LL->>LL: accumulate size (loadByField)
rect rgba(230,255,230,0.6)
LL->>PM: endOperation(timerRef, operation=key, whereSize, size)
PM-->>LL: metrics recorded (time, totalTime, counts, sizes)
end
alt Error occurs
note over LL,PM: endOperation not invoked (existing behavior)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: CodeRabbit UI 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (1)
💤 Files with no reviewable changes (1)
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
codegenerator/cli/templates/static/codegen/src/LoadLayer.res (1)
169-172: Size accumulation approach is fine; optional: avoid mutable refCurrent ref-based accumulation works and is safe under JS’s single-threaded model. If you prefer an immutable style, gather lengths and reduce after Promise.all.
Example (illustrative; minimal edits shown):
let sizes = ref([]) ... entities->Array.length->sizes->Array.push->ignore ... let totalSize = sizes.contents->Array.reduce(0, (acc, n) => acc + n) timerRef->Prometheus.StorageLoad.endOperation( ~operation=key, ~whereSize=fieldValues->Array.length, ~size=totalSize, )Also applies to: 226-226, 230-234
codegenerator/cli/npm/envio/src/Prometheus.res (1)
621-631: Optional: clarify metric naming to avoid confusion between “time” and “total time”Current names are close: envio_storage_load_time vs envio_storage_load_total_time. Consider renaming the first to envio_storage_load_envelope_time or similar to better convey that it measures the wall time while any operations of that label are in flight.
Also applies to: 633-649
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
codegenerator/cli/npm/envio/src/Prometheus.res(1 hunks)codegenerator/cli/templates/static/codegen/src/LoadLayer.res(4 hunks)codegenerator/cli/templates/static/codegen/src/globalState/GlobalState.res(0 hunks)
💤 Files with no reviewable changes (1)
- codegenerator/cli/templates/static/codegen/src/globalState/GlobalState.res
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{res,resi}
📄 CodeRabbit Inference Engine (.cursor/rules/rescript.mdc)
**/*.{res,resi}: Never use[| item |]to create an array. Use[ item ]instead.
Must always use=for setting value to a field. Use:=only for ref values created usingreffunction.
ReScript has record types which require a type definition before hand. You can access record fields by dot likefoo.myField.
It's also possible to define an inline object, it'll have quoted fields in this case.
Use records when working with structured data, and objects to conveniently pass payload data between functions.
Never use %raw to access object fields if you know the type.
Files:
codegenerator/cli/templates/static/codegen/src/LoadLayer.rescodegenerator/cli/npm/envio/src/Prometheus.res
codegenerator/cli/templates/{dynamic/**/*.hbs,static/**}
📄 CodeRabbit Inference Engine (.cursor/rules/navigation.mdc)
Templates live under codegenerator/cli/templates: dynamic/ for Handlebars (.hbs), static/ for raw Rescript files copied verbatim.
Files:
codegenerator/cli/templates/static/codegen/src/LoadLayer.res
{**/generated/src/**/*.res,codegenerator/cli/templates/static/codegen/src/**/*.res,codegenerator/cli/templates/dynamic/codegen/src/**/*.res}
📄 CodeRabbit Inference Engine (.cursor/rules/navigation.mdc)
Runtime code lives in each project’s generated/src, but template versions (good for editing) are under codegenerator/cli/templates/static/codegen/src or codegenerator/cli/templates/dynamic/codegen/src.
Files:
codegenerator/cli/templates/static/codegen/src/LoadLayer.res
codegenerator/cli/npm/envio/**
📄 CodeRabbit Inference Engine (.cursor/rules/navigation.mdc)
Library-fied runtime shared across indexers lives in codegenerator/cli/npm/envio.
Files:
codegenerator/cli/npm/envio/src/Prometheus.res
🔇 Additional comments (2)
codegenerator/cli/templates/static/codegen/src/LoadLayer.res (1)
16-16: LGTM: loadById instrumentation is correctly placedstartOperation is initiated at the beginning of the async load and endOperation reports where-size and fetched size after data is written to the in-memory cache. Looks correct.
Also applies to: 47-51
codegenerator/cli/npm/envio/src/Prometheus.res (1)
618-692: Solid, self-contained StorageLoad instrumentation
- Label schema is minimal, low-friction to adopt.
- Envelope timer (pendingCount + operationRef.timerRef) and per-call timer separation gives both wall time and total time views.
- Counters for attempts, where-size, and size provide actionable signals.
| let startOperation = (~operation) => { | ||
| switch operations->Utils.Dict.dangerouslyGetNonOption(operation) { | ||
| | Some(operationRef) => operationRef.pendingCount = operationRef.pendingCount + 1 | ||
| | None => | ||
| operations->Js.Dict.set( | ||
| operation, | ||
| ( | ||
| { | ||
| pendingCount: 1, | ||
| timerRef: Hrtime.makeTimer(), | ||
| }: operationRef | ||
| ), | ||
| ) | ||
| } | ||
| Hrtime.makeTimer() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add a failure path to prevent pendingCount leaks and skewed envio_storage_load_time
When endOperation isn’t called (e.g., exceptions in the load path), operationRef.pendingCount never returns to 0. This prevents envio_storage_load_time from being emitted again for that label and leaves the operations entry alive indefinitely.
Introduce a failOperation that:
- Decrements pendingCount
- Emits totalTimeCounter and whereSizeCounter
- Does NOT increment counter (success) or sizeCounter
- Emits timeCounter and deletes the operations entry when pendingCount reaches 0
Apply this diff within the StorageLoad module:
let endOperation = (timerRef, ~operation, ~whereSize, ~size) => {
@@
sizeCounter->SafeCounter.handleInt(~labels={operation}, ~value=size)
}
+
+ /* Failure path: ensure pendingCount is decremented and time is accounted for,
+ without incrementing success counters or size. */
+ let failOperation = (timerRef, ~operation, ~whereSize) => {
+ let operationRef = operations->Js.Dict.unsafeGet(operation)
+ operationRef.pendingCount = operationRef.pendingCount - 1
+ if operationRef.pendingCount === 0 {
+ timeCounter->SafeCounter.handleInt(
+ ~labels={operation},
+ ~value=operationRef.timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis,
+ )
+ operations->Utils.Dict.deleteInPlace(operation)
+ }
+ totalTimeCounter->SafeCounter.handleInt(
+ ~labels={operation},
+ ~value=timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis,
+ )
+ whereSizeCounter->SafeCounter.handleInt(~labels={operation}, ~value=whereSize)
+ }Then call failOperation in the LoadLayer catch blocks before re-raising the error (see the other file’s comment for an example).
Also applies to: 674-691
| let inMemTable = inMemoryStore->InMemoryStore.getInMemTable(~entityConfig) | ||
|
|
||
| let load = async idsToLoad => { | ||
| let timerRef = Prometheus.StorageLoad.startOperation(~operation=key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
endOperation is skipped on exceptions → pendingCount leak and broken aggregated timing
The current pattern starts an operation timer but only calls endOperation on the success path. If an exception is thrown (e.g., StorageError), the Prometheus.StorageLoad.operations entry’s pendingCount is never decremented. This prevents envio_storage_load_time from ever being emitted again for that operation label and leaves a dangling timerRef in the dict.
Recommend adding a failure path that decrements pendingCount and records attempt duration (without incrementing the “successful count” or “size”), then invoking it in catch blocks.
Apply this diff to add a failure handler in Prometheus.StorageLoad (see Prometheus.res comment for details), then invoke it from both load paths’ catch blocks.
Prometheus.res (adds failOperation):
@@
module StorageLoad = {
@@
let endOperation = (timerRef, ~operation, ~whereSize, ~size) => {
let operationRef = operations->Js.Dict.unsafeGet(operation)
operationRef.pendingCount = operationRef.pendingCount - 1
if operationRef.pendingCount === 0 {
timeCounter->SafeCounter.handleInt(
~labels={operation},
~value=operationRef.timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis,
)
operations->Utils.Dict.deleteInPlace(operation)
}
totalTimeCounter->SafeCounter.handleInt(
~labels={operation},
~value=timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis,
)
counter->SafeCounter.increment(~labels={operation})
whereSizeCounter->SafeCounter.handleInt(~labels={operation}, ~value=whereSize)
sizeCounter->SafeCounter.handleInt(~labels={operation}, ~value=size)
}
+
+ /* Decrement pendingCount and record attempt duration on failures without
+ incrementing success counters or size. Prevents pendingCount leaks. */
+ let failOperation = (timerRef, ~operation, ~whereSize) => {
+ let operationRef = operations->Js.Dict.unsafeGet(operation)
+ operationRef.pendingCount = operationRef.pendingCount - 1
+ if operationRef.pendingCount === 0 {
+ timeCounter->SafeCounter.handleInt(
+ ~labels={operation},
+ ~value=operationRef.timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis,
+ )
+ operations->Utils.Dict.deleteInPlace(operation)
+ }
+ totalTimeCounter->SafeCounter.handleInt(
+ ~labels={operation},
+ ~value=timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis,
+ )
+ whereSizeCounter->SafeCounter.handleInt(~labels={operation}, ~value=whereSize)
+ }
}Then, call failOperation inside the catch branches in this file before re-raising:
ReScript (illustrative change; outside the changed lines)
// loadById catch (Lines 26-29 block)
} catch {
| Persistence.StorageError({message, reason}) => {
Prometheus.StorageLoad.failOperation(
timerRef,
~operation=key,
~whereSize=idsToLoad->Array.length,
)
reason->ErrorHandling.mkLogAndRaise(~logger=eventItem->Logging.getEventLogger, ~msg=message)
}
}
// loadByField catch (Lines 202-215 block inside the async index => { ... })
} catch {
| Persistence.StorageError({message, reason}) => {
Prometheus.StorageLoad.failOperation(
timerRef,
~operation=key,
~whereSize=fieldValues->Array.length,
)
reason->ErrorHandling.mkLogAndRaise(
~logger=Logging.createChildFrom(
~logger=eventItem->Logging.getEventLogger,
~params={
"operator": operatorCallName,
"tableName": entityConfig.table.tableName,
"fieldName": fieldName,
"fieldValue": fieldValue,
},
),
~msg=message,
)
}
}If you prefer not to expand the Prometheus API, the alternative is wrapping the whole load body in try/finally and calling endOperation in finally with size=0 on failures, but that would change the semantic of “successful operations” counter.
Also applies to: 47-51, 169-172, 226-226, 230-234
🤖 Prompt for AI Agents
In codegenerator/cli/templates/static/codegen/src/LoadLayer.res around line 16
(and also touching the catch blocks at ~26-29, ~169-172, ~202-215, ~226-234),
the timerRef created by Prometheus.StorageLoad.startOperation is not ended on
exceptions which leaks pendingCount and leaves a dangling timer; update the
failure paths to call Prometheus.StorageLoad.failOperation(timerRef,
~operation=key, ~whereSize=<number of items attempted>) inside each StorageError
catch before re-raising (use idsToLoad->Array.length for loadById and
fieldValues->Array.length for loadByField), and ensure the Prometheus.res change
adding failOperation is applied so these calls compile; alternatively wrap the
whole load body in try/finally and call endOperation with size=0 on failure if
you prefer not to add failOperation.
| if operationRef.pendingCount === 0 { | ||
| timeCounter->SafeCounter.handleInt( | ||
| ~labels={operation}, | ||
| ~value=operationRef.timerRef->Hrtime.timeSince->Hrtime.toMillis->Hrtime.intFromMillis, | ||
| ) | ||
| operations->Utils.Dict.deleteInPlace(operation) | ||
| } | ||
| totalTimeCounter->SafeCounter.handleInt( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you by any chance swap totalTimeCounter and timeCounter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say we have 3 Entity.get queries running in parallel for 10ms
- timeCounter - will increase by 10ms
- totalTimeCounter - will increase by 30ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, understood. So timeCounter is accounting for concurrency 👍🏼
| let totalTimeCounter = SafeCounter.makeOrThrow( | ||
| ~name="envio_storage_load_total_time", | ||
| ~help="Cumulative time spent loading data from storage during the indexing process. (milliseconds)", | ||
| ~labelSchema=operationLabelsSchema, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this represent cumulative time? It doesn't look like it builds on a previous time. Not sure I fully understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always increment the counter with the amount of time every query spent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, I think I was confused. This is a counter and not a guage 👍🏼 got it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks cool, maybe you can explain the timeCounter vs the totalTimeCounter?
A lot of useful metrics for db load performance insights.
Summary by CodeRabbit
New Features
Refactor
Chores