docs: add clarification about memory usage (#1237)

paulohtb6 · travisdowns · Feediver1 · web-flow · commit 57005236e0d3 · 2025-08-01T15:33:19.000-03:00
Co-authored-by: Travis Downs &lt;travis.downs@gmail.com&gt;
Co-authored-by: Joyce Fee &lt;102751339+Feediver1@users.noreply.github.com&gt;
diff --git a/modules/manage/partials/monitor-health.adoc b/modules/manage/partials/monitor-health.adoc
@@ -37,7 +37,7 @@ rate(redpanda_uptime_seconds_total[5m])
 
 For the total CPU busy (non-idle) time, monitor xref:reference:public-metrics-reference.adoc#redpanda_cpu_busy_seconds_total[`redpanda_cpu_busy_seconds_total`].
 
-To detect unexpected idling, you can query the rate of change as a percentage of the shard that is in use at a given point in time.
+To detect unexpected idling, you can query the rate of change as a fraction of the shard that is in use at a given point in time.
 
 [,promql]
 ----
@@ -53,18 +53,34 @@ This high host-level CPU utilization happens because Redpanda uses Seastar, whic
 Use xref:reference:public-metrics-reference.adoc#redpanda_cpu_busy_seconds_total[`redpanda_cpu_busy_seconds_total`] to monitor the actual Redpanda CPU utilization. When it indicates close to 100% utilization over a given period of time, make sure to also monitor produce and consume <<latency,latency>> as they may then start to increase as a result of resources becoming overburdened.
 ====
 
-==== Memory allocated
+==== Memory availability and pressure
 
-To monitor the percentage of memory allocated, use a formula with xref:reference:public-metrics-reference.adoc#redpanda_memory_allocated_memory[`redpanda_memory_allocated_memory`] and xref:reference:public-metrics-reference.adoc#redpanda_memory_free_memory[`redpanda_memory_free_memory`]:
+To monitor memory, use xref:reference:public-metrics-reference.adoc#redpanda_memory_available_memory[`redpanda_memory_available_memory`], which includes both free memory and reclaimable memory from the batch cache. This provides a more accurate picture than using allocated memory alone, since allocated does not include reclaimable cache memory.
+
+To monitor the fraction of memory available:
+
+[,promql]
+----
+min(redpanda_memory_available_memory / (redpanda_memory_free_memory + redpanda_memory_allocated_memory))
+----
+
+To monitor memory pressure (fraction of memory being used), which may be more intuitive for alerting:
+
+[,promql]
+----
+min(redpanda_memory_available_memory / redpanda_memory_allocated_memory)
+----
+
+You can also monitor the lowest available memory available since the process started to understand historical memory pressure:
 
 [,promql]
 ----
-sum(redpanda_memory_allocated_memory) / (sum(redpanda_memory_free_memory) + sum(redpanda_memory_allocated_memory))
+min(redpanda_memory_available_memory_low_water_mark / (redpanda_memory_free_memory + redpanda_memory_allocated_memory))
 ----
 
 ==== Disk used
 
-To monitor the percentage of disk consumed, use a formula with xref:reference:public-metrics-reference.adoc#redpanda_storage_disk_free_bytes[`redpanda_storage_disk_free_bytes`] and xref:reference:public-metrics-reference.adoc#redpanda_storage_disk_total_bytes[`redpanda_storage_disk_total_bytes`]:
+To monitor the fraction of disk consumed, use a formula with xref:reference:public-metrics-reference.adoc#redpanda_storage_disk_free_bytes[`redpanda_storage_disk_free_bytes`] and xref:reference:public-metrics-reference.adoc#redpanda_storage_disk_total_bytes[`redpanda_storage_disk_total_bytes`]:
 
 [,promql]
 ----
diff --git a/modules/reference/pages/public-metrics-reference.adoc b/modules/reference/pages/public-metrics-reference.adoc
@@ -685,6 +685,8 @@ Total memory allocated (in bytes) per CPU shard.
 
 * `shard`
 
+*Usage*: This metric includes reclaimable memory from the batch cache. For monitoring memory pressure, consider using `redpanda_memory_available_memory` instead, which provides a more accurate picture of memory that can be immediately reallocated.
+
 ---
 
 === redpanda_memory_available_memory
@@ -697,7 +699,7 @@ Total memory (in bytes) available to a CPU shard—including both free and recla
 
 * `shard`
 
-*Usage*: Indicates memory pressure on each shard.
+*Usage*: This metric is more useful than `redpanda_memory_allocated_memory` for monitoring memory pressure, as it accounts for reclaimable memory in the batch cache. A low value indicates the system is approaching memory exhaustion.
 
 ---
 
@@ -711,6 +713,8 @@ The lowest recorded available memory (in bytes) per CPU shard since the process
 
 * `shard`
 
+*Usage*: This metric helps identify the closest the system has come to memory exhaustion. Useful for capacity planning and understanding historical memory pressure patterns.
+
 ---
 
 === redpanda_memory_free_memory