Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 21 additions & 5 deletions modules/manage/partials/monitor-health.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ rate(redpanda_uptime_seconds_total[5m])

For the total CPU busy (non-idle) time, monitor xref:reference:public-metrics-reference.adoc#redpanda_cpu_busy_seconds_total[`redpanda_cpu_busy_seconds_total`].

To detect unexpected idling, you can query the rate of change as a percentage of the shard that is in use at a given point in time.
To detect unexpected idling, you can query the rate of change as a fraction of the shard that is in use at a given point in time.

[,promql]
----
Expand All @@ -53,18 +53,34 @@ This high host-level CPU utilization happens because Redpanda uses Seastar, whic
Use xref:reference:public-metrics-reference.adoc#redpanda_cpu_busy_seconds_total[`redpanda_cpu_busy_seconds_total`] to monitor the actual Redpanda CPU utilization. When it indicates close to 100% utilization over a given period of time, make sure to also monitor produce and consume <<latency,latency>> as they may then start to increase as a result of resources becoming overburdened.
====

==== Memory allocated
==== Memory availability and pressure

To monitor the percentage of memory allocated, use a formula with xref:reference:public-metrics-reference.adoc#redpanda_memory_allocated_memory[`redpanda_memory_allocated_memory`] and xref:reference:public-metrics-reference.adoc#redpanda_memory_free_memory[`redpanda_memory_free_memory`]:
To monitor memory, use xref:reference:public-metrics-reference.adoc#redpanda_memory_available_memory[`redpanda_memory_available_memory`], which includes both free memory and reclaimable memory from the batch cache. This provides a more accurate picture than using allocated memory alone, since allocated does not include reclaimable cache memory.

To monitor the fraction of memory available:

[,promql]
----
min(redpanda_memory_available_memory / (redpanda_memory_free_memory + redpanda_memory_allocated_memory))
----

To monitor memory pressure (fraction of memory being used), which may be more intuitive for alerting:

[,promql]
----
min(redpanda_memory_available_memory / redpanda_memory_allocated_memory)
----

You can also monitor the lowest available memory available since the process started to understand historical memory pressure:

[,promql]
----
sum(redpanda_memory_allocated_memory) / (sum(redpanda_memory_free_memory) + sum(redpanda_memory_allocated_memory))
min(redpanda_memory_available_memory_low_water_mark / (redpanda_memory_free_memory + redpanda_memory_allocated_memory))
----

==== Disk used

To monitor the percentage of disk consumed, use a formula with xref:reference:public-metrics-reference.adoc#redpanda_storage_disk_free_bytes[`redpanda_storage_disk_free_bytes`] and xref:reference:public-metrics-reference.adoc#redpanda_storage_disk_total_bytes[`redpanda_storage_disk_total_bytes`]:
To monitor the fraction of disk consumed, use a formula with xref:reference:public-metrics-reference.adoc#redpanda_storage_disk_free_bytes[`redpanda_storage_disk_free_bytes`] and xref:reference:public-metrics-reference.adoc#redpanda_storage_disk_total_bytes[`redpanda_storage_disk_total_bytes`]:

[,promql]
----
Expand Down
6 changes: 5 additions & 1 deletion modules/reference/pages/public-metrics-reference.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -697,6 +697,8 @@ Total memory allocated (in bytes) per CPU shard.

* `shard`

*Usage*: This metric includes reclaimable memory from the batch cache. For monitoring memory pressure, consider using `redpanda_memory_available_memory` instead, which provides a more accurate picture of memory that can be immediately reallocated.

---

=== redpanda_memory_available_memory
Expand All @@ -709,7 +711,7 @@ Total memory (in bytes) available to a CPU shard—including both free and recla

* `shard`

*Usage*: Indicates memory pressure on each shard.
*Usage*: This metric is more useful than `redpanda_memory_allocated_memory` for monitoring memory pressure, as it accounts for reclaimable memory in the batch cache. A low value indicates the system is approaching memory exhaustion.

---

Expand All @@ -723,6 +725,8 @@ The lowest recorded available memory (in bytes) per CPU shard since the process

* `shard`

*Usage*: This metric helps identify the closest the system has come to memory exhaustion. Useful for capacity planning and understanding historical memory pressure patterns.

---

=== redpanda_memory_free_memory
Expand Down