-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Prometheus Gauge Collision for cache_size #9600
Description
Expected Behavior
The cache_size{cache_type="mutablestate"} Prometheus metric should reflect the configured
capacity of the mutable state cache (the workflow execution LRU cache in the history service).
When cacheSizeBasedLimit: true and hostLevelCacheMaxSizeBytes: 629145600 are set in dynamic
config, the metric should report 629145600 (the byte-mode capacity).
Actual Behavior
cache_size{cache_type="mutablestate"} always reports 128000 - the default value of
ReplicationProgressCacheMaxSize - regardless of the mutable state cache's actual configuration.
This happens because the replication progress cache (service/history/replication/progress_cache.go:61)
reuses MutableStateCacheTypeTagValue as its metrics tag:
// progress_cache.go:61
cache: cache.NewWithMetrics(maxSize, opts, handler.WithTags(
metrics.CacheTypeTag(metrics.MutableStateCacheTypeTagValue), // <-- should be its own tag
)),temporal/service/history/replication/progress_cache.go
Lines 60 to 63 in a4e6f11
| return &progressCacheImpl{ | |
| cache: cache.NewWithMetrics(maxSize, opts, handler.WithTags(metrics.CacheTypeTag(metrics.MutableStateCacheTypeTagValue))), | |
| } | |
| } |
Both caches call cache.NewWithMetrics(), which records a cache_size gauge at construction time. Since Prometheus gauges use last-write-wins semantics, whichever cache is constructed last determines the reported value. In practice, the replication progress cache is constructed after the mutable state cache (via fx dependency ordering), so it overwrites the gauge with its own maxSize of 128000.
This makes it impossible to monitor the actual mutable state cache capacity via Prometheus.
Impact
This bug is particularly misleading when investigating cacheSizeBasedLimit. Users who enable
byte-based cache limiting (cacheSizeBasedLimit: true) and check the cache_size{mutablestate}
metric will see 128000 instead of their configured byte limit - leading them to incorrectly
conclude that byte mode did not activate.
This has already caused confusion for at least two independent users:
- Our team spent significant time investigating a phantom "bug" in
cacheSizeBasedLimit, including
full source code tracing, unit tests, and Docker-level debugging before discovering the gauge
collision. We were about to switch to count-based mode as a workaround for a problem that didn't
exist. - @andropler in community thread #18787 and issue #8902 reported the same
cache_size = 128000observation and switched to count-based mode. Their observation is consistent with this gauge collision - byte mode may have been working for them too.
Steps to Reproduce the Problem
- Deploy Temporal v1.29.1 (or latest
main) with history service and this dynamic config:history.cacheSizeBasedLimit: - value: true history.hostLevelCacheMaxSizeBytes: - value: 629145600 # 600 MiB
- Wait for the history service to start.
- Scrape the Prometheus metrics endpoint (default
:9090/metrics). - Observe:
cache_size{cache_type="mutablestate"} 128000 - Expected:
cache_size{cache_type="mutablestate"} 629145600
Verification via debug logging
We built a patched binary from v1.29.1 source with fmt.Printf in both NewHostLevelCache
(cache.go) and NewProgressCache (progress_cache.go). Output confirms the initialization
order and gauge overwrite:
DEBUG: HistoryCacheSizeBasedLimit = true
DEBUG NewHostLevelCache: HistoryCacheLimitSizeBased=true maxSize(count)=128000
DEBUG NewHostLevelCache: maxSize(bytes)=629145600
DEBUG NewProgressCache: maxSize=128000, using tag=MutableStateCacheTypeTagValue
The mutable state cache correctly enters byte mode with maxSize=629145600. Then the replication
progress cache overwrites the gauge with 128000.
Suggested Fix
Give the replication progress cache its own metric tag value. For example:
// common/metrics/metric_defs.go - add new constant
ReplicationProgressCacheTypeTagValue = "replication_progress"// service/history/replication/progress_cache.go:61 - use new tag
cache: cache.NewWithMetrics(maxSize, opts, handler.WithTags(
metrics.CacheTypeTag(metrics.ReplicationProgressCacheTypeTagValue),
)),This is a one-line behavioral change (plus the new constant definition). It would allow both caches to report their cache_size independently via distinct cache_type label values.
Specifications
- Version: v1.29.1 (also confirmed on latest
main- the code is unchanged) - Platform: Linux/arm64 (Docker), also observed on Kubernetes (EKS)
- File:
cache: cache.NewWithMetrics(maxSize, opts, handler.WithTags(metrics.CacheTypeTag(metrics.MutableStateCacheTypeTagValue))), - Introduced in: 4d6dc3614
Related Issues
- #8902 - "History service memory usage
upward trend" - Community thread #18787 - "Memory OOM issues with history pod and size-based cache configuration"