Prometheus Gauge Collision for `cache_size`

## Expected Behavior

The `cache_size{cache_type="mutablestate"}` Prometheus metric should reflect the configured
capacity of the **mutable state cache** (the workflow execution LRU cache in the history service).

When `cacheSizeBasedLimit: true` and `hostLevelCacheMaxSizeBytes: 629145600` are set in dynamic
config, the metric should report `629145600` (the byte-mode capacity).

## Actual Behavior

`cache_size{cache_type="mutablestate"}` always reports `128000` - the default value of
`ReplicationProgressCacheMaxSize` - regardless of the mutable state cache's actual configuration.

This happens because the **replication progress cache** (`service/history/replication/progress_cache.go:61`)
reuses `MutableStateCacheTypeTagValue` as its metrics tag:

```go
// progress_cache.go:61
cache: cache.NewWithMetrics(maxSize, opts, handler.WithTags(
    metrics.CacheTypeTag(metrics.MutableStateCacheTypeTagValue), // <-- should be its own tag
)),
```

https://github.com/temporalio/temporal/blob/a4e6f11954a3d60e7e7159cb28324a09d22fb923/service/history/replication/progress_cache.go#L60-L63

Both caches call `cache.NewWithMetrics()`, which records a `cache_size` gauge at construction time. Since Prometheus gauges use last-write-wins semantics, whichever cache is constructed last determines the reported value. In practice, the replication progress cache is constructed after the mutable state cache (via fx dependency ordering), so it overwrites the gauge with its own `maxSize` of `128000`.

This makes it impossible to monitor the actual mutable state cache capacity via Prometheus.

## Impact

This bug is particularly misleading when investigating `cacheSizeBasedLimit`. Users who enable
byte-based cache limiting (`cacheSizeBasedLimit: true`) and check the `cache_size{mutablestate}`
metric will see `128000` instead of their configured byte limit - leading them to incorrectly
conclude that byte mode did not activate.

This has already caused confusion for at least two independent users:

- Our team spent significant time investigating a phantom "bug" in `cacheSizeBasedLimit`, including
  full source code tracing, unit tests, and Docker-level debugging before discovering the gauge
  collision. We were about to switch to count-based mode as a workaround for a problem that didn't
  exist.
- @andropler in [community thread #18787](https://community.temporal.io/t/memory-oom-issues-with-history-pod-and-size-based-cache-configuration/18787) and [issue #8902](https://github.com/temporalio/temporal/issues/8902) reported the same `cache_size = 128000` observation and switched to count-based mode. Their observation is consistent with this gauge collision - byte mode may have been working for them too.

## Steps to Reproduce the Problem

1. Deploy Temporal v1.29.1 (or latest `main`) with history service and this dynamic config:
   ```yaml
   history.cacheSizeBasedLimit:
     - value: true
   history.hostLevelCacheMaxSizeBytes:
     - value: 629145600  # 600 MiB
   ```
2. Wait for the history service to start.
3. Scrape the Prometheus metrics endpoint (default `:9090/metrics`).
4. Observe: `cache_size{cache_type="mutablestate"} 128000`
5. Expected: `cache_size{cache_type="mutablestate"} 629145600`

### Verification via debug logging

We built a patched binary from v1.29.1 source with `fmt.Printf` in both `NewHostLevelCache`
(`cache.go`) and `NewProgressCache` (`progress_cache.go`). Output confirms the initialization
order and gauge overwrite:

```
DEBUG: HistoryCacheSizeBasedLimit = true
DEBUG NewHostLevelCache: HistoryCacheLimitSizeBased=true maxSize(count)=128000
DEBUG NewHostLevelCache: maxSize(bytes)=629145600
DEBUG NewProgressCache: maxSize=128000, using tag=MutableStateCacheTypeTagValue
```

The mutable state cache correctly enters byte mode with `maxSize=629145600`. Then the replication
progress cache overwrites the gauge with `128000`.

## Suggested Fix

Give the replication progress cache its own metric tag value. For example:

```go
// common/metrics/metric_defs.go - add new constant
ReplicationProgressCacheTypeTagValue = "replication_progress"
```

```go
// service/history/replication/progress_cache.go:61 - use new tag
cache: cache.NewWithMetrics(maxSize, opts, handler.WithTags(
    metrics.CacheTypeTag(metrics.ReplicationProgressCacheTypeTagValue),
)),
```

This is a one-line behavioral change (plus the new constant definition). It would allow both caches to report their `cache_size` independently via distinct `cache_type` label values.

## Specifications

- **Version:** v1.29.1 (also confirmed on latest `main` - the code is unchanged)
- **Platform:** Linux/arm64 (Docker), also observed on Kubernetes (EKS)
- **File:** https://github.com/temporalio/temporal/blob/a4e6f11954a3d60e7e7159cb28324a09d22fb923/service/history/replication/progress_cache.go#L61
- **Introduced in:** https://github.com/temporalio/temporal/commit/4d6dc3614

## Related Issues

- [#8902](https://github.com/temporalio/temporal/issues/8902) - "History service memory usage
  upward trend"
- [Community thread #18787](https://community.temporal.io/t/memory-oom-issues-with-history-pod-and-size-based-cache-configuration/18787) - "Memory OOM issues with history pod and size-based cache configuration"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus Gauge Collision for `cache_size` #9600

Expected Behavior

Actual Behavior

Impact

Steps to Reproduce the Problem

Verification via debug logging

Suggested Fix

Specifications

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	return &progressCacheImpl{
	cache: cache.NewWithMetrics(maxSize, opts, handler.WithTags(metrics.CacheTypeTag(metrics.MutableStateCacheTypeTagValue))),
	}
	}

Prometheus Gauge Collision for cache_size #9600

Description

Expected Behavior

Actual Behavior

Impact

Steps to Reproduce the Problem

Verification via debug logging

Suggested Fix

Specifications

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Prometheus Gauge Collision for `cache_size` #9600