feat: Rework tenant cache #1601

mem · 2025-11-12T01:42:06Z

In order to have more resilience in the face of outages, implement an out-of-process cache using memcached. In order to support the existing mode of operation, add an in-process-cache using Otter. Rework the tenant cache to use this new system.

In order to have more resilience in the face of outages, implement an out-of-process cache using memcached. In order to support the existing mode of operation, add an in-process-cache using Otter. Rework the tenant cache to use this new system. Signed-off-by: Marcelo E. Magallon <marcelo.magallon@grafana.com>

mem · 2025-11-12T01:42:35Z

internal/tenants/manager.go

This is the bit that needs the most attention.

mem · 2025-11-12T01:44:50Z

internal/cache/cache.go

+//
+// This interface allows the agent to use different cache backends interchangeably,
+// including a no-op implementation that eliminates the need for nil checks in client code.
+type Cache interface {


There are three implementations of this interface.

A memcached one. This is the one I would like to use eventually.

A local one, based on Otter.

A noop one, which is a fallback in case the other two are not available.

I did consider a multi-tiered implementation, too, using L1 with Otter and L2 with memcached. I would prefer to get some numbers before going that way.

Yeah, I think I initially expected a multi-tiered implementation. It's not something we need to implement immediately, I'm fine with running this using one or the other first and getting some data out of it.

mem · 2025-11-12T01:46:22Z

internal/cache/cache.go

+type Cache interface {
+	// Set stores a value in the cache with the specified expiration time.
+	// Returns an error if the operation fails.
+	Set(ctx context.Context, key string, value any, expiration time.Duration) error


This is the bit where I struggled the most.

Otter uses generics to have a more strongly-typed cache. Since the goal is to use memcached, which does not have a strongly-typed API, I thought about multiple clients, one per type, but it was weird.

mem · 2025-11-12T01:50:02Z

Apologies about the large PR.

I did consider breaking it up into multiple steps, but it's hard to see where this is going without the changes in the tenant manager code, and it's hard to make the tenant manager changes without the other code.

The-9880

This looks like it'll work to me, I left a suggestion about typed caches which I think could be interesting.

The-9880 · 2025-11-20T20:33:17Z

cmd/synthetic-monitoring-agent/main.go

+	effectiveType := cacheType
+	if cacheType == "auto" {
+		if len(memcachedServers) > 0 {
+			effectiveType = "memcached"


This may make sense as a typed string in the cache package, so the valid values are defined upfront and we don't need to handle the literals.

e.g:

package cache type Type string const ( TypeMemcached Type = "memcached" TypeLocal Type = "local" TypeNoop Type = "noop" )

The-9880 · 2025-11-20T22:21:49Z

internal/cache/local_test.go

+
+	ctx := t.Context()
+
+	t.Run("flush clears all items", func(t *testing.T) {


Is t.Run() here just to visually separate the setup? Or to annotate the test? I don't see a big advantage to it otherwise as there's only one subtest.

The-9880 · 2025-11-20T22:22:36Z

internal/cache/local_test.go

+func TestLocalCacheCapacity(t *testing.T) {
+	logger := zerolog.New(io.Discard)
+
+	t.Run("respects max capacity", func(t *testing.T) {


Similar q, why bother with t.Run() in this test?

The-9880 · 2025-11-21T16:33:19Z

internal/tenants/manager.go

+
+	// Check if there's already a cached tenant
+	var existing cachedTenant
+	err := tm.cache.Get(ctx, key, &existing)


Want to log this if it's non-nil?

The-9880 · 2025-11-21T16:50:59Z

internal/tenants/manager.go

+
+	// Store in cache without TTL - we'll check ValidUntil manually
+	// This allows us to return stale data if the API is unavailable
+	if err := tm.cache.Set(ctx, key, cached, 0); err != nil {


What do you think about defining typed caches on top of the Cache interface?

The assumption being that items of the same object class (e.g. cachedTenant) would be cached with the same TTL (0 in this case). The counter-case would be if you ever wanted to have different TTLs for objects of the same kind, e.g. using validUntil as the TTL for cachedTenant.

This lets us omit expiration from Set by constructing a Cache<cachedTenant> which uses 0 for all its entries. You could also move the key namespace prefix (tenant:) to the typed cache's Set logic via its constructor.

So NewManager would accept a CacheProvider as an argument, and would initialize its own cache interface like:

tenantCache := cacheProvider.NewCache<cachedTenant>(namespace="tenant:", TTL=0) tm := &Manager{ tenantCh: tenantCh, tenantsClient: tenantsClient, timeout: timeout, cache: tenantCache, logger: logger, fetchMutexes: xsync.NewMap[int64, *sync.Mutex](), }

If we end up caching other objects, they would initialize a separate Cache<otherObject> with its own TTL and key prefix.

The-9880 · 2025-11-21T17:06:17Z

internal/tenants/manager.go

+
+	// Store in cache without TTL - we'll check ValidUntil manually
+	// This allows us to return stale data if the API is unavailable
+	if err := tm.cache.Set(ctx, key, cached, 0); err != nil {


Another note: it's documented on the implementations, but 0 means something different between caching backends:

Local: max of 24h TTL, if config.DefaultTTL is not defined/valid.

Memcached: as long as the memcached server is alive (no explicit expiry).

It makes sense, but it did trip me for a moment.

The-9880 · 2025-11-21T17:08:04Z

internal/cache/cache.go

+//
+// This interface allows the agent to use different cache backends interchangeably,
+// including a no-op implementation that eliminates the need for nil checks in client code.
+type Cache interface {


Yeah, I think I initially expected a multi-tiered implementation. It's not something we need to implement immediately, I'm fine with running this using one or the other first and getting some data out of it.

mem requested a review from a team as a code owner November 12, 2025 01:42

mem requested review from The-9880 and d0ugal November 12, 2025 01:42

mem commented Nov 12, 2025

View reviewed changes

internal/tenants/manager.go

Copy link

Contributor Author

mem Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bit that needs the most attention.

mem commented Nov 12, 2025

View reviewed changes

ka3de mentioned this pull request Nov 17, 2025

feat: use recaller to persist the time of last scrape #1607

Merged

nadiamoe mentioned this pull request Nov 20, 2025

feat: use recaller to persist the time of last scrape #1618

Draft

The-9880 approved these changes Nov 21, 2025

View reviewed changes


		ctx := t.Context()

		t.Run("flush clears all items", func(t *testing.T) {

feat: Rework tenant cache #1601

Are you sure you want to change the base?

feat: Rework tenant cache #1601

Uh oh!

Conversation

mem commented Nov 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mem commented Nov 12, 2025

Uh oh!

The-9880 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

The-9880 Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

The-9880 Nov 21, 2025 •

edited

Loading