Skip to content

Conversation

@mem
Copy link
Contributor

@mem mem commented Nov 12, 2025

In order to have more resilience in the face of outages, implement an out-of-process cache using memcached. In order to support the existing mode of operation, add an in-process-cache using Otter. Rework the tenant cache to use this new system.

In order to have more resilience in the face of outages, implement an
out-of-process cache using memcached. In order to support the existing
mode of operation, add an in-process-cache using Otter. Rework the
tenant cache to use this new system.

Signed-off-by: Marcelo E. Magallon <marcelo.magallon@grafana.com>
@mem mem requested a review from a team as a code owner November 12, 2025 01:42
@mem mem requested review from The-9880 and d0ugal November 12, 2025 01:42
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bit that needs the most attention.

//
// This interface allows the agent to use different cache backends interchangeably,
// including a no-op implementation that eliminates the need for nil checks in client code.
type Cache interface {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three implementations of this interface.

A memcached one. This is the one I would like to use eventually.

A local one, based on Otter.

A noop one, which is a fallback in case the other two are not available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider a multi-tiered implementation, too, using L1 with Otter and L2 with memcached. I would prefer to get some numbers before going that way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think I initially expected a multi-tiered implementation. It's not something we need to implement immediately, I'm fine with running this using one or the other first and getting some data out of it.

type Cache interface {
// Set stores a value in the cache with the specified expiration time.
// Returns an error if the operation fails.
Set(ctx context.Context, key string, value any, expiration time.Duration) error
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bit where I struggled the most.

Otter uses generics to have a more strongly-typed cache. Since the goal is to use memcached, which does not have a strongly-typed API, I thought about multiple clients, one per type, but it was weird.

@mem
Copy link
Contributor Author

mem commented Nov 12, 2025

Apologies about the large PR.

I did consider breaking it up into multiple steps, but it's hard to see where this is going without the changes in the tenant manager code, and it's hard to make the tenant manager changes without the other code.

Copy link
Contributor

@The-9880 The-9880 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it'll work to me, I left a suggestion about typed caches which I think could be interesting.

effectiveType := cacheType
if cacheType == "auto" {
if len(memcachedServers) > 0 {
effectiveType = "memcached"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may make sense as a typed string in the cache package, so the valid values are defined upfront and we don't need to handle the literals.

e.g:

package cache

type Type string

const (
	TypeMemcached Type = "memcached"
	TypeLocal     Type = "local"
	TypeNoop      Type = "noop"
)


ctx := t.Context()

t.Run("flush clears all items", func(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is t.Run() here just to visually separate the setup? Or to annotate the test? I don't see a big advantage to it otherwise as there's only one subtest.

func TestLocalCacheCapacity(t *testing.T) {
logger := zerolog.New(io.Discard)

t.Run("respects max capacity", func(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar q, why bother with t.Run() in this test?


// Check if there's already a cached tenant
var existing cachedTenant
err := tm.cache.Get(ctx, key, &existing)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to log this if it's non-nil?


// Store in cache without TTL - we'll check ValidUntil manually
// This allows us to return stale data if the API is unavailable
if err := tm.cache.Set(ctx, key, cached, 0); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about defining typed caches on top of the Cache interface?

The assumption being that items of the same object class (e.g. cachedTenant) would be cached with the same TTL (0 in this case). The counter-case would be if you ever wanted to have different TTLs for objects of the same kind, e.g. using validUntil as the TTL for cachedTenant.

This lets us omit expiration from Set by constructing a Cache<cachedTenant> which uses 0 for all its entries. You could also move the key namespace prefix (tenant:) to the typed cache's Set logic via its constructor.

So NewManager would accept a CacheProvider as an argument, and would initialize its own cache interface like:

tenantCache := cacheProvider.NewCache<cachedTenant>(namespace="tenant:", TTL=0)

tm := &Manager{
		tenantCh:      tenantCh,
		tenantsClient: tenantsClient,
		timeout:       timeout,
		cache:         tenantCache,
		logger:        logger,
		fetchMutexes:  xsync.NewMap[int64, *sync.Mutex](),
	}

If we end up caching other objects, they would initialize a separate Cache<otherObject> with its own TTL and key prefix.


// Store in cache without TTL - we'll check ValidUntil manually
// This allows us to return stale data if the API is unavailable
if err := tm.cache.Set(ctx, key, cached, 0); err != nil {
Copy link
Contributor

@The-9880 The-9880 Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note: it's documented on the implementations, but 0 means something different between caching backends:

  • Local: max of 24h TTL, if config.DefaultTTL is not defined/valid.
  • Memcached: as long as the memcached server is alive (no explicit expiry).

It makes sense, but it did trip me for a moment.

//
// This interface allows the agent to use different cache backends interchangeably,
// including a no-op implementation that eliminates the need for nil checks in client code.
type Cache interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think I initially expected a multi-tiered implementation. It's not something we need to implement immediately, I'm fine with running this using one or the other first and getting some data out of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants