Description
Many of our realms extend CachingUsernamePasswordRealm which maintains a local (per node) cache of users and credentials so that we do not need to do a lookup on the underlying datastore for every request.
This is particularly important when the datastore is external, e.g. an LDAP directory. If the cache was not there we would need to do an LDAP bind (and potentially also group/metadata lookups) for every REST request.
This cache has a per-entry expiry, where an entry is a "user". That means that when the cache entry for a user expires, all authentication requests for that user will lookup the original datasource until the cache is populated again (when the first authentication succeeds).
Additionally, various other factors can cause a cache entry to be evicted (e.g. we invalidate on incorrect password so that we can check for an updated password in the datastore, and there is an API to clear the cache by realm/user).
Consequently, for a single user on a single node, there is a period where we do not make requests to the datastore, and then a short period where we make potentially multiple requests.
e.g. If somone has a number of beats running on various source systems, and those beats connect directly to Elasticsearch using a single user defined in LDAP, then every 30 minutes or so, every one of those beats could trigger an LDAP bind causing cyclic spikes in load on the LDAP server.
This behaviour is problematic, but the solution is not obvious.
-
We can limit the impact of this through connection pooling (and have done so), but this doesn't reduce the total number of requests to the LDAP directory, it just flattens & stretches the spikes as that there are less simultaneous requests, but no change in the total number of requests.
-
We could implement a "cache loader" approach, where only 1 task can be attempting to load a cache entry for a given key. That is, if we find that the cache does not have an entry for a specific user, then we perform a lookup on the original datasource, and any other lookups for that same user will wait for the first lookup to complete.
This sort of concurrency behaviour is difficult to do well and can easily be a source of subtle bugs, particularly as we need the "wait for cache to populate" to be asynchronous. -
An alternative that I have a prototype for, is to simply limit the number of concurrent authentication requests that a single realm can perform simulatenously (with a BlockingQueue to hold pending requests). This limit is global and does not care about the identity of the authenticating user. It sits before the cache, so the first N requests will execute and populate the cache as needed, and then subsequent requests will be released from the queue and find the cache is ready and the relevant user entries are populated.
This works quite well when a realm is authenticating a small number of distinct users, but those users perform multiple simultaneous requests to Elasticsearch. When there are multiple users involved, it can also help, but the result is very similar to using a connection pool.