Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identity provider key caching behavior causing lookup conflict #2915

Conversation

coolgang123
Copy link
Contributor

Problem:
UAA caches the key from Identity provider. The default cache eviction time is 10 minutes. When the Identity provider changes the key, the UAA may end up with stale key in the cache. All the token validations will fail during stale key period.

Resolution:
In case of signature validation failure, invalidate the cache and retrieve the latest/current key from Identity Provider and use it for validation. This behaviour is applicable for all signature validation failure.

Another potential solution considered:
Provide an endpoint or update existing endpoint that enables Identity provider to trigger cache invalidation upon key change.

Testing done:
Unit test and IT added
Manual testing done by deploying a UAA and an Idenity provider.

Problem:
UAA caches the key from Identity provider. The default cache eviction time is 10 minutes. When the Identity provider changes the key, the UAA may end up with stale key in the cache.
All the token validations will fail during stale key period.

Resolution:
In case of signature validation failure, invalidate the cache and retrieve the latest/current key from Identity Provider and use it for validation. This behaviour is applicable for all signature validation failure.

Another potential solution considered:
Provide an endpoint or update existing endpoint that enables Identity provider to trigger cache invalidation upon key change.

Testing done:
Unit test and IT added
Manual testing done by deploying a UAA and an Idenity provider.

Signed-off-by: Peter Chen <peter-h.chen@broadcom.com>
Copy link

CLA Missing ID CLA Not Signed

@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/187737226

The labels on this github issue will be updated when the story is started.

.andExpect(header("Accept", "application/json"))
.andRespond(withStatus(OK).contentType(APPLICATION_JSON).body(response));

externalOAuthAuthenticationManager.authenticate(xCodeToken); //with first key
Copy link
Member

@peterhaochen47 peterhaochen47 Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 896: It took me a while to understand what happens here. Is this true: in this line, the code is going to fetch the first key (which does not match xCodeToken), and then immediately re-fetches the key (getting the second key) and retries (succeed). If so, the //with first key comment is a little confusing, since it's eventually not using the first key anymore.

.andRespond(withStatus(OK).contentType(APPLICATION_JSON).body(response));

externalOAuthAuthenticationManager.authenticate(xCodeToken); //with first key
externalOAuthAuthenticationManager.authenticate(xCodeToken); //with second key
Copy link
Member

@peterhaochen47 peterhaochen47 Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 897: And then here it's gonna use the previously cached key (aka the second key), and it works.

@@ -184,7 +186,7 @@ public void getExternalAuthenticationDetails_whenProviderIssuerMatchesUaaIssuer_
throws ParseException, JOSEException {
oidcConfig.setIssuer(tokenEndpointBuilder.getTokenEndpoint(IdentityZoneHolder.get()));
expectedException.expect(InvalidTokenException.class);
expectedException.expectMessage("Could not verify token signature.");
expectedException.expectMessage("Invalid signature hash.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this error message something the code consumes internally, or is it returned to the user? If the former then it's fine, if the latter we then might need to look deeper to see if there's any unexpected effects (since the user interface has technically changed).

@strehle
Copy link
Member

strehle commented Jun 5, 2024

Please explain the conflict situation in more detail, because we use the current mechanism for a longer period and don't have problems with rotation. The assumption for rotation is:

  1. that the external IdP does a key change with a unique key id (kid) in JWKS_URI
  2. that the external IdP keeps non active keys for a period of time (e.g. a Day) in JWKS

The cache is currently hard coded in amount of entries and 10 min, yes. So we maybe should have options to configured these parameters, but from our point of view we would increase evict time in cache from 10 min to 24 hours.

If you have an IdP which does not support a key rotation in the described way, then your option to mitigate this is, that you configure (and later update) the IdP configuration with a tokenKey , e.g. https://docs.cloudfoundry.org/api/uaa/version/77.10.0/index.html#oauth-oidc -> search for config.tokenKey.
With that, you are not using the cache, but if my assumptions with your IdP are correct, then in your case you should not use a cache but the configured signature validation key. You can control this key.

Allowed structure of this tokenKey is either

@strehle strehle added waiting-cla clarification needed The issue is not accepted but we need clarification labels Jun 5, 2024
@coolgang123
Copy link
Contributor Author

The following is the reason for proposed fix:

We have a customer who was running UAA version 74. The UAA used to issue a GET /jwks call with every token request for grant_type of urn:ietf:params:oauth:grant-type:jwt-bearer.

Recently, they have upgraded to UAA versions 77 which do NOT issue a callback on every request and appears to be caching the keys for some length of time. As stated, the cache duration is 10 minutes.

Customer refreshes keys in agressive manner. They generate new jwks keys on startup and every 10 minutes thereafter. So, the key generation runs into a race condition with caching and thus they encounter could not verify token signature errors.

The rotation assumptions may not work in the this aggressive rotation scenario.

@strehle
Copy link
Member

strehle commented Jun 6, 2024

The following is the reason for proposed fix:

We have a customer who was running UAA version 74. The UAA used to issue a GET /jwks call with every token request for grant_type of urn:ietf:params:oauth:grant-type:jwt-bearer.

Recently, they have upgraded to UAA versions 77 which do NOT issue a callback on every request and appears to be caching the keys for some length of time. As stated, the cache duration is 10 minutes.

Customer refreshes keys in agressive manner. They generate new jwks keys on startup and every 10 minutes thereafter. So, the key generation runs into a race condition with caching and thus they encounter could not verify token signature errors.

The rotation assumptions may not work in the this aggressive rotation scenario.

Okay, I now get it. So I assume you run on UAA before https://github.com/cloudfoundry/uaa/releases/tag/v74.30.0 ?

Because with this version we introduced the cache, but in your case I simply would answer, you should deactivate the cache.

Unfortunately you dont have the option to do this, so I think a small fix to allow to omit keyCache would solve your problem.

btw: For OIDC IdPs we have 2 caches

  1. well-known -> if you have defined discoveryUrl . Do you use this ?
  2. tokenKey -> if you either have defined discoveryUrl only or tokenUrl with tokenKey

I would start to omit keyCache only because I assume your IdP endpoint wont change, correct ?

Adding a option cacheJwk=true would be an easy task. You could set this then to false and would have the same logic in 77.11.0 as with 74.5.xxxx . What do you think ?

@strehle
Copy link
Member

strehle commented Jun 6, 2024

see #2920

@strehle
Copy link
Member

strehle commented Jun 11, 2024

think we can close this because pr #2920 is merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification needed The issue is not accepted but we need clarification delivered waiting-cla
Projects
Development

Successfully merging this pull request may close these issues.

4 participants