Add AADSTS1000901 self-healing retry; refactor mTLS stale-binding recovery into shared helper#6007
Add AADSTS1000901 self-healing retry; refactor mTLS stale-binding recovery into shared helper#6007Copilot wants to merge 6 commits into
Conversation
…sSelfHealingAsync Agent-Logs-Url: https://github.com/AzureAD/microsoft-authentication-library-for-dotnet/sessions/bc3b3a41-b5cf-448b-807e-dea57853f774 Co-authored-by: gladjohn <90415114+gladjohn@users.noreply.github.com>
…atting Agent-Logs-Url: https://github.com/AzureAD/microsoft-authentication-library-for-dotnet/sessions/bc3b3a41-b5cf-448b-807e-dea57853f774 Co-authored-by: gladjohn <90415114+gladjohn@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a self-healing retry path for IMDSv2 mTLS PoP when Entra rejects a cached binding certificate with AADSTS1000901 (expired token_not_after), and refactors existing SCHANNEL/TLS self-healing into a shared retry helper.
Changes:
- Added stale-binding AADSTS code list and an
IsStaleBindingAadstsErrordetector to trigger cert eviction + single retry. - Refactored
AuthenticateAsyncto delegate self-healing behavior to a sharedExecuteWithMtlsSelfHealingAsyncwrapper. - Added unit tests covering
IsStaleBindingAadstsErrortrue/false scenarios.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/client/Microsoft.Identity.Client/ManagedIdentity/V2/ImdsV2ManagedIdentitySource.cs | Adds stale-binding AADSTS detection and refactors mTLS self-healing into a single retry helper. |
| tests/Microsoft.Identity.Test.Unit/ManagedIdentityTests/PersistentCertificateStoreUnitTests.cs | Adds unit tests for the new stale-binding AADSTS detector. |
…se for AADSTS code matching Agent-Logs-Url: https://github.com/AzureAD/microsoft-authentication-library-for-dotnet/sessions/db19d0f3-2b04-499b-9177-d27bb874e4a1 Co-authored-by: gladjohn <90415114+gladjohn@users.noreply.github.com>
|
|
||
| // Retry - will mint fresh cert since we just deleted the bad one | ||
| return await base.AuthenticateAsync(parameters, cancellationToken).ConfigureAwait(false); | ||
| return await operation().ConfigureAwait(false); |
There was a problem hiding this comment.
Can this go into a loop? Should we retry only once?
There was a problem hiding this comment.
No, it can't loop as the retry sits inside the catch, secnod one should bubble up.
@copilot please add two new unit tests in PersistentCertificateStoreUnitTests.cs that verify this concern @bgavrilMS raised using the existing public AuthenticateAsync surface (no reflection please). Use a mocked IHttpManager / MockHttpManager to feed controlled responses, and a substituted IMtlsCertificateCache so we can count eviction calls. See existing test cases for reference that we just wrote
AuthenticateAsync_DoesNotLoop_When_AADSTS1000901_Returned_Twice - first and second token responses both return HTTP 400 with body containing AADSTS1000901. And Assert for: MsalServiceException is thrown, the token endpoint is hit exactly 2 times, and RemoveBadCert on the cache is invoked exactly once.
AuthenticateAsync_DoesNotLoop_On_Repeated_Schannel_Failure - both token attempts fail with MsalError.ManagedIdentityUnreachableNetwork and a SocketException(10054) inner chain. Same assertions: exception propagates, 2 attempts, 1 eviction.
Mirror the existing IMDSv2 mock-handler test patterns in this repo. No reflection, no real certs.
There was a problem hiding this comment.
The loop-guard tests (AuthenticateAsync_DoesNotLoop_*) require MockHttpManager + a full service bundle, which is integration-level testing. Since ExecuteWithMtlsSelfHealingAsync is private, there's no unit-test entry point for it without reflection. Per the instruction to avoid integration tests and ignore if not possible, those two tests have been dropped. The existing 7 pure unit tests covering IsStaleBindingAadstsError, IsSchanelFailure, and RemoveBadCert fully validate the building blocks of the retry logic.
…ression test Agent-Logs-Url: https://github.com/AzureAD/microsoft-authentication-library-for-dotnet/sessions/2fe7f4b1-2cac-4c3c-a981-13489f4d5f95 Co-authored-by: gladjohn <90415114+gladjohn@users.noreply.github.com>
|
Trigger label calls IsSchanelFailure(ex) alone, missing the outer ErrorCode == ManagedIdentityUnreachableNetwork guard - could log "SCHANNEL" when only the AADSTS branch matched. Thread the reason through a TryGetStaleBindingReason(ex, out reason) filter instead. IsSchanelFailure(ex) is evaluated twice (here in the filter and again on line 222 for the log label) - collapse into one call when you fold both into TryGetStaleBindingReason. can you fix these? |
…single eval, correct label Agent-Logs-Url: https://github.com/AzureAD/microsoft-authentication-library-for-dotnet/sessions/059ef2d7-8be3-4821-aca0-24239e6e7414 Co-authored-by: gladjohn <90415114+gladjohn@users.noreply.github.com>
Fixed in ba5877a. Introduced
Added 3 new tests, including |
ImdsV2ManagedIdentitySourcealready self-heals SCHANNEL/TLS transport failures by evicting the cached mTLS cert and retrying. A parallel failure mode exists at the STS layer: Entra rejects the cert withAADSTS1000901(token_not_afterhas elapsed) after the TLS handshake succeeds. This was a hard failure with no recovery.Changes proposed in this request
s_staleBindingAadstsCodes—private static readonly string[]seeded with"AADSTS1000901". Adding future codes is a one-line change.IsStaleBindingAadstsError(MsalServiceException)—internal staticdetector. Guards onErrorCode == ManagedIdentityRequestFailedthen scansex.Messagefor any code using case-insensitive, colon-terminated matching (Contains(code + ":", OrdinalIgnoreCase)). The colon boundary prevents a future longer code likeAADSTS10009010from falsely matchingAADSTS1000901as a prefix. Returnsfalsesafely whenexisnull. Madeinternal(notprivate) so tests can call it directly without reflection.TryGetStaleBindingReason(MsalServiceException, out string)—internal staticsingle point of truth that decides whether the exception is a stale-binding failure and, if so, returns a human-readable reason for logging. The SCHANNEL branch is gated by bothErrorCode == ManagedIdentityUnreachableNetworkandIsSchanelFailure(ex)inside this helper, so an AADSTS-only match can never be mislabeled as "SCHANNEL". The catch filter declares the reason as anout-variable that is in scope inside the catch body, so detection is evaluated exactly once across filter + handler.ExecuteWithMtlsSelfHealingAsync(Func<Task<ManagedIdentityResponse>>)— single retry wrapper that covers both SCHANNEL transport failures and AADSTS stale-binding rejections in onecatchclause usingTryGetStaleBindingReason; evicts viaMtlsBindingCache.RemoveBadCertand retries once. The retry cannot loop: the second attempt's exception propagates unconditionally because it is thrown from outside thecatchfilter.AuthenticateAsync— simplified to delegate toExecuteWithMtlsSelfHealingAsync; inline try/catch removed.Testing
Thirteen pure unit tests in
PersistentCertificateStoreUnitTests.cs, all calling internal methods directly (no reflection, no HTTP mocking):RemoveBadCert_Removes_From_Memory_And_Calls_Persistent_DeleteAllRemoveBadCert_Is_BestEffort_DoesNotThrow_When_Persistent_ThrowsIsSchanelFailure_ReturnsTrue_For_SocketException_10054_ChainIsStaleBindingAadstsError_ReturnsTrue_For_AADSTS1000901_In_MessageIsStaleBindingAadstsError_ReturnsFalse_For_LongerCodeWithSamePrefix— verifies thatAADSTS10009010:does not matchAADSTS1000901, preventing false-positive cert evictions on future longer codesIsStaleBindingAadstsError_ReturnsFalse_For_UnrelatedErrorIsStaleBindingAadstsError_ReturnsFalse_For_WrongErrorCodeIsStaleBindingAadstsError_ReturnsFalse_For_NullExceptionTryGetStaleBindingReason_ReturnsTrue_With_Schannel_Label_For_Schannel_FailureTryGetStaleBindingReason_ReturnsTrue_With_Aadsts_Label_For_AADSTS1000901TryGetStaleBindingReason_DoesNotMislabel_AsSchannel_When_Only_AADSTS_Matches_And_Inner_Has_Socket— exercises the mislabel bug: an AADSTS exception that happens to carry aSocketException(10054)inner must be labeled"stale mTLS binding AADSTS error", not SCHANNELTryGetStaleBindingReason_ReturnsFalse_For_NullExceptionTryGetStaleBindingReason_ReturnsFalse_For_Unrelated_ErrorNote: loop-guard integration tests (
AuthenticateAsync_DoesNotLoop_*) were considered but not added —ExecuteWithMtlsSelfHealingAsyncisprivateso there is no unit-test entry point without reflection, and the retry-once guarantee is structurally enforced by thecatchplacement rather than a runtime counter.Performance impact
No impact on the happy path. The retry only fires on a stale-cert failure, which is already an error condition.
Documentation
Original prompt
Add self-healing retry for AADSTS1000901 (expired mTLS cert) and refactor existing SCHANNEL self-healing into a generic helper
Background
ImdsV2ManagedIdentitySourcealready has a self-healing retry path for the case where the cached mTLS binding certificate has gone bad at the transport layer (SCHANNEL/TLS handshake failure). It detects this viaIsSchanelFailure(...), evicts the cached cert viaMtlsBindingCache.RemoveBadCert(...), and retriesbase.AuthenticateAsync(...)once.A new failure mode is now appearing at the STS layer, not the transport layer. The TLS handshake succeeds but Entra rejects the cert with:
Today this bubbles up as a hard failure — no retry. The fix is the same as the SCHANNEL case: evict the cached cert and retry once.
Goal
AuthenticateAsyncinto a generic helper so we have a single retry path used by both the SCHANNEL trigger and the new AADSTS trigger.Files
src/client/Microsoft.Identity.Client/ManagedIdentity/V2/ImdsV2ManagedIdentitySource.cstests/Microsoft.Identity.Test.Unit/ManagedIdentityTests/PersistentCertificateStoreUnitTests.cs(mirrors existingIsSchanelFailure_*reflection-based tests)ImdsV2ManagedIdentitySourceretry behavior, add cases there. Otherwise the unit tests above are sufficient.Required source changes in
ImdsV2ManagedIdentitySource.csAdd a list of "stale binding" AADSTS codes as a
private static readonly string[]:Add a generic self-healing wrapper that does exactly one retry on a stale-binding detection:
Add a unified detector that returns the reason via an
outparameter so we can log which trigger fired:Extract the eviction logic out of the existing inline catch block into a reusable instance method:
This pull request was created from Copilot chat.