Skip to content

Using gMSA on multiple containers simultaneously causes Domain Trust Relationship to fail #405

Open

Description

Describe the bug
When running multiple containers simultaneously using the same gMSA on either the same host or different hosts, it causes one or multiple containers to lose their domain trust relationship leading to various issues including LsaLookUp and negotiate auth failures. This especially happens when the count of containers is equal to or more than count of domain controllers in the environment. However, it is also possible to run into this issue when the count of containers is less than count of domain controllers in the environment, provided two or more containers attempt to talk to the same domain controller.

To Reproduce

  1. Build an image from the following Dockerfile
FROM mcr.microsoft.com/dotnet/aspnet:6.0-windowsservercore-ltsc2019 AS base

USER ContainerAdministrator
RUN reg.exe add "HKLM\SYSTEM\CurrentControlSet\Control\Lsa" /v LsaLookupCacheMaxSize /t REG_DWORD /d 0 /f

USER ContainerUser
ENTRYPOINT ["powershell.exe", "1..500 | %{ [void][System.Security.Principal.NTAccount]::new('contoso\\someobj').Translate([System.Security.Principal.SecurityIdentifier]).Value; Start-Sleep -Milliseconds (Get-Random -Minimum 100 -Maximum 1000); }"]

Replace contoso\someobj above with sam name of an actual object.

  1. Run the container image simultaneously on multiple hosts using the following command. To increase the chances of running into the issue, if there are N domain controllers in the environment, run the container image simultaneously on at least N+1 hosts
docker run --security-opt "credentialspec=file://gmsa-credspec.json" --hostname <gMSAName>  -it <image>

Replace <gMSAName> with actual gMSA and file://gmsa-credspec.json with actual gMSA Credential Spec file and <image> with the container image

  1. Monitor the output of all the containers, eventually one or more containers will start throwing the following error message. This usually happens within first few seconds of the container starting, assuming the docker run ... in (2) above was run simultaneously on different hosts. If it does not happen, repeat (2) until it does.

    Exception calling "Translate" with "1" argument(s): "The trust relationship between this workstation and the primary domain failed.

    While a running container is throwing the above error message in its output, exec into it and try performing some domain operation - that will fail as well.

Expected behavior
gMSAs on multiple Windows Containers is officially supported since at least Windows Server 2019. Running gMSA on multiple containers simultaneously should not result in trust relationship to fail.

Configuration:

  • Edition: Windows Server 2022
  • Base Image being used: Windows Server Core
  • Container engine: docker

Additional context

  • While the reproducer uses a PowerShell base image to demonstrate the bug, we had originally run into this issue in an ASP.NET Core web application while performing negotiate authentication.

  • The container image in the reproducer purposefully disables LSA LookUp Cache by setting LsaLookupCacheMaxSize to 0 to simplify the example.

  • If you were to observe traffic of a container that has run into this issue, the packet capture will indicate a lot of DSERPC/RPC_NETLOGON failure messages. You may also observe packets reporting nca_s_fault_sec_pkg_error
    image

  • Sometimes the container may "autorecover". It is purely a chance event. Whenever this happens, you can see RPC_NETLOGON packets in the network capture. Typically this results in the container recovering its domain trust relationship only when the NETLOGON happens through a different domain controller than what container had earlier communicated to.
    image

  • It is also possible to re-establish domain trust relationship of a failing container by running the following command in a failing container (the runtime user should be a ContainerAdministrator or should have administrators privileges)

    nltest.exe /sc_reset:contoso.com

    If the above command does not succeed, you may have to run it more than once. When the command succeeds, more often than not, all the affected containers and not just the current container "recover".

  • As mentioned in the bug description, it is very easy to run into this issue when the count of containers is more than the number of domain controllers in the environment but that is not the only scenario.

  • docker run ... is not the only way to run into this issue. It can be also be reproduced on an orchestration platform like Kubernetes, by setting replicas count of the Deployment to N+1; or by using scaling feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

P0Needs attention ASAPbugSomething isn't workinggMSAauthentication account across containers🔖 ADOHas corresponding ADO item

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions