Using gMSA on multiple containers simultaneously causes Domain Trust Relationship to fail

**Describe the bug**
When running multiple containers simultaneously using the same gMSA on either the same host or different hosts, it causes one or multiple containers to lose their domain trust relationship leading to various issues including LsaLookUp and negotiate auth failures. This especially happens when the count of containers is equal to or more than count of domain controllers in the environment. However, it is also possible to run into this issue when the count of containers is less than count of domain controllers in the environment, provided two or more containers attempt to talk to the same domain controller. 

**To Reproduce** 
1. Build an image from the following Dockerfile
  ```Dockerfile
  FROM mcr.microsoft.com/dotnet/aspnet:6.0-windowsservercore-ltsc2019 AS base
  
  USER ContainerAdministrator
  RUN reg.exe add "HKLM\SYSTEM\CurrentControlSet\Control\Lsa" /v LsaLookupCacheMaxSize /t REG_DWORD /d 0 /f
  
  USER ContainerUser
  ENTRYPOINT ["powershell.exe", "1..500 | %{ [void][System.Security.Principal.NTAccount]::new('contoso\\someobj').Translate([System.Security.Principal.SecurityIdentifier]).Value; Start-Sleep -Milliseconds (Get-Random -Minimum 100 -Maximum 1000); }"]
  ```

  Replace `contoso\someobj` above with sam name of an actual object.

2.  Run the container image **simultaneously** on multiple hosts using the following command. **To increase the chances of running into the issue, if there are N domain controllers in the environment, run the container image simultaneously on at least N+1 hosts**

   ```powershell
docker run --security-opt "credentialspec=file://gmsa-credspec.json" --hostname <gMSAName>  -it <image>
   ```
   Replace `<gMSAName>` with actual gMSA and `file://gmsa-credspec.json` with actual gMSA Credential Spec file and `<image>` with the container image

3. Monitor the output of all the containers, eventually one or more containers will start throwing the following error message. This usually happens within first few seconds of the container starting, assuming the `docker run ...` in (2) above was run simultaneously on different hosts. If it does not happen, repeat (2) until it does. 

    >Exception calling "Translate" with "1" argument(s): "The trust relationship between this workstation and the primary domain failed.

   While a running container is throwing the above error message in its output, `exec` into it and try performing some domain operation - that will fail as well.

**Expected behavior**
gMSAs on multiple Windows Containers is officially supported since at least [Windows Server 2019](https://learn.microsoft.com/en-us/virtualization/windowscontainers/manage-containers/gmsa-troubleshooting#using-a-gmsa-with-more-than-one-container-simultaneously-leads-to-intermittent-failures-on-windows-server-2016-and-windows-10-versions-1709-and-1803). Running gMSA on multiple containers simultaneously should not result in trust relationship to fail.

**Configuration:**
 - Edition: Windows Server 2022
- Base Image being used: Windows Server Core
 - Container engine: docker

**Additional context**
- While the reproducer uses a PowerShell base image to demonstrate the bug, we had originally run into this issue in an ASP.NET Core web application while performing negotiate authentication. 

- The container image in the reproducer purposefully disables LSA LookUp Cache by setting `LsaLookupCacheMaxSize` to `0` to simplify the example.

- If you were to observe traffic of a container that has run into this issue, the packet capture will indicate a lot of DSERPC/RPC_NETLOGON failure messages. You may also observe packets reporting `nca_s_fault_sec_pkg_error`
![image](https://github.com/microsoft/Windows-Containers/assets/3116732/9fbcf351-f8fb-4372-a97c-57236b613e12)

- Sometimes the container may "autorecover". It is purely a chance event. Whenever this happens, you can see RPC_NETLOGON packets in the network capture. Typically this results in the container recovering its domain trust relationship only when the NETLOGON happens through a _different_ domain controller than what container had earlier communicated to.
![image](https://github.com/microsoft/Windows-Containers/assets/3116732/58e178b6-339c-4a6f-92c6-dbe3b8780dce)

- It is also possible to re-establish domain trust relationship of a failing container by running the following command in a failing container (the runtime user should be a `ContainerAdministrator` or should have administrators privileges)
  ```cmd
  nltest.exe /sc_reset:contoso.com
  ```

  If the above command does not succeed, you may have to run it more than once. When the command succeeds, more often than not, all the affected containers and not just the current container "recover".

- As mentioned in the bug description, it is very easy to run into this issue when the count of containers is more than the number of domain controllers in the environment but that is not the only scenario.

- `docker run ...` is not the only way to run into this issue. It can be also be reproduced on an orchestration platform like Kubernetes, by setting `replicas` count of the Deployment to N+1; or by using scaling feature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using gMSA on multiple containers simultaneously causes Domain Trust Relationship to fail #405

avin3sh
openedon Aug 2, 2023

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Using gMSA on multiple containers simultaneously causes Domain Trust Relationship to fail #405

Description

avin3shopenedon Aug 2, 2023

Metadata