-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with NVIDIA GSP and g4dn, g5, and g5g instances #1523
Comments
@chiragjn can you please open a service ticket? (since you are using the stock unmodified AMI that EKS team ships!) |
Ah okay, I have created one now. Just curious are the stock AMIs not built using this codebase? The changelog seems to be indicate that |
it's a layer above what's here in this repo @chiragjn (cough! check license of things cough!) |
Hey, we're experiencing this problem with K80 GPU EC2s like p2.xlarge. It works perfectly for the A10G/V100 GPUs (We're also using the stock AMIs) |
@bhavitsharma this is expected, as newer versions of the NVIDIA driver have dropped support for the chipsets used in p2: #1448 (comment) |
@cartermckinnon, as far as I understand, this is only for kubernetes 1.28. We're running 1.27 |
@bhavitsharma the GPU cards on P2s do not support 5xx series drivers. The 1.28 GPU AMI has always provided the 535 driver, but starting with release |
I am still waiting to hear back from support team, just posting there that the issue is not consistently re-producible, got another g5 node and things are working fine 🙃 |
@chiragjn I haven't been able to reproduce this, and we haven't received any other reports of weirdness on g5 instances. Have you narrowed down a reproduction? |
@cartermckinnon We are also not able to reproduce this consistently, just a few hours ago we had an issue, so far like ~4/20 attempts |
@chiragjn this sounds like something that needs to be reported using AWS support. Can you please open one? thanks! |
I have reported it, I am guessing they too are having trouble reproducing this. We are doing some tests of our own, I'll post updates on our results |
I confirm we have the same problem on Kubelet:
|
@dmegyesi do you see this on other |
@cartermckinnon I was able to get hold of a faulty g5.12xlarge node and check And the logs led me to NVIDIA/open-gpu-kernel-modules#446 It reports a few different issues
1 and 2 do not apply to g5 instances, so based on 3, I tried disabling GSP First checked
gives
Then tried disabling it
Checked again
gives
Did not work Funnily enough, AWS' documentation on ec2 nvidia drivers installation mentions this issue but under GRID and gaming drivers
And it points to https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#disabling-gsp for the reason
Great, let's try this
Checked again
gives
Did not work At this point I go for the nuclear option - deleting the gsp firmware
And it works! dmesg complains, but it works!
And now my workload runs on this node |
@chiragjn thanks! That certainly looks like the smoking gun. Requiring a reboot puts us in a tough position; and I'm not sure we can do something at runtime before |
Apologies for not answering earlier, I was attending re:Invent and couldn't follow up with our customers before. Yes, we have seen this on various sizes of g5 instances. I'd say roughly 1 out of 5 times the machines actually worked, randomly, can't see a pattern behind even with the same workload. We run the Nvidia DCGM exporter on the nodes that's also touching the GPUs, not sure if this is relevant info. |
We also run DCGM exporter and we can confirm that not running it at least reduces the failure incidence rate. But like you said some nodes still run fine, so we don't suspect it is exactly a dcgm problem. |
@cartermckinnon Any luck with figuring out a solution? 😅 Any insights into what would be best place in the node lifecycle to fix this would also be great |
I came across NVIDIA/gpu-operator#634 (comment) which is also pointing to GSP as a possible source for this issue |
@cartermckinnon I have a working but quite a hacky solution to disable GSP Is it possible for the AMI team to configure the kernel params and disable GSP while building the kernel modules? |
Bumping this again. I am not sure how or where can I get support for this. |
Sorry for the delay; we're doing a rework of our NVIDIA setup to address #1494 which has taken priority.
Yes, I expect to get a fix out for this in the next few weeks. |
We've also appeared to hit this. So far a sure fire way to trigger it has been to run a pod that just runs some
Before recreating the pod we run these commands without issue:
This was on a g5.48xlarge and Kube 1.26 AMI Version v20240110 |
hey @sidewinder12s - I am trying to reproduce the issue based on your comments above and just wanted to make sure I was following the process you were using. If I have the following pod spec: apiVersion: v1
kind: Pod
metadata:
name: gpu-test-nvidia-smi
spec:
restartPolicy: OnFailure
containers:
- name: gpu-demo
image: public.ecr.aws/amazonlinux/amazonlinux:2023-minimal
command: ['/bin/sh', '-c']
args: ['nvidia-smi']
# args: ['nvidia-smi --query-remapped-rows=gpu_uuid,remapped_rows.failure --format=csv,noheader']
# args: ['nvidia-smi --query-gpu=gpu_uuid,ecc.errors.uncorrected.volatile.sram --format=csv,noheader']
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
resources:
requests:
nvidia.com/gpu: 4
limits:
nvidia.com/gpu: 4
tolerations:
- key: 'nvidia.com/gpu'
operator: 'Equal'
value: 'true'
effect: 'NoSchedule' based on your comments (and correct me if I am wrong), you are saying if I deploy it, then remove it, then re-deploy it - thats when you see the issue? Something like:
Is that correct? |
Roughly yes, though we're only ever assigning 1 GPU and I am not sure if we're setting those env vars (its not on our pod spec, but maybe we set it within the container image). This image is a health checking image that performs a bunch of health checks constantly, including those nvidia-smi commands (largely to discover issues on GPUs that have caused us issues in the past). It's been a few weeks but I'm also not sure if it was caused by a pod restarting, the daemonset pod being recreated/updated or both. |
If you are using an image that's built on top of an NVIDIA image, they will have already added those environment variables. Since I'm just using a minimal AL2023 image here for simplicity, I have to add those to get the full SMI output details. But thank you for sharing, I'm going to keep digging into trying to reproduce the issue |
@cartermckinnon Apologies for the ping, but was any decision made here? 😅 |
The latest release (which will complete today) addresses this issue in Kubernetes 1.29 for |
Is there any documentation that mentions this? My userdata script that tries to disable GSP is sadly now broken with the new release with open source kmod rolling out |
In my tests, the kmod param We intend to load the proprietary kmod on g5 types for the time being so that |
We've completed our rollout of this across all active k8s versions, so I'm going to close this issue. If you continue seeing this problem, please mention me here or open a case with AWS support. |
What happened:
We provisioned a g5.* instance and it was booted with the latest ami Release v20231116
When we try to run any gpu workloads, container toolkit (cli) fails to communicate with gpu devices. When we shell into the node and run
nvidia-smi -q
it really struggles to get output and bunch of values areUnknown Error
Adding lscpu and nvidia-smi logs
lscpu+nvidia-smi.log.txt
Workload runc errors
I am reporting this because we have seen similar issues in last few days with A100 + Driver 535 + AMD EPYC configurations someplace else
How to reproduce it (as minimally and precisely as possible):
Provision a g5 instance with latest AMI, run
nvidia-smi -q
on hostEnvironment:
aws eks describe-cluster --name <name> --query cluster.platformVersion
): eks.7aws eks describe-cluster --name <name> --query cluster.version
): 1.27 (v1.27.7-eks-4f4795d)uname -a
):Linux ip-10-2-53-244.eu-west-1.compute.internal 5.10.192-183.736.amzn2.x86_64 #1 SMP Wed Sep 6 21:15:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
cat /etc/eks/release
on a node):The text was updated successfully, but these errors were encountered: