Open
Description
Describe the bug
Nvidia A10 GPU node with AKS Ubuntu image ran into stuck state (2000+ times slower) after running CUDA for several minutes
To Reproduce
- Create AKS cluster (version >= 2.9.0)
- Create A10 node pool with "Ubuntu 22.04.5 LTS / 5.15.0-1081-azure" image
- Logon to the A10 node by "kubectl debug node/xxx", with container "mcr.microsoft.com/mirror/nvcr/nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04"
- Install below packages under this container:
apt update
apt install vim
apt install python3
apt install pip
pip install numpy
pip install torch
- Use vim to create a python file called "pytorch_test.py" and paste below code into it:
import torch
import torch.nn as nn
import datetime
import time
# Set device to GPU
device = torch.device('cuda')
# Define a simple neural network model
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.fc1 = nn.Linear(1000, 500)
self.fc2 = nn.Linear(500, 10)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = self.fc2(x)
return x
# Create the model and move it to the GPU
model = SimpleModel().to(device)
# Create a random input tensor
input_tensor = torch.randn(64, 1000).to(device)
for i in range(1800):
print(f"[{datetime.datetime.now()}] Torch running - loop {i}")
for j in range(1000):
output = model(input_tensor)
time.sleep(1)
- Run the pytorch code by command "python3 pytorch_test.py"
You will see the code running and printing timestamp for each loop, and each loop takes less than 1 second
Expected behavior
The code can continue running for at least several days
Actual behavior
After 7~20mins, the running be suddenly super slow, with each loop taking nearly 1 min, almost stuck.
Environment (please complete the following information):
- Kubernetes version: >= 1.29.0
- Node image: AKSUbuntu-2204gen2containerd-202502.26.0
Additional context
The issue doesn't happen to normal (not provided by ASK) Ubuntu 22.04 LTS with the same GRID driver