[BUG] Nvidia A10 GPU node with AKS Ubuntu image ran into nearly stuck state (2000+ times slower) after running CUDA for several minutes

**Describe the bug**
Nvidia A10 GPU node with AKS Ubuntu image ran into stuck state (2000+ times slower) after running CUDA for several minutes

**To Reproduce**

- Create AKS cluster (version >= 2.9.0)
- Create A10 node pool with "Ubuntu 22.04.5 LTS / 5.15.0-1081-azure" image
- Logon to the A10 node by "kubectl debug node/xxx", with container "mcr.microsoft.com/mirror/nvcr/nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04"
- Install below packages under this container:
> apt update
> apt install vim
> apt install python3
> apt install pip
> pip install numpy
> pip install torch

- Use vim to create a python file called "pytorch_test.py" and paste below code into it:
```
import torch
import torch.nn as nn
import datetime
import time

# Set device to GPU
device = torch.device('cuda')

# Define a simple neural network model
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(1000, 500)
        self.fc2 = nn.Linear(500, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Create the model and move it to the GPU
model = SimpleModel().to(device)

# Create a random input tensor
input_tensor = torch.randn(64, 1000).to(device)

for i in range(1800):
    print(f"[{datetime.datetime.now()}] Torch running - loop {i}")
    for j in range(1000):
        output = model(input_tensor)
    time.sleep(1)
```
- Run the pytorch code by command "python3 pytorch_test.py"

You will see the code running and printing timestamp for each loop, and each loop takes less than 1 second

**Expected behavior**
The code can continue running for at least several days

**Actual behavior**
After 7~20mins, the running be suddenly super slow, with each loop taking nearly 1 min, almost stuck.

**Environment (please complete the following information):**

- Kubernetes version: >= 1.29.0
- Node image: AKSUbuntu-2204gen2containerd-202502.26.0

**Additional context**
The issue doesn't happen to normal (not provided by ASK) Ubuntu 22.04 LTS with the same GRID driver



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Nvidia A10 GPU node with AKS Ubuntu image ran into nearly stuck state (2000+ times slower) after running CUDA for several minutes #4868

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Nvidia A10 GPU node with AKS Ubuntu image ran into nearly stuck state (2000+ times slower) after running CUDA for several minutes #4868

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions