Skip to content

[BUG] Nvidia A10 GPU node with AKS Ubuntu image ran into nearly stuck state (2000+ times slower) after running CUDA for several minutes #4868

Open
@yinhew

Description

@yinhew

Describe the bug
Nvidia A10 GPU node with AKS Ubuntu image ran into stuck state (2000+ times slower) after running CUDA for several minutes

To Reproduce

  • Create AKS cluster (version >= 2.9.0)
  • Create A10 node pool with "Ubuntu 22.04.5 LTS / 5.15.0-1081-azure" image
  • Logon to the A10 node by "kubectl debug node/xxx", with container "mcr.microsoft.com/mirror/nvcr/nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04"
  • Install below packages under this container:

apt update
apt install vim
apt install python3
apt install pip
pip install numpy
pip install torch

  • Use vim to create a python file called "pytorch_test.py" and paste below code into it:
import torch
import torch.nn as nn
import datetime
import time

# Set device to GPU
device = torch.device('cuda')

# Define a simple neural network model
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(1000, 500)
        self.fc2 = nn.Linear(500, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Create the model and move it to the GPU
model = SimpleModel().to(device)

# Create a random input tensor
input_tensor = torch.randn(64, 1000).to(device)

for i in range(1800):
    print(f"[{datetime.datetime.now()}] Torch running - loop {i}")
    for j in range(1000):
        output = model(input_tensor)
    time.sleep(1)
  • Run the pytorch code by command "python3 pytorch_test.py"

You will see the code running and printing timestamp for each loop, and each loop takes less than 1 second

Expected behavior
The code can continue running for at least several days

Actual behavior
After 7~20mins, the running be suddenly super slow, with each loop taking nearly 1 min, almost stuck.

Environment (please complete the following information):

  • Kubernetes version: >= 1.29.0
  • Node image: AKSUbuntu-2204gen2containerd-202502.26.0

Additional context
The issue doesn't happen to normal (not provided by ASK) Ubuntu 22.04 LTS with the same GRID driver

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions