Skip to content

GPUDriveTorchEnv fails to initialize with CUDA_ERROR_INVALID_DEVICE due to incompatible CUDA version #500

@rshimomura0815

Description

@rshimomura0815

Summary

When initializing the GPUDriveTorchEnv environment, the simulation fails with a CUDA_ERROR_INVALID_DEVICE error.

After troubleshooting, the root cause appears to be an incompatibility between the version of the CUDA Toolkit that PyTorch is bundled with and the explicit requirements of the Madrona Engine backend. While torch.cuda.is_available() returns True, the underlying Madrona engine cannot initialize a CUDA context.

This issue can be reproduced by running the official Docker container and executing example code, such as the 05_step_with_expert_actions.ipynb notebook.

Reproduction Steps

  1. Set up the environment using the official Dockerfile provided in the repository.
  2. Start the Docker container.
  3. Run any code that initializes GPUDriveTorchEnv, for example, the tutorial notebook examples/tutorials/05_step_with_expert_actions.ipynb.
  4. The kernel will crash, and checking the container logs reveals the CUDA error.

Expected Behavior

The GPUDriveTorchEnv environment should initialize successfully and allow the simulation to run on the GPU.

Actual Behavior

The program terminates with the following error:

$ uv run python gpudrive/env/env_torch.py 
/gpudrive/.venv/lib/python3.11/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_stream, resource_exists
Error at /gpudrive/external/madrona/src/mw/cuda_exec.cpp:1332 in GPUEngineState madrona::initEngineAndUserState(uint32_t, uint32_t, uint32_t, void *, uint32_t, void *, uint32_t, uint32_t, uint32_t, const GPUKernels &, ExecutorMode, CUdevice, CUcontext, cudaStream_t)
CUDA_ERROR_INVALID_DEVICE: invalid device ordinal

Environment and Diagnosis

  • GPU: NVIDIA RTX 4090
  • Host NVIDIA Driver: 576.88 (Supports up to CUDA 12.9)
  • Environment: Docker, built from the Dockerfile in the main branch.

The core of the issue is a dependency conflict:

Madrona's Requirement: The Madrona Engine's documentation explicitly states that it requires CUDA Toolkit 12.4 or 12.8+ and that CUDA 12.5 and 12.6 do not work due to bugs.
Source: Madrona GPU-Backend Dependencies

Installed PyTorch's CUDA Version: The version of PyTorch installed by the default dependencies (uv sync) is built with CUDA 12.6. This was confirmed inside the container:

>>> import torch
>>> print(f"PyTorch CUDA Version: {torch.version.cuda}")
PyTorch CUDA Version: 12.6
>>> print(f"Is CUDA available? {torch.cuda.is_available()}")
Is CUDA available? True

I'm not familiar with current development environment, so please help me if you know something about workaround for this issue. (Maybe applying #474 and use CUDA version 12.8 is desirable..?)

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions