GPUDriveTorchEnv fails to initialize with CUDA_ERROR_INVALID_DEVICE due to incompatible CUDA version

## Summary

When initializing the GPUDriveTorchEnv environment, the simulation fails with a CUDA_ERROR_INVALID_DEVICE error.

After troubleshooting, the root cause appears to be an incompatibility between the version of the CUDA Toolkit that PyTorch is bundled with and the explicit requirements of the Madrona Engine backend. While torch.cuda.is_available() returns True, the underlying Madrona engine cannot initialize a CUDA context.

This issue can be reproduced by running the official Docker container and executing example code, such as the 05_step_with_expert_actions.ipynb notebook.

## Reproduction Steps

1. Set up the environment using the official Dockerfile provided in the repository.
2. Start the Docker container.
3. Run any code that initializes GPUDriveTorchEnv, for example, the tutorial notebook examples/tutorials/05_step_with_expert_actions.ipynb.
4. The kernel will crash, and checking the container logs reveals the CUDA error.

## Expected Behavior

The `GPUDriveTorchEnv` environment should initialize successfully and allow the simulation to run on the GPU.

## Actual Behavior

The program terminates with the following error:
```bash
$ uv run python gpudrive/env/env_torch.py 
/gpudrive/.venv/lib/python3.11/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_stream, resource_exists
Error at /gpudrive/external/madrona/src/mw/cuda_exec.cpp:1332 in GPUEngineState madrona::initEngineAndUserState(uint32_t, uint32_t, uint32_t, void *, uint32_t, void *, uint32_t, uint32_t, uint32_t, const GPUKernels &, ExecutorMode, CUdevice, CUcontext, cudaStream_t)
CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
```

## Environment and Diagnosis
 
- GPU: NVIDIA RTX 4090
- Host NVIDIA Driver: 576.88 (Supports up to CUDA 12.9)
- Environment: Docker, built from the Dockerfile in the main branch.

The core of the issue is a dependency conflict:

Madrona's Requirement: The Madrona Engine's documentation explicitly states that it requires CUDA Toolkit 12.4 or 12.8+ and that CUDA 12.5 and 12.6 do not work due to bugs.
Source: [Madrona GPU-Backend Dependencies](https://github.com/shacklettbp/madrona?tab=readme-ov-file#gpu-backend-dependencies)

Installed PyTorch's CUDA Version: The version of PyTorch installed by the default dependencies (`uv sync`) is built with CUDA 12.6. This was confirmed inside the container:

```bash
>>> import torch
>>> print(f"PyTorch CUDA Version: {torch.version.cuda}")
PyTorch CUDA Version: 12.6
>>> print(f"Is CUDA available? {torch.cuda.is_available()}")
Is CUDA available? True
```

I'm not familiar with current development environment, so please help me if you know something about workaround for this issue.  (Maybe applying #474 and use CUDA version 12.8 is desirable..?)

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPUDriveTorchEnv fails to initialize with CUDA_ERROR_INVALID_DEVICE due to incompatible CUDA version #500

Summary

Reproduction Steps

Expected Behavior

Actual Behavior

Environment and Diagnosis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPUDriveTorchEnv fails to initialize with CUDA_ERROR_INVALID_DEVICE due to incompatible CUDA version #500

Description

Summary

Reproduction Steps

Expected Behavior

Actual Behavior

Environment and Diagnosis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions