Skip to content
This repository was archived by the owner on Feb 11, 2025. It is now read-only.
This repository was archived by the owner on Feb 11, 2025. It is now read-only.

Between "No GPU/TPU found, falling back to CPU." and "failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error" #157

@sbisw002

Description

@sbisw002

I am trying to get my new ThinkPad with "NVIDIA RTX 4000 Ada 12 GB" graphics card going.

No matter what "cuda-driver(12.4)+cudnn+jax+jaxlib" combination I try, the best results are either a)"No GPU/TPU found, falling back to CPU." or b)"failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error"

When I run Data Sampler section from https://github.com/PredictiveIntelligenceLab/ImprovedDeepONets/blob/main/Stokes/PI_DeepONet_Stokes.ipynb

I get errors like:

a)
Installation:
pip install jaxlib==0.4.7+cuda12.cudnn88 jax==0.4.7 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

Run:
runfile('/home/saumya/NeuralN/Op Net/ImprovedDeepONets/Stokes/PI_DeepONet_Stokes-Copy1', wdir='/home/saumya/NeuralN/Op Net/ImprovedDeepONets/Stokes')
2024-03-19 11:48:27.682846: I external/xla/xla/service/service.cc:168] XLA service 0x8dd95c0 initialized for platform Interpreter (this does not guarantee that XLA will be used). Devices:
2024-03-19 11:48:27.682867: I external/xla/xla/service/service.cc:176] StreamExecutor device (0): Interpreter,
2024-03-19 11:48:27.689135: I external/xla/xla/pjrt/tfrt_cpu_pjrt_client.cc:218] TfrtCpuClient created.
2024-03-19 11:48:29.450971: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2024-03-19 11:48:29.450988: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: saumya-TP-GPU
2024-03-19 11:48:29.450991: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: saumya-TP-GPU
2024-03-19 11:48:29.451052: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: 550.54.14
2024-03-19 11:48:29.451064: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: NOT_FOUND: could not find kernel module information in driver version file contents: "NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 550.54.14 Release Build (dvs-builder@U16-A24-2-2) Thu Feb 22 01:44:50 UTC 2024
GCC version: gcc version 12.3.0 (Ubuntu 12.3.0-1ubuntu1~22.04)
"
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

b)
Installation:
pip install jaxlib==0.4.9+cuda12.cudnn88 jax==0.4.9 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

Run:
2024-03-19 12:10:31.130411: I external/xla/xla/service/service.cc:168] XLA service 0x6a1d490 initialized for platform Interpreter (this does not guarantee that XLA will be used). Devices:
2024-03-19 12:10:31.130427: I external/xla/xla/service/service.cc:176] StreamExecutor device (0): Interpreter,
2024-03-19 12:10:31.134477: I external/xla/xla/pjrt/tfrt_cpu_pjrt_client.cc:433] TfrtCpuClient created.
2024-03-19 12:10:50.428065: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2024-03-19 12:10:50.428083: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: saumya-TP-GPU
2024-03-19 12:10:50.428086: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: saumya-TP-GPU
2024-03-19 12:10:50.428143: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: 550.54.14
2024-03-19 12:10:50.428156: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: NOT_FOUND: could not find kernel module information in driver version file contents: "NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 550.54.14 Release Build (dvs-builder@U16-A24-2-2) Thu Feb 22 01:44:50 UTC 2024
GCC version: gcc version 12.3.0 (Ubuntu 12.3.0-1ubuntu1~22.04)
"
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

My system:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:19:38_PST_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0

$ nvidia-smi
Tue Mar 19 12:21:40 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 ERR! Off | 00000000:01:00.0 N/A | N/A |
|ERR! ERR! ERR! N/A / N/A | 14MiB / 12282MiB | N/A Default |
| | | ERR! |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

python version
$ whereis python | tr ' ' '\n' | grep ^/ | sort
/home/saumya/anaconda3/envs/OpNet/bin/python
$ python --version && python3 --version
Python 3.9.18
Python 3.9.18

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions