Skip to content

CUBLAS_STATUS_NOT_INITIALIZED on NVIDIA Blackwell GPUs with PyTorch cu130 #343

@jhsmith409

Description

@jhsmith409

Description

GLiNER inference fails with CUBLAS_STATUS_NOT_INITIALIZED on NVIDIA Blackwell architecture GPUs (compute capability 12.0) when using PyTorch 2.10.0+cu130.

The error occurs in DeBERTa v2's F.linear call during the encoder forward pass — specifically in cublasLtMatmulAlgoGetHeuristic. Model loading succeeds; only inference triggers the error.

Environment

  • GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q (SM 12.0, compute capability 12.0)
  • Driver: 595.45.04
  • PyTorch: 2.10.0+cu130 (from https://download.pytorch.org/whl/cu130)
  • GLiNER: 0.2.26
  • transformers: 4.57.6 (also reproduced with 5.1.0)
  • CUDA container: nvidia/cuda:13.2.0-cudnn-runtime-ubuntu24.04
  • OS: Ubuntu 24.04

Reproduction

import torch
from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1")
model = model.to("cuda").eval()

# This fails:
with torch.no_grad():
    entities = model.predict_entities(
        "Apple CEO Tim Cook announced new products in Cupertino.",
        ["person", "organization", "location"],
        threshold=0.5
    )

Error

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling
`cublasLtMatmulAlgoGetHeuristic( ltHandle, computeDesc.descriptor(),
Adesc.descriptor(), Bdesc.descriptor(), Cdesc.descriptor(), Cdesc.descriptor(),
preference.descriptor(), 1, &heuristicResult, &returnedResult)`

Full traceback points to:

transformers/models/deberta_v2/modeling_deberta_v2.py → DisentangledSelfAttention.forward
  → self.query_proj(query_states)
    → F.linear(input, self.weight, self.bias)

Key findings

  • Basic CUDA matmul workstorch.randn(10,10,device='cuda') @ torch.randn(10,10,device='cuda') succeeds, including FP16
  • Fails in both FP32 and FP16 — the error is not related to .half() precision
  • Fails with transformers 4.57.6 and 5.1.0 — not a transformers regression
  • Specific to PyTorch cu130 — PyTorch 2.8.0+cu128 on the same GPU works perfectly
  • Upgrading the system cuBLAS from 13.0.2.14 to 13.3.0.5 (via CUDA 13.2 container) did not help, since PyTorch cu130 links against its compiled cuBLAS paths

Workaround

Use PyTorch cu128 wheels instead of cu130:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

This is likely an upstream PyTorch bug with cuBLAS on Blackwell for certain tensor shapes used by DeBERTa, but documenting here since GLiNER users with Blackwell GPUs will hit this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions