CUBLAS_STATUS_NOT_INITIALIZED on NVIDIA Blackwell GPUs with PyTorch cu130

## Description

GLiNER inference fails with `CUBLAS_STATUS_NOT_INITIALIZED` on NVIDIA Blackwell architecture GPUs (compute capability 12.0) when using PyTorch 2.10.0+cu130.

The error occurs in DeBERTa v2's `F.linear` call during the encoder forward pass — specifically in `cublasLtMatmulAlgoGetHeuristic`. Model loading succeeds; only inference triggers the error.

## Environment

- **GPU**: NVIDIA RTX PRO 6000 Blackwell Max-Q (SM 12.0, compute capability 12.0)
- **Driver**: 595.45.04
- **PyTorch**: 2.10.0+cu130 (from `https://download.pytorch.org/whl/cu130`)
- **GLiNER**: 0.2.26
- **transformers**: 4.57.6 (also reproduced with 5.1.0)
- **CUDA container**: nvidia/cuda:13.2.0-cudnn-runtime-ubuntu24.04
- **OS**: Ubuntu 24.04

## Reproduction

```python
import torch
from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1")
model = model.to("cuda").eval()

# This fails:
with torch.no_grad():
    entities = model.predict_entities(
        "Apple CEO Tim Cook announced new products in Cupertino.",
        ["person", "organization", "location"],
        threshold=0.5
    )
```

## Error

```
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling
`cublasLtMatmulAlgoGetHeuristic( ltHandle, computeDesc.descriptor(),
Adesc.descriptor(), Bdesc.descriptor(), Cdesc.descriptor(), Cdesc.descriptor(),
preference.descriptor(), 1, &heuristicResult, &returnedResult)`
```

Full traceback points to:
```
transformers/models/deberta_v2/modeling_deberta_v2.py → DisentangledSelfAttention.forward
  → self.query_proj(query_states)
    → F.linear(input, self.weight, self.bias)
```

## Key findings

- **Basic CUDA matmul works** — `torch.randn(10,10,device='cuda') @ torch.randn(10,10,device='cuda')` succeeds, including FP16
- **Fails in both FP32 and FP16** — the error is not related to `.half()` precision
- **Fails with transformers 4.57.6 and 5.1.0** — not a transformers regression
- **Specific to PyTorch cu130** — PyTorch 2.8.0+cu128 on the same GPU works perfectly
- Upgrading the system cuBLAS from 13.0.2.14 to 13.3.0.5 (via CUDA 13.2 container) did not help, since PyTorch cu130 links against its compiled cuBLAS paths

## Workaround

Use PyTorch cu128 wheels instead of cu130:

```bash
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
```

This is likely an upstream PyTorch bug with cuBLAS on Blackwell for certain tensor shapes used by DeBERTa, but documenting here since GLiNER users with Blackwell GPUs will hit this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUBLAS_STATUS_NOT_INITIALIZED on NVIDIA Blackwell GPUs with PyTorch cu130 #343

Description

Environment

Reproduction

Error

Key findings

Workaround

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

CUBLAS_STATUS_NOT_INITIALIZED on NVIDIA Blackwell GPUs with PyTorch cu130 #343

Description

Description

Environment

Reproduction

Error

Key findings

Workaround

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions