-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Title: DML provider crash / runtime error on attention Einsum (SCUNet-GAN.simpl.onnx)
Description
Running SCUNet-GAN.simpl.onnx (simplified from SCUNet-GAN.onnx) with the DirectML provider (ONNX Runtime DmlExecutionProvider) on Windows/AMD produces an access-violation crash (process exit code 3221225477) in my environment. Binary search isolation points to an Einsum in the MSA block; replacing it with Transpose+MatMul avoids the segfault but causes a runtime error in DML (MatMul: "The parameter is incorrect").
Files to reproduce
abra/ai/super-resolution/models/SCUNet-GAN.simpl.onnx(simplified full model)deployment/tmp_prefix_65.onnx(minimal crashing prefix discovered by binary search)deployment/tmp_prefix_65_fixed.onnx(Einsum replaced with Transpose+MatMul; triggers MatMul runtime error on DML)deployment/onnx_dml_repro/run_repro.py(simple runner)deployment/onnx_dml_repro/logs.txt(captured outputs & environment info)
Environment
- OS: Windows 11 10.0.26200-SP0
- Python: 3.12.10
- onnx: 1.17.0
- onnxruntime: 1.23.2
- onnxruntime-directml: 1.21.1
- GPU: AMD (local machine)
Reproduction steps
- Activate virtualenv (
.venv\Scripts\activate) in the repo root. - (Quick test) Run:
.venv\Scripts\python abra/ai/super-resolution/deployment/test_directml_safe.py abra/ai/super-resolution/models/SCUNet-GAN.simpl.onnx- Expected: process returns 3221225477 (access violation) in my earlier runs; current runs sometimes show
NO_SUCHFILEreferencing original model (see logs). Either way the run fails on DML.
- Expected: process returns 3221225477 (access violation) in my earlier runs; current runs sometimes show
- (Isolated repro) Run:
.venv\Scripts\python deployment/onnx_dml_repro/run_repro.py deployment/tmp_prefix_65.onnx- Observed: earlier binary search identified
tmp_prefix_65.onnxas the minimal prefix that reproduces the problematic provider behavior.
- Observed: earlier binary search identified
- (Einsum replacement experiment) Run:
.venv\Scripts\python deployment/onnx_dml_repro/run_repro.py deployment/tmp_prefix_65_fixed.onnx- Observed: CPU runs OK; DML returns runtime error about MatMul parameters (see logs).
Observed logs
See deployment/onnx_dml_repro/logs.txt (contains MatMul runtime message and the earlier segfault exit code). Key excerpt:
-
MatMul error (DML):
"Non-zero status code returned while running MatMul node... The parameter is incorrect." -
Earlier segfault on simplified model:
returncode: 3221225477
Notes & suggestions
- The problem appears tied to Einsum/attention matmul patterns used in the model's MSA block. Replacing Einsum removed the segfault, but DML failed to run MatMul on the replacement (shape/parameter issue).
- Minimal prefix (
tmp_prefix_65.onnx) is fairly small (56 nodes) and should be helpful as a repro;SCUNet-GAN.simpl.onnxdemonstrates the failure at production scale.
Would you like me to open this as an issue on the ONNX Runtime (runtime/directml provider) GitHub repo and attach these files? I can create the issue and attach the deployment/tmp_prefix_65.onnx and abra/ai/super-resolution/models/SCUNet-GAN.simpl.onnx files for triage.