Skip to content

DML provider crash / runtime error on attention Einsum (SCUNet-GAN.simpl.onnx) #26842

@TheColorRed

Description

@TheColorRed

Title: DML provider crash / runtime error on attention Einsum (SCUNet-GAN.simpl.onnx)

Description

Running SCUNet-GAN.simpl.onnx (simplified from SCUNet-GAN.onnx) with the DirectML provider (ONNX Runtime DmlExecutionProvider) on Windows/AMD produces an access-violation crash (process exit code 3221225477) in my environment. Binary search isolation points to an Einsum in the MSA block; replacing it with Transpose+MatMul avoids the segfault but causes a runtime error in DML (MatMul: "The parameter is incorrect").

Files to reproduce

  • abra/ai/super-resolution/models/SCUNet-GAN.simpl.onnx (simplified full model)
  • deployment/tmp_prefix_65.onnx (minimal crashing prefix discovered by binary search)
  • deployment/tmp_prefix_65_fixed.onnx (Einsum replaced with Transpose+MatMul; triggers MatMul runtime error on DML)
  • deployment/onnx_dml_repro/run_repro.py (simple runner)
  • deployment/onnx_dml_repro/logs.txt (captured outputs & environment info)

Environment

  • OS: Windows 11 10.0.26200-SP0
  • Python: 3.12.10
  • onnx: 1.17.0
  • onnxruntime: 1.23.2
  • onnxruntime-directml: 1.21.1
  • GPU: AMD (local machine)

Reproduction steps

  1. Activate virtualenv (.venv\Scripts\activate) in the repo root.
  2. (Quick test) Run: .venv\Scripts\python abra/ai/super-resolution/deployment/test_directml_safe.py abra/ai/super-resolution/models/SCUNet-GAN.simpl.onnx
    • Expected: process returns 3221225477 (access violation) in my earlier runs; current runs sometimes show NO_SUCHFILE referencing original model (see logs). Either way the run fails on DML.
  3. (Isolated repro) Run: .venv\Scripts\python deployment/onnx_dml_repro/run_repro.py deployment/tmp_prefix_65.onnx
    • Observed: earlier binary search identified tmp_prefix_65.onnx as the minimal prefix that reproduces the problematic provider behavior.
  4. (Einsum replacement experiment) Run: .venv\Scripts\python deployment/onnx_dml_repro/run_repro.py deployment/tmp_prefix_65_fixed.onnx
    • Observed: CPU runs OK; DML returns runtime error about MatMul parameters (see logs).

Observed logs

See deployment/onnx_dml_repro/logs.txt (contains MatMul runtime message and the earlier segfault exit code). Key excerpt:

  • MatMul error (DML):
    "Non-zero status code returned while running MatMul node... The parameter is incorrect."

  • Earlier segfault on simplified model:
    returncode: 3221225477

Notes & suggestions

  • The problem appears tied to Einsum/attention matmul patterns used in the model's MSA block. Replacing Einsum removed the segfault, but DML failed to run MatMul on the replacement (shape/parameter issue).
  • Minimal prefix (tmp_prefix_65.onnx) is fairly small (56 nodes) and should be helpful as a repro; SCUNet-GAN.simpl.onnx demonstrates the failure at production scale.

Would you like me to open this as an issue on the ONNX Runtime (runtime/directml provider) GitHub repo and attach these files? I can create the issue and attach the deployment/tmp_prefix_65.onnx and abra/ai/super-resolution/models/SCUNet-GAN.simpl.onnx files for triage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ep:DMLissues related to the DirectML execution provider

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions