[Performance] Severe performance penalty with transformer model and DirectML

### Describe the issue

I am testing Meta's Segment Anything (SAM) encoder model, both on Linux (CUDA) and on Windows (DirectML). When testing the model on the two platforms, using identical hardware (Intel i9-9900, NVIDIA RTX A2000 12GB), I see extremely different runtime (median over 115 images):

- On Linux+CUDA, model loading takes ~2s and encoding takes ~370ms per image.
- On Windows+DirectML, model loading takes ~14s and encoding takes ~780ms per image.

I got these numbers using the C++ API v1.14.1 with some custom code, but I got comparable results also with more recent versions (including the latest 1.18.0), different hardware and also using the Python bindings. I thus decided to try [profiling](https://fs-eire.github.io/onnxruntime/docs/performance/tune-performance/profiling-tools.html#in-code-performance-profiling) the model execution. Comparing the profiling on Linux+CUDA vs Windows+DirectML, it seems that the longer runtime on Windows+DirectML is related to the time spent in `Memcpy_token_..._kernel_time`. Why would DirectML need to make copies when CUDA doesn't? Can that be really related to the specific execution provider? [note: a very hacky test using CUDA on windows might suggest that also the CUDA EP suffers from a similar issue on Windows, however I cannot tell that for sure]

I am now wondering if the issue I see is related to some error that I make, e.g. in the model export, or if it is actually related to some limitation of DirectML or Windows with this model. Other models (in particular, models without attention layers), do not show comparable platform-dependent differences. I also wonder if the [optimizations suggested for transformer models](https://onnxruntime.ai/docs/performance/transformers-optimization.html) might have an impact, but I don't think that SAM or ViT transformers are supported, or at least I did not understand how to apply the optimizations.

I am running out of ideas, at least given the available time and hardware that I have, so I write to try to understand if anybody experienced similar issues, or if anybody understands what is going on. Thanks.

Linux+CUDA profiling: https://drive.google.com/file/d/19NykxOWKMxZebQn3UQ9oOOs2atDv7O_8/view?usp=drive_link
Windows+DirectML profiling: https://drive.google.com/file/d/1mTCB1CzbQVj1EysXJ-hJ1wSGF077cAhV/view?usp=drive_link

### To reproduce

The onnx exported model is available [here](https://drive.google.com/file/d/1xmR8sNRGarMicfVxF2t7LEDGZ56vroi1/view?usp=drive_link).

For CUDA on linux, the EP is created with the following options:
```cpp
            OrtCUDAProviderOptions cuda_options;
            cuda_options.device_id = 0;
            cuda_options.arena_extend_strategy = 0;
            cuda_options.cudnn_conv_algo_search = OrtCudnnConvAlgoSearch::OrtCudnnConvAlgoSearchDefault;
            cuda_options.gpu_mem_limit = 0;
            cuda_options.do_copy_in_default_stream = true;
            cuda_options.has_user_compute_stream = false;
            cuda_options.default_memory_arena_cfg = nullptr;
            session_options.AppendExecutionProvider_CUDA(cuda_options);
```

For DirectML on windows, this is the set-up (based on [this](https://onnxruntime.ai/docs/execution-providers/DirectML-ExecutionProvider.html#configuration-options)):
```cpp
            session_options.DisableMemPattern();
            session_options.SetExecutionMode(ExecutionMode::ORT_SEQUENTIAL);
```
--- EDIT 10/06/24 ---
It turns out that the two options above don't seem to be required any more. Removing them has a positive impact on the Windows+DirectML runtime (~750ms per image), which however remains very far from the Linux+CUDA one.
--- END EDIT ---

In both cases `session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);` and I append the CPU provider as suggested in documentation.

--- EDIT 11/06/24 ---
Note that image preparation (resizing, normalization, padding), which is done outside of the inference call to onnxruntime, is included in the runtimes reported above. However, it cannot explain the differences observed (~55ms on Linux, ~60ms on Windows).
--- END EDIT ---

### Urgency

This might be an important issue for DirectML-based inference on Windows.

### Platform

Windows

### OS Version

11

### ONNX Runtime Installation

Built from Source

### ONNX Runtime Version or Commit ID

1.14.1

### ONNX Runtime API

C++

### Architecture

X64

### Execution Provider

DirectML

### Execution Provider Library Version

Onnxruntime default DirectML version

### Model File

Meta's Segment Anything (SAM) model exported with default settings, opset v17, constant folding optimization enabled, no dynamic input axes. Exported model available [here](https://drive.google.com/file/d/1xmR8sNRGarMicfVxF2t7LEDGZ56vroi1/view?usp=drive_link).

### Is this a quantized model?

No

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Severe performance penalty with transformer model and DirectML #20983

andrea-cimatoribus-pix4d
openedon Jun 10, 2024

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Assignees

Labels

Type

Projects

Milestone

Relationships

Development