Skip to content

Update win-ort-main to tip main 250211 #23646

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 80 commits into from
Feb 11, 2025

Conversation

ashrit-ms
Copy link
Contributor

Description

This PR is to update the win-ort-main branch to the tip main branch as of 2025-02-11.

PR List

74c778e [WebNN EP] Automatically move input CPU tensors to ml-tensor (#23073)
3775057 use correct total length to fix static kv_cache performance (#23615)
3901e96 remove --use_vcpkg flag for Python-CUDA-Packaging-Pipeline (#23631)
c610df5 Add python_requires to package metadata (#23604)
2d27d68 [QNN EP] Add QNN EP to ARM64X build targets (#23635)
e666503 [webgpu] no longer need pass-in gpu adapter for custom context (#23593)
af679a0 Fix logic for selecting alternate name for blob (#23617)
e206950 [ARM CPU] Add fp16 mlas kernels for exp, tanh, softmax, logsoftmax, softcap (#23597)
9ba5619 Update pybind and json to the latest (#23589)
c54736c Migrate iOS release pipeline to 1 ES (#23606)
3981326 Increase timeout for Windows TensorRT CI (#23625)
0274b7b fix on trtCudaVersion (#23616)
740e9ab update run CI script (#23621)
5ef1832 [WebGPU] Support PIX Capture for WebGPU EP (#23192)
0114551 Fix for C4267 warning (#23610)
002916a Validate the context_file_path before EP compile graphs (#23611)
0887e36 [webgpu] Use pushErrorScope()/popErrorScope() once for an inference run (#23438)
65008cb Auto-generated baselines by 1ES Pipeline Templates (#23603)
09e5724 [CUDA] Fix beam search of num_beams > 32 (#23599)
82840f6 Implement Flash Attention 2 for webgpu EP (#23576)
a6ea57b OpenVINO EP Weights Sharing Feature (#23553)
2c2ff4a [CUDA] Fix BeamSearchTest.DummyT5WithSequenceInputIds test failure in Windows (#23596)
d981b15 [webgpu/js] Optimize resize webgpu op & fix precision issues (#23591)
328a13c Enable VCPKG in more pipelines (#23590)
6728d60 [TensorRT EP] support TensorRT 10.8-GA (#23592)
d1fb58b Quantization tool: Allow user to override calibrator's session EP (#23559)
649ced4 Enable user loading model with external data from memory buffer (#23557)
544bdd6 Fix ConvTranspose for certain attribute combinations (#23488)
8f6ddf3 Delete extra cgmanifest entries and files (#23583)
5f6a315 Enable VCPKG in CI build (#23426)
e1e3f62 Bump lintrunner from 0.12.5 to 0.12.7 (#23326)
cd8775f Fix Node JS Samples (#23581)
6b4f9c4 [WebGPU EP] Batch Norm Implementation (#23525)
1fce51b Fix all instances of 4244 and 4267 warnings in OV EP code (#23567)
c29ca1c Update QNN default version to 2.31 (#23573)
2fc75a4 [mobile] Add Android BrowserStack test project back (#23551)
9e18b6a [CUDA] Update nvcc flags (#23572)
b47e1e6 [QNN EP] Make offloading graph input/output quantization (to CPU) the default (#23368)
75a9b40 [ROCm] Update CI to use rocm 6.3.2 (#23577)
26ff2b6 Bump ruff from 0.9.3 to 0.9.4 (#23563)
b2560a7 Update react-native to 0.72 (#23509)
faee912 [js] update JavaScript API to support QNN EP options (#23486)
816e8cb [EP Perf] Update env to ubuntu 22.04 (#23570)
cddc271 Use Eigen in Round implementation (#23571)
e8b0bdb Shape inference: ReduceMean dispatcher, quant_pre_process: skip_symbolic_shape bugfix (#23558)
267b493 delete the supported domain version upper bounds (#23237)
bb7f961 remove log spam from cpuinfo (#23548)
169917b Use latest vcpkg commit in configuration, sync manifest with deps.txt (#23554)
a9d4d08 Add of ReduceMax Gradient (#23501)
6bbf1bd [js/web] upgrade version of flatbuffers (#23545)
271c509 DP4AMatMul perf refinements (#23539)
cb69c59 Add fusions for SigLIP and Conformer-Encoder (#23528)
61fae9b Remove "--enable_pybind" from webgpu pipeline (#23550)
0bb4ea6 Update BiasGelu fusion and related ops (#23518)
4dde74a Add more details to BrowserStack script failure (#23520)
ead9d5c Set ANDROID_USE_LEGACY_TOOLCHAIN_FILE to false (#23544)
7e24088 Enable dlpack by default (#23110)
dc2f7a9 Add overload of TryParseStringWithClassicLocale() that uses std::from_chars() (#23541)
5407c69 Fix the issue that the new generated EP context model not able to find external data (#23537)
fbae88f [js/web] use the recommended workaround for Vite (#23531)
d5338da Fix tensor external data info length parsing issue. (#23526)
e3e4173 [ROCm EP] Fix transpose helper for gfx gridsize constraints (#23527)
80bc1d2 Enable Ep context with external data for CPU nodes (#23498)
bf023ab [js/web] allow import .mjs/.wasm file (#23487)
655a23f [onnxruntime/build] Add new flag enable_generic_interface to build primary EPs by default (#23342)
a770a8d Update RN to 0.71.19 (#23381)
1cf0ebd Delete Prefast workflow until the build failure is fixed (#23510)
d2c5e24 Add of GlobalMaxPool Gradient (#23502)
ded8730 Remove thrust::unary_function (#23506)
8db97a6 [webgpu] Bump version of Dawn to b9b4a370 (#23494)
fdde2e2 Fix for gcc 13.3.1: Avoid creating a copy (#23500)
96ec1dd Bump ruff from 0.9.2 to 0.9.3 (#23496)
42f0c00 Adds the new System.Numerics.Tensors as an input/output type when using dotnet 8.0 and up. (#23261)
97c2bbe Fix shape infer of onnx GroupNorm (#23477)
1fc9c48 Enable coremltools for Linux build (#23481)
13348c5 [ARM CPU] hgemm optimized for gqa (#23107)
c89a798 Enable opti on Microsoft.ML.OnnxRuntime with RelWithDebInfo config (#23463)
d00ae32 Revert "[Mobile] Add BrowserStack Android MAUI Test (#23383)" (#23474)
8b1d3b3 Align AvgPool ceil_mode on last value to torch (#16752)
06fc73b [TRT EP Perf Tool] Add annotations import to python script to support annotations on Python 3.8 (#23466)

Motivation and Context

This update includes the change to add QNN EP to ARM64X build targets.

adrianlizarraga and others added 30 commits February 11, 2025 09:06
… annotations on Python 3.8 (#23466)

### Description
Adds `from __future__ import annotations` to python script to support
annotations on Python 3.8.



### Motivation and Context
Pipeline that runs this script is using Ubuntu 20.04's default python
version (3.8), which does not support annotations unless one imports
from __future__.
Fix #16203

Previous to this PR, if `ceil_mode` is on, the calculation of a value
would divide the kernel size, even if remaining pixels is less than the
kernel size, which causes the difference in this operator between ORT
and torch.

However, this fix only applies to the change in #15597, which only
supports AvgPool since 19. The older opset version is remain the same,
as it's using mlas files.

Also, the PR fixes the shape mismatch caused by sliding window starting
from padding. More detail: onnx/onnx#6650 (And
this PR is also validated with the tests added in
onnx/onnx#6650)
This reverts commit 9f9fcf7.

### Motivation and Context
- NuGet packaging pipelines failing with this error:
```Files\dotnet\packs\Microsoft.NET.Runtime.MonoTargets.Sdk\8.0.12\Sdk\RuntimeComponentManifest.targets(3,5):
error : Empty ResolveFrameworkReference.RuntimePackPath while trying to
read runtime components manifest. ResolvedFrameworkReference available:
{ Microsoft.NETCore.App, RuntimePackPath: }```
…23463)

Microsoft.ML.OnnxRuntime is not built with the Release configuration but
RelWithDebInfo which is not recognized by the MSBuild SDK. Consequently,
the optimizations are not enabled. A fix would be to simply force the
configuration to be Release when building the .NET code even if it was
set to RelWithDebInfo in the command line arguments but I could not find
an easy way to do that. Instead, I try to mimic the behavior of the
Release configuration by setting the optimize property.

I can see a 15% performance improvement using this simple model summing
up the 3 inputs:
```csharp
using System.Buffers;
using System.Collections.Frozen;
using System.Net;
using System.Net.Sockets;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Text;
using System.Text.RegularExpressions;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Running;
using Microsoft.ML.OnnxRuntime;

var config = DefaultConfig.Instance; //.WithOptions(ConfigOptions.DisableOptimizationsValidator);
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args, config);

public class OnnxBench
{
    private const int Iterations = 100_000;
    private const int BatchSize = 50;
    
    private InferenceSession _session = default!;
    private string[] _inputNames = default!;
    private OrtValue[] _inputValues = default!;
    private RunOptions _runOptions = default!;

    [GlobalSetup]
    public void GlobalSetup()
    {
        using SessionOptions sessionOptions = new();
        sessionOptions.InterOpNumThreads = 1;
        sessionOptions.IntraOpNumThreads = 1;
        sessionOptions.GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_ALL;
        sessionOptions.ExecutionMode = ExecutionMode.ORT_SEQUENTIAL;

        _session = new InferenceSession(
            Convert.FromBase64String("CAo6cAoOCgFBCgFCEgFEIgNBZGQKDgoBQwoBRBIBWCIDQWRkEgJscloRCgFBEgwKCggBEgYKAAoCCAFaEQoBQhIMCgoIARIGCgAKAggBWhEKAUMSDAoKCAESBgoACgIIAWIRCgFYEgwKCggBEgYKAAoCCAFCBAoAEBU="),
            sessionOptions);
        _inputNames = ["A", "B", "C"];
        _inputValues =
        [
            OrtValue.CreateTensorValueFromMemory(new float[BatchSize], [BatchSize, 1]),
            OrtValue.CreateTensorValueFromMemory(new float[BatchSize], [BatchSize, 1]),
            OrtValue.CreateTensorValueFromMemory(new float[BatchSize], [BatchSize, 1]),
        ];
        _runOptions = new RunOptions();
    }

    [Benchmark(OperationsPerInvoke = Iterations)]
    public float Run()
    {
        var inputValues0Span = _inputValues[0].GetTensorMutableDataAsSpan<float>();
        var inputValues1Span = _inputValues[1].GetTensorMutableDataAsSpan<float>();
        var inputValues2Span = _inputValues[2].GetTensorMutableDataAsSpan<float>();
        for (int i = 0; i < BatchSize; i += 1)
        {
            inputValues0Span[i] = Random.Shared.NextSingle();
            inputValues1Span[i] = Random.Shared.NextSingle();
            inputValues2Span[i] = Random.Shared.NextSingle();
        }
        
        float sum = 0f;
        for (int i = 0; i < Iterations; i += 1)
        {
            using var output = _session.Run(_runOptions, _inputNames, _inputValues, _session.OutputNames);
            ReadOnlySpan<float> outputData = output[0].GetTensorDataAsSpan<float>();
            for (int j = 0; j < outputData.Length; j += 1)
            {
                sum += outputData[j];
            }
        }
        
        return sum;
    }
}
```

| Method | Mean     | Error     | StdDev    |
|------- |---------:|----------:|----------:|
| Before | 5.003 us | 0.0318 us | 0.0297 us |
| After   | 4.325 us | 0.0568 us | 0.0503 us |
### Description
Add fp16 kernels for GQA matmul on ARM CPU.
The kernels are mlas hgemm for C = alpha * A x B' + beta * C


### Motivation and Context
Add fp16 support for GQA, speed up the operator and reduce memory usage.

__Token Generation__
| | HGEMM Runtime (ns) | SGEMM Runtime (ns) | Speed-up (%) |

|---------------------------------|--------------------|--------------------|--------------|
| M:1/N:4096/K:4096 | 251551 | 1775905 | 85.84 |
| M:1/N:11008/K:4096 | 892507 | 4649145 | 80.80 |
| M:1/N:4096/K:11008 | 866860 | 3240015 | 73.25 |
| M:1/N:11008/K:11008 | 2631615 |8783877 | 70.04 |

__Prompting__
| | HGEMM Runtime (ns) | SGEMM Runtime (ns) | Speed-up (%) |

|---------------------------------|--------------------|--------------------|--------------|
| M:1024/N:4096/K:4096 | 90508701 | 111283029 | 18.67 |
| M:2048/N:4096/K:4096 | 181307522 | 240211107 | 24.52 |
| M:1024/N:11008/K:4096 | 241120234 | 307707933 | 21.64 |
| M:2048/N:11008/K:4096 | 481091232 | 648921367 | 25.86 |
| M:1024/N:4096/K:11008 | 241736343 | 310129880 | 22.05 |
| M:2048/N:4096/K:11008 | 480456703 | 644814999 | 25.49 |
| M:1024/N:11008/K:11008 | 642121440 | 847925766 | 24.27 |
| M:2048/N:11008/K:11008 | 1276097154 | 1731314509 | 26.29
### Description

Enable coremltools for Linux build. In order to do this, I did:

1. Add uuid-devel to the Linux images and regenerate them.
2. Patch the coremltools code a little bit to add some missing header
files.

### Motivation and Context
To make the code simpler. Later on I will create another PR to remove
the COREML_ENABLE_MLPROGRAM C/C++ macro.
Also, after this PR I will bring more changes to
onnxruntime_provider_coreml.cmake to make it work with vcpkg.
### Description
<!-- Describe your changes. -->
Fix shape infer of onnx GroupNorm.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Unable to run shape inference for onnx `GroupNorm`.


[model.onnx](https://raw.githubusercontent.com/onnx/onnx/refs/heads/main/onnx/backend/test/data/node/test_group_normalization_example/model.onnx)

> python
D:\source\cognition\onnxruntime\onnxruntime\python\tools\symbolic_shape_infer.py
--input model.onnx
Traceback (most recent call last):
File
"D:\source\cognition\onnxruntime\onnxruntime\python\tools\symbolic_shape_infer.py",
line 2999, in <module>
    out_mp = SymbolicShapeInference.infer_shapes(
File
"D:\source\cognition\onnxruntime\onnxruntime\python\tools\symbolic_shape_infer.py",
line 2935, in infer_shapes
    raise Exception("Incomplete symbolic shape inference")
…ng dotnet 8.0 and up. (#23261)

### Description
Adds the new System.Numerics.Tensors as an input/output type when using
dotnet 8.0 and up. It does not change/remove any of the existing API,
only adds additional ones.


### Motivation and Context
Now that C#/Dotnet has an official tensor type built into the language,
we want to expand the places that it can be used.
Bumps [ruff](https://github.com/astral-sh/ruff) from 0.9.2 to 0.9.3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/astral-sh/ruff/releases">ruff's
releases</a>.</em></p>
<blockquote>
<h2>0.9.3</h2>
<h2>Release Notes</h2>
<h3>Preview features</h3>
<ul>
<li>[<code>airflow</code>] Argument <code>fail_stop</code> in DAG has
been renamed as <code>fail_fast</code> (<code>AIR302</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15633">#15633</a>)</li>
<li>[<code>airflow</code>] Extend <code>AIR303</code> with more symbols
(<a
href="https://redirect.github.com/astral-sh/ruff/pull/15611">#15611</a>)</li>
<li>[<code>flake8-bandit</code>] Report all references to suspicious
functions (<code>S3</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15541">#15541</a>)</li>
<li>[<code>flake8-pytest-style</code>] Do not emit diagnostics for empty
<code>for</code> loops (<code>PT012</code>, <code>PT031</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15542">#15542</a>)</li>
<li>[<code>flake8-simplify</code>] Avoid double negations
(<code>SIM103</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15562">#15562</a>)</li>
<li>[<code>pyflakes</code>] Fix infinite loop with unused local import
in <code>__init__.py</code> (<code>F401</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15517">#15517</a>)</li>
<li>[<code>pylint</code>] Do not report methods with only one
<code>EM101</code>-compatible <code>raise</code> (<code>PLR6301</code>)
(<a
href="https://redirect.github.com/astral-sh/ruff/pull/15507">#15507</a>)</li>
<li>[<code>pylint</code>] Implement
<code>redefined-slots-in-subclass</code> (<code>W0244</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/9640">#9640</a>)</li>
<li>[<code>pyupgrade</code>] Add rules to use PEP 695 generics in
classes and functions (<code>UP046</code>, <code>UP047</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15565">#15565</a>,
<a
href="https://redirect.github.com/astral-sh/ruff/pull/15659">#15659</a>)</li>
<li>[<code>refurb</code>] Implement <code>for-loop-writes</code>
(<code>FURB122</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/10630">#10630</a>)</li>
<li>[<code>ruff</code>] Implement <code>needless-else</code> clause
(<code>RUF047</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15051">#15051</a>)</li>
<li>[<code>ruff</code>] Implement <code>starmap-zip</code>
(<code>RUF058</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15483">#15483</a>)</li>
</ul>
<h3>Rule changes</h3>
<ul>
<li>[<code>flake8-bugbear</code>] Do not raise error if keyword argument
is present and target-python version is less or equals than 3.9
(<code>B903</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15549">#15549</a>)</li>
<li>[<code>flake8-comprehensions</code>] strip parentheses around
generators in <code>unnecessary-generator-set</code> (<code>C401</code>)
(<a
href="https://redirect.github.com/astral-sh/ruff/pull/15553">#15553</a>)</li>
<li>[<code>flake8-pytest-style</code>] Rewrite references to
<code>.exception</code> (<code>PT027</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15680">#15680</a>)</li>
<li>[<code>flake8-simplify</code>] Mark fixes as unsafe
(<code>SIM201</code>, <code>SIM202</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15626">#15626</a>)</li>
<li>[<code>flake8-type-checking</code>] Fix some safe fixes being
labeled unsafe (<code>TC006</code>,<code>TC008</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15638">#15638</a>)</li>
<li>[<code>isort</code>] Omit trailing whitespace in
<code>unsorted-imports</code> (<code>I001</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15518">#15518</a>)</li>
<li>[<code>pydoclint</code>] Allow ignoring one line docstrings for
<code>DOC</code> rules (<a
href="https://redirect.github.com/astral-sh/ruff/pull/13302">#13302</a>)</li>
<li>[<code>pyflakes</code>] Apply redefinition fixes by source code
order (<code>F811</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15575">#15575</a>)</li>
<li>[<code>pyflakes</code>] Avoid removing too many imports in
<code>redefined-while-unused</code> (<code>F811</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15585">#15585</a>)</li>
<li>[<code>pyflakes</code>] Group redefinition fixes by source statement
(<code>F811</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15574">#15574</a>)</li>
<li>[<code>pylint</code>] Include name of base class in message for
<code>redefined-slots-in-subclass</code> (<code>W0244</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15559">#15559</a>)</li>
<li>[<code>ruff</code>] Update fix for <code>RUF055</code> to use
<code>var == value</code> (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15605">#15605</a>)</li>
</ul>
<h3>Formatter</h3>
<ul>
<li>Fix bracket spacing for single-element tuples in f-string
expressions (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15537">#15537</a>)</li>
<li>Fix unstable f-string formatting for expressions containing a
trailing comma (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15545">#15545</a>)</li>
</ul>
<h3>Performance</h3>
<ul>
<li>Avoid quadratic membership check in import fixes (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15576">#15576</a>)</li>
</ul>
<h3>Server</h3>
<ul>
<li>Allow <code>unsafe-fixes</code> settings for code actions (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15666">#15666</a>)</li>
</ul>
<h3>Bug fixes</h3>
<ul>
<li>[<code>flake8-bandit</code>] Add missing single-line/dotall regex
flag (<code>S608</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15654">#15654</a>)</li>
<li>[<code>flake8-import-conventions</code>] Fix infinite loop between
<code>ICN001</code> and <code>I002</code> (<code>ICN001</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15480">#15480</a>)</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md">ruff's
changelog</a>.</em></p>
<blockquote>
<h2>0.9.3</h2>
<h3>Preview features</h3>
<ul>
<li>[<code>airflow</code>] Argument <code>fail_stop</code> in DAG has
been renamed as <code>fail_fast</code> (<code>AIR302</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15633">#15633</a>)</li>
<li>[<code>airflow</code>] Extend <code>AIR303</code> with more symbols
(<a
href="https://redirect.github.com/astral-sh/ruff/pull/15611">#15611</a>)</li>
<li>[<code>flake8-bandit</code>] Report all references to suspicious
functions (<code>S3</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15541">#15541</a>)</li>
<li>[<code>flake8-pytest-style</code>] Do not emit diagnostics for empty
<code>for</code> loops (<code>PT012</code>, <code>PT031</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15542">#15542</a>)</li>
<li>[<code>flake8-simplify</code>] Avoid double negations
(<code>SIM103</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15562">#15562</a>)</li>
<li>[<code>pyflakes</code>] Fix infinite loop with unused local import
in <code>__init__.py</code> (<code>F401</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15517">#15517</a>)</li>
<li>[<code>pylint</code>] Do not report methods with only one
<code>EM101</code>-compatible <code>raise</code> (<code>PLR6301</code>)
(<a
href="https://redirect.github.com/astral-sh/ruff/pull/15507">#15507</a>)</li>
<li>[<code>pylint</code>] Implement
<code>redefined-slots-in-subclass</code> (<code>W0244</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/9640">#9640</a>)</li>
<li>[<code>pyupgrade</code>] Add rules to use PEP 695 generics in
classes and functions (<code>UP046</code>, <code>UP047</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15565">#15565</a>,
<a
href="https://redirect.github.com/astral-sh/ruff/pull/15659">#15659</a>)</li>
<li>[<code>refurb</code>] Implement <code>for-loop-writes</code>
(<code>FURB122</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/10630">#10630</a>)</li>
<li>[<code>ruff</code>] Implement <code>needless-else</code> clause
(<code>RUF047</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15051">#15051</a>)</li>
<li>[<code>ruff</code>] Implement <code>starmap-zip</code>
(<code>RUF058</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15483">#15483</a>)</li>
</ul>
<h3>Rule changes</h3>
<ul>
<li>[<code>flake8-bugbear</code>] Do not raise error if keyword argument
is present and target-python version is less or equals than 3.9
(<code>B903</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15549">#15549</a>)</li>
<li>[<code>flake8-comprehensions</code>] strip parentheses around
generators in <code>unnecessary-generator-set</code> (<code>C401</code>)
(<a
href="https://redirect.github.com/astral-sh/ruff/pull/15553">#15553</a>)</li>
<li>[<code>flake8-pytest-style</code>] Rewrite references to
<code>.exception</code> (<code>PT027</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15680">#15680</a>)</li>
<li>[<code>flake8-simplify</code>] Mark fixes as unsafe
(<code>SIM201</code>, <code>SIM202</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15626">#15626</a>)</li>
<li>[<code>flake8-type-checking</code>] Fix some safe fixes being
labeled unsafe (<code>TC006</code>,<code>TC008</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15638">#15638</a>)</li>
<li>[<code>isort</code>] Omit trailing whitespace in
<code>unsorted-imports</code> (<code>I001</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15518">#15518</a>)</li>
<li>[<code>pydoclint</code>] Allow ignoring one line docstrings for
<code>DOC</code> rules (<a
href="https://redirect.github.com/astral-sh/ruff/pull/13302">#13302</a>)</li>
<li>[<code>pyflakes</code>] Apply redefinition fixes by source code
order (<code>F811</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15575">#15575</a>)</li>
<li>[<code>pyflakes</code>] Avoid removing too many imports in
<code>redefined-while-unused</code> (<code>F811</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15585">#15585</a>)</li>
<li>[<code>pyflakes</code>] Group redefinition fixes by source statement
(<code>F811</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15574">#15574</a>)</li>
<li>[<code>pylint</code>] Include name of base class in message for
<code>redefined-slots-in-subclass</code> (<code>W0244</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15559">#15559</a>)</li>
<li>[<code>ruff</code>] Update fix for <code>RUF055</code> to use
<code>var == value</code> (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15605">#15605</a>)</li>
</ul>
<h3>Formatter</h3>
<ul>
<li>Fix bracket spacing for single-element tuples in f-string
expressions (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15537">#15537</a>)</li>
<li>Fix unstable f-string formatting for expressions containing a
trailing comma (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15545">#15545</a>)</li>
</ul>
<h3>Performance</h3>
<ul>
<li>Avoid quadratic membership check in import fixes (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15576">#15576</a>)</li>
</ul>
<h3>Server</h3>
<ul>
<li>Allow <code>unsafe-fixes</code> settings for code actions (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15666">#15666</a>)</li>
</ul>
<h3>Bug fixes</h3>
<ul>
<li>[<code>flake8-bandit</code>] Add missing single-line/dotall regex
flag (<code>S608</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15654">#15654</a>)</li>
<li>[<code>flake8-import-conventions</code>] Fix infinite loop between
<code>ICN001</code> and <code>I002</code> (<code>ICN001</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15480">#15480</a>)</li>
<li>[<code>flake8-simplify</code>] Do not emit diagnostics for
expressions inside string type annotations (<code>SIM222</code>,
<code>SIM223</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15405">#15405</a>)</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/astral-sh/ruff/commit/90589372daf58ec4d314cbd15db8d2ef572c33cc"><code>9058937</code></a>
Fix grep for version number in docker build (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15699">#15699</a>)</li>
<li><a
href="https://github.com/astral-sh/ruff/commit/b5ffb404de8ab05eb7b14d6547f79f4fe3a3e25f"><code>b5ffb40</code></a>
Bump version to 0.9.3 (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15698">#15698</a>)</li>
<li><a
href="https://github.com/astral-sh/ruff/commit/cffd1866ce1ac6da4d6a5bc12435316d2d99755b"><code>cffd186</code></a>
Preserve raw string prefix and escapes (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15694">#15694</a>)</li>
<li><a
href="https://github.com/astral-sh/ruff/commit/569060f46ca2e036cd54532c97121737884f26c0"><code>569060f</code></a>
[<code>flake8-pytest-style</code>] Rewrite references to
<code>.exception</code> (<code>PT027</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15680">#15680</a>)</li>
<li><a
href="https://github.com/astral-sh/ruff/commit/15394a80282f589526497eefb2507a0afc662ca6"><code>15394a8</code></a>
[red-knot] MDTests: Do not depend on precise public-symbol type
inference (<a
href="https://redirect.github.com/astral-sh/ruff/issues/1">#1</a>...</li>
<li><a
href="https://github.com/astral-sh/ruff/commit/fc2ebea7369b26c864769fce54201a8657d70058"><code>fc2ebea</code></a>
[red-knot] Make <code>infer.rs</code> unit tests independent of public
symbol inference ...</li>
<li><a
href="https://github.com/astral-sh/ruff/commit/43160b4c3edb9cda4c01ed857e94578213e70c6f"><code>43160b4</code></a>
Tidy knot CLI tests (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15685">#15685</a>)</li>
<li><a
href="https://github.com/astral-sh/ruff/commit/0173738eef808a9b2f492a0b966e3f70e8584e21"><code>0173738</code></a>
[red-knot] Port comprehension tests to Markdown (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15688">#15688</a>)</li>
<li><a
href="https://github.com/astral-sh/ruff/commit/05ea77b1d4d1863e6436101cf877fbf265e966f4"><code>05ea77b</code></a>
Create Unknown rule diagnostics with a source range (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15648">#15648</a>)</li>
<li><a
href="https://github.com/astral-sh/ruff/commit/1e790d3885919826e2cff2fbf6ddb31554714050"><code>1e790d3</code></a>
[red-knot] Port 'deferred annotations' unit tests to Markdown (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15686">#15686</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/astral-sh/ruff/compare/0.9.2...0.9.3">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=ruff&package-manager=pip&previous-version=0.9.2&new-version=0.9.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
This change avoids creating loop variable copy. GCC 13.3 suggests to use
reference type to prevent copying.


### Motivation and Context
While building onnxruntime 1.20.1 with latest changes from gcc 13.3, I
get build error like
```
onnxruntime-1.20.1/onnxruntime/core/optimizer/selectors_actions/selector_action_transformer.cc: In function 'onnxruntime::common::Status onnxruntime::MatchAndProcess(Graph&, const GraphViewer&, Node&, bool&, const logging::Logger&, const std::string&, const SelectorActionRegistry&, const SatRuntimeOptimizationSaveContext*)':
onnxruntime-1.20.1/onnxruntime/core/optimizer/selectors_actions/selector_action_transformer.cc:150:23: error: loop variable 'op_schema' creates a copy from type 'const gsl::not_null<const onnx::OpSchema*>' [-Werror=range-loop-construct]
  150 |       for (const auto op_schema : action_saved_state.produced_node_op_schemas) {
      |                       ^~~~~~~~~
onnxruntime-1.20.1/onnxruntime/core/optimizer/selectors_actions/selector_action_transformer.cc:150:23: note: use reference type to prevent copying
  150 |       for (const auto op_schema : action_saved_state.produced_node_op_schemas) {
      |                       ^~~~~~~~~
      |                       &
```
### Description

This PR updates the version of Dawn to
`b9b4a37041dec3dd62ac92014a6cc1aece48d9f3` (ref:
[chromium](https://chromium.googlesource.com/chromium/src.git/+/67f86f01ddb0e5cbdac4a050c17c468deb740c6c/DEPS#399))
in the `deps.txt` file.

The newer version of Dawn includes the previous changes from dawn.patch
so that we can remove the patch file.

There is a little interface changes and code is updated correspondingly.
### Description
<!-- Describe your changes. -->
Remove thrust::unary_function which is deprecated in later versions of
CUDA.

### Motivation and Context
Addresses issue: #23499
### Description
Added gradient computation support for the GlobalMaxPool node.



### Motivation and Context
Improve the training capabilities of ONNX Runtime.
### Description
Delete Prefast workflow until the build failure is fixed


### Motivation and Context
Right now the pipelines are failing due to an environment change from
Github.
### Description
<!-- Describe your changes. -->

Upgrading RN to 0.71.19, including Android and iOS changes.. This PR
also include the E2E test changes.

Used React-Native upgrade
[helper](https://react-native-community.github.io/upgrade-helper/?from=0.70.15&to=0.71.19&package=onnxruntime-android&name=onnxruntime)
as the reference.



### Motivation and Context
Need newer RN version to fix S360 work items.
…imary EPs by default (#23342)

### Description
- Add new build flag in build.py to build onnxruntime.dll supporting
interfaces for all primary EPs( QNN, TensoRT, OpenVino, VitisAI).
- Modify onnxruntime.dll/onnxruntime_shared.dll build settings to remove
dependency of IHV SDK Toolset to be installed on the system.
- Change CMake variables to be explicit when building EP vs ORT. e.g.
onnxruntime_USE_TENSORRT vs onnxruntime_USE_TENSORRT_INTERFACE, to
evolve the build system to build ORT independent of EPs.



### Motivation and Context
Changes in the build system required to evolve the repo to build the
components independently while removing unnecessary dependencies

---------

Co-authored-by: Lei Cao <jslhcl@gmail.com>
Co-authored-by: Karim Vadsariya <kvadsariya@microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description

Allow importing the `.mjs` and `.wasm` files.

when using Vite, this enables web app to consume ORT-web for simplify
the setup:
   ```js
   import * as ort from 'onnxruntime-web';

   import wasmFileUrl from 'onnxruntime-web/.wasm?url';
   ort.env.wasm.wasmPaths = { wasm: wasmFileUrl };
### Description
When user dump the EP context model, if the nodes not partitioned to the EP, and they have external initializers, then the dumped model still point to the old external data file. It does not make sense that new generated model still point to old external data file.
Example, model has node A, B, C, D all has external initializer in ext.bin. So ext.bin contains data for A, B, C, D.
After dumping the EP context model, node A is on CPU, node B, C, D are on EP and dumped as EPContext node. If A's data is still in ext.bin, then new generated model has to depend on old ext.bin which contains all external data for the old model which is a big overhead.

Fix:
For new generated model, user should have option to specify the new external data file, so that the new generated model either pack all initializers into the Onnx model or has all initializers in the external data file.
Add option ep.context_model_external_initializers_file_name to specify the new external data file and size threshold. All initializers will be inside the external data fie if the options is specified. Otherwise all initializers will be inside the EP context Onnx model.

### Motivation and Context
Fix the issue #23358
Remove inline default transposeHelper and ensure we use the proper check
via CanUse_hipBlasTransposeHelper_MLFloat16

Related to change in ROCm Onnxruntime repo:
ROCm#82

### Description

Required to correctly limit grid size of transpose helper kernel

### Motivation and Context
Compile was defaulting to the inline constructor that was removed
instead of using the overloaded case with proper checks.
Removed the inline default "true" case as this is incorrect for newer
AMD cards/targets

Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
Fix tensor external data info length parsing issue.

The old implementation was parsing a `size_t` value with `strtol` (via `OrtStrToPtrDiff`) on ARM64 MSVC.

https://github.com/microsoft/onnxruntime/blob/bf023ab3d565668c13a5334b505df0eb6acf3625/onnxruntime/core/platform/path_lib.h#L74

If we have `sizeof(size_t) == 8` and `sizeof(long) == 4` (as is the case for x64 and ARM64 MSVC), `strtol` will return a maximum value of `2^31-1` even for a larger, valid `size_t` value. `strtol` will also set `errno` to `ERANGE`, but we weren't checking that.

Updated to use `ParseStringWithClassicLocale` which will parse directly to the target type.

Added some tests.
### Description

After some investigation and debug, I decided to follow the recommended
workaround as suggested in vitejs/vite#8427.

### Motivation and Context

There is a known issue with Vite 5.x when using WebAssembly package.
Detail information is in vitejs/vite#8427.

There are previous attempts to fix this problem (#23487). I tried
various ways to make it working out of the box for Vite users but none
of them worked: Some "fixes" did fix the usage of Vite but broke other
use case/bundler and some introduced other issues. Eventually I figured
out that there is no good way to fix this inside ONNX Runtime.

Considering the root cause is inside Vite and it may be fixed in Vite
v6. I think now the best way is to follow the recommended workaround.
…d external data (#23537)

Fix the issue that the new generated EP context model not able to find external data

### Description
The new generated EP context model was not able to find the external data file because it lost track of the source model path which used to locate the external initializers.

Relate to issue: #23358
…rom_chars()` (#23541)

Add overload of `TryParseStringWithClassicLocale()` that uses `std::from_chars()` for certain types.

Reduce binary size. It recently increased after PR #23526.
### Description
<!-- Describe your changes. -->
This PR will enable python dlpack interface by default.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

dlpack python interface is useful in inference mode not only training
mode.
Since some inference result preprocess may be written in torch and
making unnecessary device transfer should be reduced in those cases.
closes #15963 closes
#22061

TODOs:
- [x] Add tests like
https://github.com/microsoft/onnxruntime/blob/5407c69028ae6dd4e87521aea147c22153d8e6c7/orttraining/orttraining/test/python/orttraining_test_ortvalue.py
that's unrelated to training feature

---------

Co-authored-by: Xavier Dupré <xadupre@users.noreply.github.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
NDK has two toolchain cmake files as you can see in 

https://android.googlesource.com/platform/ndk/+/refs/heads/main/build/cmake

By default NDK use the legacy one for providing the best compatibility.
We don't need to. This PR changes to use the new one.

The new toolchain cmake file uses standard cmake flags like
CMAKE_ANDROID_RTTI to control C++ features.
### Description
Add details about how to access the BrowserStack logs

### Motivation and Context
- browserstack link on its own is confusing to people who don't have
context.

Let me know if you have suggestions to make the text more clear or
informative
### Description
(1) Update BiasGelu fusion to support onnx Gelu-20

Since onnx Gelu-20 supports float/double/bf16/fp16, here we update
related ops to support these data types in CUDA and ROCm execution
providers:
(2) Add double support for Gelu/FastGelu op in CUDA/ROCm execution
provider
(3) Add BFloat16 support for Gelu ops in CUDA execution provider

(4) Add unit tests
(5) Update operator documents

### Motivation and Context
#23491
There is a crash in the WebGPU CI pipeline. It crashed at process
shutdown when unloading onnxruntime_pybind11_state.pyd.
Here is the callstack:

```
 	dxil.dll!DxcSwapThreadMalloc()	Unknown
 	dxil.dll!DxcThreadMalloc::DxcThreadMalloc(struct IMalloc *)	Unknown
 	dxil.dll!DxcValidator::Release(void)	Unknown
 	[Inline Frame] webgpu_dawn.dll!Microsoft::WRL::ComPtr<IDxcValidator>::InternalRelease() Line 235	C++
 	[Inline Frame] webgpu_dawn.dll!Microsoft::WRL::ComPtr<IDxcValidator>::{dtor}() Line 290	C++
 	webgpu_dawn.dll!dawn::native::d3d12::Backend::`scalar deleting destructor'(unsigned int)	C++
 	webgpu_dawn.dll!`eh vector destructor iterator'(void * ptr, unsigned __int64 size, unsigned __int64 count, void(*)(void *) destructor)	C++
 	webgpu_dawn.dll!dawn::native::InstanceBase::~InstanceBase() Line 197	C++
 	webgpu_dawn.dll!dawn::native::InstanceBase::`scalar deleting destructor'(unsigned int)	C++
 	webgpu_dawn.dll!dawn::native::InstanceBase::DeleteThis() Line 218	C++
 	ucrtbase.dll!<lambda>(void)()	Unknown
 	ucrtbase.dll!__crt_seh_guarded_call<int>::operator()<<lambda_7777bce6b2f8c936911f934f8298dc43>,<lambda>(void) &,<lambda_3883c3dff614d5e0c5f61bb1ac94921c>>()	Unknown
 	ucrtbase.dll!_execute_onexit_table()	Unknown
 	onnxruntime_pybind11_state.pyd!dllmain_crt_process_detach(const bool is_terminating) Line 182	C++
>	onnxruntime_pybind11_state.pyd!dllmain_dispatch(HINSTANCE__ * const instance, const unsigned long reason, void * const reserved) Line 293	C++
 	ntdll.dll!LdrpCallInitRoutine()	Unknown
 	ntdll.dll!LdrShutdownProcess()	Unknown
 	ntdll.dll!RtlExitUserProcess()	Unknown
 	kernel32.dll!ExitProcessImplementation()	Unknown
 	ucrtbase.dll!exit_or_terminate_process()	Unknown
 	ucrtbase.dll!common_exit()	Unknown
 	python312.dll!00007ff9cab3ec8d()	Unknown
 	python312.dll!00007ff9cab3efbf()	Unknown
 	python312.dll!00007ff9cab3edee()	Unknown
 	python312.dll!00007ff9cab57f4c()	Unknown
 	python312.dll!00007ff9cab57579()	Unknown
 	python312.dll!00007ff9cab573be()	Unknown
 	python312.dll!00007ff9cab5729b()	Unknown
 	python312.dll!00007ff9cabacfcb()	Unknown
 	python312.dll!00007ff9cabacd7d()	Unknown
 	python312.dll!00007ff9cab99e2d()	Unknown
 	python.exe!00007ff78a641230()	Unknown
 	kernel32.dll!BaseThreadInitThunk()	Unknown
 	ntdll.dll!RtlUserThreadStart()	Unknown
```
It might be because the destruct order of some global variables was
wrong. I saw DX DLLs were getting destroyed earlier than the WebGPU
instance in our code in onnxruntime_pybind11_state.pyd.
### Description
This PR adds fusions for [Google's SigLIP
model](https://huggingface.co/google/siglip-base-patch16-224/) and
Microsoft's internal conformer-encoder model.

Here is an example of how to run the ORT transformer optimizer for the
SigLIP model.
```
$ git clone https://github.com/microsoft/onnxruntime
$ cd onnxruntime/onnxruntime/python/tools/transformers
$ python3 optimizer.py --input /path/to/model.onnx --output /path/to/model_opt.onnx --model_type clip --num_heads 16 --hidden_size 1152 --use_external_data_format --opt_level 0 --disable_shape_inference
```

Here is an example of how to run the ORT transformer optimizer for the
conformer-encoder model.
```
$ git clone https://github.com/microsoft/onnxruntime
$ cd onnxruntime/onnxruntime/python/tools/transformers
$ python3 optimizer.py --input /path/to/model.onnx --output /path/to/model_opt.onnx --model_type conformer --num_heads 16 --hidden_size 1024 --use_external_data_format --opt_level 0 --disable_shape_inference --convert_attribute
```

### Motivation and Context
This PR helps optimize multi-modal models that use SigLIP for the vision
encoder and conformer-encoder for the speech encoder.

This PR uses changes from the following PRs:
- pytorch/pytorch#144801
- microsoft/onnxscript#2018
- microsoft/onnxscript#2019
- microsoft/onnxscript#2020
- microsoft/onnxscript#2021
- microsoft/onnxscript#2022
- microsoft/onnxscript#2024
- microsoft/onnxscript#2025
- microsoft/onnxscript#2029
- microsoft/onnxscript#2033

### Introduction of ONNX Script

This PR introduces [ONNX
Script](https://github.com/microsoft/onnxscript) into the ORT
transformer optimizer as an optional step via the
`fold_transpose_initializers()` method of the `DynamoOnnxHelper` class.
In this change

1. Vectorization of k is updated to 4.
2. Tile_A, Tile_B are stored transposed in shared memory. This makes it
so that memory locality is improved for our access pattern.
3. Lane output is switched to being individual vectors and its loop
unrolled, this solves the problem where laneoutput was not on registers
before.

Perf improvements are not very consistent with this change. On Tigerlake
GPU with 32.0.101.6460 (latest intel drivers)
```
Baseline

model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web\ -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       7.36557e+06                         <<<<
        avg (tokens/s): 135.903
        p50 (us):       7.35498e+06
        stddev (us):    27599
        n:              5 * 1001 token(s)

With Change

model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web\ -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       6.52302e+06                           <<<<
        avg (tokens/s): 153.457
        p50 (us):       6.52224e+06
        stddev (us):    10407.3
        n:              5 * 1001 token(s)
```

However, using the Intel GPA comparing before and after profile, one can
clearly see straight runs of ALU work without being interspersed by
writebacks to local memory that contained lane_output before.


![image](https://github.com/user-attachments/assets/e01d3474-8406-4a61-b352-2ecbf0855a7f)
tianleiwu and others added 22 commits February 11, 2025 09:06
… Windows (#23596)

### Description
BeamSearchTest.DummyT5WithSequenceInputIds failed in Windows due to
early stopping triggered. The cause is state.early_stopping_ is
interpreted as true in cuda kernel at some point, however printf still
show its value is false. The root cause is unknown.

Update the code to use early_stopping as template parameter seems walk
around the issue.

Other changes: 
* Add some debug code (will not be built into binary unless
DEBUG_GENERATION is fined) to assist debugging beam search scorer in
CUDA.
* Enable DummyT5WithSequenceInputIds test in CI. This test was not run
in Windows CUDA CI pipeline previously.

### Motivation and Context

Fix a unit test BeamSearchTest.DummyT5WithSequenceInputIds failure in
Windows.
### Description
These changes are done to ensure that weight sharing happens between two model using session context option ep_weight_sharing.

Key changes introduced in this feature are:

Creating a shared context between two models Extracting external constant initializers and re labelling them back as
inputs to the model to allow weight loading in the direct blob. Creating EP Context Nodes when Subgraph partitioning is happening.

### Motivation and Context
This change was required to ensure that LLM with prefill and kvcache models can use the same share
The change was also required to ensure EP Context nodes can be formed even when model is being subgraph partitioned.

---------

Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: saurabh <saurabh1.kale@intel.com>
Co-authored-by: TejalKhade28 <tejal.khade@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
### Description
This change implements FlashAttention 2 for the webgpu EP for the MHA
operator.

Numbers from Alderlake device show a 2.2x speed up for prefill, which
considering that Attention is 50% of prefill phase (other 50% being
MatMul) implies 4x speed up for Attention with this implementation. This
is inline with the expected perf gain of 2-4x with FlashAttention over
regular attention.

```
Baseline
PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web\ -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       9.54997e+06   <<<<<
        avg (tokens/s): 104.817
        p50 (us):       9.49218e+06
        stddev (us):    251442
        n:              5 * 1001 token(s)
------
With FlashAttention 2
PS C:\onnxruntime> C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web\ -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       4.27937e+06     <<<<<
        avg (tokens/s): 233.913
        p50 (us):       4.27687e+06
        stddev (us):    5344.1
        n:              5 * 1001 token(s)
```

### Motivation and Context

On integrated GPUs memory bandwidth is premium, Flash attention makes
softmax computation (and therefore output attention vector computation)
a running operation instead of maintaining full QKt attention scores in
memory. As a result, we see significant improvements in prefill speed -
200% speed up measured here.

This change uses techniques from co-operative matrix multiply to use
registers from a subgroup for fast in register matrix multiply. Without
the co-operative matrix multiply technique ALD showed about 6.0s prefill
time.

Tested on ALD/TGL intel integrated and Nvidia 4070.

### Future Work
- Fine tuning and profiling optimizations.
- Current implement is for prefill only, a generation phase optimized
FA2 implementation is possible, however attention is a tiny part of the
generation phase.
### Description
* Pass topk_scores to beam scorer in slow topk path.
* Add an env variable `ORT_BEAM_SEARCH_USE_FAST_TOPK` to enable/disable fast topk.
* Add a test case for slow topk path.

### Motivation and Context

This bug was introduced in
#16272

Beam search uses fast cuda kernel when number of beams <= 32. When beam
size is larger than that threshold, we use another code path (slower
cuda kernel) to get topk. In such `slow topk path`, topk_scores shall be
passed to beam scorer but it is not.

This bug will cause incorrect result when num_beams > 32. It was not
found previously since such large beam size is rarely used.
…un (#23438)

The CPU walltime of waiting for PopErrorScope is non-trivial, and also
validation errors are not expected to happen in Release build.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Validate the context_file_path before EP compile graphs to make it fail fast. To avoid the possibility that EP generate new file (context binary file or blob file) over write the existing file. Return error if the path points to folder.
### Description
A recent
[commit](1fce51b)
is causing an OVEP warning in
[openvino_provider_factory.cc](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/openvino/openvino_provider_factory.cc#L151).
This PR fixes the warning.

### Motivation and Context
Minor fix
PIX Capture tool requires 'present' to end a frame capture. ORT doesn't
have rendering work so no 'present' happens.

To avoid endless waiting for PIX capture tool, this PR added a blank
surface and 'present' on it in each session run.

The surface is created in WebGPU ep constructor and closed in WebGPU ep
destructor.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description

Add `Win_TRT_Minimal_CUDA_Test_CI`.
### Description
<!-- Describe your changes. -->
TensorRT 10.8 zip file has suffix of cuda-12.8, not 12.6


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description

Increase the timeout from 150 minutes to 180 minutes.
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Update pybind and json to the latest. 

### Motivation and Context
Resolve #23512
…oftcap (#23597)

### Description
Add fp16 mlas kernels for exp, tanh, softmax, logsoftmax, softcap on ARM
CPU



### Motivation and Context
Group query attention supports fast fp16 CPU EP.
### Description
When context embed mode 0 there were some unhandled corner cases in OVEP that generated inconsistent/incorrect compiled blob names. This PR corrects that.


### Motivation and Context
Fix corner cases when OVEP generates external compiled blob names.
### Description

Remove the need to pass in the GPU adapter for the custom context.

With the introduction of the `wgpuDeviceGetAdapterInfo` API, we no
longer need user to specify the GPU adapter when creating a custom
context.
### Description
Currently when we build with --buildasx, only the onnxruntime.dll is
built as ARM64X binary.

This change addresses the above issue by adding
onnxruntime_provider_shared and
onnxruntime_provider_qnn targets to the ARM64X_TARGETS build target
list.

### Motivation and Context
For QNN EP which supports ARM64, --buildasx does not build the EP dll'
such as onnxruntime_provider_shared.dll and onnxruntime_provider_qnn.dll
as ARM64X binary.
### Description
Support for python 3.8 and python 3.9 was dropped at 1.20. Declare the
3.10 requirement in metadata

### Motivation and Context
Helps solvers like uv and poetry to build accurate solutions eg see
python-poetry/poetry#10151,
astral-sh/uv#11274
when using static kv_cache, past_sequence_length is the max sequence
length of kv_cache.
issue1: total_sequence_length will be larger than the cache entry
issue2: we do way more calculations that needed so things are noticeable
slower
### Description
If it would improve performance, this patch moves the CPU to ml-tensor
before sending the to the ONNXRuntime WebNN EP.

### Motivation and Context
We are currently performing 2 extra copies on input tensors located in
the CPU when using the WebNN EP (JS -(copy)-> wasm heap -(copy)-> JS ->
WebNN API). This patch removes these extra copies.
@ashrit-ms ashrit-ms requested review from a team as code owners February 11, 2025 17:12
@ashrit-ms ashrit-ms self-assigned this Feb 11, 2025
@ashrit-ms ashrit-ms merged commit 0420687 into win-ort-main Feb 11, 2025
36 of 38 checks passed
@ashrit-ms ashrit-ms deleted the ashritms/main2win-ort-main-250211 branch February 11, 2025 17:15
ashrit-ms added a commit that referenced this pull request Mar 17, 2025
ashrit-ms added a commit that referenced this pull request Mar 17, 2025
### Description
This change reverts the following PRs made to win-ort-main
0420687 Update win-ort-main to tip main 250211 (#23646)
480bcdf [VitisAI] Add vaip Integration Using FetchContent
(Cherry-pick of PR#22038 to win-ort-main branch) (#23608)
4b5b5f7 Update win-ort-main to tip main 250123 (#23473)
df87317 Update win-ort-main to tip main 250116 (#23398)

and cherry-picks commits between 6806174 to e0b66ca

---------

Signed-off-by: Junze Wu <junze.wu@intel.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com>
Signed-off-by: Michael Tyler <michael.tyler@arm.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Co-authored-by: Yifan Li <109183385+yf711@users.noreply.github.com>
Co-authored-by: Yueqing Zhang <yuz75@Pitt.edu>
Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: Wu, Junze <junze.wu@intel.com>
Co-authored-by: Jian Chen <cjian@microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Wanming Lin <wanming.lin@intel.com>
Co-authored-by: Matthieu Darbois <mayeut@users.noreply.github.com>
Co-authored-by: Prathik Rao <prathik.rao@gmail.com>
Co-authored-by: wonchung-microsoft <wonchung@microsoft.com>
Co-authored-by: Vincent Wang <wangwchpku@outlook.com>
Co-authored-by: PARK DongHa <luncliff@gmail.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
Co-authored-by: Sam Webster <13457618+samwebster@users.noreply.github.com>
Co-authored-by: Adrian Lizarraga <adrianlm2@gmail.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: Satya Kumar Jandhyala <satya.k.jandhyala@gmail.com>
Co-authored-by: Corentin Maravat <101636442+cocotdf@users.noreply.github.com>
Co-authored-by: Xiaoyu <85524621+xiaoyu-work@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jie Chen <jie.a.chen@intel.com>
Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Ted Themistokleous <107195283+TedThemistokleous@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Artur Wojcik <artur.wojcik@outlook.com>
Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>
Co-authored-by: ikalinic <ilija.kalinic@amd.com>
Co-authored-by: sstamenk <sstamenk@amd.com>
Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com>
Co-authored-by: Ti-Tai Wang <titaiwang@microsoft.com>
Co-authored-by: Peishen Yan <peishen.yan@intel.com>
Co-authored-by: Dwayne Robinson <fdwr@hotmail.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Alexis Tsogias <1114095+Zyrin@users.noreply.github.com>
Co-authored-by: junchao-zhao <68935141+junchao-loongson@users.noreply.github.com>
Co-authored-by: sushraja-msft <44513542+sushraja-msft@users.noreply.github.com>
Co-authored-by: Caroline Zhu <wolfivyaura@gmail.com>
Co-authored-by: Grégoire <gregoire.verdier@gmail.com>
Co-authored-by: Jing Fang <126209182+fajin-corp@users.noreply.github.com>
Co-authored-by: Yateng Hong <yatengh@microsoft.com>
Co-authored-by: Michael Sharp <51342856+michaelgsharp@users.noreply.github.com>
Co-authored-by: Malik Shahzad Muzaffar <shahzad.malik.muzaffar@cern.ch>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: Karim Vadsariya <karim.vadsariya@microsoft.com>
Co-authored-by: Lei Cao <jslhcl@gmail.com>
Co-authored-by: Karim Vadsariya <kvadsariya@microsoft.com>
Co-authored-by: Takeshi Watanabe <take-cheeze@users.noreply.github.com>
Co-authored-by: Xavier Dupré <xadupre@users.noreply.github.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: Xinpeng Dou <15529241576@163.com>
Co-authored-by: Jambay Kinley <jambaykinley@microsoft.com>
Co-authored-by: Gavin Kinsey <98115505+ms-gavinkinsey@users.noreply.github.com>
Co-authored-by: Jon Campbell <jcampbell@cephable.com>
Co-authored-by: Joshua Lochner <admin@xenova.com>
Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: saurabh <saurabh1.kale@intel.com>
Co-authored-by: TejalKhade28 <tejal.khade@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: microsoft-github-policy-service[bot] <77245923+microsoft-github-policy-service[bot]@users.noreply.github.com>
Co-authored-by: shaoboyan091 <shaoboyan@microsoft.com>
Co-authored-by: David Hotham <david.hotham@microsoft.com>
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Enrico Galli <enrico.galli@intel.com>
Co-authored-by: amarin16 <91211819+amarin16@users.noreply.github.com>
Co-authored-by: xieofxie <xieofxie@126.com>
Co-authored-by: hualxie <hualxie@microsoft.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: David Fan <30608893+jiafatom@users.noreply.github.com>
Co-authored-by: Bilyana Indzheva <36890669+bili2002@users.noreply.github.com>
Co-authored-by: Michael Tyler <67695629+MichaelTylerArm@users.noreply.github.com>
Co-authored-by: Ranjit Ranjan <165394499+ranjitshs@users.noreply.github.com>
Co-authored-by: liqun Fu <liqfu@microsoft.com>
Co-authored-by: n1harika <niharika.sathish@intel.com>
Co-authored-by: rayngun <103146671+rayngun@users.noreply.github.com>
Co-authored-by: Surendar Rama Sitaraman <surendar.rama.sitaraman@intel.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: Ashish Garg <quic_ashigarg@quicinc.com>
Co-authored-by: Ashish Garg <ashigarg@qti.qualcomm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.