Skip to content

feat: Enable EpContext OVIR Encapsulation #704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 24, 2025

Conversation

ankitm3k
Copy link

@ankitm3k ankitm3k commented Jun 11, 2025

Description

This PR enables the EPContext OVIR Encapsulation model import, compilation & inference feature.

To use this feature, one must enable below session option i.e. ep.context_file_path (when explicitly using CreateSessionFormArray() API with absolute path) -
onnxruntime_perf_test.exe -e openvino -m times -r 1 -o 0 -I -l -i "device_type|NPU" -C "ep.context_file_path|model.onnx " model.onnx

https://jira.devtools.intel.com/browse/CVS-169087

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for the EPContext OVIR Encapsulation feature by updating the model import, compilation, and inference logic across the OpenVINO provider. Key changes include:

  • Adding a new parameter (enable_causallm) to the OVCore::ImportModel API.
  • Updating the OVCore::ImportModel implementation to branch on XML model stream detection and enable stateful compilation.
  • Introducing a helper function to detect XML model streams and enforcing OpenVINO SDK version compatibility in the onnx_ctx_model_helper.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
onnxruntime/core/providers/openvino/ov_interface.h Added the bool enable_causallm parameter to ImportModel.
onnxruntime/core/providers/openvino/ov_interface.cc Refactored ImportModel logic and repositioned log messages based on the model format.
onnxruntime/core/providers/openvino/onnx_ctx_model_helper.cc Added XML stream check and enforced SDK version compatibility with an error message.
onnxruntime/core/providers/openvino/backends/basic_backend.cc Updated the call to ImportModel to supply the new parameter and model path name.
onnxruntime/core/providers/openvino/backend_utils.h Declared the new IsModelStreamXML helper.
onnxruntime/core/providers/openvino/backend_utils.cc Implemented IsModelStreamXML to detect XML headers in the model stream.
Comments suppressed due to low confidence (1)

onnxruntime/core/providers/openvino/ov_interface.h:82

  • [nitpick] Consider renaming 'enable_causallm' to 'enableCausalLM' to follow typical C++ camelCase naming conventions and improve readability.
bool enable_causallm,

Copy link

@RyanMetcalfeInt8 RyanMetcalfeInt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I tested this on my LNL machine with some EPCtx models that are used for AI Toolkit. They seem to work fine.

The only issue I see, which is somewhat outside of the scope of this specific PR is, as compared to msb_release_v2 branch, python applications are unable to recover from KV-Cache is full. exceptions.

With this branch (and ovep-develop), when these exceptions are thrown, the infer request is deleted from the infer request queue (with delete Request0 message printed), and when application tries to start another generation sequence, the application crashes.

With msb_release_v2, the application is able to catch these exceptions and then proceed / try again with the next generation sequence (after rewinding KV-Cache state, etc.).

// where weights from bin file is directly consumed
std::string xml_file_name = name;
if (name.size() >= 5 && name.substr(name.size() - 5) == ".onnx") {
xml_file_name.replace(name.size() - 5, 5, ".xml");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we validated this with CreateSessionFromArray where model is passed in memory? Is there a way to decouple from the location on disk since the onnx model and its contents should be portable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just a file name handling code where your input model name string i.e. mode.onnx is now reprsented as an input xml file name string i.e. model.xml, this wont impact the memory usage

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not regarding memory usage but whether model passed in memory would work.

This can be verified by:
onnxruntime_perf_test.exe -e openvino -m times -r 1 -o 0 -I -l -i "device_type|NPU" -C "ep.context_file_path|C:\resnet50_int8_st.onnx" C:\resnet50_int8_st.onnx

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not regarding memory usage but whether model passed in memory would work.

This can be verified by: onnxruntime_perf_test.exe -e openvino -m times -r 1 -o 0 -I -l -i "device_type|NPU" -C "ep.context_file_path|C:\resnet50_int8_st.onnx" C:\resnet50_int8_st.onnx

tested on latest commit 143f4c1 & it is functional

@@ -73,7 +73,8 @@ BasicBackend::BasicBackend(std::unique_ptr<ONNX_NAMESPACE::ModelProto>& model_pr
exe_network_ = OVCore::Get()->ImportModel(*model_stream,
hw_target,
device_config,
subgraph_context_.subgraph_name);
enable_causallm,
session_context_.onnx_model_path_name.string());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@javier-intel, @preetha-intel .. Does this change start supporting OV IR wrapped in ONNX but impact pre-compiled and partitioned ONNX models as we do not have any reference to subgraph_context anymore passed into ImportModel ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field subgraph_context_.subgraph_name was a redundant entity in the current implementation which was used in exception handling with the graph name, thus we are using the original model name here to facilitate the model xml contents to be parsed while loading the model

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With subgraph_context_.subgraph_name we get better error handling inside ImportModel, so let's try to retain the argument.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@ankitm3k ankitm3k force-pushed the ankit/epctx_encaps_feature branch 2 times, most recently from c1aa179 to 1748e06 Compare June 16, 2025 07:51
Copy link

@RyanMetcalfeInt8 RyanMetcalfeInt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see failures after testing with latest forced-push. It seems like when I use EPCtx-wrapped models, the control flow is attempting to use ImportModel path:

>python chat_sequence_test.py --ortgenai_model_path C:\Users\LocalAdmin\ort_build\EPCtxModels\Qwen2.5-1.5B-Instruct_context_ov_dynamic_sym_bkp_int8_sym_r1\Qwen2.5-1.5B-Instruct_context_ov_dynamic_sym_bkp_int8_sym_r1\model
Creating Model...
2025-06-16 07:39:25.1757742 [E:onnxruntime:onnxruntime-genai, inference_session.cc:2288 onnxruntime::InferenceSession::Initialize::<lambda_e5577c429b426a2a020919e219e36787>::operator ()] Exception during initialization: C:\Users\LocalAdmin\ort_build\BuildArtifacts-20250616_065035\onnxruntime\onnxruntime\core\providers\openvino\ov_interface.cc:232 class onnxruntime::openvino_ep::OVExeNetwork __cdecl onnxruntime::openvino_ep::OVCore::ImportModel(class std::basic_istream<char,struct std::char_traits<char> > &,class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> >,const class std::map<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> >,class ov::Any,struct std::less<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > >,class std::allocator<struct std::pair<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > const ,class ov::Any> > > &,bool,class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> >) [OpenVINO-EP]  Exception while Loading Network for graph: C:\Users\LocalAdmin\ort_build\EPCtxModels\Qwen2.5-1.5B-Instruct_context_ov_dynamic_sym_bkp_int8_sym_r1\Qwen2.5-1.5B-Instruct_context_ov_dynamic_sym_bkp_int8_sym_r1\model\openvino_model.onnxException from src\inference\src\cpp\core.cpp:112:
Exception from src\inference\src\dev\plugin.cpp:53:
Exception from src\plugins\intel_npu\src\plugin\npuw\compiled_model.cpp:460:
Failed to compile Model0_FCEW000__0 for all devices in [NPU]




Traceback (most recent call last):
  File "C:\Users\LocalAdmin\Workspace\onnxruntime-genai-dev-tools\tests\chat_sequence\chat_sequence_test.py", line 277, in <module>
    main(args)
  File "C:\Users\LocalAdmin\Workspace\onnxruntime-genai-dev-tools\tests\chat_sequence\chat_sequence_test.py", line 115, in main
    chat = OrtGenaiChat(args, search_options)
  File "C:\Users\LocalAdmin\Workspace\onnxruntime-genai-dev-tools\tests\common\ortgenai_chat.py", line 22, in __init__
    self.model = og.Model(config)
RuntimeError: Exception during initialization: C:\Users\LocalAdmin\ort_build\BuildArtifacts-20250616_065035\onnxruntime\onnxruntime\core\providers\openvino\ov_interface.cc:232 class onnxruntime::openvino_ep::OVExeNetwork __cdecl onnxruntime::openvino_ep::OVCore::ImportModel(class std::basic_istream<char,struct std::char_traits<char> > &,class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> >,const class std::map<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> >,class ov::Any,struct std::less<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > >,class std::allocator<struct std::pair<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > const ,class ov::Any> > > &,bool,class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> >) [OpenVINO-EP]  Exception while Loading Network for graph: C:\Users\LocalAdmin\ort_build\EPCtxModels\Qwen2.5-1.5B-Instruct_context_ov_dynamic_sym_bkp_int8_sym_r1\Qwen2.5-1.5B-Instruct_context_ov_dynamic_sym_bkp_int8_sym_r1\model\openvino_model.onnxException from src\inference\src\cpp\core.cpp:112:
Exception from src\inference\src\dev\plugin.cpp:53:
Exception from src\plugins\intel_npu\src\plugin\npuw\compiled_model.cpp:460:
Failed to compile Model0_FCEW000__0 for all devices in [NPU]

@ankitm3k ankitm3k force-pushed the ankit/epctx_encaps_feature branch from 1748e06 to 143f4c1 Compare June 17, 2025 14:50
@RyanMetcalfeInt8
Copy link

With this latest branch, I need to add the following to my genai_config.json:

"ep_context_file_path" : "C:/Users/LocalAdmin/ort_build/EPCtxModels/Phi-3.5-mini-instruct_context_ov_dynamic_sym_gs128_bkp_int8_sym/model/openvino_model.onnx",

This is unacceptable, as requiring this new option to be passed as an absolute path will become very problematic from a user experience / distribution perspective.

Can we still have it support the 'old way' (the method we used in msb_release_v2 branch) if ep_context_file_path session option is not specified?

@ankitm3k
Copy link
Author

With this latest branch, I need to add the following to my genai_config.json:

"ep_context_file_path" : "C:/Users/LocalAdmin/ort_build/EPCtxModels/Phi-3.5-mini-instruct_context_ov_dynamic_sym_gs128_bkp_int8_sym/model/openvino_model.onnx",

This is unacceptable, as requiring this new option to be passed as an absolute path will become very problematic from a user experience / distribution perspective.

Can we still have it support the 'old way' (the method we used in msb_release_v2 branch) if ep_context_file_path session option is not specified?

fixed

@ankitm3k ankitm3k force-pushed the ankit/epctx_encaps_feature branch from fdd5b0d to 0908804 Compare June 18, 2025 10:41
@ankitm3k ankitm3k force-pushed the ankit/epctx_encaps_feature branch from 0908804 to 56a0ce5 Compare June 18, 2025 11:14
@RyanMetcalfeInt8
Copy link

Using this latest branch, I seem to get some failures for NPU when using ORT GenAI.

It seems to fail within the call to ExportCompiledBlobAsEPCtxNode (from here:

auto status = onnxruntime::openvino_ep::BackendManager::ExportCompiledBlobAsEPCtxNode(subgraph);
)

Specifically, it fails for the compiled_model.export_model(blob_file); (from here:

compiled_model.export_model(blob_file);
)

Here is the exception:

2025-06-18 09:13:32.0906159 [E:onnxruntime:onnxruntime-genai, inference_session.cc:2288 onnxruntime::InferenceSession::Initialize::<lambda_4aa6d47481a0927abf40a87f6c9c773d>::operator ()] Exception during initialization: Exception from src\inference\src\cpp\compiled_model.cpp:132:
Exception from src\plugins\intel_npu\src\plugin\npuw\serialization.cpp:164:
NPUW: Assertion false && "Unsupported type" failed


Traceback (most recent call last):
  File "C:\Users\LocalAdmin\Workspace\onnxruntime-genai-dev-tools\tests\chat_sequence\chat_sequence_test.py", line 277, in <module>
    main(args)
  File "C:\Users\LocalAdmin\Workspace\onnxruntime-genai-dev-tools\tests\chat_sequence\chat_sequence_test.py", line 115, in main
    chat = OrtGenaiChat(args, search_options)
  File "C:\Users\LocalAdmin\Workspace\onnxruntime-genai-dev-tools\tests\common\ortgenai_chat.py", line 22, in __init__
    self.model = og.Model(config)
RuntimeError: Exception during initialization: Exception from src\inference\src\cpp\compiled_model.cpp:132:
Exception from src\plugins\intel_npu\src\plugin\npuw\serialization.cpp:164:
NPUW: Assertion false && "Unsupported type" failed

@ankitm3k
Copy link
Author

Using this latest branch, I seem to get some failures for NPU when using ORT GenAI.

It seems to fail within the call to ExportCompiledBlobAsEPCtxNode (from here:

auto status = onnxruntime::openvino_ep::BackendManager::ExportCompiledBlobAsEPCtxNode(subgraph);

)
Specifically, it fails for the compiled_model.export_model(blob_file); (from here:

compiled_model.export_model(blob_file);

)
Here is the exception:

2025-06-18 09:13:32.0906159 [E:onnxruntime:onnxruntime-genai, inference_session.cc:2288 onnxruntime::InferenceSession::Initialize::<lambda_4aa6d47481a0927abf40a87f6c9c773d>::operator ()] Exception during initialization: Exception from src\inference\src\cpp\compiled_model.cpp:132:
Exception from src\plugins\intel_npu\src\plugin\npuw\serialization.cpp:164:
NPUW: Assertion false && "Unsupported type" failed


Traceback (most recent call last):
  File "C:\Users\LocalAdmin\Workspace\onnxruntime-genai-dev-tools\tests\chat_sequence\chat_sequence_test.py", line 277, in <module>
    main(args)
  File "C:\Users\LocalAdmin\Workspace\onnxruntime-genai-dev-tools\tests\chat_sequence\chat_sequence_test.py", line 115, in main
    chat = OrtGenaiChat(args, search_options)
  File "C:\Users\LocalAdmin\Workspace\onnxruntime-genai-dev-tools\tests\common\ortgenai_chat.py", line 22, in __init__
    self.model = og.Model(config)
RuntimeError: Exception during initialization: Exception from src\inference\src\cpp\compiled_model.cpp:132:
Exception from src\plugins\intel_npu\src\plugin\npuw\serialization.cpp:164:
NPUW: Assertion false && "Unsupported type" failed

The precompiled blob export is functional for CPU/GPU plugins while NPU/NPUW plugins run into serialization issues as above which I believe OV toolkit team can fix it.

@RyanMetcalfeInt8
Copy link

The precompiled blob export is functional for CPU/GPU plugins while NPU/NPUW plugins run into serialization issues as above which I believe OV toolkit team can fix it.

Yes, I agree that it's a feature that can be requested to be fixed by NPU/NPUW team. But I am running with OpenVINO 2025.2, which was just released today. So I think OV EP needs to be conscious of this limitation and skip the export for problematic cases (presumably when device==NPU and enable_causallm=True.

By the way, I'm a little bit confused why this function ExportCompiledBlobAsEPCtxNode is being called in my case, as ORT GenAI doesn't enable / set any kind of export features. Is it expected?

@ankitm3k ankitm3k force-pushed the ankit/epctx_encaps_feature branch from 56a0ce5 to 9a77fcd Compare June 18, 2025 19:35
@ankitm3k
Copy link
Author

The precompiled blob export is functional for CPU/GPU plugins while NPU/NPUW plugins run into serialization issues as above which I believe OV toolkit team can fix it.

Yes, I agree that it's a feature that can be requested to be fixed by NPU/NPUW team. But I am running with OpenVINO 2025.2, which was just released today. So I think OV EP needs to be conscious of this limitation and skip the export for problematic cases (presumably when device==NPU and enable_causallm=True.

By the way, I'm a little bit confused why this function ExportCompiledBlobAsEPCtxNode is being called in my case, as ORT GenAI doesn't enable / set any kind of export features. Is it expected?

Missed an edge case, fixed it. the export wont get triggered this time in the latest fix

@RyanMetcalfeInt8
Copy link

Missed an edge case, fixed it. the export wont get triggered this time in the latest fix

Okay, with this version I am able to run ORT GenAI with NPU with models that we were testing msb_release_v2 with last month.

@vthaniel
Copy link

@ankitm3k
Can you please check the following issue
https://jira.devtools.intel.com/browse/CVS-169356

@ankitm3k ankitm3k force-pushed the ankit/epctx_encaps_feature branch from e74d3d6 to 01a26b7 Compare June 23, 2025 13:42
@ankitm3k
Copy link
Author

@ankitm3k Can you please check the following issue https://jira.devtools.intel.com/browse/CVS-169356

its fixed now, we are merge ready

@RyanMetcalfeInt8
Copy link

I don't think this is quite ready for merge yet. Running with latest branch here, I hit some exceptions when using ORT GenAI:

>python chat_sequence_test.py  --ortgenai_model_path C:\Users\LocalAdmin\ort_build\EPCtxModels\Phi-3.5-mini-instruct_context_ov_dynamic_sym_gs128_bkp_int8_asym_r1_noAWQ_wqe_noSE\model -r
Creating Model...
Creating Tokenizer...
Prompt (Use quit() to exit): hello
2025-06-23 09:08:11.9734765 [E:onnxruntime:onnxruntime-genai, sequential_executor.cc:572 onnxruntime::ExecuteKernel] Non-zero status code returned while running OpenVINO-EP-subgraph_1 node. Name:'OpenVINOExecutionProvider_OpenVINO-EP-subgraph_1_0' Status Message: the ort_value must contain a constructed tensor or sparse tensor
An error occurred: Non-zero status code returned while running OpenVINO-EP-subgraph_1 node. Name:'OpenVINOExecutionProvider_OpenVINO-EP-subgraph_1_0' Status Message: the ort_value must contain a constructed tensor or sparse tensor

@vthaniel
Copy link

vthaniel commented Jun 24, 2025

@ankitm3k Can you please check the following issue https://jira.devtools.intel.com/browse/CVS-169356

@ankitm3k @sfatimar
The issue is fixed now.
Unittests and Feature tests pass

Copy link

@RyanMetcalfeInt8 RyanMetcalfeInt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this one, as it seems like ORT GenAI failures were caused by ovep-develop changes that were pulled into this branch during a rebase. @ankitm3k will raise a ticket to track that.

@sfatimar sfatimar merged commit 6d04a2e into ovep-develop Jun 24, 2025
6 of 8 checks passed
gblong1 added a commit to gblong1/onnxruntime that referenced this pull request Jun 24, 2025
RyanMetcalfeInt8 added a commit that referenced this pull request Jun 24, 2025
ankitm3k added a commit that referenced this pull request Jun 24, 2025
* feat: Enable EpContext OVIR Encapsulation

* fix: refactor EpCtx OVIR parsing logic to use ep.context_file_path

* fix: Fix logic for parsing model_file_path

* fix: enable EPCtx OVIR encapsulation compiled blob caching

* fix: fix merge conflicts

* fix: fix bugs
javier-intel pushed a commit that referenced this pull request Jun 24, 2025
* feat: Enable EpContext OVIR Encapsulation

* fix: refactor EpCtx OVIR parsing logic to use ep.context_file_path

* fix: Fix logic for parsing model_file_path

* fix: enable EPCtx OVIR encapsulation compiled blob caching

* fix: fix merge conflicts

* fix: fix bugs
javier-intel pushed a commit that referenced this pull request Jun 24, 2025
* feat: Enable EpContext OVIR Encapsulation

* fix: refactor EpCtx OVIR parsing logic to use ep.context_file_path

* fix: Fix logic for parsing model_file_path

* fix: enable EPCtx OVIR encapsulation compiled blob caching

* fix: fix merge conflicts

* fix: fix bugs
javier-intel pushed a commit that referenced this pull request Jun 25, 2025
* feat: Enable EpContext OVIR Encapsulation

* fix: refactor EpCtx OVIR parsing logic to use ep.context_file_path

* fix: Fix logic for parsing model_file_path

* fix: enable EPCtx OVIR encapsulation compiled blob caching

* fix: fix merge conflicts

* fix: fix bugs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants