-
Notifications
You must be signed in to change notification settings - Fork 39
feat: ORT GenAI Stateful Compilation changes #676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
46802e5
to
f137407
Compare
f137407
to
8a9fca1
Compare
b672820
to
775c27e
Compare
@@ -106,7 +106,8 @@ BackendManager::BackendManager(SessionContext& session_context, | |||
subgraph_context_.has_dynamic_input_shape = true; | |||
LOGS_DEFAULT(INFO) << "[OpenVINO-EP] Model has symbolic input dims"; | |||
if ((session_context_.device_type.find("CPU") != std::string::npos || | |||
session_context_.device_type.find("GPU") != std::string::npos) && | |||
session_context_.device_type.find("GPU") != std::string::npos || | |||
(session_context_.device_type.find("NPU") != std::string::npos && session_context_.enable_causallm)) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this condition be simplified Since this is not valid only for a dynamic model on NPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
845f903
to
833cff9
Compare
ovInfReq.set_tensor(tensor_name, tensor); | ||
} | ||
|
||
void StatefulOVInferRequest::CacheTensor(const std::string& tensor_name, std::vector<int64_t>& cache) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we may eventually need to support caching position_ids / logits which have more complicated shapes than just [1, <num_input_tokens>]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will handle that logic in the next PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
97775fd
to
a5ac79d
Compare
a5ac79d
to
a9b1f9d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces stateful compilation support for ORT GenAI using OpenVINO by integrating a new provider option, enable_causallm, along with several supporting changes in compilation, inference request handling, and backend communication. Key changes include adding stateful model transformation utilities, updating the OpenVINO interface to support causal LM functionality, and modifying test cases and backend management to incorporate KV cache rewind and dynamic shapes handling.
Reviewed Changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.
File | Description |
---|---|
onnxruntime/test/perftest/ort_test_session.cc | Extended option parsing for enable_causallm with boolean value checking and error messaging |
onnxruntime/core/providers/openvino/ov_interface.{h,cc} | Added StatefulCompileModel mechanism and updated OVExeNetwork to carry extra stateful attributes |
onnxruntime/core/providers/openvino/openvino_provider_factory.cc | Integrated parsing logic for enable_causallm and adjusted dynamic shapes flags for NPU devices |
onnxruntime/* (multiple backend and execution provider files) | Updated backend and infer request handling to support KV cache operations, stateful inference, and additional configuration for ORT GenAI |
Comments suppressed due to low confidence (2)
onnxruntime/core/providers/openvino/backends/basic_backend.h:54
- It would be helpful to add a comment explaining why tensor names such as ''beam_idx'', ''past_key_values'', and ''present'' are being skipped when session_context.enable_causallm is true. This aids future maintainers in understanding the rationale behind bypassing KV cache tensor mapping in stateful model scenarios.
if ((onnx_name.empty() || onnx_name == "beam_idx" ||
onnxruntime/core/providers/openvino/ov_interface.h:90
- [nitpick] Consider renaming the member variable 'compiled_model_obj' to simply 'compiled_model' for improved clarity and consistency with standard naming conventions.
ov::CompiledModel compiled_model_obj;
Description
This PR enables the essential features to enable ORT GenAI with OVEP using Stateful Compilation of ov::Model, inspired from OV GenAI pipeline flow.
I have introduced a new provider option
enable_causallm
which can be set toTrue
for enabling the ORT GenAI pipeline with Causal LLM Models that are fully supported on OVEP in the custom config file called genai_config.jsonSample genai_config.json -
FYI the GenAI models in ONNX format are usually Stateless in nature & require dynamic shapes compilation.