-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor TensorRT EP code to better handle dynamic shape subgraphs #4504
Conversation
looks like there are some build errors on Windows TensorRT CI |
@@ -67,9 +67,13 @@ ORT_TENSORRT_MIN_SUBGRAPH_SIZE: minimum node size in a subgraph after partitioni | |||
|
|||
ORT_TENSORRT_FP16_ENABLE: Enable FP16 mode in TensorRT | |||
|
|||
By default TensorRT execution provider builds an ICudaEngine with max workspace size = 1 GB, max partition iterations = 1000, min subgraph size = 1 and FP16 mode is disabled. | |||
ORT_TENSORRT_ENGINE_CACHE_ENABLE: Enable TensorRT engine caching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need some more documentation on the engine caching. (why it needed, how does it work,
when would you use it, and what are some of the pitfalls and limitations.
examples might be
-if you enabled fp16 and serialized engines, you need to enable fp16 when deploying/running it.
-engines are built specifically for the underlying hardware and aren't portable.
-caveats about input shape changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I've added more explanations in the doc.
} | ||
|
||
cudaDeviceSynchronize(); | ||
for (const auto& binding_index : binding_buffers_to_freeup) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we leaking memory if the enqueueV2() fails above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For FAIL status, would the session run quit or continue to try other EPs? For the latter, we may need to free buffer when FAIL occurs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the session doesn't get destroyed automatically.
can enqueue failure be intermittent? (can there be a success after failure?)
in any case, a user could issue Run() again, or do the fallback manually (create a new session), so
it does seem like it could cause a leak.
TensorRT requires shape tensor input to be deterministic when engine is built, so the engine must be built at run-time (rather than compile phase) specifically for dynamic shape subgraphs. This PR moves all engine builds to compute() except static shape engines.
Even model's inputs are static, at subgraph level the inputs could still be dynamic and the engine for the subgraph may need to be re-build at run-time.
In some cases it takes long time for TRT to build engine because of engine optimization. In this PR the engine of static shape subgraph is serialized/cached when it was first built, and will be de-serialized/loaded to save build time for new sessions. Dynamic shape case will be addressed later due to its extra complexity since the engine is created dynamically during run time. Thank @jywu-msft for drafting the code