-
Notifications
You must be signed in to change notification settings - Fork 371
Tentatively eliminate graph break overhead #3741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
self.cudagraphs_enabled = torch_tensorrt.runtime.get_cudagraphs_mode() | ||
self.requires_unique_output = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do these do ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved these states from every runtime to init to avoid repeated overhead
if self.sync_stream: | ||
self._engine_stream.wait_stream(self._caller_stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if pytorch is not on default stream, both Pyt and TRT can run on same stream and outputs matched ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if there is a performance benefit of running Pytorch & TRT on a different stream vs (Pytorch on default and TRT on a separate stream ) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Correct. Output matched. Moreover, if we run both on default stream, the output also matches. I also verified this with TRT team. Not sure whether we can implement that.
- Yes there is 15%-20% improvement when there are multiple graph breaks. Running them on different streams requires wait_stream() which takes lots of time.
def set_requires_unique_output(self, requires_unique_output: bool) -> None: | ||
self.requires_unique_output = requires_unique_output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this do ? Consider adding a docstring for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include similar changes to the C++ runtime as well?
@@ -174,6 +173,8 @@ def __init__( | |||
self.cudagraph: Optional[torch.cuda.CUDAGraph] = None | |||
self._caller_stream: Optional[torch.cuda.Stream] = None | |||
self._engine_stream: Optional[torch.cuda.Stream] = None | |||
self.output_tensors: Optional[List[torch.Tensor]] = None | |||
self.sync_stream = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just inherit stream from PyTorch / input tensors
@@ -381,16 +405,17 @@ def setup_input_tensors( | |||
|
|||
# For shape tensors, we use CPU pointers and for data tensors, we use GPU pointers | |||
# as per TensorRT requirements | |||
if self.engine.is_shape_inference_io(input_name): | |||
if self.is_shape_inference_io[i]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to make this a dictionary and key on names, instead of implicitly relying on input order to stay the same over time
input_name, tuple(contiguous_inputs[i].shape) | ||
) | ||
if shape_changed: | ||
self.context.set_input_shape( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we safely assume execution context holds shape between inference calls?
@@ -994,6 +994,10 @@ def preserve_module_specs( | |||
) as f: | |||
f.write(trt_module.get_layer_info()) | |||
|
|||
# Only set the requires_unique_output flag for the last TRT Module when user has access to the output tensor | |||
if trt_module and settings.use_python_runtime: | |||
trt_module.set_requires_unique_output(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this going to work with serialization in C++?
Also make the name clearer like trt_module.module_is_output_operator
or trt_module.requires_unowned_output_tensor
Yeah once we think all changes in pytorch is valid and I can make changes accordingly |
Description
Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.
Fixes # (issue)
Type of change
Please delete options that are not relevant and/or add your own.
Checklist: