Torch-TensorRT v2.10.0 #4087
narendasan
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Torch-TensorRT 2.10.0 Linux x86-64 and Windows targets
PyTorch 2.10, CUDA 12.9, 13.0, TensorRT 10.14, Python 3.10~3.13
Torch-TensorRT Wheels are available:
x86-64 Linux and Windows:
CUDA 13.0 + Python 3.10-3.13 is Available via PyPI
CUDA 12.9/13.0 + Python 3.10-3.13 is also Available via Pytorch Index
aarch64 SBSA Linux and Jetson Thor
CUDA 13.0 + Python 3.10–3.13 + Torch 2.10 + TensorRT 10.14
Jetson Orin
Important Changes
Retracing is enabled as the default behavior of saving a compiled graph module with
torch_tensorrt.save. Torch-TensorRT re-exports the graph using torch.export.export(strict=False) to save it. This preserves the completeness of the output FX Graph and fills in metadata.New Features
LLM improvements
The run_llm script now supports compiling models that have previously been quantized using the TensorRT Model Optimizer
Toolkit and uploaded to HuggingFace.
Now we support the following inference scenarios:
Standard high precision model, directly compile and run inference in fp16/bf16 via torch_tensorrt Autocast
Standard high precision model, use TensorRT Model optimizer to quantize and compile on device and then run inference in fp8/nvfp4 precision
Previously quantized model uploaded to Huggingface, directly compile and run inference infp8/nvfp4
Notes:
--model_precision
this is mandatory, it is used to tell llm tool what is the model's precision
--quant_format
this is optional, it is only used for quantized model inference
for the pre-quantized modelopt checkpoint, this is to tell
Improvements to Engine Caching
Before this release, since weight-stripped engines can be refitted only once due to the limitation of TensorRT (<10.14), we cached weighted engines to make sure Engine Caching feature work properly, which occupied unnecessary hard disk. Since this release, if users install TensorRT >= 10.14, engine caching will only save weight-stripped engines on disk regardless of
compilation_settings.strip_engine_weights, and then, when users pull out the cached engine, it will be automatically refitted and kept refittable all the time, which means compiled TRT modules can be refitted multiple times with the functionrefit_module_weights(). e.g.:Autocast
Before TensorRT 10.12, TensorRT would implicitly pick kernels for layers that result in the best performance (i.e., weak typing). Weak typing behavior is deprecated in newer TensorRT versions, but it is a good way to maximize performance. Therefore, in this release, we want to provide a solution for users to enable mixed precision behavior like weak typing, which is called
Autocast.Unlike PyTorch Autocast, Torch-TensorRT Autocast is a rule-based autocast, which intelligently selects nodes to
keep in FP32 precision to maintain model accuracy while benefiting from reduced precision on the rest of the nodes.
Torch-TensorRT Autocast also supports users to specify which nodes to exclude from Autocast, considering some nodes
might be more sensitive to affecting accuracy. In addition, Torch-TensorRT Autocast can cooperate with PyTorch Autocast,
allowing users to use both PyTorch Autocast and Torch-TensorRT Autocast in the same model. Torch-TensorRT Autocast
respects the precision of the nodes within PyTorch Autocast context. Please refer to Torch-TRT mixed precision doc for more details.
Compilation Resource Management
Compiling large models on limited-resource hardware is challenging. Before this release, to successfully compile the FLUX model (24GB), we needed at least 128GB of host memory, which is >5x of the model size. This huge consumption limited Torch-TensorRT's capability to compile large models with limited resources.
Host Memory Optimization
Introduce the feature of trimming malloc memory, thus reducing peak host memory consumption.
bash export TORCHTRT_ENABLE_BUILDER_MALLOC_TRIM=1 python example.pyBy using the environment variable, the peak memory usage can be reduced to 3x.
If the cuda memory is sufficient, you can disable by setting
offload_module_to_cpu=Falseto further reduce the host memory to 2x. More detailed explanation can be found here: https://github.com/pytorch/TensorRT/blob/main/docsrc/contributors/resource_management.rstResource Aware Partitioner
A new feature called Resource Aware Partitioner is introduced to address situations where the available host memory is smaller than 3x of the model size. In compilation settings, set
enable_resource_partitioning=Trueand (optionally) set acpu_memory_budget, the partitioner will automatically shard the graph such that the compilation resource consumption can fit into very constrained resources (<2x) without sacrificing performance and accuracy.Example usage can be found here:
https://github.com/pytorch/TensorRT/blob/b7ae84fc020b1f0428b019d39c6284c7d52626e7/examples/dynamo/low_cpu_memory_compilation.py
Debugger
TensorRT API Capture
In this release, we have added the TensorRT API Capture and Replay feature
which streamlines the process of reproducing and debugging issues within your model.
It allows you to record the engine-building phase of your model and later replay the engine-build steps.
Capture:
The capture feature is by default disabled.
You can enable the capture feature via environment variable: TORCHTRT_ENABLE_TENSORRT_API_CAPTURE=1
TORCHTRT_ENABLE_TENSORRT_API_CAPTURE=1 python your_model_test.pyYou should see the shim.json and shim.bin generated after enable the capture.
Replay:
Use tensorrt_player tool to replay the captured trt engine build without the original framework
tensorrt_player -j /absolute/path/to/shim.json -o /absolute/path/to/output_engineLimitations:
-This feature is currently restricted to Linux(x86-64 and aarch64) only.
-This feature is currently restricted to capture and record 1 trt engine only,
in case you have graph break, there are multiple engines are built, only the first engine is recorded.
In the next release we will support multiple engines are all recorded in the same bin file.
You can see more details in
https://docs.pytorch.org/TensorRT/getting_started/capture_and_replay.html?highlight=capture+replay#
What's Changed
New Contributors
Full Changelog: v2.9.0...v2.10.0
This discussion was created from the release Torch-TensorRT v2.10.0.
Beta Was this translation helpful? Give feedback.
All reactions