Open
Description
Bug Description
I'm testing BERT with commands python perf_run.py --backends=dynamo --inputs="(512, 128)@int32;(512, 128)@int32" ...
and python perf_run.py --backends=dynamo --inputs="(256, 128)@int32;(256, 128)@int32" ...
. I only changed the inputs here and saved their TRT engines but the two engines have different sizes:
Using --inputs="(256, 128)@int32;(256, 128)@int32":
Number of layers: 1277
Number of inputs: 2
Number of outputs: 2
Input 0: input_ids, shape: (256, 128), dtype: DataType.INT32
Input 1: attention_mask, shape: (256, 128), dtype: DataType.INT32
Output 0: output0, shape: (256, 128, 768), dtype: DataType.FLOAT
Output 1: output1, shape: (256, 768), dtype: DataType.FLOAT
TRT Engine uses: 516.5105247497559 Mb of Memory
Using --inputs="(512, 128)@int32;(512, 128)@int32":
Number of layers: 1277
Number of inputs: 2
Number of outputs: 2
Input 0: input_ids, shape: (512, 128), dtype: DataType.INT32
Input 1: attention_mask, shape: (512, 128), dtype: DataType.INT32
Output 0: output0, shape: (512, 128, 768), dtype: DataType.FLOAT
Output 1: output1, shape: (512, 768), dtype: DataType.FLOAT
TRT Engine uses: 612.4900169372559 Mb of Memory
To Reproduce
Steps to reproduce the behavior:
cd tools/perf
python perf_run.py ...
Expected behavior
Torch-TensorRT Dynamo backend looks faster than Inductor, but a little bit slower than ONNX path. I was wondering the reason, so I pulled out the engines and directly ran them. When I check the engines, I found the engine size got larger and larger as inputs increase. However, the engine sizes exported from ONNX path keep the same.