Open
Description
System Info
I was following this whisper-doc to run on triton Inference Server with TensorRT-LLM backend, getting following error after running following command while building TensorRT-LLM engines for Decoder but work fine for encoder.
System specs:
OS: Ubuntu 24
CPU: x86_64
GPU specs:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.6 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:03:00.0 Off | Off |
| 30% 30C P8 8W / 300W | 23516MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
building TensorRT-LLM engines for Decoder
trtllm-build --checkpoint_dir ${checkpoint_dir}/decoder \
--output_dir ${output_dir}/decoder \
--moe_plugin disable \
--max_beam_width ${MAX_BEAM_WIDTH} \
--max_batch_size ${MAX_BATCH_SIZE} \
--max_seq_len 114 \
--max_input_len 14 \
--max_encoder_input_len 3000 \
--gemm_plugin ${INFERENCE_PRECISION} \
--bert_attention_plugin ${INFERENCE_PRECISION} \
--gpt_attention_plugin ${INFERENCE_PRECISION}```
Expected behavior
trtllm-build
command should return TensorRT-LLM model require which will be require during inference
actual behavior
Faced following Error:
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 627, in main
parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 425, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 390, in build_and_save
engine = build_model(build_config,
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 360, in build_model
model = model_cls.from_checkpoint(ckpt_dir, config=rank_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/modeling_utils.py", line 653, in from_checkpoint
model.load(weights, from_pruned=is_checkpoint_pruned)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/modeling_utils.py", line 675, in load
raise RuntimeError(
RuntimeError: Required but not provided tensors:
additional notes
I used nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3
and used this this script to convert_checkpoints.py