-
Couldn't load subscription status.
- Fork 131
Description
System Info
For this case not necessary, but I use the 25.09 ngc tensorrt llm container for triton inference server.
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
At the moment a lot of the tutorial/docs and especially links are deprecated/not functioning or the information is not valid anymore. This makes things very confusing. An example:
If you look at this section the links are broken:
https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#scheduling
This for example leads to nothing, and searching for the file does not really work as well.
https://github.com/NVIDIA/TensorRT-LLM/tree/v0.15.0/docs/source/advanced/batch-manager.md
Also there is some ambiguity about how the method of batching in tensorrt llm can be enabled/changed.
So you basically have the options
- static batching
- dynamic batching
- inflight_batching
- infused_inflight_batching.
But in order to set these (from my limited understanding)
- As of now you set in config.pbtxt:
parameters: {
key: "batch_scheduler_policy"
value: {
string_value: "STATIC_BATCH"
}
- You just set it as usual in the configt.pbtxt as well:
dynamic_batching {
preferred_batch_size: [ 32, 16, 8, 4 ]
max_queue_delay_microseconds: 500
default_queue_policy: { max_queue_size: 256 }
}
- and 4. are the same:
parameters: {
key: "gpt_model_type"
value: {
string_value: "inflight_batching"
}
}
What confuses the heck out of me: I can define all of these in any combination in the same config and it still works. However i am completely clueless how the batching now works internally. Static, Dynamic? Static when I give a batch to the model, dynamic + some kind of fusing when its one request at a time? I can also send batches when STATIC_BATCH is not enabled just with infused_inflight_batching.
What does infused_inflight_batching + dynami_batching do?
How can I set a config like this to enable exclusive static batching/dynamic batching or inflight batching?
Expected behavior
Maybe a more clear explanation of the settings and how they work in combination and links that work.
Gemini gave me this explanation which seems to be plausible:
_Capacity Scheduler Policy (CapacitySchedulerPolicy): This is the parameter you were looking at (kMAX_UTILIZATION, kGUARANTEED_NO_EVICT, kSTATIC_BATCH). This controls the scheduling logic—how the backend manages the shared KV cache memory and decides which requests to run in a batch at any given moment.
Batching Strategy (In-Flight Batching/Continuous Batching): This is the fundamental, high-level technique used by the scheduler. For high-performance LLM serving, TensorRT-LLM uses In-Flight Batching (also called Continuous Batching) by default, which is an architectural feature and is not typically a separate, switchable strategy parameter._
This would mean dynamic_batching is probably enabled by setting it together with inflight_batching?
actual behavior
The current docs.
additional notes
I use tensorrt llm on triton so i cant call a scheduler with python and have to use config.pbtxt.