Skip to content

Batching documentation confusing - can you update the docs of main repository please #809

@protonicage

Description

@protonicage

System Info

For this case not necessary, but I use the 25.09 ngc tensorrt llm container for triton inference server.

Who can help?

@juney-nvidia @kaiyux

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

At the moment a lot of the tutorial/docs and especially links are deprecated/not functioning or the information is not valid anymore. This makes things very confusing. An example:

If you look at this section the links are broken:
https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#scheduling

This for example leads to nothing, and searching for the file does not really work as well.
https://github.com/NVIDIA/TensorRT-LLM/tree/v0.15.0/docs/source/advanced/batch-manager.md

Also there is some ambiguity about how the method of batching in tensorrt llm can be enabled/changed.

So you basically have the options

  1. static batching
  2. dynamic batching
  3. inflight_batching
  4. infused_inflight_batching.

But in order to set these (from my limited understanding)

  1. As of now you set in config.pbtxt:
parameters: {
  key: "batch_scheduler_policy"
  value: {
    string_value: "STATIC_BATCH"
  }
  1. You just set it as usual in the configt.pbtxt as well:
dynamic_batching {
    preferred_batch_size: [ 32, 16, 8, 4 ]
    max_queue_delay_microseconds: 500
    default_queue_policy: { max_queue_size: 256 }
}
  1. and 4. are the same:
  parameters: {
  key: "gpt_model_type"
  value: {
    string_value: "inflight_batching"
  }
}

What confuses the heck out of me: I can define all of these in any combination in the same config and it still works. However i am completely clueless how the batching now works internally. Static, Dynamic? Static when I give a batch to the model, dynamic + some kind of fusing when its one request at a time? I can also send batches when STATIC_BATCH is not enabled just with infused_inflight_batching.
What does infused_inflight_batching + dynami_batching do?

How can I set a config like this to enable exclusive static batching/dynamic batching or inflight batching?

Expected behavior

Maybe a more clear explanation of the settings and how they work in combination and links that work.

Gemini gave me this explanation which seems to be plausible:

_Capacity Scheduler Policy (CapacitySchedulerPolicy): This is the parameter you were looking at (kMAX_UTILIZATION, kGUARANTEED_NO_EVICT, kSTATIC_BATCH). This controls the scheduling logic—how the backend manages the shared KV cache memory and decides which requests to run in a batch at any given moment.

Batching Strategy (In-Flight Batching/Continuous Batching): This is the fundamental, high-level technique used by the scheduler. For high-performance LLM serving, TensorRT-LLM uses In-Flight Batching (also called Continuous Batching) by default, which is an architectural feature and is not typically a separate, switchable strategy parameter._

This would mean dynamic_batching is probably enabled by setting it together with inflight_batching?

actual behavior

The current docs.

additional notes

I use tensorrt llm on triton so i cant call a scheduler with python and have to use config.pbtxt.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions