Batching documentation confusing - can you update the docs of main repository please

### System Info

For this case not necessary, but I use the 25.09 ngc tensorrt llm container for triton inference server.

### Who can help?

@juney-nvidia @kaiyux 

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

At the moment a lot of the tutorial/docs and especially links are deprecated/not functioning or the information is not valid anymore. This makes things very confusing. An example:

If you look at this section the links are broken:
https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#scheduling

This for example leads to nothing, and searching for the file does not really work as well.
https://github.com/NVIDIA/TensorRT-LLM/tree/v0.15.0/docs/source/advanced/batch-manager.md

Also there is some ambiguity about how the method of batching in tensorrt llm can be enabled/changed.

So you basically have the options 
1. static batching 
2. dynamic batching 
3. inflight_batching 
4. infused_inflight_batching.

But in order to set these (from my limited understanding)

1. As of now you set in config.pbtxt:
```
parameters: {
  key: "batch_scheduler_policy"
  value: {
    string_value: "STATIC_BATCH"
  }
```

2. You just set it as usual in the configt.pbtxt as well:
```
dynamic_batching {
    preferred_batch_size: [ 32, 16, 8, 4 ]
    max_queue_delay_microseconds: 500
    default_queue_policy: { max_queue_size: 256 }
}
```

3. and 4. are the same:
```
  parameters: {
  key: "gpt_model_type"
  value: {
    string_value: "inflight_batching"
  }
}
```

What confuses the heck out of me: I can define all of these in any combination in the same config and it still works. However i am completely clueless how the batching now works internally. Static, Dynamic? Static when I give a batch to the model, dynamic + some kind of fusing when its one request at a time? I can also send batches when STATIC_BATCH is not enabled just with infused_inflight_batching.
What does infused_inflight_batching + dynami_batching do?


How can I set a config like this to enable exclusive static batching/dynamic batching or inflight batching?

### Expected behavior

Maybe a more clear explanation of the settings and how they work in combination and links that work.

Gemini gave me this explanation which seems to be plausible:

_Capacity Scheduler Policy (CapacitySchedulerPolicy): This is the parameter you were looking at (kMAX_UTILIZATION, kGUARANTEED_NO_EVICT, kSTATIC_BATCH). This controls the scheduling logic—how the backend manages the shared KV cache memory and decides which requests to run in a batch at any given moment.

Batching Strategy (In-Flight Batching/Continuous Batching): This is the fundamental, high-level technique used by the scheduler. For high-performance LLM serving, TensorRT-LLM uses In-Flight Batching (also called Continuous Batching) by default, which is an architectural feature and is not typically a separate, switchable strategy parameter._

This would mean `dynamic_batching` is probably enabled by setting it together with `inflight_batching`?


### actual behavior

The current docs.

### additional notes

I use tensorrt llm on triton so i cant call a scheduler with python and have to use config.pbtxt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Batching documentation confusing - can you update the docs of main repository please #809

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Batching documentation confusing - can you update the docs of main repository please #809

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions