From ed74630a6addce79e101821d94a134232686adb7 Mon Sep 17 00:00:00 2001
From: Sindhu Somasundaram <56774226+sindhuvahinis@users.noreply.github.com>
Date: Wed, 22 Nov 2023 09:26:26 -0800
Subject: [PATCH] [doc] Updating new TensorRT-LLM configurations (#1340)
---
...ations_large_model_inference_containers.md | 26 +++++++++++++------
1 file changed, 18 insertions(+), 8 deletions(-)
diff --git a/serving/docs/lmi/configurations_large_model_inference_containers.md b/serving/docs/lmi/configurations_large_model_inference_containers.md
index 4f625b6dd..f248567ed 100644
--- a/serving/docs/lmi/configurations_large_model_inference_containers.md
+++ b/serving/docs/lmi/configurations_large_model_inference_containers.md
@@ -107,17 +107,27 @@ If you are using Neuron container and engine set to Python, the following parame
| option.compiled_graph_path | No | Provide an s3 URI, or a local directory that stores the pre-compiled graph for your model (NEFF cache) to skip runtime compilation. | Default: `None` |
-### TensorRT LLM
+### TensorRT-LLM
If you specify MPI engine in TensorRT LLM container, the following parameters will be accessible.
-| Item | Required | Description | Example value |
-|-------------------------------|-----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------|
-| option.quantize | No | Quantize model with supported quantization methods (smoothquant) | `smoothquant` , Default: `None` |
-| Advanced parameters |
-| option.use_custom_all_reduce | No | Custom all reduce kernel is used for GPUs that have NVLink enabled. This can help to speed up model inference speed with better communication. Turn this on by setting true on P4D, P4De, G5 or other GPUs that are NVLink connected | Default: `false` |
-| option.max_input_len | No | Maximum input token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to consume the long input. | Default: `2048` |
-| option.max_output_len | No | Maximum output token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to produce tokens beyond the value you set. | Default: `512` |
+| Item | Required | Description | Example value |
+|----------------------------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
+| option.max_input_len | No | Maximum input token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to consume the long input. | Default values for:
Llama is `512`
Falcon is `1024` |
+| option.max_output_len | No | Maximum output token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to produce tokens beyond the value you set. | Default values for:
Llama is `512`
Falcon is `1024` |
+| option.use_custom_all_reduce | No | Custom all reduce kernel is used for GPUs that have NVLink enabled. This can help to speed up model inference speed with better communication. Turn this on by setting true on P4D, P4De, P5 and other GPUs that are NVLink connected | `true`, `false`.
Default is `false` |
+| Advanced parameters |
+| option.tokens_per_block | No | tokens per block to be used in paged attention algorithm | Default values is `64` |
+| option.batch_scheduler_policy | No | scheduler policy of Tensorrt-LLM batch manager. | `max_utilization`, `guaranteed_no_evict`
Default value is `max_utilization` |
+| option.kv_cache_free_gpu_mem_fraction | No | fraction of free gpu memory allocated for kv cache. The larger value you set, the more memory the model will try to take over on the GPU. The more memory preserved, the larger KV Cache size we can use and that means longer input+output sequence or larger batch size. | float number between 0 and 1.
Default is `0.95` |
+| option.max_num_sequences | No | maximum number of input requests processed in the batch. We will apply max_rolling_batch_size as the value for it if you don't set this. Generally you don't have to touch it unless you really want the model to be compiled to a batch size that not the same as model server set | Integer greater than 0
Default value is the batch size set while building Tensorrt engine |
+| option.enable_trt_overlap | No | Parameter to overlap the execution of batches of requests. It may have a negative impact on performance when the number of requests is too small. During our experiment, we saw more negative impact to turn this on than off. | `true`, `false`.
Default is `false` |
+| Advanced parameters: Quantization |
+| option.quantize | No | Currently only supports `smoothquant` for Llama models with just in time compilation mode. | `smoothquant` |
+| option.smoothquant_alpha | No | smoothquant alpha parameter | Default value is `0.8` |
+| option.smoothquant_per_token | No | This is only applied when `option.quantize` is set to `smoothquant`. This enables choosing at run time a custom smoothquant scaling factor for each token. This is usally little slower and more accurate | `true`, `false`.
Default is `false` |
+| option.smoothquant_per_channel | No | This is only applied when `option.quantize` is set to `smoothquant`. This enables choosing at run time a custom smoothquant scaling factor for each channel. This is usally little slower and more accurate | `true`, `false`.
Default is `false` |
+| option.multi_query_mode | No | This is only needed when `option.quantize` is set to `smoothquant` . This is should be set for models that support multi-query-attention, for e.g llama-70b | `true`, `false`.
Default is `false` |
## Aliases