Skip to content

Commit

Permalink
[0.25.0][cherrypick][doc] Updating new TensorRT-LLM configurations (d…
Browse files Browse the repository at this point in the history
  • Loading branch information
sindhuvahinis committed Nov 28, 2023
1 parent cd3ef4a commit 95323b5
Showing 1 changed file with 18 additions and 8 deletions.
26 changes: 18 additions & 8 deletions serving/docs/configurations_large_model_inference_containers.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,17 +107,27 @@ If you are using Neuron container and engine set to Python, the following parame
| option.compiled_graph_path | No | Provide an s3 URI, or a local directory that stores the pre-compiled graph for your model (NEFF cache) to skip runtime compilation. | Default: `None` |


### TensorRT LLM
### TensorRT-LLM

If you specify MPI engine in TensorRT LLM container, the following parameters will be accessible.

| Item | Required | Description | Example value |
|-------------------------------|-----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------|
| option.quantize | No | Quantize model with supported quantization methods (smoothquant) | `smoothquant` , Default: `None` |
| Advanced parameters |
| option.use_custom_all_reduce | No | Custom all reduce kernel is used for GPUs that have NVLink enabled. This can help to speed up model inference speed with better communication. Turn this on by setting true on P4D, P4De, G5 or other GPUs that are NVLink connected | Default: `false` |
| option.max_input_len | No | Maximum input token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to consume the long input. | Default: `2048` |
| option.max_output_len | No | Maximum output token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to produce tokens beyond the value you set. | Default: `512` |
| Item | Required | Description | Example value |
|----------------------------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
| option.max_input_len | No | Maximum input token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to consume the long input. | Default values for:<br/>Llama is `512` <br/> Falcon is `1024` |
| option.max_output_len | No | Maximum output token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to produce tokens beyond the value you set. | Default values for:<br/> Llama is `512` <br/> Falcon is `1024` |
| option.use_custom_all_reduce | No | Custom all reduce kernel is used for GPUs that have NVLink enabled. This can help to speed up model inference speed with better communication. Turn this on by setting true on P4D, P4De, P5 and other GPUs that are NVLink connected | `true`, `false`. <br/> Default is `false` |
| Advanced parameters |
| option.tokens_per_block | No | tokens per block to be used in paged attention algorithm | Default values is `64` |
| option.batch_scheduler_policy | No | scheduler policy of Tensorrt-LLM batch manager. | `max_utilization`, `guaranteed_no_evict` <br/> Default value is `max_utilization` |
| option.kv_cache_free_gpu_mem_fraction | No | fraction of free gpu memory allocated for kv cache. The larger value you set, the more memory the model will try to take over on the GPU. The more memory preserved, the larger KV Cache size we can use and that means longer input+output sequence or larger batch size. | float number between 0 and 1. <br/> Default is `0.95` |
| option.max_num_sequences | No | maximum number of input requests processed in the batch. We will apply max_rolling_batch_size as the value for it if you don't set this. Generally you don't have to touch it unless you really want the model to be compiled to a batch size that not the same as model server set | Integer greater than 0 <br/> Default value is the batch size set while building Tensorrt engine |
| option.enable_trt_overlap | No | Parameter to overlap the execution of batches of requests. It may have a negative impact on performance when the number of requests is too small. During our experiment, we saw more negative impact to turn this on than off. | `true`, `false`. <br/> Default is `false` |
| Advanced parameters: Quantization |
| option.quantize | No | Currently only supports `smoothquant` for Llama models with just in time compilation mode. | `smoothquant` |
| option.smoothquant_alpha | No | smoothquant alpha parameter | Default value is `0.8` |
| option.smoothquant_per_token | No | This is only applied when `option.quantize` is set to `smoothquant`. This enables choosing at run time a custom smoothquant scaling factor for each token. This is usally little slower and more accurate | `true`, `false`. <br/> Default is `false` |
| option.smoothquant_per_channel | No | This is only applied when `option.quantize` is set to `smoothquant`. This enables choosing at run time a custom smoothquant scaling factor for each channel. This is usally little slower and more accurate | `true`, `false`. <br/> Default is `false` |
| option.multi_query_mode | No | This is only needed when `option.quantize` is set to `smoothquant` . This is should be set for models that support multi-query-attention, for e.g llama-70b | `true`, `false`. <br/> Default is `false` |


## Aliases
Expand Down

0 comments on commit 95323b5

Please sign in to comment.