[docs][lmi] update guidance on advanced configurations

deepjavalibrary · Apr 1, 2024 · 23d596d · 23d596d
1 parent 449c75f
commit 23d596d
Show file tree

Hide file tree

Showing 5 changed files with 36 additions and 23 deletions.
diff --git a/serving/docs/lmi/deployment_guide/configurations.md b/serving/docs/lmi/deployment_guide/configurations.md
@@ -108,7 +108,7 @@ The following list of configurations is intended to highlight the relevant confi
 | option.revision               | The commit hash of a HuggingFace Hub Model Id. We recommend setting this value to ensure you use a specific version of the model artifacts                                                                                                                                                                                                        | None                                                                                                                                                                         | `dc1d3b3bfdb69df26f8fc966c16353274b138c56`                                                                                                                                                                                                               |
 | option.rolling_batch          | Enables continuous batching (iteration level batching) with one of the supported backends. Available backends differ by container, see [Inference Library Configurations](#inference-library-configuration) for mappings                                                                                                                          | None                                                                                                                                                                         | `auto`, `vllm`, `lmi-dist`, `deepspeed`, `trtllm`                                                                                                                                                                                                        |
 | option.max_rolling_batch_size | The maximum number of requests/sequences the model can process at a time. This parameter should be tuned to maximize throughput while staying within the available memory limits. `job_queue_size` should be set to a value equal or higher to this value. If the current batch is full, new requests will be queued until they can be processed. | `32` for all backends except DeepSpeed. `4` for DeepSpeed                                                                                                                    | Integer                                                                                                                                                                                                                                                  |
-| option.dtype                  | The data type you plan to cast the model weights to                                                                                                                                                                                                                                                                                               | `fp16`                                                                                                                                                                       | `fp32`, `fp16`, `bf16` (only on G5/P4/P5 or newer instance types), `int8` (only in lmi-dist)                                                                                                                                                             | 
+| option.dtype                  | The data type you plan to cast the model weights to. If not provided, LMI will use the model's default data type.                                                                                                                                                                                                                                 | `fp16`                                                                                                                                                                       | `fp32`, `fp16`, `bf16` (only on G5/P4/P5 or newer instance types), `int8` (only in lmi-dist)                                                                                                                                                             | 
 | option.tensor_parallel_degree | The number of GPUs to shard the model across. Recommended value is `max`, which partitions the model across all available GPUS                                                                                                                                                                                                                    | `1` for DeepSpeed and Transformers NeuronX containers, `max` for TensorRT-LLM container                                                                                      | Value between 1 and number of available GPUs. For Inferentia, this represents the number of neuron cores                                                                                                                                                 | 
 | option.entryPoint             | The inference handler to use. This is either one of the built-in handlers provided by lmi, or the name of a custom script provided to LMI                                                                                                                                                                                                         | `djl_python.huggingface` for DeepSpeed Container, `djl_python.tensorrt_llm` for TensorRT-LLM container, `djl_python.transformers_neuronx` for Transformers NeuronX container | `djl_python.huggingface` (vllm, lmi-dist, hf-accelerate), `djl_python.deepspeed` (deepspeed), `djl_python.tensorrt_llm` (tensorrt-llm), `djl_python.transformers_neuronx` (transformers neuronx / optimum neuron), `<custom_script>.py` (custom handler) |
 | option.parallel_loading       | If using multiple workers (multiple model copies), setting to `true` will load the model workers in parallel. This should only be set to `true` if using multiple model copies, and there is sufficient CPU memory to load N copies of the model (memory at least N * model size in GB)                                                           | `false`                                                                                                                                                                      | `true`, `false`                                                                                                                                                                                                                                          |

diff --git a/serving/docs/lmi/user_guides/deepspeed_user_guide.md b/serving/docs/lmi/user_guides/deepspeed_user_guide.md
@@ -134,16 +134,20 @@ The Dynamic Int8 quantization method uses DeepSpeed's [Mixture-of-Quantization](
 
 ## Advanced DeepSpeed Configurations
 
-Here are the advanced parameters that are available when using DeepSpeed.
-Each advanced configuration is specified with a Configuration Type. 
-`LMI` means the configuration is processed by LMI and translated into the appropriate backend configurations.
-`Pass Through` means the configuration is passed down directly to the library. 
-If you encounter an issue with a `Pass Through` configuration, it is likely an issue with the underlying library and not LMI. 
+The following table lists the advanced configurations that are available with the DeepSpeed backend.
+There are two types of advanced configurations: `LMI`, and `Pass Through`.
+`LMI` configurations are processed by LMI and translated into configurations that DeepSpeed uses.
+`Pass Through` configurations are passed directly to the backend library. These are opaque configurations from the perspective of the model server and LMI.
+We recommend that you file an [issue](https://github.com/deepjavalibrary/djl-serving/issues/new?assignees=&labels=bug&projects=&template=bug_report.md&title=) for any issues you encounter with configurations.
+For `LMI` configurations, if we determine an issue with the configuration, we will attempt to provide a workaround for the current released version, and attempt to fix the issue for the next release.
+For `Pass Through` configurations it is possible that our investigation reveals an issue with the backend library.
+In that situation, there is nothing LMI can do until the issue is fixed in the backend library.
+
 
 
 | Item	                        | LMI Version | Configuration Type	 | Description	                                                                                                                                                                                                                                                                                                                             | Example value	                   |
 |------------------------------|-------------|---------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------|
-| option.task	                 | >= 0.25.0   | LMI                 | The task used in Hugging Face for different pipelines. Default is text-generation	                                                                                                                                                                                                                                                       | `text-generation`	               |
+| option.task	                 | >= 0.25.0   | LMI                 | The task used in Hugging Face for different pipelines. Default is text-generation                                                                                                                                                                                                                                                        | `text-generation`	               |
 | option.quantize	             | >= 0.25.0	  | LMI                 | Specify this option to quantize your model using the supported quantization methods in DeepSpeed. SmoothQuant is our special offering to provide quantization with better quality	                                                                                                                                                       | `dynamic_int8`, `smoothquant`	   |
 | option.max_tokens	           | >= 0.25.0	  | LMI                 | Total number of tokens (input and output) with which DeepSpeed can work. The number of output tokens in the difference between the total number of tokens and the number of input tokens. By default we set the value to 1024. If you are looking for long sequence generation, you may want to set this to higher value (2048, 4096..)	 | 1024	                            |
 | option.low_cpu_mem_usage	    | >= 0.25.0   | 	   Pass Through    | Reduce CPU memory usage when loading models. We recommend that you set this to True.	                                                                                                                                                                                                                                                    | Default:`true` 	                 |

diff --git a/serving/docs/lmi/user_guides/tnx_user_guide.md b/serving/docs/lmi/user_guides/tnx_user_guide.md
@@ -77,11 +77,14 @@ Currently, we allow customer to use `option.quantize=static_int8` or `OPTION_QUA
 
 ## Advanced Transformers NeuronX Configurations
 
-Here are the advanced parameters that are available when using Transformers NeuronX.
-Each advanced configuration is specified with a Configuration Type.
-`LMI` means the configuration is processed by LMI and translated into the appropriate backend configurations.
-`Pass Through` means the configuration is passed down directly to the library.
-If you encounter an issue with a `Pass Through` configuration, it is likely an issue with the underlying library and not LMI.
+The following table lists the advanced configurations that are available with the Transformers NeuronX backend.
+There are two types of advanced configurations: `LMI`, and `Pass Through`.
+`LMI` configurations are processed by LMI and translated into configurations that DeepSpeed uses.
+`Pass Through` configurations are passed directly to the backend library. These are opaque configurations from the perspective of the model server and LMI.
+We recommend that you file an [issue](https://github.com/deepjavalibrary/djl-serving/issues/new?assignees=&labels=bug&projects=&template=bug_report.md&title=) for any issues you encounter with configurations.
+For `LMI` configurations, if we determine an issue with the configuration, we will attempt to provide a workaround for the current released version, and attempt to fix the issue for the next release.
+For `Pass Through` configurations it is possible that our investigation reveals an issue with the backend library.
+In that situation, there is nothing LMI can do until the issue is fixed in the backend library.
 
 | Item                                       | LMI Version | Configuration Type | Description                                                                                                                                                                                                                                                                                                                                           | Example value                                                                                  |
 |--------------------------------------------|-------------|--------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|

diff --git a/serving/docs/lmi/user_guides/trt_llm_user_guide.md b/serving/docs/lmi/user_guides/trt_llm_user_guide.md
@@ -69,11 +69,14 @@ More details about additional (optional) quantization configurations are availab
 
 ##  Advanced TensorRT-LLM Configurations
 
-Here are the advanced parameters that are available when using TensorRT-LLM.
-Each advanced configuration is specified with a Configuration Type.
-`LMI` means the configuration is processed by LMI and translated into the appropriate backend configurations.
-`Pass Through` means the configuration is passed down directly to the library.
-If you encounter an issue with a `Pass Through` configuration, it is likely an issue with the underlying library and not LMI.
+The following table lists the advanced configurations that are available with the TensorRT-LLM backend.
+There are two types of advanced configurations: `LMI`, and `Pass Through`.
+`LMI` configurations are processed by LMI and translated into configurations that DeepSpeed uses.
+`Pass Through` configurations are passed directly to the backend library. These are opaque configurations from the perspective of the model server and LMI.
+We recommend that you file an [issue](https://github.com/deepjavalibrary/djl-serving/issues/new?assignees=&labels=bug&projects=&template=bug_report.md&title=) for any issues you encounter with configurations.
+For `LMI` configurations, if we determine an issue with the configuration, we will attempt to provide a workaround for the current released version, and attempt to fix the issue for the next release.
+For `Pass Through` configurations it is possible that our investigation reveals an issue with the backend library.
+In that situation, there is nothing LMI can do until the issue is fixed in the backend library.
 
 | Item                                                          | LMI Version | Configuration Type | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | Example value                                                                                                                                                               |
 |---------------------------------------------------------------|-------------|--------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|