PaddlePaddle · ming1753 · Aug 1, 2025 · Jul 9, 2025 · Jul 9, 2025 · Jul 9, 2025
diff --git a/README.md b/README.md
@@ -64,6 +64,7 @@ Learn how to use FastDeploy through our documentation:
 - [Offline Inference Development](./docs/offline_inference.md)
 - [Online Service Deployment](./docs/online_serving/README.md)
 - [Full Supported Models List](./docs/supported_models.md)
+- [Optimal Deployment](./docs/optimal_deployment/README.md)
 
 ## Supported Models
 

diff --git a/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
@@ -0,0 +1,123 @@
+
+# ERNIE-4.5-VL-28B-A3B-Paddle
+
+## 1. Environment Preparation
+### 1.1 Support Status
+
+The minimum number of cards required for deployment on the following hardware is as follows:
+| Device [GPU Mem] | WINT4 | WINT8 | BFLOAT16 |
+|:----------:|:----------:|:------:| :------:|
+| A30 [24G] | 2 | 2 | 4 |
+| L20 [48G] | 1 | 1 | 2 |
+| H20 [144G] | 1 | 1 |  1 |
+| A100 [80G] | 1 | 1 |  1 |
+| H800 [80G] | 1 | 1 |  1 |
+
+### 1.2 Install Fastdeploy
+
+Installation process reference documentation [FastDeploy GPU Install](../get_started/installation/nvidia_gpu.md)
+
+> ⚠️ Precautions:
+> - FastDeploy only supports models in Paddle format – please ensure to download models with the `-Paddle` file extension.
+> - The model name will trigger an automatic download. If the model has already been downloaded, you can directly use the absolute path to the model's download location.
+
+## 2.How to Use
+### 2.1 Basic: Launching the Service
+**Example 1:** Deploying a 32K Context Service on a Single RTX 4090 GPU
+```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
+
+python -m fastdeploy.entrypoints.openai.api_server \
+  --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
+  --port 8180 \
+  --metrics-port 8181 \
+  --engine-worker-queue-port 8182 \
+  --tensor-parallel-size 1 \
+  --max-model-len 32768 \
+  --max-num-seqs 256 \
+  --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+  --reasoning-parser ernie-45-vl \
+  --gpu-memory-utilization 0.9 \
+  --enable-chunked-prefill \
+  --max-num-batched-tokens 384 \
+  --quantization wint4 \
+  --enable-mm
+```
+**Example 2:** Deploying a 128K Context Service on Dual H800 GPUs
+```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
+
+python -m fastdeploy.entrypoints.openai.api_server \
+  --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
+  --port 8180 \
+  --metrics-port 8181 \
+  --engine-worker-queue-port 8182 \
+  --tensor-parallel-size 2 \
+  --max-model-len 131072 \
+  --max-num-seqs 256 \
+  --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+  --reasoning-parser ernie-45-vl \
+  --gpu-memory-utilization 0.9 \
+  --enable-chunked-prefill \
+  --max-num-batched-tokens 384 \
+  --quantization wint4 \
+  --enable-mm
+```
+An example is a set of configurations that can run stably while also delivering relatively good performance. If you have further requirements for precision or performance, please continue reading the content below.
+### 2.2 Advanced: How to Achieve Better Performance
+
+#### 2.2.1 Evaluating Application Scenarios and Setting Parameters Correctly
+> **Context Length**
+- **Parameters：** `--max-model-len`
+- **Description：** Controls the maximum context length that the model can process.
+- **Recommendation：** Longer context lengths may reduce throughput. Adjust based on actual needs, with a maximum supported context length of **128k** (131,072).
+
+   ⚠️ Note: Longer context lengths will significantly increase GPU memory requirements. Ensure your hardware resources are sufficient before setting a longer context.
+> **Maximum sequence count**
+- **Parameters：** `--max-num-seqs`
+- **Description：** Controls the maximum number of sequences the service can handle, supporting a range of 1 to 256.
+- **Recommendation：** If you are unsure of the average number of sequences per request in your actual application scenario, we recommend setting it to **256**. If the average number of sequences per request in your application is significantly fewer than 256, we suggest setting it to a slightly higher value than the average to further reduce GPU memory usage and optimize service performance.
+
+> **Multi-image and multi-video input**
+- **Parameters**：`--limit-mm-per-prompt`
+- **Description**：Our model supports multi-image and multi-video input in a single prompt. Please use this **Parameters** setting to limit the number of images/videos per request, ensuring efficient resource utilization.
+- **Recommendation**：We recommend setting the number of images and videos in a single prompt to **100 each** to balance performance and memory usage.
+
+> **Available GPU memory ratio during initialization**
+- **Parameters：** `--gpu-memory-utilization`
+- **Description：** Controls the available GPU memory for FastDeploy service initialization. The default value is 0.9, meaning 10% of the memory is reserved for backup.
+- **Recommendation：** It is recommended to use the default value of 0.9. If an "out of memory" error occurs during stress testing, you may attempt to reduce this value.
+
+#### 2.2.2 Chunked Prefill
+- **Parameters：** `--enable-chunked-prefill`
+- **Description：** Enabling `chunked prefill` can **reduce peak GPU memory usage** and **improve service throughput**.
+- **Other relevant configurations**:
+
+    `--max-num-batched-tokens`：Limit the maximum number of tokens per chunk, with a recommended setting of 384.
+
+#### 2.2.3  **Quantization precision**
+- **Parameters：** `--quantization`
+
+- **Supported precision types：**
+  - WINT4 (Suitable for most users)
+  - WINT8
+  - BFLOAT16 (When the `--quantization` parameter is not set, BFLOAT16 is used by default.)
+
+- **Recommendation：**
+  - Unless you have extremely stringent precision requirements, we strongly recommend using WINT4 quantization. This will significantly reduce memory consumption and increase throughput.
+  - If slightly higher precision is required, you may try WINT8.
+  - Only consider using BFLOAT16 if your application scenario demands extreme precision, as it requires significantly more GPU memory.
+
+## 3. FAQ
+**Note:** Deploying multimodal services requires adding parameters to the configuration `--enable-mm`.
+
+### 3.1 Out of Memory
+If the service prompts "Out of Memory" during startup, please try the following solutions:
+1. Ensure no other processes are occupying GPU memory;
+2. Use WINT4/WINT8 quantization and enable chunked prefill;
+3. Reduce context length and maximum sequence count as needed;
+4. Increase the number of GPU cards for deployment (e.g., 2 or 4 cards) by modifying the parameter `--tensor-parallel-size 2` or `--tensor-parallel-size 4`.
+
+If the service starts normally but later reports insufficient memory, try:
+1. Adjust the initial GPU memory utilization ratio by modifying `--gpu-memory-utilization`;
+2. Increase the number of deployment cards (parameter adjustment as above).
diff --git a/docs/optimal_deployment/ERNIE-4.5-VL-424B-A47B-Paddle.md b/docs/optimal_deployment/ERNIE-4.5-VL-424B-A47B-Paddle.md
@@ -0,0 +1,99 @@
+
+# ERNIE-4.5-VL-424B-A47B-Paddle
+
+## 1. Environment Preparation
+### 1.1 Support Status
+The minimum number of cards required for deployment on the following hardware is as follows:
+| Device [GPU Mem] | WINT4 | WINT8 | BFLOAT16 |
+|:----------:|:----------:|:------:| :------:|
+| H20 [144G] | 8 | 8 |  8 |
+| A100 [80G] | 8 | 8 |  - |
+| H800 [80G] | 8 | 8 |  - |
+
+### 1.2 Install Fastdeploy
+
+Installation process reference documentation [FastDeploy GPU Install](../get_started/installation/nvidia_gpu.md)
+
+> ⚠️ Precautions:
+> - FastDeploy only supports models in Paddle format – please ensure to download models with the `-Paddle` file extension.
+> - The model name will trigger an automatic download. If the model has already been downloaded, you can directly use the absolute path to the model's download location.
+
+## 2.How to Use
+### 2.1 Basic: Launching the Service
+**Example 1:** Deploying a 128K context service on 8x H800 GPUs.
+```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
+
+python -m fastdeploy.entrypoints.openai.api_server \
+  --model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \
+  --port 8180 \
+  --metrics-port 8181 \
+  --engine-worker-queue-port 8182 \
+  --tensor-parallel-size 8 \
+  --max-model-len 131072 \
+  --max-num-seqs 16 \
+  --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+  --reasoning-parser ernie-45-vl \
+  --gpu-memory-utilization 0.8 \
+  --enable-chunked-prefill \
+  --max-num-batched-tokens 384 \
+  --quantization wint4 \
+  --enable-mm
+```
+
+An example is a set of configurations that can run stably while also delivering relatively good performance. If you have further requirements for precision or performance, please continue reading the content below.
+### 2.2 Advanced: How to Achieve Better Performance
+
+#### 2.2.1 Evaluating Application Scenarios and Setting Parameters Correctly
+> **Context Length**
+- **Parameters：** `--max-model-len`
+- **Description：** Controls the maximum context length that the model can process.
+- **Recommendation：** Longer context lengths may reduce throughput. Adjust based on actual needs, with a maximum supported context length of **128k** (131,072).
+
+   ⚠️ Note: Longer context lengths will significantly increase GPU memory requirements. Ensure your hardware resources are sufficient before setting a longer context.
+> **Maximum sequence count**
+- **Parameters：** `--max-num-seqs`
+- **Description：** Controls the maximum number of sequences the service can handle, supporting a range of 1 to 256.
+- **Recommendation：** If you are unsure of the average number of sequences per request in your actual application scenario, we recommend setting it to **256**. If the average number of sequences per request in your application is significantly fewer than 256, we suggest setting it to a slightly higher value than the average to further reduce GPU memory usage and optimize service performance.
+
+> **Multi-image and multi-video input**
+- **Parameters**：`--limit-mm-per-prompt`
+- **Description**：Our model supports multi-image and multi-video input in a single prompt. Please use this **Parameters** setting to limit the number of images/videos per request, ensuring efficient resource utilization.
+- **Recommendation**：We recommend setting the number of images and videos in a single prompt to **100 each** to balance performance and memory usage.
+
+> **Available GPU memory ratio during initialization**
+- **Parameters：** `--gpu-memory-utilization`
+- **Description：** Controls the available GPU memory for FastDeploy service initialization. The default value is 0.9, meaning 10% of the memory is reserved for backup.
+- **Recommendation：** It is recommended to use the default value of 0.9. If an "out of memory" error occurs during stress testing, you may attempt to reduce this value.
+
+#### 2.2.2 Chunked Prefill
+- **Parameters：** `--enable-chunked-prefill`
+- **Description：** Enabling `chunked prefill` can **reduce peak GPU memory usage** and **improve service throughput**.
+- **Other relevant configurations**:
+
+    `--max-num-batched-tokens`：Limit the maximum number of tokens per chunk, with a recommended setting of 384.
+
+#### 2.2.3  **Quantization precision**
+- **Parameters：** `--quantization`
+
+- **Supported precision types：**
+  - wint4 (Suitable for most users)
+  - wint8
+  - bfloat16 (When the `--quantization` parameter is not set, bfloat16 is used by default.)
+
+- **Recommendation：**
+  - Unless you have extremely stringent precision requirements, we strongly recommend using wint4 quantization. This will significantly reduce memory consumption and increase throughput.
+  - If slightly higher precision is required, you may try wint8.
+  - Only consider using bfloat16 if your application scenario demands extreme precision, as it requires significantly more GPU memory.
+
+## 3. FAQ
+**Note:** Deploying multimodal services requires adding parameters to the configuration `--enable-mm`.
+
+### 3.1 Out of Memory
+If the service prompts "Out of Memory" during startup, please try the following solutions:
+1. Ensure no other processes are occupying GPU memory;
+2. Use wint4/wint8 quantization and enable chunked prefill;
+3. Reduce context length and maximum sequence count as needed.
+
+If the service starts normally but later reports insufficient memory, try:
+1. Adjust the initial GPU memory utilization ratio by modifying `--gpu-memory-utilization`.
diff --git a/docs/optimal_deployment/README.md b/docs/optimal_deployment/README.md
@@ -0,0 +1,4 @@
+# Optimal Deployment
+
+- [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md)
+- [ERNIE-4.5-VL-424B-A47B-Paddle](ERNIE-4.5-VL-424B-A47B-Paddle.md)
diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
@@ -0,0 +1,124 @@
+
+# ERNIE-4.5-VL-28B-A3B-Paddle
+
+## 一、环境准备
+### 1.1 支持情况
+在下列硬件上部署所需要的最小卡数如下：
+| 设备[显存] | WINT4 | WINT8 | BFLOAT16 |
+|:----------:|:----------:|:------:| :------:|
+| A30 [24G] | 2 | 2 | 4 |
+| L20 [48G] | 1 | 1 | 2 |
+| H20 [144G] | 1 | 1 |  1 |
+| A100 [80G] | 1 | 1 |  1 |
+| H800 [80G] | 1 | 1 |  1 |
+
+### 1.2 安装fastdeploy
+
+安装流程参考文档 [FastDeploy GPU 安装](../get_started/installation/nvidia_gpu.md)
+
+> ⚠️ 注意事项
+> - FastDeploy只支持Paddle格式的模型，注意下载Paddle后缀的模型
+> - 使用模型名称会自动下载模型，如果已经下载过模型，可以直接使用模型下载位置的绝对路径
+
+## 二、如何使用
+### 2.1 基础：启动服务
+ **示例1：** 4090上单卡部署32K上下文的服务
+```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
+
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
+    --port 8180 \
+    --metrics-port 8181 \
+    --engine-worker-queue-port 8182 \
+    --tensor-parallel-size 1 \
+    --max-model-len 32768 \
+    --max-num-seqs 32 \
+    --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+    --reasoning-parser ernie-45-vl \
+    --gpu-memory-utilization 0.9 \
+    --enable-chunked-prefill \
+    --max-num-batched-tokens 384 \
+    --quantization wint4 \
+    --enable-mm
+```
+ **示例2：** H800上双卡部署128K上下文的服务
+```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
+
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
+    --port 8180 \
+    --metrics-port 8181 \
+    --engine-worker-queue-port 8182 \
+    --tensor-parallel-size 2 \
+    --max-model-len 131072 \
+    --max-num-seqs 128 \
+    --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+    --reasoning-parser ernie-45-vl \
+    --gpu-memory-utilization 0.9 \
+    --enable-chunked-prefill \
+    --max-num-batched-tokens 384 \
+    --quantization wint4 \
+    --enable-mm
+```
+示例是可以稳定运行的一组配置，同时也能得到比较好的性能。
+如果对精度、性能有进一步的要求，请继续阅读下面的内容。
+### 2.2 进阶：如何获取更优性能
+
+#### 2.2.1 评估应用场景，正确设置参数
+> **上下文长度**
+- **参数：** `--max-model-len`
+- **描述：** 控制模型可处理的最大上下文长度。
+- **推荐：** 更长的上下文会导致吞吐降低，根据实际情况设置，`ERNIE-4.5-VL-28B-A3B-Paddle`最长支持**128k**（131072）长度的上下文。
+
+   ⚠️ 注：更长的上下文会显著增加GPU显存需求，设置更长的上下文之前确保硬件资源是满足的。
+> **最大序列数量**
+- **参数：** `--max-num-seqs`
+- **描述：** 控制服务可以处理的最大序列数量，支持1～256。
+- **推荐：** 如果您不知道实际应用场景中请求的平均序列数量是多少，我们建议设置为**256**。如果您的应用场景中请求的平均序列数量明显少于256，我们建议设置为一个略大于平均值的较小值，以进一步降低显存占用，优化服务性能。
+
+> **多图、多视频输入**
+- **参数**：`--limit-mm-per-prompt`
+- **描述**：我们的模型支持单次提示词（prompt）中输入多张图片和视频。请使用此参数限制每次请求的图片/视频数量，以确保资源高效利用。
+- **推荐**：我们建议将单次提示词（prompt）中的图片和视频数量均设置为100个，以平衡性能与内存占用。
+
+> **初始化时可用的显存比例**
+- **参数：** `--gpu-memory-utilization`
+- **用处：** 用于控制 FastDeploy 初始化服务的可用显存，默认0.9，即预留10%的显存备用。
+- **推荐：** 推荐使用默认值0.9。如果服务压测时提示显存不足，可以尝试调低该值。
+
+#### 2.2.2 Chunked Prefill
+- **参数：** `--enable-chunked-prefill`
+- **用处：** 开启 `chunked prefill` 可**降低显存峰值**并**提升服务吞吐**。
+
+- **其他相关配置**:
+
+    `--max-num-batched-tokens`：限制每个chunk的最大token数量。多模场景下每个chunk会向上取整保持图片的完整性，因此实际每次推理的总token数会大于该值。我们推荐设置为384。
+
+#### 2.2.3  **量化精度**
+- **参数：** `--quantization`
+
+- **已支持的精度类型：**
+  - WINT4 (适合大多数用户)
+  - WINT8
+  - BFLOAT16 (未设置 `--quantization` 参数时，默认使用BFLOAT16)
+
+- **推荐：**
+  - 除非您有极其严格的精度要求，否则我们建议使用WINT4量化。这将显著降低内存占用并提升吞吐量。
+  - 若需要稍高的精度，可尝试WINT8。
+  - 仅当您的应用场景对精度有极致要求时候才尝试使用BFLOAT16，因为它需要更多显存。
+
+## 三、常见问题FAQ
+**注意：** 使用多模服务部署需要在配置中添加参数 `--enable-mm`。
+
+### 3.1 显存不足(OOM)
+如果服务启动时提示显存不足，请尝试以下方法：
+1. 确保无其他进程占用显卡显存；
+2. 使用WINT4/WINT8量化，开启chunked prefill；
+3. 酌情降低上下文长度和最大序列数量；
+4. 增加部署卡数，使用2卡或4卡部署，即修改参数 `--tensor-parallel-size 2` 或 `--tensor-parallel-size 4`。
+
+如果可以服务可以正常启动，运行时提示显存不足，请尝试以下方法：
+1. 酌情降低初始化时可用的显存比例，即调整参数 `--gpu-memory-utilization` 的值；
+2. 增加部署卡数，参数修改同上。