Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ Learn how to use FastDeploy through our documentation:
- [Offline Inference Development](./docs/offline_inference.md)
- [Online Service Deployment](./docs/online_serving/README.md)
- [Full Supported Models List](./docs/supported_models.md)
- [Optimal Deployment](./docs/optimal_deployment/README.md)

## Supported Models

Expand Down
123 changes: 123 additions & 0 deletions docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@

# ERNIE-4.5-VL-28B-A3B-Paddle

## 1. Environment Preparation
### 1.1 Support Status

The minimum number of cards required for deployment on the following hardware is as follows:
| Device [GPU Mem] | WINT4 | WINT8 | BFLOAT16 |
|:----------:|:----------:|:------:| :------:|
| A30 [24G] | 2 | 2 | 4 |
| L20 [48G] | 1 | 1 | 2 |
| H20 [144G] | 1 | 1 | 1 |
| A100 [80G] | 1 | 1 | 1 |
| H800 [80G] | 1 | 1 | 1 |

### 1.2 Install Fastdeploy

Installation process reference documentation [FastDeploy GPU Install](../get_started/installation/nvidia_gpu.md)

> ⚠️ Precautions:
> - FastDeploy only supports models in Paddle format – please ensure to download models with the `-Paddle` file extension.
> - The model name will trigger an automatic download. If the model has already been downloaded, you can directly use the absolute path to the model's download location.

## 2.How to Use
### 2.1 Basic: Launching the Service
**Example 1:** Deploying a 32K Context Service on a Single RTX 4090 GPU
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1

python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--max-num-seqs 256 \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.9 \
--enable-chunked-prefill \
--max-num-batched-tokens 384 \
--quantization wint4 \
--enable-mm
```
**Example 2:** Deploying a 128K Context Service on Dual H800 GPUs
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1

python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--max-num-seqs 256 \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.9 \
--enable-chunked-prefill \
--max-num-batched-tokens 384 \
--quantization wint4 \
--enable-mm
```
An example is a set of configurations that can run stably while also delivering relatively good performance. If you have further requirements for precision or performance, please continue reading the content below.
### 2.2 Advanced: How to Achieve Better Performance

#### 2.2.1 Evaluating Application Scenarios and Setting Parameters Correctly
> **Context Length**
- **Parameters:** `--max-model-len`
- **Description:** Controls the maximum context length that the model can process.
- **Recommendation:** Longer context lengths may reduce throughput. Adjust based on actual needs, with a maximum supported context length of **128k** (131,072).

⚠️ Note: Longer context lengths will significantly increase GPU memory requirements. Ensure your hardware resources are sufficient before setting a longer context.
> **Maximum sequence count**
- **Parameters:** `--max-num-seqs`
- **Description:** Controls the maximum number of sequences the service can handle, supporting a range of 1 to 256.
- **Recommendation:** If you are unsure of the average number of sequences per request in your actual application scenario, we recommend setting it to **256**. If the average number of sequences per request in your application is significantly fewer than 256, we suggest setting it to a slightly higher value than the average to further reduce GPU memory usage and optimize service performance.

> **Multi-image and multi-video input**
- **Parameters**:`--limit-mm-per-prompt`
- **Description**:Our model supports multi-image and multi-video input in a single prompt. Please use this **Parameters** setting to limit the number of images/videos per request, ensuring efficient resource utilization.
- **Recommendation**:We recommend setting the number of images and videos in a single prompt to **100 each** to balance performance and memory usage.

> **Available GPU memory ratio during initialization**
- **Parameters:** `--gpu-memory-utilization`
- **Description:** Controls the available GPU memory for FastDeploy service initialization. The default value is 0.9, meaning 10% of the memory is reserved for backup.
- **Recommendation:** It is recommended to use the default value of 0.9. If an "out of memory" error occurs during stress testing, you may attempt to reduce this value.

#### 2.2.2 Chunked Prefill
- **Parameters:** `--enable-chunked-prefill`
- **Description:** Enabling `chunked prefill` can **reduce peak GPU memory usage** and **improve service throughput**.
- **Other relevant configurations**:

`--max-num-batched-tokens`:Limit the maximum number of tokens per chunk, with a recommended setting of 384.

#### 2.2.3 **Quantization precision**
- **Parameters:** `--quantization`

- **Supported precision types:**
- WINT4 (Suitable for most users)
- WINT8
- BFLOAT16 (When the `--quantization` parameter is not set, BFLOAT16 is used by default.)

- **Recommendation:**
- Unless you have extremely stringent precision requirements, we strongly recommend using WINT4 quantization. This will significantly reduce memory consumption and increase throughput.
- If slightly higher precision is required, you may try WINT8.
- Only consider using BFLOAT16 if your application scenario demands extreme precision, as it requires significantly more GPU memory.

## 3. FAQ
**Note:** Deploying multimodal services requires adding parameters to the configuration `--enable-mm`.

### 3.1 Out of Memory
If the service prompts "Out of Memory" during startup, please try the following solutions:
1. Ensure no other processes are occupying GPU memory;
2. Use WINT4/WINT8 quantization and enable chunked prefill;
3. Reduce context length and maximum sequence count as needed;
4. Increase the number of GPU cards for deployment (e.g., 2 or 4 cards) by modifying the parameter `--tensor-parallel-size 2` or `--tensor-parallel-size 4`.

If the service starts normally but later reports insufficient memory, try:
1. Adjust the initial GPU memory utilization ratio by modifying `--gpu-memory-utilization`;
2. Increase the number of deployment cards (parameter adjustment as above).
99 changes: 99 additions & 0 deletions docs/optimal_deployment/ERNIE-4.5-VL-424B-A47B-Paddle.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@

# ERNIE-4.5-VL-424B-A47B-Paddle

## 1. Environment Preparation
### 1.1 Support Status
The minimum number of cards required for deployment on the following hardware is as follows:
| Device [GPU Mem] | WINT4 | WINT8 | BFLOAT16 |
|:----------:|:----------:|:------:| :------:|
| H20 [144G] | 8 | 8 | 8 |
| A100 [80G] | 8 | 8 | - |
| H800 [80G] | 8 | 8 | - |

### 1.2 Install Fastdeploy

Installation process reference documentation [FastDeploy GPU Install](../get_started/installation/nvidia_gpu.md)

> ⚠️ Precautions:
> - FastDeploy only supports models in Paddle format – please ensure to download models with the `-Paddle` file extension.
> - The model name will trigger an automatic download. If the model has already been downloaded, you can directly use the absolute path to the model's download location.

## 2.How to Use
### 2.1 Basic: Launching the Service
**Example 1:** Deploying a 128K context service on 8x H800 GPUs.
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1

python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--tensor-parallel-size 8 \
--max-model-len 131072 \
--max-num-seqs 16 \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.8 \
--enable-chunked-prefill \
--max-num-batched-tokens 384 \
--quantization wint4 \
--enable-mm
```

An example is a set of configurations that can run stably while also delivering relatively good performance. If you have further requirements for precision or performance, please continue reading the content below.
### 2.2 Advanced: How to Achieve Better Performance

#### 2.2.1 Evaluating Application Scenarios and Setting Parameters Correctly
> **Context Length**
- **Parameters:** `--max-model-len`
- **Description:** Controls the maximum context length that the model can process.
- **Recommendation:** Longer context lengths may reduce throughput. Adjust based on actual needs, with a maximum supported context length of **128k** (131,072).

⚠️ Note: Longer context lengths will significantly increase GPU memory requirements. Ensure your hardware resources are sufficient before setting a longer context.
> **Maximum sequence count**
- **Parameters:** `--max-num-seqs`
- **Description:** Controls the maximum number of sequences the service can handle, supporting a range of 1 to 256.
- **Recommendation:** If you are unsure of the average number of sequences per request in your actual application scenario, we recommend setting it to **256**. If the average number of sequences per request in your application is significantly fewer than 256, we suggest setting it to a slightly higher value than the average to further reduce GPU memory usage and optimize service performance.

> **Multi-image and multi-video input**
- **Parameters**:`--limit-mm-per-prompt`
- **Description**:Our model supports multi-image and multi-video input in a single prompt. Please use this **Parameters** setting to limit the number of images/videos per request, ensuring efficient resource utilization.
- **Recommendation**:We recommend setting the number of images and videos in a single prompt to **100 each** to balance performance and memory usage.

> **Available GPU memory ratio during initialization**
- **Parameters:** `--gpu-memory-utilization`
- **Description:** Controls the available GPU memory for FastDeploy service initialization. The default value is 0.9, meaning 10% of the memory is reserved for backup.
- **Recommendation:** It is recommended to use the default value of 0.9. If an "out of memory" error occurs during stress testing, you may attempt to reduce this value.

#### 2.2.2 Chunked Prefill
- **Parameters:** `--enable-chunked-prefill`
- **Description:** Enabling `chunked prefill` can **reduce peak GPU memory usage** and **improve service throughput**.
- **Other relevant configurations**:

`--max-num-batched-tokens`:Limit the maximum number of tokens per chunk, with a recommended setting of 384.

#### 2.2.3 **Quantization precision**
- **Parameters:** `--quantization`

- **Supported precision types:**
- wint4 (Suitable for most users)
- wint8
- bfloat16 (When the `--quantization` parameter is not set, bfloat16 is used by default.)

- **Recommendation:**
- Unless you have extremely stringent precision requirements, we strongly recommend using wint4 quantization. This will significantly reduce memory consumption and increase throughput.
- If slightly higher precision is required, you may try wint8.
- Only consider using bfloat16 if your application scenario demands extreme precision, as it requires significantly more GPU memory.

## 3. FAQ
**Note:** Deploying multimodal services requires adding parameters to the configuration `--enable-mm`.

### 3.1 Out of Memory
If the service prompts "Out of Memory" during startup, please try the following solutions:
1. Ensure no other processes are occupying GPU memory;
2. Use wint4/wint8 quantization and enable chunked prefill;
3. Reduce context length and maximum sequence count as needed.

If the service starts normally but later reports insufficient memory, try:
1. Adjust the initial GPU memory utilization ratio by modifying `--gpu-memory-utilization`.
4 changes: 4 additions & 0 deletions docs/optimal_deployment/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Optimal Deployment

- [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md)
- [ERNIE-4.5-VL-424B-A47B-Paddle](ERNIE-4.5-VL-424B-A47B-Paddle.md)
124 changes: 124 additions & 0 deletions docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@

# ERNIE-4.5-VL-28B-A3B-Paddle

## 一、环境准备
### 1.1 支持情况
在下列硬件上部署所需要的最小卡数如下:
| 设备[显存] | WINT4 | WINT8 | BFLOAT16 |
|:----------:|:----------:|:------:| :------:|
| A30 [24G] | 2 | 2 | 4 |
| L20 [48G] | 1 | 1 | 2 |
| H20 [144G] | 1 | 1 | 1 |
| A100 [80G] | 1 | 1 | 1 |
| H800 [80G] | 1 | 1 | 1 |

### 1.2 安装fastdeploy

安装流程参考文档 [FastDeploy GPU 安装](../get_started/installation/nvidia_gpu.md)

> ⚠️ 注意事项
> - FastDeploy只支持Paddle格式的模型,注意下载Paddle后缀的模型
> - 使用模型名称会自动下载模型,如果已经下载过模型,可以直接使用模型下载位置的绝对路径

## 二、如何使用
### 2.1 基础:启动服务
**示例1:** 4090上单卡部署32K上下文的服务
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1

python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--max-num-seqs 32 \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.9 \
--enable-chunked-prefill \
--max-num-batched-tokens 384 \
--quantization wint4 \
--enable-mm
```
**示例2:** H800上双卡部署128K上下文的服务
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1

python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--max-num-seqs 128 \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.9 \
--enable-chunked-prefill \
--max-num-batched-tokens 384 \
--quantization wint4 \
--enable-mm
```
示例是可以稳定运行的一组配置,同时也能得到比较好的性能。
如果对精度、性能有进一步的要求,请继续阅读下面的内容。
### 2.2 进阶:如何获取更优性能

#### 2.2.1 评估应用场景,正确设置参数
> **上下文长度**
- **参数:** `--max-model-len`
- **描述:** 控制模型可处理的最大上下文长度。
- **推荐:** 更长的上下文会导致吞吐降低,根据实际情况设置,`ERNIE-4.5-VL-28B-A3B-Paddle`最长支持**128k**(131072)长度的上下文。

⚠️ 注:更长的上下文会显著增加GPU显存需求,设置更长的上下文之前确保硬件资源是满足的。
> **最大序列数量**
- **参数:** `--max-num-seqs`
- **描述:** 控制服务可以处理的最大序列数量,支持1~256。
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当前不支持256以上的 batch-size?如果不支持,后续需要排查

- **推荐:** 如果您不知道实际应用场景中请求的平均序列数量是多少,我们建议设置为**256**。如果您的应用场景中请求的平均序列数量明显少于256,我们建议设置为一个略大于平均值的较小值,以进一步降低显存占用,优化服务性能。

> **多图、多视频输入**
- **参数**:`--limit-mm-per-prompt`
- **描述**:我们的模型支持单次提示词(prompt)中输入多张图片和视频。请使用此参数限制每次请求的图片/视频数量,以确保资源高效利用。
- **推荐**:我们建议将单次提示词(prompt)中的图片和视频数量均设置为100个,以平衡性能与内存占用。

> **初始化时可用的显存比例**
- **参数:** `--gpu-memory-utilization`
- **用处:** 用于控制 FastDeploy 初始化服务的可用显存,默认0.9,即预留10%的显存备用。
- **推荐:** 推荐使用默认值0.9。如果服务压测时提示显存不足,可以尝试调低该值。

#### 2.2.2 Chunked Prefill
- **参数:** `--enable-chunked-prefill`
- **用处:** 开启 `chunked prefill` 可**降低显存峰值**并**提升服务吞吐**。

- **其他相关配置**:

`--max-num-batched-tokens`:限制每个chunk的最大token数量。多模场景下每个chunk会向上取整保持图片的完整性,因此实际每次推理的总token数会大于该值。我们推荐设置为384。

#### 2.2.3 **量化精度**
- **参数:** `--quantization`

- **已支持的精度类型:**
- WINT4 (适合大多数用户)
- WINT8
- BFLOAT16 (未设置 `--quantization` 参数时,默认使用BFLOAT16)

- **推荐:**
- 除非您有极其严格的精度要求,否则我们建议使用WINT4量化。这将显著降低内存占用并提升吞吐量。
- 若需要稍高的精度,可尝试WINT8。
- 仅当您的应用场景对精度有极致要求时候才尝试使用BFLOAT16,因为它需要更多显存。

## 三、常见问题FAQ
**注意:** 使用多模服务部署需要在配置中添加参数 `--enable-mm`。

### 3.1 显存不足(OOM)
如果服务启动时提示显存不足,请尝试以下方法:
1. 确保无其他进程占用显卡显存;
2. 使用WINT4/WINT8量化,开启chunked prefill;
3. 酌情降低上下文长度和最大序列数量;
4. 增加部署卡数,使用2卡或4卡部署,即修改参数 `--tensor-parallel-size 2` 或 `--tensor-parallel-size 4`。

如果可以服务可以正常启动,运行时提示显存不足,请尝试以下方法:
1. 酌情降低初始化时可用的显存比例,即调整参数 `--gpu-memory-utilization` 的值;
2. 增加部署卡数,参数修改同上。
Loading
Loading