This repository is a deployment project of BEV 3D Detection (including BEVFormer, BEVDet) on TensorRT, supporting FP32/FP16/INT8 inference. Meanwhile, in order to improve the inference speed of BEVFormer on TensorRT, this project implements some TensorRT Ops that support nv_half, nv_half2 and INT8. With the accuracy almost unaffected, the inference speed of the BEVFormer base can be increased by more than four times, the engine size can be reduced by more than 90%, and the GPU memory usage can be saved by more than 80%. In addition, the project also supports common 2D object detection models in MMDetection, which support INT8 Quantization and TensorRT Deployment with a small number of code changes.
| Model | Data | Batch Size | NDS/mAP | FPS | Size (MB) | Memory (MB) | Device |
|---|---|---|---|---|---|---|---|
| BEVFormer tiny download |
NuScenes | 1 | NDS: 0.354 mAP: 0.252 |
15.9 | 383 | 2167 | RTX 3090 |
| BEVFormer small download |
NuScenes | 1 | NDS: 0.478 mAP: 0.370 |
5.1 | 680 | 3147 | RTX 3090 |
| BEVFormer base download |
NuScenes | 1 | NDS: 0.517 mAP: 0.416 |
2.4 | 265 | 5435 | RTX 3090 |
| Model | Data | Batch Size | Float/Int | Quantization Method | NDS/mAP | FPS | Size (MB) | Memory (MB) | Device |
|---|---|---|---|---|---|---|---|---|---|
| BEVFormer tiny | NuScenes | 1 | FP32 | - | NDS: 0.354 mAP: 0.252 |
37.9 (x1) | 136 (x1) | 2159 (x1) | RTX 3090 |
| BEVFormer tiny | NuScenes | 1 | FP16 | - | NDS: 0.354 mAP: 0.252 |
69.2 (x1.83) | 74 (x0.54) | 1729 (x0.80) | RTX 3090 |
| BEVFormer tiny | NuScenes | 1 | FP32/INT8 | PTQ entropy per-tensor |
NDS: 0.353 mAP: 0.249 |
65.1 (x1.72) | 58 (x0.43) | 1737 (x0.80) | RTX 3090 |
| BEVFormer tiny | NuScenes | 1 | FP16/INT8 | PTQ entropy per-tensor |
NDS: 0.353 mAP: 0.249 |
70.7 (x1.87) | 54 (x0.40) | 1665 (x0.77) | RTX 3090 |
| BEVFormer small | NuScenes | 1 | FP32 | - | NDS: 0.478 mAP: 0.370 |
6.6 (x1) | 245 (x1) | 4663 (x1) | RTX 3090 |
| BEVFormer small | NuScenes | 1 | FP16 | - | NDS: 0.478 mAP: 0.370 |
12.8 (x1.94) | 126 (x0.51) | 3719 (x0.80) | RTX 3090 |
| BEVFormer small | NuScenes | 1 | FP32/INT8 | PTQ entropy per-tensor |
NDS: 0.476 mAP: 0.367 |
8.7 (x1.32) | 158 (x0.64) | 4079 (x0.87) | RTX 3090 |
| BEVFormer small | NuScenes | 1 | FP16/INT8 | PTQ entropy per-tensor |
NDS: 0.477 mAP: 0.368 |
13.3 (x2.02) | 106 (x0.43) | 3441 (x0.74) | RTX 3090 |
| BEVFormer base * | NuScenes | 1 | FP32 | - | NDS: 0.517 mAP: 0.416 |
1.5 (x1) | 1689 (x1) | 13893 (x1) | RTX 3090 |
| BEVFormer base | NuScenes | 1 | FP16 | - | NDS: 0.517 mAP: 0.416 |
1.8 (x1.20) | 849 (x0.50) | 11865 (x0.85) | RTX 3090 |
| BEVFormer base * | NuScenes | 1 | FP32/INT8 | PTQ entropy per-tensor |
NDS: 0.516 mAP: 0.414 |
1.8 (x1.20) | 426 (x0.25) | 12429 (x0.89) | RTX 3090 |
| BEVFormer base * | NuScenes | 1 | FP16/INT8 | PTQ entropy per-tensor |
NDS: 0.515 mAP: 0.414 |
2.2 (x1.47) | 244 (x0.14) | 11011 (x0.79) | RTX 3090 |
* Out of Memory when onnx2trt with TensorRT-8.5.1.7, but they convert successfully with TensorRT-8.4.3.1. So the version of these engines is TensorRT-8.4.3.1.
FP16 Plugins with nv_half
| Model | Data | Batch Size | Float/Int | Quantization Method | NDS/mAP | FPS/Improve | Size (MB) | Memory (MB) | Device |
|---|---|---|---|---|---|---|---|---|---|
| BEVFormer tiny | NuScenes | 1 | FP32 | - | NDS: 0.354 mAP: 0.252 |
40.0 (x1.06) | 135 (x0.99) | 1693 (x0.78) | RTX 3090 |
| BEVFormer tiny | NuScenes | 1 | FP16 | - | NDS: 0.355 mAP: 0.252 |
81.2 (x2.14) | 70 (x0.51) | 1203 (x0.56) | RTX 3090 |
| BEVFormer tiny | NuScenes | 1 | FP32/INT8 | PTQ entropy per-tensor |
NDS: 0.351 mAP: 0.249 |
90.1 (x2.38) | 58 (x0.43) | 1105 (x0.51) | RTX 3090 |
| BEVFormer tiny | NuScenes | 1 | FP16/INT8 | PTQ entropy per-tensor |
NDS: 0.351 mAP: 0.249 |
107.4 (x2.83) | 52 (x0.38) | 1095 (x0.51) | RTX 3090 |
| BEVFormer small | NuScenes | 1 | FP32 | - | NDS: 0.478 mAP: 0.37 |
7.4 (x1.12) | 250 (x1.02) | 2585 (x0.55) | RTX 3090 |
| BEVFormer small | NuScenes | 1 | FP16 | - | NDS: 0.479 mAP: 0.37 |
15.8 (x2.40) | 127 (x0.52) | 1729 (x0.37) | RTX 3090 |
| BEVFormer small | NuScenes | 1 | FP32/INT8 | PTQ entropy per-tensor |
NDS: 0.477 mAP: 0.367 |
17.9 (x2.71) | 166 (x0.68) | 1637 (x0.35) | RTX 3090 |
| BEVFormer small | NuScenes | 1 | FP16/INT8 | PTQ entropy per-tensor |
NDS: 0.476 mAP: 0.366 |
20.4 (x3.10) | 108 (x0.44) | 1467 (x0.31) | RTX 3090 |
| BEVFormer base | NuScenes | 1 | FP32 | - | NDS: 0.517 mAP: 0.416 |
3.0 (x2.00) | 292 (x0.17) | 5715 (x0.41) | RTX 3090 |
| BEVFormer base | NuScenes | 1 | FP16 | - | NDS: 0.517 mAP: 0.416 |
4.9 (x3.27) | 148 (x0.09) | 3417 (x0.25) | RTX 3090 |
| BEVFormer base | NuScenes | 1 | FP32/INT8 | PTQ entropy per-tensor |
NDS: 0.515 mAP: 0.414 |
6.9 (x4.60) | 202 (x0.12) | 3307 (x0.24) | RTX 3090 |
| BEVFormer base | NuScenes | 1 | FP16/INT8 | PTQ entropy per-tensor |
NDS: 0.514 mAP: 0.413 |
8.0 (x5.33) | 131 (x0.08) | 2429 (x0.17) | RTX 3090 |
FP16 Plugins with nv_half2
| Model | Data | Batch Size | Float/Int | Quantization Method | NDS/mAP | FPS | Size (MB) | Memory (MB) | Device |
|---|---|---|---|---|---|---|---|---|---|
| BEVFormer tiny | NuScenes | 1 | FP16 | - | NDS: 0.355 mAP: 0.251 |
84.2 (x2.22) | 72 (x0.53) | 1205 (x0.56) | RTX 3090 |
| BEVFormer tiny | NuScenes | 1 | FP16/INT8 | PTQ entropy per-tensor |
NDS: 0.354 mAP: 0.250 |
108.3 (x2.86) | 52 (x0.38) | 1093 (x0.51) | RTX 3090 |
| BEVFormer small | NuScenes | 1 | FP16 | - | NDS: 0.479 mAP: 0.371 |
18.6 (x2.82) | 124 (x0.51) | 1725 (x0.37) | RTX 3090 |
| BEVFormer small | NuScenes | 1 | FP16/INT8 | PTQ entropy per-tensor |
NDS: 0.477 mAP: 0.368 |
22.9 (x3.47) | 110 (x0.45) | 1487 (x0.32) | RTX 3090 |
| BEVFormer base | NuScenes | 1 | FP16 | - | NDS: 0.517 mAP: 0.416 |
6.6 (x4.40) | 146 (x0.09) | 3415 (x0.25) | RTX 3090 |
| BEVFormer base | NuScenes | 1 | FP16/INT8 | PTQ entropy per-tensor |
NDS: 0.516 mAP: 0.415 |
8.6 (x5.73) | 159 (x0.09) | 2479 (x0.18) | RTX 3090 |
| Model | Data | Batch Size | NDS/mAP | FPS | Size (MB) | Memory (MB) | Device |
|---|---|---|---|---|---|---|---|
| BEVDet R50 CBGS | NuScenes | 1 | NDS: 0.38 mAP: 0.298 |
29.0 | 170 | 1858 | RTX 2080Ti |
with Custom Plugin bev_pool_v2 (Support nv_half, nv_half2 and int8), modified from Official BEVDet
| Model | Data | Batch Size | Float/Int | Quantization Method | NDS/mAP | FPS | Size (MB) | Memory (MB) | Device |
|---|---|---|---|---|---|---|---|---|---|
| BEVDet R50 CBGS | NuScenes | 1 | FP32 | - | NDS: 0.38 mAP: 0.298 |
44.6 | 245 | 1032 | RTX 2080Ti |
| BEVDet R50 CBGS | NuScenes | 1 | FP16 | - | NDS: 0.38 mAP: 0.298 |
135.1 | 86 | 790 | RTX 2080Ti |
| BEVDet R50 CBGS | NuScenes | 1 | FP32/INT8 | PTQ entropy per-tensor |
NDS: 0.355 mAP: 0.274 |
234.7 | 44 | 706 | RTX 2080Ti |
| BEVDet R50 CBGS | NuScenes | 1 | FP16/INT8 | PTQ entropy per-tensor |
NDS: 0.357 mAP: 0.277 |
236.4 | 44 | 706 | RTX 2080Ti |
This project also supports common 2D object detection models in MMDetection with little modification. The following are deployment examples of YOLOx and CenterNet.
| Model | Data | Framework | Batch Size | Float/Int | Quantization Method | mAP | FPS | Size (MB) | Memory (MB) | Device |
|---|---|---|---|---|---|---|---|---|---|---|
| YOLOx download |
COCO | PyTorch | 32 | FP32 | - | mAP: 0.506 | 63.1 | 379 | 7617 | RTX 3090 |
| YOLOx | COCO | TensorRT | 32 | FP32 | - | mAP: 0.506 | 71.3 (x1) | 546 (x1) | 9943 (x1) | RTX 3090 |
| YOLOx | COCO | TensorRT | 32 | FP16 | - | mAP: 0.506 | 296.8 (x4.16) | 192 (x0.35) | 4567 (x0.46) | RTX 3090 |
| YOLOx | COCO | TensorRT | 32 | FP32/INT8 | PTQ entropy per-tensor |
mAP: 0.488 | 556.4 (x7.80) | 99 (x0.18) | 5225 (x0.53) | RTX 3090 |
| YOLOx | COCO | TensorRT | 32 | FP16/INT8 | PTQ entropy per-tensor |
mAP: 0.479 | 550.6 (x7.72) | 99 (x0.18) | 5119 (x0.51) | RTX 3090 |
| Model | Data | Framework | Batch Size | Float/Int | Quantization Method | mAP | FPS | Size (MB) | Memory (MB) | Device |
|---|---|---|---|---|---|---|---|---|---|---|
| CenterNet download |
COCO | PyTorch | 32 | FP32 | - | mAP: 0.299 | 337.4 | 56 | 5171 | RTX 3090 |
| CenterNet | COCO | TensorRT | 32 | FP32 | - | mAP: 0.299 | 475.6 (x1) | 58 (x1) | 8241 (x1) | RTX 3090 |
| CenterNet | COCO | TensorRT | 32 | FP16 | - | mAP: 0.297 | 1247.1 (x2.62) | 29 (x0.50) | 5183 (x0.63) | RTX 3090 |
| CenterNet | COCO | TensorRT | 32 | FP32/INT8 | PTQ entropy per-tensor |
mAP: 0.27 | 1534.0 (x3.22) | 20 (x0.34) | 6549 (x0.79) | RTX 3090 |
| CenterNet | COCO | TensorRT | 32 | FP16/INT8 | PTQ entropy per-tensor |
mAP: 0.285 | 1889.0 (x3.97) | 17 (x0.29) | 6453 (x0.78) | RTX 3090 |
git clone git@github.com:donghe4/BEVFormer_tensorrt.git
cd BEVFormer_tensorrt
PROJECT_DIR=$(pwd)Download the COCO 2017 datasets to /path/to/coco and unzip them.
cd ${PROJECT_DIR}/data
ln -s /path/to/coco cocoDownload nuScenes V1.0 full dataset data and CAN bus expansion data HERE as /path/to/nuscenes and /path/to/can_bus.
Prepare nuscenes data like BEVFormer.
cd ${PROJECT_DIR}/data
ln -s /path/to/nuscenes nuscenes
ln -s /path/to/can_bus can_bus
cd ${PROJECT_DIR}
sh samples/bevformer/create_data.sh${PROJECT_DIR}/data/.
├── can_bus
│ ├── scene-0001_meta.json
│ ├── scene-0001_ms_imu.json
│ ├── scene-0001_pose.json
│ └── ...
├── coco
│ ├── annotations
│ ├── test2017
│ ├── train2017
│ └── val2017
└── nuscenes
├── maps
├── samples
├── sweeps
└── v1.0-trainval- Quick start
bash run_docker.sh- Alternative
cd ${PROJECT_DIR}
docker run -it --gpus all -v ./:/workspace/BEVFormer_tensorrt/ \
-v /path/to/can_bus:/workspace/BEVFormer_tensorrt/data/can_bus \
-v /path/to/coco:/workspace/BEVFormer_tensorrt/data/coco \
-v /path/to/nuscenes:/workspace/BEVFormer_tensorrt/data/nuscenes \
--shm-size=16G \
--privileged \
--network=host \
--user root \
hadonga/bev_trt:1.0 /bin/bash# [dhe] test docker image first. if test is sccuessful, there is no need for following installation.
# Run Unit Test of Custom TensorRT Plugins
cd ${PROJECT_DIR}
sh samples/test_trt_ops.sh
# Build and Install Custom TensorRT Plugins
cd /workspace/BEVFormer_tensorrt/TensorRT/build
cmake .. -DCMAKE_TENSORRT_PATH=/usr
make -j$(nproc)
make install
# Build and Install Part of Ops in MMDetection3D
cd /workspace/BEVFormer_tensorrt/third_party/bev_mmdet3d
python setup.py build develop --userDownload above PyTorch checkpoints to ${PROJECT_DIR}/checkpoints/pytorch/. The ONNX files and TensorRT engines will be saved in ${PROJECT_DIR}/checkpoints/onnx/ and ${PROJECT_DIR}/checkpoints/tensorrt/.
Support Common TensorRT Ops in BEVFormer:
- Grid Sampler
- Multi-scale Deformable Attention
- Modulated Deformable Conv2d
- Rotate
- Inverse
- BEV Pool V2
- Flash Multi-Head Attention
Each operation is implemented as 2 versions: FP32/FP16 (nv_half)/INT8 and FP32/FP16 (nv_half2)/INT8.
For specific speed comparison, see Custom TensorRT Plugins.
The following tutorial uses BEVFormer base as an example.
- Evaluate with PyTorch
cd ${PROJECT_DIR}
# defult gpu_id is 0
sh samples/bevformer/base/pth_evaluate.sh -d ${gpu_id}- Evaluate with TensorRT and MMDeploy Plugins
# convert .pth to .onnx
sh samples/bevformer/base/pth2onnx.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP32)
sh samples/bevformer/base/onnx2trt.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP16)
sh samples/bevformer/base/onnx2trt_fp16.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP32)
sh samples/bevformer/base/trt_evaluate.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16)
sh samples/bevformer/base/trt_evaluate_fp16.sh -d ${gpu_id}
# Quantization
# calibration and convert .onnx to TensorRT engine (FP32/INT8)
sh samples/bevformer/base/onnx2trt_int8.sh -d ${gpu_id}
# calibration and convert .onnx to TensorRT engine (FP16/INT8)
sh samples/bevformer/base/onnx2trt_int8_fp16.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP32/INT8)
sh samples/bevformer/base/trt_evaluate_int8.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16/INT8)
sh samples/bevformer/base/trt_evaluate_int8_fp16.sh -d ${gpu_id}
# quantization aware train
# defult gpu_ids is 0,1,2,3,4,5,6,7
sh samples/bevformer/base/quant_aware_train.sh -d ${gpu_ids}
# then following the post training quantization process- Evaluate with TensorRT and Custom Plugins
# nv_half
# convert .pth to .onnx
sh samples/bevformer/plugin/base/pth2onnx.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP32)
sh samples/bevformer/plugin/base/onnx2trt.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP16-nv_half)
sh samples/bevformer/plugin/base/onnx2trt_fp16.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP32)
sh samples/bevformer/plugin/base/trt_evaluate.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16-nv_half)
sh samples/bevformer/plugin/base/trt_evaluate_fp16.sh -d ${gpu_id}
# nv_half2
# convert .pth to .onnx
sh samples/bevformer/plugin/base/pth2onnx_2.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP16-nv_half2)
sh samples/bevformer/plugin/base/onnx2trt_fp16_2.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16-nv_half2)
sh samples/bevformer/plugin/base/trt_evaluate_fp16_2.sh -d ${gpu_id}
# Quantization
# nv_half
# calibration and convert .onnx to TensorRT engine (FP32/INT8)
sh samples/bevformer/plugin/base/onnx2trt_int8.sh -d ${gpu_id}
# calibration and convert .onnx to TensorRT engine (FP16-nv_half/INT8)
sh samples/bevformer/plugin/base/onnx2trt_int8_fp16.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP32/INT8)
sh samples/bevformer/plugin/base/trt_evaluate_int8.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16-nv_half/INT8)
sh samples/bevformer/plugin/base/trt_evaluate_int8_fp16.sh -d ${gpu_id}
# nv_half2
# calibration and convert .onnx to TensorRT engine (FP16-nv_half2/INT8)
sh samples/bevformer/plugin/base/onnx2trt_int8_fp16_2.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16-nv_half2/INT8)
sh samples/bevformer/plugin/base/trt_evaluate_int8_fp16_2.sh -d ${gpu_id}This project is mainly based on these excellent open source projects: