Deployment of BEV 3D Detection on TensorRT

modified for easy setup

This repository is a deployment project of BEV 3D Detection (including BEVFormer, BEVDet) on TensorRT, supporting FP32/FP16/INT8 inference. Meanwhile, in order to improve the inference speed of BEVFormer on TensorRT, this project implements some TensorRT Ops that support nv_half, nv_half2 and INT8. With the accuracy almost unaffected, the inference speed of the BEVFormer base can be increased by more than four times, the engine size can be reduced by more than 90%, and the GPU memory usage can be saved by more than 80%. In addition, the project also supports common 2D object detection models in MMDetection, which support INT8 Quantization and TensorRT Deployment with a small number of code changes.

Benchmarks

BEVFormer

BEVFormer PyTorch

Model	Data	Batch Size	NDS/mAP	FPS	Size (MB)	Memory (MB)	Device
BEVFormer tiny download	NuScenes	1	NDS: 0.354 mAP: 0.252	15.9	383	2167	RTX 3090
BEVFormer small download	NuScenes	1	NDS: 0.478 mAP: 0.370	5.1	680	3147	RTX 3090
BEVFormer base download	NuScenes	1	NDS: 0.517 mAP: 0.416	2.4	265	5435	RTX 3090

BEVFormer TensorRT with MMDeploy Plugins (Only Support FP32)

Model	Data	Batch Size	Float/Int	Quantization Method	NDS/mAP	FPS	Size (MB)	Memory (MB)	Device
BEVFormer tiny	NuScenes	1	FP32	-	NDS: 0.354 mAP: 0.252	37.9 (x1)	136 (x1)	2159 (x1)	RTX 3090
BEVFormer tiny	NuScenes	1	FP16	-	NDS: 0.354 mAP: 0.252	69.2 (x1.83)	74 (x0.54)	1729 (x0.80)	RTX 3090
BEVFormer tiny	NuScenes	1	FP32/INT8	PTQ entropy per-tensor	NDS: 0.353 mAP: 0.249	65.1 (x1.72)	58 (x0.43)	1737 (x0.80)	RTX 3090
BEVFormer tiny	NuScenes	1	FP16/INT8	PTQ entropy per-tensor	NDS: 0.353 mAP: 0.249	70.7 (x1.87)	54 (x0.40)	1665 (x0.77)	RTX 3090
BEVFormer small	NuScenes	1	FP32	-	NDS: 0.478 mAP: 0.370	6.6 (x1)	245 (x1)	4663 (x1)	RTX 3090
BEVFormer small	NuScenes	1	FP16	-	NDS: 0.478 mAP: 0.370	12.8 (x1.94)	126 (x0.51)	3719 (x0.80)	RTX 3090
BEVFormer small	NuScenes	1	FP32/INT8	PTQ entropy per-tensor	NDS: 0.476 mAP: 0.367	8.7 (x1.32)	158 (x0.64)	4079 (x0.87)	RTX 3090
BEVFormer small	NuScenes	1	FP16/INT8	PTQ entropy per-tensor	NDS: 0.477 mAP: 0.368	13.3 (x2.02)	106 (x0.43)	3441 (x0.74)	RTX 3090
BEVFormer base *	NuScenes	1	FP32	-	NDS: 0.517 mAP: 0.416	1.5 (x1)	1689 (x1)	13893 (x1)	RTX 3090
BEVFormer base	NuScenes	1	FP16	-	NDS: 0.517 mAP: 0.416	1.8 (x1.20)	849 (x0.50)	11865 (x0.85)	RTX 3090
BEVFormer base *	NuScenes	1	FP32/INT8	PTQ entropy per-tensor	NDS: 0.516 mAP: 0.414	1.8 (x1.20)	426 (x0.25)	12429 (x0.89)	RTX 3090
BEVFormer base *	NuScenes	1	FP16/INT8	PTQ entropy per-tensor	NDS: 0.515 mAP: 0.414	2.2 (x1.47)	244 (x0.14)	11011 (x0.79)	RTX 3090

* Out of Memory when onnx2trt with TensorRT-8.5.1.7, but they convert successfully with TensorRT-8.4.3.1. So the version of these engines is TensorRT-8.4.3.1.

BEVFormer TensorRT with Custom Plugins (Support nv_half, nv_half2 and int8)

FP16 Plugins with nv_half

Model	Data	Batch Size	Float/Int	Quantization Method	NDS/mAP	FPS/Improve	Size (MB)	Memory (MB)	Device
BEVFormer tiny	NuScenes	1	FP32	-	NDS: 0.354 mAP: 0.252	40.0 (x1.06)	135 (x0.99)	1693 (x0.78)	RTX 3090
BEVFormer tiny	NuScenes	1	FP16	-	NDS: 0.355 mAP: 0.252	81.2 (x2.14)	70 (x0.51)	1203 (x0.56)	RTX 3090
BEVFormer tiny	NuScenes	1	FP32/INT8	PTQ entropy per-tensor	NDS: 0.351 mAP: 0.249	90.1 (x2.38)	58 (x0.43)	1105 (x0.51)	RTX 3090
BEVFormer tiny	NuScenes	1	FP16/INT8	PTQ entropy per-tensor	NDS: 0.351 mAP: 0.249	107.4 (x2.83)	52 (x0.38)	1095 (x0.51)	RTX 3090
BEVFormer small	NuScenes	1	FP32	-	NDS: 0.478 mAP: 0.37	7.4 (x1.12)	250 (x1.02)	2585 (x0.55)	RTX 3090
BEVFormer small	NuScenes	1	FP16	-	NDS: 0.479 mAP: 0.37	15.8 (x2.40)	127 (x0.52)	1729 (x0.37)	RTX 3090
BEVFormer small	NuScenes	1	FP32/INT8	PTQ entropy per-tensor	NDS: 0.477 mAP: 0.367	17.9 (x2.71)	166 (x0.68)	1637 (x0.35)	RTX 3090
BEVFormer small	NuScenes	1	FP16/INT8	PTQ entropy per-tensor	NDS: 0.476 mAP: 0.366	20.4 (x3.10)	108 (x0.44)	1467 (x0.31)	RTX 3090
BEVFormer base	NuScenes	1	FP32	-	NDS: 0.517 mAP: 0.416	3.0 (x2.00)	292 (x0.17)	5715 (x0.41)	RTX 3090
BEVFormer base	NuScenes	1	FP16	-	NDS: 0.517 mAP: 0.416	4.9 (x3.27)	148 (x0.09)	3417 (x0.25)	RTX 3090
BEVFormer base	NuScenes	1	FP32/INT8	PTQ entropy per-tensor	NDS: 0.515 mAP: 0.414	6.9 (x4.60)	202 (x0.12)	3307 (x0.24)	RTX 3090
BEVFormer base	NuScenes	1	FP16/INT8	PTQ entropy per-tensor	NDS: 0.514 mAP: 0.413	8.0 (x5.33)	131 (x0.08)	2429 (x0.17)	RTX 3090

FP16 Plugins with nv_half2

Model	Data	Batch Size	Float/Int	Quantization Method	NDS/mAP	FPS	Size (MB)	Memory (MB)	Device
BEVFormer tiny	NuScenes	1	FP16	-	NDS: 0.355 mAP: 0.251	84.2 (x2.22)	72 (x0.53)	1205 (x0.56)	RTX 3090
BEVFormer tiny	NuScenes	1	FP16/INT8	PTQ entropy per-tensor	NDS: 0.354 mAP: 0.250	108.3 (x2.86)	52 (x0.38)	1093 (x0.51)	RTX 3090
BEVFormer small	NuScenes	1	FP16	-	NDS: 0.479 mAP: 0.371	18.6 (x2.82)	124 (x0.51)	1725 (x0.37)	RTX 3090
BEVFormer small	NuScenes	1	FP16/INT8	PTQ entropy per-tensor	NDS: 0.477 mAP: 0.368	22.9 (x3.47)	110 (x0.45)	1487 (x0.32)	RTX 3090
BEVFormer base	NuScenes	1	FP16	-	NDS: 0.517 mAP: 0.416	6.6 (x4.40)	146 (x0.09)	3415 (x0.25)	RTX 3090
BEVFormer base	NuScenes	1	FP16/INT8	PTQ entropy per-tensor	NDS: 0.516 mAP: 0.415	8.6 (x5.73)	159 (x0.09)	2479 (x0.18)	RTX 3090

BEVDet

BEVDet PyTorch

Model	Data	Batch Size	NDS/mAP	FPS	Size (MB)	Memory (MB)	Device
BEVDet R50 CBGS	NuScenes	1	NDS: 0.38 mAP: 0.298	29.0	170	1858	RTX 2080Ti

BEVDet TensorRT

with Custom Plugin bev_pool_v2 (Support nv_half, nv_half2 and int8), modified from Official BEVDet

Model	Data	Batch Size	Float/Int	Quantization Method	NDS/mAP	FPS	Size (MB)	Memory (MB)	Device
BEVDet R50 CBGS	NuScenes	1	FP32	-	NDS: 0.38 mAP: 0.298	44.6	245	1032	RTX 2080Ti
BEVDet R50 CBGS	NuScenes	1	FP16	-	NDS: 0.38 mAP: 0.298	135.1	86	790	RTX 2080Ti
BEVDet R50 CBGS	NuScenes	1	FP32/INT8	PTQ entropy per-tensor	NDS: 0.355 mAP: 0.274	234.7	44	706	RTX 2080Ti
BEVDet R50 CBGS	NuScenes	1	FP16/INT8	PTQ entropy per-tensor	NDS: 0.357 mAP: 0.277	236.4	44	706	RTX 2080Ti

2D Detection Models

This project also supports common 2D object detection models in MMDetection with little modification. The following are deployment examples of YOLOx and CenterNet.

YOLOx

Model	Data	Framework	Batch Size	Float/Int	Quantization Method	mAP	FPS	Size (MB)	Memory (MB)	Device
YOLOx download	COCO	PyTorch	32	FP32	-	mAP: 0.506	63.1	379	7617	RTX 3090
YOLOx	COCO	TensorRT	32	FP32	-	mAP: 0.506	71.3 (x1)	546 (x1)	9943 (x1)	RTX 3090
YOLOx	COCO	TensorRT	32	FP16	-	mAP: 0.506	296.8 (x4.16)	192 (x0.35)	4567 (x0.46)	RTX 3090
YOLOx	COCO	TensorRT	32	FP32/INT8	PTQ entropy per-tensor	mAP: 0.488	556.4 (x7.80)	99 (x0.18)	5225 (x0.53)	RTX 3090
YOLOx	COCO	TensorRT	32	FP16/INT8	PTQ entropy per-tensor	mAP: 0.479	550.6 (x7.72)	99 (x0.18)	5119 (x0.51)	RTX 3090

CenterNet

Model	Data	Framework	Batch Size	Float/Int	Quantization Method	mAP	FPS	Size (MB)	Memory (MB)	Device
CenterNet download	COCO	PyTorch	32	FP32	-	mAP: 0.299	337.4	56	5171	RTX 3090
CenterNet	COCO	TensorRT	32	FP32	-	mAP: 0.299	475.6 (x1)	58 (x1)	8241 (x1)	RTX 3090
CenterNet	COCO	TensorRT	32	FP16	-	mAP: 0.297	1247.1 (x2.62)	29 (x0.50)	5183 (x0.63)	RTX 3090
CenterNet	COCO	TensorRT	32	FP32/INT8	PTQ entropy per-tensor	mAP: 0.27	1534.0 (x3.22)	20 (x0.34)	6549 (x0.79)	RTX 3090
CenterNet	COCO	TensorRT	32	FP16/INT8	PTQ entropy per-tensor	mAP: 0.285	1889.0 (x3.97)	17 (x0.29)	6453 (x0.78)	RTX 3090

Clone

git clone git@github.com:donghe4/BEVFormer_tensorrt.git
cd BEVFormer_tensorrt
PROJECT_DIR=$(pwd)

Data Preparation

MS COCO (For 2D Detection)

Download the COCO 2017 datasets to /path/to/coco and unzip them.

cd ${PROJECT_DIR}/data
ln -s /path/to/coco coco

NuScenes and CAN bus (For BEVFormer)

Download nuScenes V1.0 full dataset data and CAN bus expansion data HERE as /path/to/nuscenes and /path/to/can_bus.

Prepare nuscenes data like BEVFormer.

cd ${PROJECT_DIR}/data
ln -s /path/to/nuscenes nuscenes
ln -s /path/to/can_bus can_bus

cd ${PROJECT_DIR}
sh samples/bevformer/create_data.sh

Tree

${PROJECT_DIR}/data/.
├── can_bus
│   ├── scene-0001_meta.json
│   ├── scene-0001_ms_imu.json
│   ├── scene-0001_pose.json
│   └── ...
├── coco
│   ├── annotations
│   ├── test2017
│   ├── train2017
│   └── val2017
└── nuscenes
    ├── maps
    ├── samples
    ├── sweeps
    └── v1.0-trainval

Install

With Docker

Quick start

bash run_docker.sh

Alternative

cd ${PROJECT_DIR}
docker run -it --gpus all -v ./:/workspace/BEVFormer_tensorrt/ \
-v /path/to/can_bus:/workspace/BEVFormer_tensorrt/data/can_bus \
-v /path/to/coco:/workspace/BEVFormer_tensorrt/data/coco \
-v /path/to/nuscenes:/workspace/BEVFormer_tensorrt/data/nuscenes \
--shm-size=16G \
--privileged \
--network=host \
--user root \
hadonga/bev_trt:1.0 /bin/bash

In container

# [dhe] test docker image first. if test is sccuessful, there is no need for following installation. 
# Run Unit Test of  Custom TensorRT Plugins
cd ${PROJECT_DIR}
sh samples/test_trt_ops.sh

# Build and Install Custom TensorRT Plugins
cd /workspace/BEVFormer_tensorrt/TensorRT/build
cmake .. -DCMAKE_TENSORRT_PATH=/usr
make -j$(nproc)
make install

# Build and Install Part of Ops in MMDetection3D
cd /workspace/BEVFormer_tensorrt/third_party/bev_mmdet3d
python setup.py build develop --user

Prepare the Checkpoints

Download above PyTorch checkpoints to ${PROJECT_DIR}/checkpoints/pytorch/. The ONNX files and TensorRT engines will be saved in ${PROJECT_DIR}/checkpoints/onnx/ and ${PROJECT_DIR}/checkpoints/tensorrt/.

Custom TensorRT Plugins

Support Common TensorRT Ops in BEVFormer:

Grid Sampler
Multi-scale Deformable Attention
Modulated Deformable Conv2d
Rotate
Inverse
BEV Pool V2
Flash Multi-Head Attention

Each operation is implemented as 2 versions: FP32/FP16 (nv_half)/INT8 and FP32/FP16 (nv_half2)/INT8.

For specific speed comparison, see Custom TensorRT Plugins.

Run

The following tutorial uses BEVFormer base as an example.

Evaluate with PyTorch

cd ${PROJECT_DIR}
# defult gpu_id is 0
sh samples/bevformer/base/pth_evaluate.sh -d ${gpu_id}

Evaluate with TensorRT and MMDeploy Plugins

# convert .pth to .onnx
sh samples/bevformer/base/pth2onnx.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP32)
sh samples/bevformer/base/onnx2trt.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP16)
sh samples/bevformer/base/onnx2trt_fp16.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP32)
sh samples/bevformer/base/trt_evaluate.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16)
sh samples/bevformer/base/trt_evaluate_fp16.sh -d ${gpu_id}

# Quantization
# calibration and convert .onnx to TensorRT engine (FP32/INT8)
sh samples/bevformer/base/onnx2trt_int8.sh -d ${gpu_id}
# calibration and convert .onnx to TensorRT engine (FP16/INT8)
sh samples/bevformer/base/onnx2trt_int8_fp16.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP32/INT8)
sh samples/bevformer/base/trt_evaluate_int8.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16/INT8)
sh samples/bevformer/base/trt_evaluate_int8_fp16.sh -d ${gpu_id}

# quantization aware train
# defult gpu_ids is 0,1,2,3,4,5,6,7
sh samples/bevformer/base/quant_aware_train.sh -d ${gpu_ids}
# then following the post training quantization process

Evaluate with TensorRT and Custom Plugins

# nv_half
# convert .pth to .onnx
sh samples/bevformer/plugin/base/pth2onnx.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP32)
sh samples/bevformer/plugin/base/onnx2trt.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP16-nv_half)
sh samples/bevformer/plugin/base/onnx2trt_fp16.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP32)
sh samples/bevformer/plugin/base/trt_evaluate.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16-nv_half)
sh samples/bevformer/plugin/base/trt_evaluate_fp16.sh -d ${gpu_id}

# nv_half2
# convert .pth to .onnx
sh samples/bevformer/plugin/base/pth2onnx_2.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP16-nv_half2)
sh samples/bevformer/plugin/base/onnx2trt_fp16_2.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16-nv_half2)
sh samples/bevformer/plugin/base/trt_evaluate_fp16_2.sh -d ${gpu_id}

# Quantization
# nv_half
# calibration and convert .onnx to TensorRT engine (FP32/INT8)
sh samples/bevformer/plugin/base/onnx2trt_int8.sh -d ${gpu_id}
# calibration and convert .onnx to TensorRT engine (FP16-nv_half/INT8)
sh samples/bevformer/plugin/base/onnx2trt_int8_fp16.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP32/INT8)
sh samples/bevformer/plugin/base/trt_evaluate_int8.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16-nv_half/INT8)
sh samples/bevformer/plugin/base/trt_evaluate_int8_fp16.sh -d ${gpu_id}

# nv_half2
# calibration and convert .onnx to TensorRT engine (FP16-nv_half2/INT8)
sh samples/bevformer/plugin/base/onnx2trt_int8_fp16_2.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16-nv_half2/INT8)
sh samples/bevformer/plugin/base/trt_evaluate_int8_fp16_2.sh -d ${gpu_id}

Acknowledgement

This project is mainly based on these excellent open source projects:

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
TensorRT		TensorRT
checkpoints		checkpoints
ci/check		ci/check
configs		configs
data		data
det2trt		det2trt
samples		samples
third_party		third_party
tools		tools
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
run_docker.sh		run_docker.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Deployment of BEV 3D Detection on TensorRT

modified for easy setup

Benchmarks

BEVFormer

BEVFormer PyTorch

BEVFormer TensorRT with MMDeploy Plugins (Only Support FP32)

BEVFormer TensorRT with Custom Plugins (Support nv_half, nv_half2 and int8)

BEVDet

BEVDet PyTorch

BEVDet TensorRT

2D Detection Models

YOLOx

CenterNet

Clone

Data Preparation

MS COCO (For 2D Detection)

NuScenes and CAN bus (For BEVFormer)

Tree

Install

With Docker

In container

Prepare the Checkpoints

Custom TensorRT Plugins

Run

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

donghe4/BEVFormer_tensorrt

Folders and files

Latest commit

History

Repository files navigation

Deployment of BEV 3D Detection on TensorRT

modified for easy setup

Benchmarks

BEVFormer

BEVFormer PyTorch

BEVFormer TensorRT with MMDeploy Plugins (Only Support FP32)

BEVFormer TensorRT with Custom Plugins (Support nv_half, nv_half2 and int8)

BEVDet

BEVDet PyTorch

BEVDet TensorRT

2D Detection Models

YOLOx

CenterNet

Clone

Data Preparation

MS COCO (For 2D Detection)

NuScenes and CAN bus (For BEVFormer)

Tree

Install

With Docker

In container

Prepare the Checkpoints

Custom TensorRT Plugins

Run

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages