-
Notifications
You must be signed in to change notification settings - Fork 103
Description
Hi! Thanks for your great work! We are trying to deploy the pretrained model on some edge devices.
We followed the onnx model export script and got the lightstereo-s-sceneflow-general.onnx model. By using Netron to inspect the model, we identified a potential optimization.
Here is the original structure of correlation_volume processing in onnx.

We modified the code at link.
def correlation_volume(left_feature, right_feature, max_disp):
b, c, h, w = left_feature.shape
padded_right = F.pad(right_feature, (max_disp, 0, 0, 0))
cost_volume = torch.stack([
(left_feature * padded_right[:, :, :, max_disp - i : max_disp + w - i]).mean(dim=1) # 计算相似度
for i in range(max_disp)
], dim=1)
return cost_volume.contiguous()The LightStereo model inference performance based on RKNN or OnnxRuntime has been significantly improved.
| orangepi-5-plus-16GB | qps | cpu |
|---|---|---|
| lightstereo(fp16) - origin | 3.7 | 65% |
| lightstereo(fp16) - opt | 9 | 35% |
| lightstereo(fp16) - origin - async | 14 | 210% |
| lightstereo(fp16) - opt - async | 29 | 90% |
| intel-i7-11800H | qps | cpu |
|---|---|---|
| lightstereo(fp16) - origin | 7 | 800% |
| lightstereo(fp16) - opt | 9 | 800% |
However, on nvidia device, benefiting from the Myelin optimization engine, the original model's inference process is already well-optimized, making further optimizations redundant.
| nvidia-3080-laptop | qps | cpu |
|---|---|---|
| lightstereo(fp16) - origin | 388 | 150% |
| lightstereo(fp16) - opt | 370 | 150% |
| lightstereo(fp16) - origin - async | 418 | 170% |
| lightstereo(fp16) - opt - async | 390 | 170% |
| jetson-orin-nx-16GB | qps | cpu |
|---|---|---|
| lightstereo(fp16) - origin | 70 | 65% |
| lightstereo(fp16) - opt | 65 | 70% |
| lightstereo(fp16) - origin - async | 76 | 80% |
| lightstereo(fp16) - opt - async | 69 | 85% |
Would it be helpful if I submit a PR for this? I’d be happy to.
Please let me know if this aligns with the project's direction. I completely understand if this optimization isn’t a priority at the moment.
You could find our implementation and test code at link