Some performance improvement tricks on edge devices.

Hi! Thanks for your great work! We are trying to deploy the pretrained model on some edge devices.

We followed the onnx model export script and got the `lightstereo-s-sceneflow-general.onnx` model. By using `Netron` to inspect the model, we identified a potential optimization. 

Here is the original structure of `correlation_volume` processing in onnx.
![Image](https://github.com/user-attachments/assets/6effd858-c53d-4b1d-9bad-9f3b74af9d84)

We modified the code at [link](https://github.com/XiandaGuo/OpenStereo/blob/9e7f8e76c34afea5d189911d92f187c7d6d55e47/stereo/modeling/cost_volume/cost_volume.py#L32).
```python
def correlation_volume(left_feature, right_feature, max_disp):
    b, c, h, w = left_feature.shape

    padded_right = F.pad(right_feature, (max_disp, 0, 0, 0)) 

    cost_volume = torch.stack([
        (left_feature * padded_right[:, :, :, max_disp - i : max_disp + w - i]).mean(dim=1)  # 计算相似度
        for i in range(max_disp)
    ], dim=1)  

    return cost_volume.contiguous()
```

The `LightStereo` model inference performance based on `RKNN` or `OnnxRuntime` has been significantly improved.

|  orangepi-5-plus-16GB   |   qps   |  cpu   |
|:---------:|:---------:|:----------------:|
|  lightstereo(fp16) - origin   |   3.7   |  65%   |
|  lightstereo(fp16) - **opt**  |   **9**   |  **35%**   |
|  lightstereo(fp16) - origin - ***async***  |   14   |  210%   |
|  lightstereo(fp16) - **opt** - ***async***  |   **29**   |  **90%**   |

|  intel-i7-11800H   |   qps   |  cpu   |
|:---------:|:---------:|:----------------:|
|  lightstereo(fp16) - origin   |   7   |  800%   |
|  lightstereo(fp16) - **opt**  |   **9**   |  800%   |

However, on `nvidia` device, benefiting from the `Myelin` optimization engine, the original model's inference process is already well-optimized, making further optimizations redundant.

|  nvidia-3080-laptop   |   qps   |  cpu   |
|:---------:|:---------:|:----------------:|
|  lightstereo(fp16) - origin   |   **388**   |  150%   |
|  lightstereo(fp16) - opt  |   370   |  150%   |
|  lightstereo(fp16) - origin - ***async***  |   **418**   |  170%   |
|  lightstereo(fp16) - opt - ***async***  |   390   |  170%   |

|  jetson-orin-nx-16GB   |   qps   |  cpu   |
|:---------:|:---------:|:----------------:|
|  lightstereo(fp16) - origin   |   **70**   |  65%   |
|  lightstereo(fp16) - opt  |   65   |  70%   |
|  lightstereo(fp16) - origin - ***async***  |   **76**   |  80%   |
|  lightstereo(fp16) - opt - ***async***  |   69   |  85%   |

Would it be helpful if I submit a PR for this? I’d be happy to.
Please let me know if this aligns with the project's direction. I completely understand if this optimization isn’t a priority at the moment.

You could find our implementation and test code at [link](https://github.com/zz990099/lightstereo_cpp)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some performance improvement tricks on edge devices. #212

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

orangepi-5-plus-16GB	qps	cpu
lightstereo(fp16) - origin	3.7	65%
lightstereo(fp16) - opt	9	35%
lightstereo(fp16) - origin - async	14	210%
lightstereo(fp16) - opt - async	29	90%

intel-i7-11800H	qps	cpu
lightstereo(fp16) - origin	7	800%
lightstereo(fp16) - opt	9	800%

nvidia-3080-laptop	qps	cpu
lightstereo(fp16) - origin	388	150%
lightstereo(fp16) - opt	370	150%
lightstereo(fp16) - origin - async	418	170%
lightstereo(fp16) - opt - async	390	170%

Some performance improvement tricks on edge devices. #212

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions