Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to optimise OPAL #291

Merged
merged 8 commits into from
Jan 3, 2025
Merged

Try to optimise OPAL #291

merged 8 commits into from
Jan 3, 2025

Conversation

maks
Copy link
Collaborator

@maks maks commented Jan 2, 2025

As part of trying to improve performance for #283 this PR does some basic optimisation for the OPAL audio rendering code. The main optimisation in the OPAL code is removing function calls from the tight loop by creating a new method for generating samples together for a whole buffer and then further removing more function calls by inlining several small functions. At the cost of 200bytes of ram, the constant LUTs are also forced to be in ram all the time.

For a release builds takes rendering time for 1 Render() method call at default bpm and default OPAL instrument settings from:

Render took: 3838 (0 ts)[268860620]
Render took: 3820 (0 ts)[268860620]
Render took: 3813 (0 ts)[268860620]
Render took: 3764 (0 ts)[268860620]
Render took: 3800 (0 ts)[268860620]
Render took: 3811 (0 ts)[268860620]
Render took: 3822 (0 ts)[268860620]
Render took: 3802 (0 ts)[268860620]
Render took: 3773 (0 ts)[268860620]
Render took: 3827 (0 ts)[268860620]

to:

[OPALINSTRUMENT] Render took: 2832 us [799])
[OPALINSTRUMENT] Render took: 2812 us [799])
[OPALINSTRUMENT] Render took: 2783 us [799])
[OPALINSTRUMENT] Render took: 2829 us [799])
[OPALINSTRUMENT] Render took: 2827 us [799])
[OPALINSTRUMENT] Render took: 2805 us [798])
[OPALINSTRUMENT] Render took: 2786 us [799])
[OPALINSTRUMENT] Render took: 2792 us [799])
[OPALINSTRUMENT] Render took: 2813 us [799])
[OPALINSTRUMENT] Render took: 2840 us [799])
[OPALINSTRUMENT] Render took: 2814 us [799])

so from around avg of 3.8ms to 2.8ms which is a decent if not stellar improvement.

maks added 3 commits December 31, 2024 11:59
removing function calls from tight loops saves 30-40 cycles per
function call on RP2040
reduce opal sample rendering time by sacrificing 200bytes of ram for
const sine, exp LUTs.
@maks maks requested a review from democloid January 2, 2025 02:22
Copy link
Collaborator

@democloid democloid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

Wouldn't inlining Sample and Output methods in the current code achieve the same results?

Do you have a sense on how much of this gain is due to the method change vs putting that data in memory? (BTW, wouldn't you achieve not being in flash by not defining it const?)

sources/CMakeLists.txt Outdated Show resolved Hide resolved
sources/Externals/opal/opal.cpp Outdated Show resolved Hide resolved
sources/Externals/opal/opal.cpp Outdated Show resolved Hide resolved
@maks
Copy link
Collaborator Author

maks commented Jan 2, 2025

Very nice!

Wouldn't inlining Sample and Output methods in the current code achieve the same results?

Unfortunately no as the linker complains if you try to call an inline function, in the this case Sample() from another compilation object, which sort of makes sense because the compiler has removed it as its already been inlined (ie. deleted) from its own compilation object.

Do you have a sense on how much of this gain is due to the method change vs putting that data in memory? (BTW, wouldn't you achieve not being in flash by not defining it const?)

Good question! I just reran test with the force into ram macro removed from the LUTs and I get:

[OPALINSTRUMENT] Render took: 3306 us [799])
[OPALINSTRUMENT] Render took: 3292 us [799])
[OPALINSTRUMENT] Render took: 3249 us [799])
[OPALINSTRUMENT] Render took: 3310 us [799])
[OPALINSTRUMENT] Render took: 3337 us [799])
[OPALINSTRUMENT] Render took: 3304 us [799])
[OPALINSTRUMENT] Render took: 3280 us [799])

so moving the LUT data into ram is what gives approx half the performance improvement.

@maks
Copy link
Collaborator Author

maks commented Jan 3, 2025

Wouldn't inlining Sample and Output methods in the current code achieve the same results?

Oops! sorry I see what you mean now, make them inline and then call them from the new SampleBuffer function instead.
Yes that works and just tested gives same perf improvement 👍🏻

and have compiler do the work instead of manually inlining the code
@maks maks merged commit eda27c7 into master Jan 3, 2025
3 checks passed
@maks maks deleted the 283-try-to-optimise-opal branch January 3, 2025 03:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants