-
Notifications
You must be signed in to change notification settings - Fork 3.4k
[AUDIO_WORKLET] Optimise the copy back from wasm's heap to JS #22753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
38203f0
to
24a90e4
Compare
Nice idea! It looks like this PR carries the rename from the other PR. Rebase/merge should hide it? |
A shower thought, as all good ideas are!
Yes, this one was rebased off the earlier one so should have the same commit IDs. |
d186eb5
to
d4680ca
Compare
EDIT: every test I tried suggests no, the stack is only used once we hit Wasm in the callback. @juj or someone who knows more about the internals of this than me: on entering To rephrase: unless the audio worklet explicitly uses the stack functions from JS, nothing external from Wasm will before |
dcdcea1
to
5b65dcf
Compare
Some notes: lots of experiments with the stack allocations, minimum sizes, various flags ( Next is to benchmark it. |
c98adcf
to
2373dd2
Compare
Benchmarks of the main part of the audio copy done: https://wip.numfum.com/cw/2024-10-29/index.html Testing on my M2 Mac Studio in Chrome and Safari this PR is around 15x faster on the float copy, e.g. the original being 0.625µs per
@juj if we're in agreement that the simplified standalone JavaScript test code is doing the right thing (it's short and a copy and paste), I can gather numbers from regular hardware (we have a wall of Chromebooks at work). Next I'll need to create tests to show that this still works with various input and output configs. EDIT: a 7-12x speed-up seems typical on x64 Windows or Linux. |
|
069f7a4
to
ae0e8bf
Compare
f6153e9
to
cccece4
Compare
b3dc2ef
to
ac37140
Compare
The tests pass the audio context in a void* for convenience, which needs shortening/widening for 64-bit pointers.
32a9d98
to
90c4bae
Compare
Tests moved to #23394 (since they don't require these changes). |
These are the audio worklet tests from #22753 extracted to a standalone PR. The tests are: - Multiple stereo inputs mixing in the processor to a single stereo output - Multiple stereo inputs copying in the processor to multiple stereo outputs - Multiple mono inputs mixing in the processor to a single mono output - Multiple mono inputs copying in the processor to L+R stereo outputs The tests use different stack sizes (from 2kB to 6kB depending on the requirement). The audio tracks were composed by Tim Wright especially for Emscripten and released under a CC0 license.
Since this has had a lot of comments and changes since its first commit in October I've edited the initial comment to contain the findings and results, and merged the tests which landed in main. Besides the comment/error message in |
Odd, GH closed this when I renamed the branch to merge into a fork (it's residing in cw-audio-optimised-copy). If there's any interest in this I can re-open another PR. |
Short version: this improves the copy back from the audio worklet's heap to JS by 7-12x depending on the browser.
Since we pass in the stack for the worklet from the caller's heap, its address doesn't change. And since the render quantum size doesn't change after the audio worklet creation, the stack positions for the audio buffers do not change either. This optimisation adds one-time subarray views and replaces the float-by-float copy with a simple
set()
per channel (per output).To prove this doesn't break anything, tests of the audio worklet API to compare before and after have already landed in #23394, merged here; they can be run using:
A benchmark of the extracted copy before and after is here:
https://wip.numfum.com/cw/2024-10-29/index.html
This PR fulfils the garbage-free requirement as well as having the performance boost. Multiple output buffers are created with addresses at the first entries in the AW's stack, and the number of ins and outs can change dynamically (using or not the predefined buffers). There are tests for buffer overflows as well as sanity.
(Edited to take the content from the conversation below and turn this from a proposal to a summary.)