-
Notifications
You must be signed in to change notification settings - Fork 5k
[wasm] Jiterpreter multithreading support #88279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tagging subscribers to this area: @BrzVlad, @kotlarmilos Issue DetailsThis PR contains initial work to make the jiterpreter thread safe. Key changes:
I haven't been able to test any of this with threading enabled because for some reason right now I can't get any of the samples to work right with threading turned on, presumably something bit-rotted. It all seems to work with threads disabled, at least.
|
There are two problems with enabling traces in threaded builds right now:
|
Almost ready for review. Despite what the comment says, I realized it's not actually fine to use the jitcall table for do_jit_call_indirect, we need special handling for that function |
Tagging subscribers to 'arch-wasm': @lewing Issue DetailsThis PR enables jiterpreter traces when threading is enabled by making lots of cchanges:
So far I've tested this with the basic browser threading sample (fibonacci) and the threaded version of the raytracer that pavel and aleksey put together.
|
I'm not sure what the right solution is for the worker interop exports/imports, I was unable to successfully build locally until I removed all the conditional logic around them - the linker was complaining it couldn't find those two functions. I'm not sure it's worth making them conditional since their implementation is guarded and they're small. |
I expect this to cause some small amount of performance regression in single-threaded configurations, so setting NO-MERGE until the perf tooling is all functioning as expected again and we will be able to spot any unexpected changes |
Tried reverting the corebindings changes, but when I do I can't run tests locally no matter what I do. Will see if it works on CI. |
Checkpoint Checkpoint Checkpoint thread safety work Checkpoint Fix build Disable traces in threaded builds due to emscripten compat issue STASH: Debugging atomics issue Checkpoint threading support Allocate jiterp tables in each thread's startup Make jiterpreter counters work cross-thread Checkpoint Checkpoint Checkpoint Checkpoint Basic implementation of trace self-healing so all threads will eventually jit all traces Don't force atomics anymore Checkpoint Cleanup Fix jiterp elapsed time Disable table allocation message unless stats are active Fix build Fix missing linker imports in threaded build Fix do_jit_call dispatcher attempting to use not-yet-allocated JIT tables Don't reuse the jit call wrapper table for do_jit_call dispatcher Raise default table size Fix typo Comment out debug print Fix boost_back_branch_target Silence unhelpful clang warning Silence another warning Address PR feedback: Unify things around Module.getMemory Revert corebindings changes
I will start reading the code in a minute, but before that, few questions on the description.
Is all the atomics introduced here only in place for MT build flavor ? Is there perf or size impact on non-MT build ?
Could you please elaborate, I probably don't have the context to understand what this is about.
We could have "repository" of SharedArrayBuffer blobs by trace index on the main thread and broadcast updates to workers. Probably another PR.
You could install different wrapper to each slot, like
I still want it strongly typed :)
Was this undone already ?
What is "Trace transfer " ? |
Right now most of the atomics are not ifdefed out, but size increase is a good reason to do it, so I'll do it.
We have to pre-reserve tables for each kind of generated code. I haven't been able to figure out why, but the ones for interp_entry - transitions from native code to interpreter - have to be very big. The size I picked actually isn't big enough for System.Text.Json.Tests.
The problem is that we need to reconstruct the correct imports on each thread, so we would need to serialize all of that somewhere while building the trace the first time and stash it. Some constants in the trace could also be thread-local, so we would need to record their locations and patch them.
Yeah, I considered doing this. The problem is that we would now need to either include ~4000 unique dummy functions in the .wasm file (not reasonable) or generate them at startup (also kind of unreasonable)
Yeah, I need to figure this out. Thanks for the reminder.
No. As I mentioned on discord, I decided to stick with stdatomic because we don't have mono_ versions of all the atomics I need for the types I'm using, and adding new mono_ atomics would require me to implement them on every platform. So I decided jiterpreter.c specifically will only use stdatomic
It's an old optimization that predates the jiterp having support for backward branches. It used to be worthwhile but in my testing it doesn't actually do anything anymore, so instead of making it thread-safe I just removed it. |
LibraryTests failure is in hybrid ICU, looks like maybe the browser ICU data doesn't match what we expect it to be |
/azp run runtime-wasm |
Azure Pipelines successfully started running 1 pipeline(s). |
|
||
// Atomically update the hit count. We may lose a race here, but that's not a big | ||
// problem since losing the race indicates that the trace entry point is probably hot. | ||
mono_atomic_cas_i64 (&trace_info->hit_count, new_hit_count, old_hit_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#ifdef DISABLE_THREADS
Do you mean that interp needs to to have many re-entry thunks ? Do they have different signatures ? Do they carry some values in a closure ? Does the number of them grow with number of interpreted methods ? Or with number of jiterp traces ?
Is that pointer to a local stack ? To instance of something that is thread specific ? Could we have TLS getter for jiterp ?
I was not proposing wasm methods, I was proposing to create JS wrappers during start. Functions 4000 per thread is not too bad, they would be each executed just once, and therefore not JITed by the browser. But I don't know, maybe the extra argument doesn't have significant performance hit and it it's not worth it ?
Why you can't just implement wasm and let the other platforms throw? As the methods are new, they would not be used elsewhere until needed. |
Yes, each thunk is dedicated to a specific method. They have different signatures (there's one table per signature) and they close over their target method
Yeah some sort of TLS getter is one potential solution here, but it would make things much slower. Another option is to use a global import for the constant and then specify a different value each time we instantiate the module - I did tests with imported globals for constants before, and while it worked it made traces slower, so it's not enabled right now. The thread local values are things like scratch buffers.
You can't put JS functions directly into the WASM function table right now. You have to first import them into a WASM module and then re-export them, which is what Module.addFunction does. So we would end up needing to instantiate a bunch of modules in order to create these wrappers. The cost of that might not be too bad, but it's complicated to implement and hard to measure the impact.
I guess I could do that, but it makes me uncomfortable :) |
Adjust default table size a bit
Re: earlier questions, |
This PR enables jiterpreter traces when threading is enabled by making lots of changes:
JITERPRETER_xxx [2-byte relative function pointer value] [4-byte trace index]
This means that to run a trace we need to add the value of a global to the function pointer we loaded from memory, but the overhead of that is not too bad. The existing opcode patching model where we reuse the 4-byte storage for different values just was not possible to do in a thread-safe way given WASM's alignment constraints.
So far I've tested this with the basic browser threading sample (fibonacci) and the threaded version of the raytracer that pavel and aleksey put together.