-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Fix embind failing in workers when STACK_OVERFLOW_CHECK is enabled #12366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kripken recently changed this so might have more context
Thanks @tklajnscek ! I think this fixes it and is correct. However I am worried about the Also, I think the embind call should be a few lines lower down, right before the Please also add a test - adding to an existing test to run with the stack check should be enough, as mentioned in #12356 |
Ok great. Had a global already, but wanted to keep it lean 🤣 Will do the changes 1st thing tomorrow. Cheers! |
1d354be
to
975c0ae
Compare
#include <emscripten.h> | ||
|
||
static pthread_t thread; | ||
static int done = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think done
should be an atomic<bool>
or something like that - in which case using C++ is probably better.
I have a few other comments on the test, but reading the original issue, this should reproduce with
emcc -s USE_PTHREADS=1 --bind -s STACK_OVERFLOW_CHECK=2 tests/pthread/test_pthread_create.cpp
So maybe we don't even need a new testcase? How about just doing this:
diff --git a/tests/test_core.py b/tests/test_core.py
index a2a9b7824..a0557575a 100644
--- a/tests/test_core.py
+++ b/tests/test_core.py
@@ -8179,6 +8179,11 @@ NODEFS is no longer included by default; build with -lnodefs.js
self.emcc_args += ['-DPOOL']
test()
+ print('with embind and stack overflow checks (see #12356)')
+ self.set_setting('STACK_OVERFLOW_CHECK', '2')
+ self.emcc_args += ['--bind']
+ test()
+
@node_pthreads
def test_pthread_exceptions(self):
self.set_setting('PTHREAD_POOL_SIZE', '2')
Other than the test, code looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually tried using that test, pretty much like you wrote it, but the problem is that it gets stuck in a loop and never exists when the thread aborts - probably because the results never come in and the main thread keeps looping. That's why I made a new one with an explicit "timeout". I guess you could make a case for fixing all the pthread tests to actually exit if the threads hard fail... but I guess that's for another PR to keep this to the point.
Let me know if you still think I should change the test.
Oh, as for int done
being atomic - technically you're right, but this is one thread writing with one thread reading and there's no cache/memory order concerns here since it's just a simple flag. So I thought it was unnecessary complexity, but yeah, might be clearer if it's an atomic... I'll change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like the existing test also suffers from the "if the thread aborts it never returns issue".
Given that problem exists today, and (IIUC) it only effects failing tests (since successful runs will always return), I think it would be better to modify the existing test.
Also, the solution of looping for an arbirary number of times in the main loop does not seem great. I think we can up with a better solution to that problem, but that doesn't have to be part of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time looking into this... I think the problem is that in library_thread.js in loadWasmModuleToWorker
the worker.onerror
function is defined as:
worker.onerror = function(e) {
err('pthread sent an error! ' + e.filename + ':' + e.lineno + ': ' + e.message);
};
In a normal program when a thread crashes/aborts it takes the whole process down with it, but what's happening here is that it's just getting reported as a benign error and the execution just carries on as if nothing happened. So the main thread is left there waiting for results from a worker thread that never returns.
If I flip that err
to an abort
the tests bail out nicely because an abort fires on the main thread. That's also highly likely the proper thing to do to match the behavior of a native app.
Thoughts?
(I'll run all the tests to see what happens.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can leave aside the issue of the test hanging if it fails - that's annoying, and we should look into it, but it's separate from this. A test like this should not fail, and is deterministic, so it would be like all the other tests we have already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Let's tackle that separately. Pushed the change to remove the custom test.
@@ -12,6 +12,9 @@ | |||
var threadInfoStruct = 0; // Info area for this thread in Emscripten HEAP (shared). If zero, this worker is not currently hosting an executing pthread. | |||
var selfThreadId = 0; // The ID of this thread. 0 if not hosting a pthread. | |||
var parentThreadId = 0; // The ID of the parent pthread that launched this thread. | |||
#if EMBIND | |||
var initializedJS = false; // Guard variable for one-time init of the JS state (currently only embind types registration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is wrong with re-initializing the state at the start of each thread?
It seems cleaner to me to clear the state when the thread exits an re-create it on thread start, just like we would do for any per-thread state that lives in the linear memory.
Maybe I'm missing something but preserving JS state and then re-using it on another thread seem, in the general case, wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could maybe add a mechanism to "clear" the embind data, and then when a pthread exits call that, then call init when one is started. That would be more complex though - in particular I'm not sure how hard the "clear" would be to do. But I agree if it's practical it would be cleaner, as you say.
(Regardless this PR doesn't change this property - the JS was always initialized once and then the Worker could be reused for more pthreads. This just moves the init to the right place.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems cleaner to me to clear the state when the thread exits an re-create it on thread start, just like we would do for any per-thread state that lives in the linear memory.
The weirdness here comes from the fact that ___embind_register_native_and_builtin_types
is a native function, but it has JS side-effects so the usual "crt is inited on the main thread and then shared directly via SAB with the workers" doesn't apply because the JS side-effects don't get shared.
I disagree that it needs clearing since it's analogous to one-time-crt-init and the JS legitimately lives on for reuse. I think it makes it extra error prone if we clear the embind specific bits and init again on each run as we're just introducing another thing that can fail (the clear), especially when someone doing changes might not be aware of both.
If you really want the clear/init on every run then I'd suggest making it a very formal thing, much like the ATINIT/ATEXIT, but for the workers. That way it's not something tied to this specific case.
… running it after the stack space has been established (which is the correct thing to do anyways)
…/stack_overflow_check.c)
a02280d
to
95e5169
Compare
Rebased to make it re-run the tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thanks @tklajnscek !
This fixes #12356.