Fix embind failing in workers when STACK_OVERFLOW_CHECK is enabled #12366

tklajnscek · 2020-09-29T13:56:21Z

This fixes #12356.

sbc100

@kripken recently changed this so might have more context

src/worker.js

kripken · 2020-09-30T20:22:36Z

Thanks @tklajnscek ! I think this fixes it and is correct. However I am worried about the Object.keys(registeredTypes) stuff, that's looking at embind internals, which it would be good to not look at outside of the embind code. Instead, how about adding a global variable here, initializedJS, that guards this situation?

Also, I think the embind call should be a few lines lower down, right before the try (so it's after we set up the thread status, etc., and it is then valid to run code).

Please also add a test - adding to an existing test to run with the stack check should be enough, as mentioned in #12356

tklajnscek · 2020-09-30T20:26:00Z

Ok great. Had a global already, but wanted to keep it lean 🤣

Will do the changes 1st thing tomorrow. Cheers!

kripken · 2020-10-01T16:51:41Z

tests/core/pthread/stack_overflow_check.c

+#include <emscripten.h>
+
+static pthread_t thread;
+static int done = 0;


I think done should be an atomic<bool> or something like that - in which case using C++ is probably better.

I have a few other comments on the test, but reading the original issue, this should reproduce with

emcc -s USE_PTHREADS=1 --bind -s STACK_OVERFLOW_CHECK=2 tests/pthread/test_pthread_create.cpp

So maybe we don't even need a new testcase? How about just doing this:

diff --git a/tests/test_core.py b/tests/test_core.py index a2a9b7824..a0557575a 100644 --- a/tests/test_core.py +++ b/tests/test_core.py @@ -8179,6 +8179,11 @@ NODEFS is no longer included by default; build with -lnodefs.js self.emcc_args += ['-DPOOL'] test() + print('with embind and stack overflow checks (see #12356)') + self.set_setting('STACK_OVERFLOW_CHECK', '2') + self.emcc_args += ['--bind'] + test() + @node_pthreads def test_pthread_exceptions(self): self.set_setting('PTHREAD_POOL_SIZE', '2')

Other than the test, code looks good!

I actually tried using that test, pretty much like you wrote it, but the problem is that it gets stuck in a loop and never exists when the thread aborts - probably because the results never come in and the main thread keeps looping. That's why I made a new one with an explicit "timeout". I guess you could make a case for fixing all the pthread tests to actually exit if the threads hard fail... but I guess that's for another PR to keep this to the point.

Let me know if you still think I should change the test.

Oh, as for int done being atomic - technically you're right, but this is one thread writing with one thread reading and there's no cache/memory order concerns here since it's just a simple flag. So I thought it was unnecessary complexity, but yeah, might be clearer if it's an atomic... I'll change that.

Sounds like the existing test also suffers from the "if the thread aborts it never returns issue".

Given that problem exists today, and (IIUC) it only effects failing tests (since successful runs will always return), I think it would be better to modify the existing test.

Also, the solution of looping for an arbirary number of times in the main loop does not seem great. I think we can up with a better solution to that problem, but that doesn't have to be part of this PR.

I spent some time looking into this... I think the problem is that in library_thread.js in loadWasmModuleToWorker the worker.onerror function is defined as:

worker.onerror = function(e) { err('pthread sent an error! ' + e.filename + ':' + e.lineno + ': ' + e.message); };

In a normal program when a thread crashes/aborts it takes the whole process down with it, but what's happening here is that it's just getting reported as a benign error and the execution just carries on as if nothing happened. So the main thread is left there waiting for results from a worker thread that never returns.

If I flip that err to an abort the tests bail out nicely because an abort fires on the main thread. That's also highly likely the proper thing to do to match the behavior of a native app.

Thoughts?

(I'll run all the tests to see what happens.)

I think we can leave aside the issue of the test hanging if it fails - that's annoying, and we should look into it, but it's separate from this. A test like this should not fail, and is deterministic, so it would be like all the other tests we have already?

Agreed. Let's tackle that separately. Pushed the change to remove the custom test.

sbc100 · 2020-10-01T17:18:44Z

src/worker.js

@@ -12,6 +12,9 @@
 var threadInfoStruct = 0; // Info area for this thread in Emscripten HEAP (shared). If zero, this worker is not currently hosting an executing pthread.
 var selfThreadId = 0; // The ID of this thread. 0 if not hosting a pthread.
 var parentThreadId = 0; // The ID of the parent pthread that launched this thread.
+#if EMBIND
+var initializedJS = false; // Guard variable for one-time init of the JS state (currently only embind types registration)


What is wrong with re-initializing the state at the start of each thread?

It seems cleaner to me to clear the state when the thread exits an re-create it on thread start, just like we would do for any per-thread state that lives in the linear memory.

Maybe I'm missing something but preserving JS state and then re-using it on another thread seem, in the general case, wrong.

We could maybe add a mechanism to "clear" the embind data, and then when a pthread exits call that, then call init when one is started. That would be more complex though - in particular I'm not sure how hard the "clear" would be to do. But I agree if it's practical it would be cleaner, as you say.

(Regardless this PR doesn't change this property - the JS was always initialized once and then the Worker could be reused for more pthreads. This just moves the init to the right place.)

It seems cleaner to me to clear the state when the thread exits an re-create it on thread start, just like we would do for any per-thread state that lives in the linear memory.

The weirdness here comes from the fact that ___embind_register_native_and_builtin_types is a native function, but it has JS side-effects so the usual "crt is inited on the main thread and then shared directly via SAB with the workers" doesn't apply because the JS side-effects don't get shared.

I disagree that it needs clearing since it's analogous to one-time-crt-init and the JS legitimately lives on for reuse. I think it makes it extra error prone if we clear the embind specific bits and init again on each run as we're just introducing another thing that can fail (the clear), especially when someone doing changes might not be aware of both.

If you really want the clear/init on every run then I'd suggest making it a very formal thing, much like the ATINIT/ATEXIT, but for the workers. That way it's not something tied to this specific case.

… running it after the stack space has been established (which is the correct thing to do anyways)

…/stack_overflow_check.c)

tklajnscek · 2020-10-05T17:10:07Z

Rebased to make it re-run the tests.
I think this is good to go @kripken

kripken

Perfect, thanks @tklajnscek !

sbc100 reviewed Sep 29, 2020

View reviewed changes

src/worker.js Show resolved Hide resolved

tklajnscek force-pushed the embind_pthread_fix branch from 1d354be to 975c0ae Compare October 1, 2020 11:44

kripken reviewed Oct 1, 2020

View reviewed changes

sbc100 reviewed Oct 1, 2020

View reviewed changes

tklajnscek added 3 commits October 5, 2020 14:33

Fix embind failing in workers when STACK_OVERFLOW_CHECK is enabled by…

93a78a0

… running it after the stack space has been established (which is the correct thing to do anyways)

Reused an existing test instead of adding a new test (removed pthread…

c699bcc

…/stack_overflow_check.c)

Fixed test

95e5169

tklajnscek force-pushed the embind_pthread_fix branch from a02280d to 95e5169 Compare October 5, 2020 12:34

kripken approved these changes Oct 5, 2020

View reviewed changes

kripken merged commit 59421d2 into emscripten-core:master Oct 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix embind failing in workers when STACK_OVERFLOW_CHECK is enabled #12366

Fix embind failing in workers when STACK_OVERFLOW_CHECK is enabled #12366

Uh oh!

tklajnscek commented Sep 29, 2020

Uh oh!

sbc100 left a comment

Uh oh!

Uh oh!

kripken commented Sep 30, 2020

Uh oh!

tklajnscek commented Sep 30, 2020

Uh oh!

kripken Oct 1, 2020

Uh oh!

tklajnscek Oct 1, 2020

Uh oh!

sbc100 Oct 1, 2020

Uh oh!

tklajnscek Oct 2, 2020

Uh oh!

kripken Oct 2, 2020

Uh oh!

tklajnscek Oct 2, 2020

Uh oh!

sbc100 Oct 1, 2020

Uh oh!

kripken Oct 1, 2020 •

edited

Loading

Uh oh!

tklajnscek Oct 2, 2020

Uh oh!

tklajnscek commented Oct 5, 2020

Uh oh!

kripken left a comment

Uh oh!

Uh oh!

Fix embind failing in workers when STACK_OVERFLOW_CHECK is enabled #12366

Fix embind failing in workers when STACK_OVERFLOW_CHECK is enabled #12366

Uh oh!

Conversation

tklajnscek commented Sep 29, 2020

Uh oh!

sbc100 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kripken commented Sep 30, 2020

Uh oh!

tklajnscek commented Sep 30, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kripken Oct 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tklajnscek commented Oct 5, 2020

Uh oh!

kripken left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kripken Oct 1, 2020 •

edited

Loading