Skip to content

Fix embind failing in workers when STACK_OVERFLOW_CHECK is enabled #12366

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 5, 2020

Conversation

tklajnscek
Copy link
Contributor

This fixes #12356.

Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kripken recently changed this so might have more context

@kripken
Copy link
Member

kripken commented Sep 30, 2020

Thanks @tklajnscek ! I think this fixes it and is correct. However I am worried about the Object.keys(registeredTypes) stuff, that's looking at embind internals, which it would be good to not look at outside of the embind code. Instead, how about adding a global variable here, initializedJS, that guards this situation?

Also, I think the embind call should be a few lines lower down, right before the try (so it's after we set up the thread status, etc., and it is then valid to run code).

Please also add a test - adding to an existing test to run with the stack check should be enough, as mentioned in #12356

@tklajnscek
Copy link
Contributor Author

Ok great. Had a global already, but wanted to keep it lean 🤣

Will do the changes 1st thing tomorrow. Cheers!

#include <emscripten.h>

static pthread_t thread;
static int done = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think done should be an atomic<bool> or something like that - in which case using C++ is probably better.

I have a few other comments on the test, but reading the original issue, this should reproduce with

emcc -s USE_PTHREADS=1 --bind -s STACK_OVERFLOW_CHECK=2 tests/pthread/test_pthread_create.cpp

So maybe we don't even need a new testcase? How about just doing this:

diff --git a/tests/test_core.py b/tests/test_core.py
index a2a9b7824..a0557575a 100644
--- a/tests/test_core.py
+++ b/tests/test_core.py
@@ -8179,6 +8179,11 @@ NODEFS is no longer included by default; build with -lnodefs.js
     self.emcc_args += ['-DPOOL']
     test()
 
+    print('with embind and stack overflow checks (see #12356)')
+    self.set_setting('STACK_OVERFLOW_CHECK', '2')
+    self.emcc_args += ['--bind']
+    test()
+
   @node_pthreads
   def test_pthread_exceptions(self):
     self.set_setting('PTHREAD_POOL_SIZE', '2')

Other than the test, code looks good!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually tried using that test, pretty much like you wrote it, but the problem is that it gets stuck in a loop and never exists when the thread aborts - probably because the results never come in and the main thread keeps looping. That's why I made a new one with an explicit "timeout". I guess you could make a case for fixing all the pthread tests to actually exit if the threads hard fail... but I guess that's for another PR to keep this to the point.

Let me know if you still think I should change the test.

Oh, as for int done being atomic - technically you're right, but this is one thread writing with one thread reading and there's no cache/memory order concerns here since it's just a simple flag. So I thought it was unnecessary complexity, but yeah, might be clearer if it's an atomic... I'll change that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like the existing test also suffers from the "if the thread aborts it never returns issue".

Given that problem exists today, and (IIUC) it only effects failing tests (since successful runs will always return), I think it would be better to modify the existing test.

Also, the solution of looping for an arbirary number of times in the main loop does not seem great. I think we can up with a better solution to that problem, but that doesn't have to be part of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time looking into this... I think the problem is that in library_thread.js in loadWasmModuleToWorker the worker.onerror function is defined as:

worker.onerror = function(e) {
  err('pthread sent an error! ' + e.filename + ':' + e.lineno + ': ' + e.message);
};

In a normal program when a thread crashes/aborts it takes the whole process down with it, but what's happening here is that it's just getting reported as a benign error and the execution just carries on as if nothing happened. So the main thread is left there waiting for results from a worker thread that never returns.

If I flip that err to an abort the tests bail out nicely because an abort fires on the main thread. That's also highly likely the proper thing to do to match the behavior of a native app.

Thoughts?

(I'll run all the tests to see what happens.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave aside the issue of the test hanging if it fails - that's annoying, and we should look into it, but it's separate from this. A test like this should not fail, and is deterministic, so it would be like all the other tests we have already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Let's tackle that separately. Pushed the change to remove the custom test.

@@ -12,6 +12,9 @@
var threadInfoStruct = 0; // Info area for this thread in Emscripten HEAP (shared). If zero, this worker is not currently hosting an executing pthread.
var selfThreadId = 0; // The ID of this thread. 0 if not hosting a pthread.
var parentThreadId = 0; // The ID of the parent pthread that launched this thread.
#if EMBIND
var initializedJS = false; // Guard variable for one-time init of the JS state (currently only embind types registration)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is wrong with re-initializing the state at the start of each thread?

It seems cleaner to me to clear the state when the thread exits an re-create it on thread start, just like we would do for any per-thread state that lives in the linear memory.

Maybe I'm missing something but preserving JS state and then re-using it on another thread seem, in the general case, wrong.

Copy link
Member

@kripken kripken Oct 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could maybe add a mechanism to "clear" the embind data, and then when a pthread exits call that, then call init when one is started. That would be more complex though - in particular I'm not sure how hard the "clear" would be to do. But I agree if it's practical it would be cleaner, as you say.

(Regardless this PR doesn't change this property - the JS was always initialized once and then the Worker could be reused for more pthreads. This just moves the init to the right place.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems cleaner to me to clear the state when the thread exits an re-create it on thread start, just like we would do for any per-thread state that lives in the linear memory.

The weirdness here comes from the fact that ___embind_register_native_and_builtin_types is a native function, but it has JS side-effects so the usual "crt is inited on the main thread and then shared directly via SAB with the workers" doesn't apply because the JS side-effects don't get shared.

I disagree that it needs clearing since it's analogous to one-time-crt-init and the JS legitimately lives on for reuse. I think it makes it extra error prone if we clear the embind specific bits and init again on each run as we're just introducing another thing that can fail (the clear), especially when someone doing changes might not be aware of both.

If you really want the clear/init on every run then I'd suggest making it a very formal thing, much like the ATINIT/ATEXIT, but for the workers. That way it's not something tied to this specific case.

… running it after the stack space has been established (which is the correct thing to do anyways)
@tklajnscek
Copy link
Contributor Author

Rebased to make it re-run the tests.
I think this is good to go @kripken

Copy link
Member

@kripken kripken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks @tklajnscek !

@kripken kripken merged commit 59421d2 into emscripten-core:master Oct 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stack overflow check broken when used with pthreads and embind in 2.0.3 onwards
3 participants