Skip to content

Port kleisauke's pthreads+printf fix #13837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Port kleisauke's pthreads+printf fix #13837

wants to merge 1 commit into from

Conversation

kripken
Copy link
Member

@kripken kripken commented Apr 6, 2021

See d5d5f69 and #13194 (comment)

The key fixes are to initialize multithreading in musl properly, and to
also build libc_rt_wasm with pthreads support (for the latter, we could
also move the locking code out of that library and into libc, but this
is safer).

Test is verified as failing before, and passing with this change, both
in the musl internals printed and also with high probability the test
shows interleaved output without the fix, but never with it, in my
testing.

Fixes #13194

@kripken kripken requested review from sbc100 and kleisauke April 6, 2021 23:32
@@ -24,12 +25,28 @@ EM_JS(void, initPthreadsJS, (void), {
PThread.initRuntime();
})

static void init_file_lock(FILE *f)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment along that lines of "This code is copied from musl/src/thread/pthread_create.c. Keep in sync"

Same below where init_file_lock is called.

#include <stdlib.h>
#include "../../system/lib/libc/musl/include/features.h"
#include "../../system/lib/libc/musl/src/internal/stdio_impl.h"
#include "../../system/lib/libc/musl/src/internal/libc.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC including these musl-internal headers from code other than musl-internal code is not supported an no longer works with the musl upgrade I'm looking at landing soon.

Any way we can avoid this? I guess not.. I can look into a workaround when I next loop back around.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be avoided by removing those printf statements at the beginning of main(), not sure if those are useful.

Another testcase that appears to do this is tests/pthread/test_pthread_locale.c:

#include "../../system/lib/libc/musl/src/internal/pthread_impl.h"

Copy link
Collaborator

@kleisauke kleisauke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Test suite probably fails due to a possible deadlock in test_pthread_exit_process, which can be avoided by including commit ffe2696.

Comment on lines +28 to +29
printf("stdin lock: %d\n", stderr->lock);
printf("stdout lock: %d\n", stdin->lock);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
printf("stdin lock: %d\n", stderr->lock);
printf("stdout lock: %d\n", stdin->lock);
printf("stdin lock: %d\n", stdin->lock);
printf("stdout lock: %d\n", stdout->lock);

@kripken
Copy link
Member Author

kripken commented Apr 7, 2021

@kleisauke

Thanks! Test suite probably fails due to a possible deadlock in test_pthread_exit_process, which can be avoided by including commit ffe2696.

Reading ffe2696 , I'm not sure I understand it. Why does it say main runtime thread does not need to acquire file locks - ? The musl stdio data structures can be accessed on any thread, so the main thread must lock them to guard against a race with a pthread?

(It's only the syscalls that are proxied to the main thread, and so the main thread is special there. But __lockfile is called from e.g. fputc() which can happen on any thread, so even the main thread must lock there IIUC.)

If that commit prevents a deadlock, then that seems to confirm that another thread is holding the lock, and that there is a possible race?

@kleisauke
Copy link
Collaborator

kleisauke commented Apr 8, 2021

Ah yes, you're right. I probably got confused, with that I meant to check if the current "process" is single-threaded (i.e. when all threads have exited), similar to this musl check:
https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c?id=cfdfd5ea3ce14c6abf7fb22a531f3d99518b5a1b#n137

The wasm2.test_pthread_exit_process deadlock seems to be caused by the proxied main thread, see for example this log:

Details
'PROXY_TO_PTHREAD' tid: 10490864
'thread_main_exit' tid: 12589328
main tid: 5247192
__lockfile(): tid: 10490864 f->lock: 0 stack: Error
    at 1885 (eval at globalEval (/tmp/emscripten_test/test_pthread_exit_runtime.worker.js:1:2832), <anonymous>:1:17202)
    at _emscripten_asm_const_int (eval at globalEval (/tmp/emscripten_test/test_pthread_exit_runtime.worker.js:1:2832), <anonymous>:1:29252)
    at <anonymous>:wasm-function[69]:0x48bb
    at <anonymous>:wasm-function[68]:0x46e6
    at main (<anonymous>:wasm-function[24]:0x81c)
    at Module._main (eval at globalEval (/tmp/emscripten_test/test_pthread_exit_runtime.worker.js:1:2832), <anonymous>:1:50943)
    at ___call_main (eval at globalEval (/tmp/emscripten_test/test_pthread_exit_runtime.worker.js:1:2832), <anonymous>:1:27574)
    at <anonymous>:wasm-function[54]:0x19ab
    at Object.invokeEntryPoint (eval at globalEval (/tmp/emscripten_test/test_pthread_exit_runtime.worker.js:1:2832), <anonymous>:1:26929)
    at onmessage (/tmp/emscripten_test/test_pthread_exit_runtime.worker.js:1:1430)
main
__lockfile(): tid: 12589328 f->lock: 0 stack: Error
    at 1885 (eval at globalEval (/tmp/emscripten_test/test_pthread_exit_runtime.worker.js:1:2832), <anonymous>:1:17202)
    at _emscripten_asm_const_int (eval at globalEval (/tmp/emscripten_test/test_pthread_exit_runtime.worker.js:1:2832), <anonymous>:1:29252)
    at <anonymous>:wasm-function[69]:0x48bb
    at <anonymous>:wasm-function[68]:0x46e6
    at <anonymous>:wasm-function[23]:0x804
    at Object.invokeEntryPoint (eval at globalEval (/tmp/emscripten_test/test_pthread_exit_runtime.worker.js:1:2832), <anonymous>:1:26929)
    at onmessage (/tmp/emscripten_test/test_pthread_exit_runtime.worker.js:1:1430)
    at MessagePort.<anonymous> (/tmp/emscripten_test/test_pthread_exit_runtime.worker.js:1:2633)
    at MessagePort.[nodejs.internal.kHybridDispatch] (node:internal/event_target:459:20)
    at MessagePort.exports.emitMessage (node:internal/per_context/messageport:23:28)
calling exit
__lockfile(): tid: 5247192 f->lock: 10490864 stack: Error
    at 1885 (/tmp/emscripten_test/test_pthread_exit_runtime.js:1:17202)
    at _emscripten_asm_const_int (/tmp/emscripten_test/test_pthread_exit_runtime.js:1:29252)
    at <anonymous>:wasm-function[69]:0x48bb
    at fflush (<anonymous>:wasm-function[74]:0x4a29)
    at fflush (<anonymous>:wasm-function[74]:0x4a4e)
    at Module._fflush (/tmp/emscripten_test/test_pthread_exit_runtime.js:1:51424)
    at flush_NO_FILESYSTEM (/tmp/emscripten_test/test_pthread_exit_runtime.js:1:43269)
    at exitRuntime (/tmp/emscripten_test/test_pthread_exit_runtime.js:1:12452)
    at exit (/tmp/emscripten_test/test_pthread_exit_runtime.js:1:57985)
    at Worker.worker.onmessage (/tmp/emscripten_test/test_pthread_exit_runtime.js:1:25476)

Interpreting that, it looks like the file lock is still held by the proxied thread when the main thread is shutdown. I could workaround this by inheriting the tid of the main thread within the proxied thread, see for example commit 894b964 (not sure what the best approach is to resolve this).

@kripken
Copy link
Member Author

kripken commented Apr 8, 2021

Interesting, thanks @kleisauke - looks like what's happening in your traces is that flush_NO_FILESYSTEM calls flush(), and there is still a pthread (the proxied main from PROXY_TO_PTHREAD) that is holding the lock.

I don't think we can share the tid between the main runtime thread and the proxied main. The proxied main is just a normal pthread, and it and the main runtime thread need to lock if they access the same things.

__lockfile calls __wait. __wait has special code to handle this type of thing: the main runtime thread will call emscripten_main_thread_process_queued_calls(). That would let a proxied call from the proxied main complete.

A deadlock seems to indicate more than one lock being involved, then. Like one thread holds a lock to stdout and one to stderr, or some other lock. Does the new code that is enabled by libc.threaded = 1; add any new locks perhaps? Or maybe newer musl fixes a bug here, and so this can't be backported?

@kleisauke
Copy link
Collaborator

Indeed, sharing the tid between the main thread and the proxied main is not a good solution. I could resolve the above mentioned test hang with commit 8c1fd43. This is probably because the proxied main thread is shutdown while still holding the lock, causing the fflush / __wait function to wait indefinitely for a thread that has already been terminated (but couldn't verify this properly).

Cherry-picking that commit on this PR doesn't resolve the test hang in asan.test_pthread_exit_process locally (unfortunately), but I can confirm that also applying musl commit c21f750727515602a9e84f2a190ee8a0a2aeb2a1 and 0db393d3a77bb9f300a356c6a5484fc2dddb161d seems to resolve this.

@kripken
Copy link
Member Author

kripken commented Apr 26, 2021

Thanks @kleisauke !

What do we think of this PR, given that information? Is there still value in cherry-picking those changes before the full musl upgrade? That is, originally this PR was meant to be a small forward port, but since we need to forward port multiple things, this feels less and less clean and more and more risky to me...

@kleisauke
Copy link
Collaborator

I just opened PR #14481 to resolve the test hang. Indeed, it feels a bit risky to cherry-pick those commits, so it's probably better to wait for the full musl upgrade.

@kripken
Copy link
Member Author

kripken commented Jun 18, 2021

Thanks @kleisauke , that makes sense. Closing this then.

@kripken kripken closed this Jun 18, 2021
@kripken kripken deleted the thread branch June 18, 2021 14:28
@sbc100 sbc100 restored the thread branch September 17, 2021 13:47
sbc100 added a commit that referenced this pull request Sep 17, 2021
This is second attempt at landing a version of
d5d5f69

The first time we tried it #13837 we ran into issues with
test_pthread_exit_process deadlocking which I tracked down to an issue
with `system/lib/libc/musl/src/thread/__wait.c` where it was blocking
the main thread forever rather then looping and calling
`emscripten_main_thread_process_queued_calls`.

Fixes: #13194
sbc100 added a commit that referenced this pull request Sep 17, 2021
This is second attempt at landing a version of
d5d5f69

The first time we tried it #13837 we ran into issues with
test_pthread_exit_process deadlocking which I tracked down to an issue
with `system/lib/libc/musl/src/thread/__wait.c` where it was blocking
the main thread forever rather then looping and calling
`emscripten_main_thread_process_queued_calls`.

Includes a version of #14481 so that should land before this does.

Fixes: #13194
@sbc100
Copy link
Collaborator

sbc100 commented Sep 17, 2021

I tracked down the deadlock to an issues with __lock.c: #15065

@kleisauke
Copy link
Collaborator

Great catch! I think that was probably the reason why musl commit c21f750727515602a9e84f2a190ee8a0a2aeb2a1 seemed to resolve it, because in that case it doesn't call __wait. I don't know why I was overlooking that.

sbc100 pushed a commit that referenced this pull request Sep 20, 2021
Although not defined by the POSIX standard, it seems that on most
systems, atexit handlers are run before threads are terminated.

Helps: #13837
Fixes: #14993
sbc100 added a commit that referenced this pull request Sep 20, 2021
This is second attempt at landing a version of
d5d5f69

The first time we tried it #13837 we ran into issues with
test_pthread_exit_process deadlocking which I tracked down to an issue
with `system/lib/libc/musl/src/thread/__wait.c` where it was blocking
the main thread forever rather then looping and calling
`emscripten_main_thread_process_queued_calls`.

Includes a version of #14481 so that should land before this does.

Fixes: #13194
sbc100 added a commit that referenced this pull request Sep 20, 2021
This is second attempt at landing a version of
d5d5f69

The first time we tried it #13837 we ran into issues with
test_pthread_exit_process deadlocking which I tracked down to an issue
with `system/lib/libc/musl/src/thread/__wait.c` where it was blocking
the main thread forever rather then looping and calling
`emscripten_main_thread_process_queued_calls`.

Includes a version of #14481 so that should land before this does.

Fixes: #13194
sbc100 added a commit that referenced this pull request Sep 20, 2021
This is second attempt at landing a version of
d5d5f69

The first time we tried it #13837 we ran into issues with
test_pthread_exit_process deadlocking which I tracked down to an issue
with `system/lib/libc/musl/src/thread/__wait.c` where it was blocking
the main thread forever rather then looping and calling
`emscripten_main_thread_process_queued_calls`.

Includes a version of #14481 so that should land before this does.

Fixes: #13194
@sbc100 sbc100 deleted the thread branch November 30, 2021 04:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fprintf is non-thread-safe
3 participants