Description
There is a race condition problem that results in a deadlock, which seems to take its way somewhere back to the em_proxying_queue
/ em_task_queue
rewrite of the proxying mechanism that @tlively landed in #15737.
It can be observed on the CI as a flaky browser.test_pthread_cancel
test, for example by adding a Python script that repeatedly reruns the test. Running a short ad hoc python test script that repeats test/runner browser.test_pthread_cancel
a hundred times, I get 19/100 failures, or 19.0%.
When the race condition occurs, the browser will hang with the main thread deadlocked into the following callstack:
Script terminated by timeout at:
_emscripten_get_now
test.wasm.futex_wait_main_browser_thread
test.wasm.emscripten_futex_wait
test.wasm.__timedwait_cp
test.wasm.__timedwait
test.wasm.__pthread_mutex_timedlock
test.wasm.__pthread_mutex_lock // 6
test.wasm.emscripten_proxy_execute_queue
test.wasm.emscripten_current_thread_process_queued_calls
test.wasm.emscripten_main_thread_process_queued_calls // 5
test.wasm._emscripten_yield
test.wasm.emscripten_futex_wait
test.wasm.__timedwait_cp
test.wasm.__timedwait
test.wasm.__pthread_mutex_timedlock
test.wasm.__pthread_mutex_lock // 4
test.wasm.emscripten_builtin_malloc // 3
test.wasm.em_task_queue_create // 2
test.wasm.get_or_add_tasks_for_thread
test.wasm.do_proxy
test.wasm.emscripten_proxy_async
test.wasm.pthread_kill
test.wasm.pthread_cancel // 1
test.wasm.__original_main
test.wasm.main
callMain
doRun
run/<
I.e.
- main thread has decided to
pthread_cancel()
the pthread in the test. - nested inside
pthread_cancel()
, main thread decides to allocate a new proxying queue for the pthread. - it calls
malloc()
- that takes the malloc lock
- inside malloc lock, the main thread decides to run operations proxied to it
- running proxied operations involves taking a lock again, which never returns - the main thread is hung.
I do not understand exactly why the main thread is hanging on attempting to acquire the lock at step 6.
Two hypotheses come to mind:
a) is the main thread attempting to acquire the very proxying queue lock that it is just in the process of allocating on line // 2
? (and so the proxying queue is not 'well-formed' yet, or main thread already implicitly has that lock?)
b) or does the pthread have the proxying queue lock for some reason, and it is waiting something from the main thread, while the main thread is attempting to acquire that queue lock?
Not quite sure.. When I try to follow the logic in the pthread proxying implementation, I see it is now rather alien to me as it has been rewritten in different concepts, so I feel very unfamiliar with the internals at this point.