Pthread task queue deadlock race condition problem (with pthread_cancel()?)

There is a race condition problem that results in a deadlock, which seems to take its way somewhere back to the `em_proxying_queue` / `em_task_queue` rewrite of the proxying mechanism that @tlively landed in #15737.

It can be observed on the CI as a flaky `browser.test_pthread_cancel` test, for example by adding a Python script that repeatedly reruns the test. Running a short ad hoc python test script that repeats `test/runner browser.test_pthread_cancel` a hundred times, I get 19/100 failures, or 19.0%.

When the race condition occurs, the browser will hang with the main thread deadlocked into the following callstack:

```
Script terminated by timeout at:
_emscripten_get_now
test.wasm.futex_wait_main_browser_thread
test.wasm.emscripten_futex_wait
test.wasm.__timedwait_cp
test.wasm.__timedwait
test.wasm.__pthread_mutex_timedlock
test.wasm.__pthread_mutex_lock // 6
test.wasm.emscripten_proxy_execute_queue
test.wasm.emscripten_current_thread_process_queued_calls
test.wasm.emscripten_main_thread_process_queued_calls // 5
test.wasm._emscripten_yield
test.wasm.emscripten_futex_wait
test.wasm.__timedwait_cp
test.wasm.__timedwait
test.wasm.__pthread_mutex_timedlock
test.wasm.__pthread_mutex_lock // 4
test.wasm.emscripten_builtin_malloc // 3
test.wasm.em_task_queue_create  // 2
test.wasm.get_or_add_tasks_for_thread
test.wasm.do_proxy
test.wasm.emscripten_proxy_async
test.wasm.pthread_kill
test.wasm.pthread_cancel        // 1
test.wasm.__original_main
test.wasm.main
callMain
doRun
run/<
```

I.e.
1. main thread has decided to `pthread_cancel()` the pthread in the test.
2. nested inside `pthread_cancel()`, main thread decides to allocate a new proxying queue for the pthread.
3. it calls `malloc()`
4. that takes the malloc lock
5. inside malloc lock, the main thread decides to run operations proxied to it
6. running proxied operations involves taking a lock again, which never returns - the main thread is hung.

I do not understand exactly why the main thread is hanging on attempting to acquire the lock at step 6.

Two hypotheses come to mind:
a) is the main thread attempting to acquire the very proxying queue lock that it is just in the process of allocating on line `// 2`? (and so the proxying queue is not 'well-formed' yet, or main thread already implicitly has that lock?)
b) or does the pthread have the proxying queue lock for some reason, and it is waiting something from the main thread, while the main thread is attempting to acquire that queue lock?

Not quite sure.. When I try to follow the logic in the pthread proxying implementation, I see it is now rather alien to me as it has been rewritten in different concepts, so I feel very unfamiliar with the internals at this point.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pthread task queue deadlock race condition problem (with pthread_cancel()?) #24570

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pthread task queue deadlock race condition problem (with pthread_cancel()?) #24570

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions