Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Multithreading] PROXY_TO_PTHREAD & MAIN_THREAD_EM_ASM causes perf degradation #22570

Open
ravisumit33 opened this issue Sep 15, 2024 · 16 comments

Comments

@ravisumit33
Copy link
Contributor

ravisumit33 commented Sep 15, 2024

Please include the following in your bug report:

Version of emscripten/emsdk:

emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.56 (cf90417346b78455089e64eb909d71d091ecc055)
clang version 19.0.0git (https:/github.com/llvm/llvm-project 34ba90745fa55777436a2429a51a3799c83c6d4c)
Target: wasm32-unknown-emscripten
Thread model: posix
InstalledDir: ~/emsdk/upstream/bin

I am trying to port my application from single-threaded to multi-threaded environment. I cannot ensure max number of threads required at a time by my application, thus I finalized using PROXY_TO_PTHREAD. In single-threaded mode, my application used to work like below:

  1. C++ main function does some initialization. After main function exits, we keep the runtime alive.
  2. We have exposed a C++ function to process events coming from the UI.

To port this architecture into multi-threaded environment I used PROXY_TO_PTHREAD to create a proxied main thread and kept that thread alive for further processing. I used proxying to proxy events coming from UI to this detached thread. Once done, this thread called MAIN_THREAD_EM_ASM to send the response back to the main application thread. Also, this is the only MAIN_THREAD_EM_ASM that the detached thread does. Rest is C++ execution without waiting on anything else.

Functionality wise, this model worked well. But when doing performance analysis I figured out that I had a degradation of around 200-400 ms. Upon profiling, I could see that detached thread completed work in time but was waiting for around 200-400 ms for the MAIN_THREAD_EM_ASM to complete i.e. for main application thread to receive the response. Also, the main application thread was completey idle around this time. This can be seen in the below screenshot.

Untitled design (1)

Is this performance degradation expected? Is there any other way I could model my app to get away with this? How can I minimise the time taken by the detached thread to send back the response?

@ravisumit33 ravisumit33 changed the title [Multithreading] PROXY_TO_PTHREAD causes perf degradation [Multithreading] PROXY_TO_PTHREAD & MAIN_THREAD_EM_ASM causes perf degradation Sep 15, 2024
@ravisumit33
Copy link
Contributor Author

@sbc100 @kripken Any thought on this?

@sbc100
Copy link
Collaborator

sbc100 commented Sep 19, 2024

To be clear this is not some kind of regression? i.e. you are not claiming that some previous version of emscripten had a faster version of MAIN_THREAD_EM_ASM?

As far as I know there are no delays built into the proxying system. The call to MAIN_THREAD_EM_ASM should use a postMessage to wake the main which should then use a shared memory futex to wake the secondary thread once its done.

@tlively are you aware of any reason for such a delay?

@ravisumit33 perhaps you could share a example of simple program that demonstrates the delay you are talking about?

@sbc100
Copy link
Collaborator

sbc100 commented Sep 19, 2024

Are you doing anything on the main UI thread that is likely to be blocking it? i.e. are you doing synchronous proxying to your background thread? i.e. can you give more details on what you mean by "I used proxying to proxy events coming from UI to this detached thread"?

@ravisumit33
Copy link
Contributor Author

Sorry to not provide complete details about the issue. I am doing an async proxy to the detached thread. My main application thread isn't the main UI thread. I instantiate wasm in a web-worker.

@ravisumit33
Copy link
Contributor Author

To be clear this is not some kind of regression? i.e. you are not claiming that some previous version of emscripten had a faster version of MAIN_THREAD_EM_ASM?

As far as I know there are no delays built into the proxying system. The call to MAIN_THREAD_EM_ASM should use a postMessage to wake the main which should then use a shared memory futex to wake the secondary thread once its done.

@tlively are you aware of any reason for such a delay?

@ravisumit33 perhaps you could share a example of simple program that demonstrates the delay you are talking about?

I will try to reproduce in a simple program. Just to be clear, delay isn't in proxying from main application thread to the detached thread. Delay comes in receiving the response from the background (detached) thread which is sending the response back in a synchronous way (MAIN_THREAD_EM_ASM).

@sbc100
Copy link
Collaborator

sbc100 commented Sep 19, 2024

So you have the following JS contexts:

0: The main browser UI thread
1: The worker that starts your wasm program
2. The worker that runs the main function inside a pthread (due to PROXY_TO_PTHREAD).

Is that correct?

@sbc100
Copy link
Collaborator

sbc100 commented Sep 19, 2024

I instantiate wasm in a web-worker

I think think this aspect could be a clue, since its not the most common setup. Can you explain a little more about this setup? I assume you create this worker using the normal new Worker API and communicate with it solely through postMessage to/from the main UI browser thread? (i.e. the main UI browser thread doesn't do any shared memory stuff?)

@ravisumit33
Copy link
Contributor Author

Yes list of JS contexts is correct. I create the worker instantiating wasm using new Worker API as you mentioned and communicate with it through postMessage from the main UI browser thread. The main UI browser thread doesn't do any shared memory stuff.

@ravisumit33
Copy link
Contributor Author

ravisumit33 commented Sep 19, 2024

I have highlighted the delay in red rectangle below. As can be seen background thread (below one) is just wating till the main application thread (above one) has received the response. Also, main application thread is idle during the delay.
Untitled design (2)

@tlively
Copy link
Member

tlively commented Sep 20, 2024

Instead of using MAIN_THREAD_EM_ASM to communicate the results back, can you use emscripten_proxy_callback, emscripten_proxy_callback_with_ctx, emscripten_proxy_promise, or emscripten_proxy_promise_with_ctx? I don't know where the pause could be coming from, but these would be more direct methods of reporting the results.

An example program that demonstrates the issue would certainly be helpful.

@sbc100
Copy link
Collaborator

sbc100 commented Sep 20, 2024

I think we should try to get to the bottom of this since MAIN_THREAD_EM_ASM shouldn't have this kind of delay. I agree a simple repro case would be great here.

@sbc100
Copy link
Collaborator

sbc100 commented Sep 20, 2024

By the way I see that you have .worker.js in your filename. Does that mean you are using a version of emscripten before #21701 landed (this change removed the worker.js output file)? i.e. older than 3.1.58?

edit: I see you are using 3.1.56, would upgrading to the latest version be difficult?

@sbc100
Copy link
Collaborator

sbc100 commented Sep 26, 2024

Were you able to build a reproducer?

If you are able to run experiments, can you confirm if the issue also occurs when you run the main application thread on the main browser thread?

@ravisumit33
Copy link
Contributor Author

Sorry for replying late. I tried multiple approaches to build the reproducer but failed to do so. While trying to build the reproducer, I did both profiling and added logs for timestamp before sending to main applicaiton thread and after message is received on the main application thread. The message was getting received almost instantly. Then, I did the same on my codebase and weirdingly here also the the message was getting received almost instantly as seen by the difference of timestamps (performance.timeOrigin + performance.now()). With all these experiments I have come to below conclusion:

  1. Chrome profiling is not showing the correct picture. May be a bug in the profiler??
  2. The perf degradation (200-300 ms) is actually the difference my code is taking to complete the function call itself. Not sure why it is taking time in the multithreaded build than in the single threaded build as the logic is same.

I am still exploring the root cause of the issue.

@ravisumit33
Copy link
Contributor Author

By the way I see that you have .worker.js in your filename. Does that mean you are using a version of emscripten before #21701 landed (this change removed the worker.js output file)? i.e. older than 3.1.58?

edit: I see you are using 3.1.56, would upgrading to the latest version be difficult?

I am unable to build my codebase with 3.1.57 and have opened a new issue for that: #22646

@ravisumit33
Copy link
Contributor Author

Were you able to build a reproducer?

If you are able to run experiments, can you confirm if the issue also occurs when you run the main application thread on the main browser thread?

Can't run the main application thread on the main browser thread in my codebase due to its structure and that will involve a whole lot of changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants