Skip to content

Speed up (up to 7x!) building large system libraries #20577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Oct 31, 2023

Conversation

RReverser
Copy link
Collaborator

@RReverser RReverser commented Oct 30, 2023

Large system libraries like libc consist of lots of individual objects.

Previously, we'd compile each of them individually, although the processes themselves were parallelised across CPU cores.

However, this resulted in a lot of per-object overhead: create a Python process for Emscripten itself, compile our Python code, execute a bunch of custom logic in said Python code, create another process for Clang with the computed flags, and finally get an object. This overhead is even more pronounced on Windows where (leaving aside details on why) creating a new process is much more expensive than on Linux.

Luckily, Clang, and, by extension, Emscripten, allows to compile lots of source files into object files in the same command invocation / same process, as long as all of them share same compilation flags and target directory (which has to be current working directory).

In this PR I'm leveraging that feature by grouping all the sources within a library by compiler and flags they use. Then, I'm splitting said groups into chunks. Chunk size is currently chosen as 2 * number_of_cores as a heuristic to strike balance between creating large batches and avoiding workload imbalance where some CPU core might've already finished and stays idle while others are still processing their share of source files. Finally, each chunk is added to the list of commands, and all of those commands are executed in parallel across CPU cores like before. (There is still one exception for commands that need to be singled out, but I won't go into those details here.)

On my laptop with 8 cores, this leads to the following speedups on large system libraries when building ./em++ temp.cpp -s STRICT where temp.cpp is just int main() {} and the cache is empty:

Library Name Before (Windows) After (Windows) Before (Linux) After (Linux) Speedup (Windows) Speedup (Linux)
libc-debug.a 141.7s 29.6s 53.2s 7.7s 4.8x 7.0x
libcompiler_rt.a 22.7s 6.6s 8.7s 1.8s 3.5x 4.8x
libc++-noexcept.a 39.5s 37.0s 20.8s 18.6s 1.1x 1.1x

Less verbose than full profiling and makes regular build_objects output closer to Ninja's, which already prints build time.
@RReverser
Copy link
Collaborator Author

RReverser commented Oct 30, 2023

I'm not seeing those Windows CI failures on Windows locally but I'm pretty sure I know why it's happening - fixed the same issue for Ninja in #20556, should be easy to fix here too. Will do tomorrow.

Copy link
Member

@kripken kripken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, excellent idea! And great speedups!

(I didn't even know you can pass in multiple filenames with -c...)

Co-authored-by: Alon Zakai <alonzakai@gmail.com>
@RReverser
Copy link
Collaborator Author

(I didn't even know you can pass in multiple filenames with -c...)

It's actually something I only learnt a few days ago myself, and immediately was like "wait a second... I know where this would be useful" 😅

@sbc100
Copy link
Collaborator

sbc100 commented Oct 31, 2023

Nice!

I guess its a good job that clang doesn't using parallelism internally compile the different source files that its passed, otherwise we would end up with N * N.

@sbc100
Copy link
Collaborator

sbc100 commented Oct 31, 2023

Note that when you have multiple inputs in a single compile command you loose the ability to specify output filenames. So things like clang -c dir1/foo.c and dir2/foo.c won't work since both sources compile to foo.o.

Luckily I guess in this case we don't care much about the names of the object files, and we address name collisions as they arise.

@RReverser
Copy link
Collaborator Author

RReverser commented Oct 31, 2023

I guess its a good job that clang doesn't using parallelism internally compile the different source files that its passed, otherwise we would end up with N * N.

I don't think they would do that implicitly, but I actually searched for it and saw a PR where someone was adding parallelism for clang-cl and said they could do it for clang itself via -j if there was interest, but that didn't go anywhere.

It would be nice if Clang had such opt-in parallelism since then we wouldn't even need to spawn bunch of processes and could instead pass all the source files in a batch to a single Clang command and let it build those in parallel threads (which are much more lightweight), making builds even faster... Oh well.

@RReverser
Copy link
Collaborator Author

Note that when you have multiple inputs in a single compile command you loose the ability to specify output filenames. So things like clang -c dir1/foo.c and dir2/foo.c won't work since both sources compile to foo.o.

Yup, I know, that's why I still have that exemption in code where commands that would result in same object file are split out and handled separately.

@RReverser
Copy link
Collaborator Author

Had to fix one more issue this uncovered (see comment in code), and I think this is good to go.

@sbc100
Copy link
Collaborator

sbc100 commented Oct 31, 2023

I'm surprised the speedup is bigger on linux, given that windows is that one that we expect to see have a higher cost of process creation.

@RReverser
Copy link
Collaborator Author

RReverser commented Oct 31, 2023

I'm surprised the speedup is bigger on linux, given that windows is that one that we expect to see have a higher cost of process creation.

I think this is one of the cases where it should be looked at not so much a multiplier kind of improvement (new_time / previous_time like I did), but as an absolute reduction of overhead * number_of_files.

So e.g. libc-debug which has 1024 files reduced overhead by (141.7-29.6)/1024 = 0.10 seconds per file on Windows and (53.2-7.7)/1024 = 0.04 seconds on Linux.

Looking at it this way scales more correctly when we look at other libraries - e.g. for libcompiler_rt which has 175 files we get (22.7-6.6)/175 = 0.09 seconds reduction on Windows and (8.7-1.8)/175 = 0.04 seconds on Linux, almost perfectly same time per file.

@dschuff
Copy link
Member

dschuff commented Oct 31, 2023

One thing to watch out for on testing here: the CircleCI builder actually uses ninja to prebuild the libraries, so IIUC that's not covering this change. Can you maybe try a commit that removes EMCC_USE_NINJA from the .config.yml and tests that way? It may still be slower than ninja so we may want to revert it, but at least we can test it before landing. The emscripten-releases CI uses this builtin way, so we will get coverage, but not until after it lands here.

@sbc100
Copy link
Collaborator

sbc100 commented Oct 31, 2023

If this method ends up being faster than ninja I wonder if we should just remove the EMCC_USE_NINJA support?

@RReverser
Copy link
Collaborator Author

One thing to watch out for on testing here: the CircleCI builder actually uses ninja to prebuild the libraries, so IIUC that's not covering this change. Can you maybe try a commit that removes EMCC_USE_NINJA from the .config.yml and tests that way? It may still be slower than ninja so we may want to revert it, but at least we can test it before landing. The emscripten-releases CI uses this builtin way, so we will get coverage, but not until after it lands here.

It's still much faster than Ninja, yes.

I actually initially implemented this change for Ninja as I'm using it locally as well. But I got there only partially because Ninja doesn't have built-in support for batching yet (ninja-build/ninja#1004) and the workaround to find out which files have changed so that only those are batched is rather involved - you need to have task that just collects changed filenames into a temporary file, then chunk that temporary files into batches... it gets complicated.

So I decided to stop there and implement for non-Ninja version of the building script instead, which is both much easier implementation-wise and benefits more users as most didn't opt-in to Ninja yet.

I'll try to remove EMCC_USE_NINJA from config.

@RReverser
Copy link
Collaborator Author

One thing to watch out for on testing here: the CircleCI builder actually uses ninja to prebuild the libraries, so IIUC that's not covering this change.

If this is the case (which it seems to be), I'm a bit surprised some tests failed before due to a change in this non-Ninja path (due to me spawning subprocesses in a different working directory when building system library)...

@dschuff
Copy link
Member

dschuff commented Oct 31, 2023

One thing to watch out for on testing here: the CircleCI builder actually uses ninja to prebuild the libraries, so IIUC that's not covering this change.

If this is the case (which it seems to be), I'm a bit surprised some tests failed before due to a change in this non-Ninja path (due to me spawning subprocesses in a different working directory when building system library)...

It only uses ninja for the 'build libraries' step, before tests start. But that doesn't build every configuration of every library. IIRC there are still some libraries that get built on-demand during testing runs, and we don't use ninja for that.

@RReverser
Copy link
Collaborator Author

RReverser commented Oct 31, 2023

@dschuff Looks like it passes w/o EMCC_USE_NINJA too. CI seems to have a lot of build time variation on commit-to-commit basis though, so it's hard to judge if it's any faster there, but at least it still works.

Should I revert this last change or push it a bit further by removing ninja installation step as well?

@RReverser
Copy link
Collaborator Author

RReverser commented Nov 14, 2023

Which strongly suggests that most of the overhead reduced here did come from executing the Python code, and actually invoking Clang is relatively cheap.

That also matches the observation that currently emcc.py will still invoke separate clang.exe per each source even for -c ...multiple files..., so all this PR did is merged computation of cflags up to that point and it clearly yielded significant speedup.

That computation is the same overhead I think caching would help to reduce more universally.

@sbc100
Copy link
Collaborator

sbc100 commented Nov 14, 2023

Which strongly suggests that most of the overhead reduced here did come from executing the Python code, and actually invoking Clang is relatively cheap.

That also matches the observation that currently emcc.py will invoke separate clang.exe per each source even for -c ...multiple files..., so all this PR did is merged computation of cflags up to that point and it clearly yielded significant speedup.

That computation is the same overhead I think caching would help to reduce more universally.

I imagine there two sources of overhead here:
(1) the cost of starting python.exe to run the very first line of python code in emcc.py
(2) the cost running the emcc.py python code up until subprocess creation

I would love to get rid of both of them one day, but you are right that getting rid of (2) could be done by cache the clang arguments.

We've also not spent much time to optimizing (2) so there could be a lot of other low hanging fruit here even without caching.

sbc100 added a commit to sbc100/emscripten that referenced this pull request Nov 14, 2023
For folks that compiler a bunch of source files at once which should
help by reducing the number of clang subprocesses that we spawn.  This
includes the new system library build process that we use internally.
See emscripten-core#20577.

Do the same for PCH compilation and pre-processing (-E).
@RReverser
Copy link
Collaborator Author

I imagine there two sources of overhead here:
(1) the cost of starting python.exe to run the very first line of python code in emcc.py
(2) the cost running the emcc.py python code up until subprocess creation

Fair enough, it's plausible. I'm not very familiar with Python execution and don't know what the cost of (1) looks like. I only measured (2) the actual python code.

We've also not spent much time to optimizing (2) so there could be a lot of other low hanging fruit here even without caching.

Right, caching (as usual I guess heh) is more of a suggested band-aid. I certainly agree properly optimising (2) is better long-term for both cached and uncached scenarios, I just figured caching might be much quicker to implement to get them sooner.

sbc100 added a commit to sbc100/emscripten that referenced this pull request Nov 14, 2023
For folks that compiler a bunch of source files at once which should
help by reducing the number of clang subprocesses that we spawn.  This
includes the new system library build process that we use internally.
See emscripten-core#20577.

Do the same for PCH compilation and pre-processing (-E).
sbc100 added a commit to sbc100/emscripten that referenced this pull request Nov 14, 2023
For folks that compiler a bunch of source files at once which should
help by reducing the number of clang subprocesses that we spawn.  This
includes the new system library build process that we use internally.
See emscripten-core#20577.

Do the same for PCH compilation and pre-processing (-E).
sbc100 added a commit to sbc100/emscripten that referenced this pull request Nov 14, 2023
For folks that compiler a bunch of source files at once which should
help by reducing the number of clang subprocesses that we spawn.  This
includes the new system library build process that we use internally.
See emscripten-core#20577.

Do the same for PCH compilation and pre-processing (-E).
sbc100 added a commit to sbc100/emscripten that referenced this pull request Nov 14, 2023
For folks that compiler a bunch of source files at once which should
help by reducing the number of clang subprocesses that we spawn.  This
includes the new system library build process that we use internally.
See emscripten-core#20577.

Do the same for PCH compilation and pre-processing (-E).
sbc100 added a commit to sbc100/emscripten that referenced this pull request Nov 15, 2023
For folks that compiler a bunch of source files at once which should
help by reducing the number of clang subprocesses that we spawn.  This
includes the new system library build process that we use internally.
See emscripten-core#20577.

Do the same for PCH compilation and pre-processing (-E).
sbc100 added a commit to sbc100/emscripten that referenced this pull request Nov 15, 2023
For folks that compiler a bunch of source files at once which should
help by reducing the number of clang subprocesses that we spawn.  This
includes the new system library build process that we use internally.
See emscripten-core#20577.

Do the same for PCH compilation and pre-processing (-E).
sbc100 added a commit to sbc100/emscripten that referenced this pull request Nov 15, 2023
For folks that compiler a bunch of source files at once which should
help by reducing the number of clang subprocesses that we spawn.  This
includes the new system library build process that we use internally.
See emscripten-core#20577.

Do the same for PCH compilation and pre-processing (-E).
sbc100 added a commit to sbc100/emscripten that referenced this pull request Nov 15, 2023
For folks that compiler a bunch of source files at once which should
help by reducing the number of clang subprocesses that we spawn.  This
includes the new system library build process that we use internally.
See emscripten-core#20577.

Do the same for PCH compilation and pre-processing (-E).
sbc100 added a commit to sbc100/emscripten that referenced this pull request Nov 15, 2023
For folks that compiler a bunch of source files at once which should
help by reducing the number of clang subprocesses that we spawn.  This
includes the new system library build process that we use internally.
See emscripten-core#20577.

Do the same for PCH compilation and pre-processing (-E).
sbc100 added a commit to sbc100/emscripten that referenced this pull request Nov 15, 2023
This change moves more than half of the code in emcc.py which is used
only for linking into its own file (link.py).  This makes reasoning
about the code easier and paves the way for a fast path then only
compiling (e.g. this code shouldn't even need to be imported when only
compiling).  See emscripten-core#20577.
sbc100 added a commit to sbc100/emscripten that referenced this pull request Nov 15, 2023
This change moves more than half of the code in emcc.py which is used
only for linking into its own file (link.py).  This makes reasoning
about the code easier and paves the way for a fast path then only
compiling (e.g. this code shouldn't even need to be imported when only
compiling).  See emscripten-core#20577.
sbc100 added a commit to sbc100/emscripten that referenced this pull request Nov 15, 2023
For folks that compiler a bunch of source files at once which should
help by reducing the number of clang subprocesses that we spawn.  This
includes the new system library build process that we use internally.
See emscripten-core#20577.

Do the same for PCH compilation and pre-processing (-E).
sbc100 added a commit to sbc100/emscripten that referenced this pull request Nov 15, 2023
For folks that compiler a bunch of source files at once which should
help by reducing the number of clang subprocesses that we spawn.  This
includes the new system library build process that we use internally.
See emscripten-core#20577.

Do the same for PCH compilation and pre-processing (-E).
sbc100 added a commit that referenced this pull request Nov 15, 2023
…#20713)

For folks that compiler a bunch of source files at once which should
help by reducing the number of clang subprocesses that we spawn.  This
includes the new system library build process that we use internally.
See #20577.

Do the same for PCH compilation and pre-processing (-E).
sbc100 added a commit to sbc100/emscripten that referenced this pull request Nov 15, 2023
This change moves more than half of the code in emcc.py which is used
only for linking into its own file (link.py).  This makes reasoning
about the code easier and paves the way for a fast path then only
compiling (e.g. this code shouldn't even need to be imported when only
compiling).  See emscripten-core#20577.
sbc100 added a commit to sbc100/emscripten that referenced this pull request Nov 15, 2023
This change moves more than half of the code in emcc.py which is used
only for linking into its own file (link.py).  This makes reasoning
about the code easier and paves the way for a fast path then only
compiling (e.g. this code shouldn't even need to be imported when only
compiling).  See emscripten-core#20577.
sbc100 added a commit to sbc100/emscripten that referenced this pull request Nov 15, 2023
This change moves more than half of the code in emcc.py which is used
only for linking into its own file (link.py).  This makes reasoning
about the code easier and paves the way for a fast path then only
compiling (e.g. this code shouldn't even need to be imported when only
compiling).  See emscripten-core#20577.
sbc100 added a commit to sbc100/emscripten that referenced this pull request Nov 16, 2023
This change moves more than half of the code in emcc.py which is used
only for linking into its own file (link.py).  This makes reasoning
about the code easier and paves the way for a fast path then only
compiling (e.g. this code shouldn't even need to be imported when only
compiling).  See emscripten-core#20577.
sbc100 added a commit that referenced this pull request Nov 16, 2023
This change moves more than half of the code in emcc.py which is used
only for linking into its own file (link.py).  This makes reasoning
about the code easier and paves the way for a fast path then only
compiling (e.g. this code shouldn't even need to be imported when only
compiling).  See #20577.
kripken pushed a commit that referenced this pull request Dec 14, 2023
Since #20577, relative paths are used in command line to build system libraries.

As such, we need to properly handle those relative paths to generate a source
map pointing to real source file locations.
sbc100 added a commit to sbc100/emscripten that referenced this pull request Dec 15, 2023
Basically gives emscripten developers a way to opt out of emscripten-core#20577.

I've found it useful for debugging to build just one file at time and
have absolute (not relative) paths in the build commands.  This allows
me to copy and paste a single failing command and run it in isolation
when working on system library changes.
sbc100 added a commit that referenced this pull request Dec 18, 2023
…20929)

Basically gives emscripten developers a way to opt out of #20577.

I've found it useful for debugging to build just one file at time and
have absolute (not relative) paths in the build commands.  This allows
me to copy and paste a single failing command and run it in isolation
when working on system library changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants