-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Speed up (up to 7x!) building large system libraries #20577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Less verbose than full profiling and makes regular build_objects output closer to Ninja's, which already prints build time.
I'm not seeing those Windows CI failures on Windows locally but I'm pretty sure I know why it's happening - fixed the same issue for Ninja in #20556, should be easy to fix here too. Will do tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, excellent idea! And great speedups!
(I didn't even know you can pass in multiple filenames with -c
...)
Co-authored-by: Alon Zakai <alonzakai@gmail.com>
It's actually something I only learnt a few days ago myself, and immediately was like "wait a second... I know where this would be useful" 😅 |
Nice! I guess its a good job that clang doesn't using parallelism internally compile the different source files that its passed, otherwise we would end up with N * N. |
Note that when you have multiple inputs in a single compile command you loose the ability to specify output filenames. So things like Luckily I guess in this case we don't care much about the names of the object files, and we address name collisions as they arise. |
I don't think they would do that implicitly, but I actually searched for it and saw a PR where someone was adding parallelism for clang-cl and said they could do it for clang itself via It would be nice if Clang had such opt-in parallelism since then we wouldn't even need to spawn bunch of processes and could instead pass all the source files in a batch to a single Clang command and let it build those in parallel threads (which are much more lightweight), making builds even faster... Oh well. |
Yup, I know, that's why I still have that exemption in code where commands that would result in same object file are split out and handled separately. |
Had to fix one more issue this uncovered (see comment in code), and I think this is good to go. |
I'm surprised the speedup is bigger on linux, given that windows is that one that we expect to see have a higher cost of process creation. |
I think this is one of the cases where it should be looked at not so much a multiplier kind of improvement ( So e.g. Looking at it this way scales more correctly when we look at other libraries - e.g. for |
One thing to watch out for on testing here: the CircleCI builder actually uses ninja to prebuild the libraries, so IIUC that's not covering this change. Can you maybe try a commit that removes EMCC_USE_NINJA from the .config.yml and tests that way? It may still be slower than ninja so we may want to revert it, but at least we can test it before landing. The emscripten-releases CI uses this builtin way, so we will get coverage, but not until after it lands here. |
If this method ends up being faster than ninja I wonder if we should just remove the EMCC_USE_NINJA support? |
It's still much faster than Ninja, yes. I actually initially implemented this change for Ninja as I'm using it locally as well. But I got there only partially because Ninja doesn't have built-in support for batching yet (ninja-build/ninja#1004) and the workaround to find out which files have changed so that only those are batched is rather involved - you need to have task that just collects changed filenames into a temporary file, then chunk that temporary files into batches... it gets complicated. So I decided to stop there and implement for non-Ninja version of the building script instead, which is both much easier implementation-wise and benefits more users as most didn't opt-in to Ninja yet. I'll try to remove EMCC_USE_NINJA from config. |
If this is the case (which it seems to be), I'm a bit surprised some tests failed before due to a change in this non-Ninja path (due to me spawning subprocesses in a different working directory when building system library)... |
It only uses ninja for the 'build libraries' step, before tests start. But that doesn't build every configuration of every library. IIRC there are still some libraries that get built on-demand during testing runs, and we don't use ninja for that. |
@dschuff Looks like it passes w/o EMCC_USE_NINJA too. CI seems to have a lot of build time variation on commit-to-commit basis though, so it's hard to judge if it's any faster there, but at least it still works. Should I revert this last change or push it a bit further by removing ninja installation step as well? |
That also matches the observation that currently emcc.py will still invoke separate clang.exe per each source even for That computation is the same overhead I think caching would help to reduce more universally. |
I imagine there two sources of overhead here: I would love to get rid of both of them one day, but you are right that getting rid of (2) could be done by cache the clang arguments. We've also not spent much time to optimizing (2) so there could be a lot of other low hanging fruit here even without caching. |
For folks that compiler a bunch of source files at once which should help by reducing the number of clang subprocesses that we spawn. This includes the new system library build process that we use internally. See emscripten-core#20577. Do the same for PCH compilation and pre-processing (-E).
Fair enough, it's plausible. I'm not very familiar with Python execution and don't know what the cost of (1) looks like. I only measured (2) the actual python code.
Right, caching (as usual I guess heh) is more of a suggested band-aid. I certainly agree properly optimising (2) is better long-term for both cached and uncached scenarios, I just figured caching might be much quicker to implement to get them sooner. |
For folks that compiler a bunch of source files at once which should help by reducing the number of clang subprocesses that we spawn. This includes the new system library build process that we use internally. See emscripten-core#20577. Do the same for PCH compilation and pre-processing (-E).
For folks that compiler a bunch of source files at once which should help by reducing the number of clang subprocesses that we spawn. This includes the new system library build process that we use internally. See emscripten-core#20577. Do the same for PCH compilation and pre-processing (-E).
For folks that compiler a bunch of source files at once which should help by reducing the number of clang subprocesses that we spawn. This includes the new system library build process that we use internally. See emscripten-core#20577. Do the same for PCH compilation and pre-processing (-E).
For folks that compiler a bunch of source files at once which should help by reducing the number of clang subprocesses that we spawn. This includes the new system library build process that we use internally. See emscripten-core#20577. Do the same for PCH compilation and pre-processing (-E).
For folks that compiler a bunch of source files at once which should help by reducing the number of clang subprocesses that we spawn. This includes the new system library build process that we use internally. See emscripten-core#20577. Do the same for PCH compilation and pre-processing (-E).
For folks that compiler a bunch of source files at once which should help by reducing the number of clang subprocesses that we spawn. This includes the new system library build process that we use internally. See emscripten-core#20577. Do the same for PCH compilation and pre-processing (-E).
For folks that compiler a bunch of source files at once which should help by reducing the number of clang subprocesses that we spawn. This includes the new system library build process that we use internally. See emscripten-core#20577. Do the same for PCH compilation and pre-processing (-E).
For folks that compiler a bunch of source files at once which should help by reducing the number of clang subprocesses that we spawn. This includes the new system library build process that we use internally. See emscripten-core#20577. Do the same for PCH compilation and pre-processing (-E).
For folks that compiler a bunch of source files at once which should help by reducing the number of clang subprocesses that we spawn. This includes the new system library build process that we use internally. See emscripten-core#20577. Do the same for PCH compilation and pre-processing (-E).
This change moves more than half of the code in emcc.py which is used only for linking into its own file (link.py). This makes reasoning about the code easier and paves the way for a fast path then only compiling (e.g. this code shouldn't even need to be imported when only compiling). See emscripten-core#20577.
This change moves more than half of the code in emcc.py which is used only for linking into its own file (link.py). This makes reasoning about the code easier and paves the way for a fast path then only compiling (e.g. this code shouldn't even need to be imported when only compiling). See emscripten-core#20577.
For folks that compiler a bunch of source files at once which should help by reducing the number of clang subprocesses that we spawn. This includes the new system library build process that we use internally. See emscripten-core#20577. Do the same for PCH compilation and pre-processing (-E).
For folks that compiler a bunch of source files at once which should help by reducing the number of clang subprocesses that we spawn. This includes the new system library build process that we use internally. See emscripten-core#20577. Do the same for PCH compilation and pre-processing (-E).
This change moves more than half of the code in emcc.py which is used only for linking into its own file (link.py). This makes reasoning about the code easier and paves the way for a fast path then only compiling (e.g. this code shouldn't even need to be imported when only compiling). See emscripten-core#20577.
This change moves more than half of the code in emcc.py which is used only for linking into its own file (link.py). This makes reasoning about the code easier and paves the way for a fast path then only compiling (e.g. this code shouldn't even need to be imported when only compiling). See emscripten-core#20577.
This change moves more than half of the code in emcc.py which is used only for linking into its own file (link.py). This makes reasoning about the code easier and paves the way for a fast path then only compiling (e.g. this code shouldn't even need to be imported when only compiling). See emscripten-core#20577.
This change moves more than half of the code in emcc.py which is used only for linking into its own file (link.py). This makes reasoning about the code easier and paves the way for a fast path then only compiling (e.g. this code shouldn't even need to be imported when only compiling). See emscripten-core#20577.
This change moves more than half of the code in emcc.py which is used only for linking into its own file (link.py). This makes reasoning about the code easier and paves the way for a fast path then only compiling (e.g. this code shouldn't even need to be imported when only compiling). See #20577.
Since #20577, relative paths are used in command line to build system libraries. As such, we need to properly handle those relative paths to generate a source map pointing to real source file locations.
Basically gives emscripten developers a way to opt out of emscripten-core#20577. I've found it useful for debugging to build just one file at time and have absolute (not relative) paths in the build commands. This allows me to copy and paste a single failing command and run it in isolation when working on system library changes.
…20929) Basically gives emscripten developers a way to opt out of #20577. I've found it useful for debugging to build just one file at time and have absolute (not relative) paths in the build commands. This allows me to copy and paste a single failing command and run it in isolation when working on system library changes.
Large system libraries like libc consist of lots of individual objects.
Previously, we'd compile each of them individually, although the processes themselves were parallelised across CPU cores.
However, this resulted in a lot of per-object overhead: create a Python process for Emscripten itself, compile our Python code, execute a bunch of custom logic in said Python code, create another process for Clang with the computed flags, and finally get an object. This overhead is even more pronounced on Windows where (leaving aside details on why) creating a new process is much more expensive than on Linux.
Luckily, Clang, and, by extension, Emscripten, allows to compile lots of source files into object files in the same command invocation / same process, as long as all of them share same compilation flags and target directory (which has to be current working directory).
In this PR I'm leveraging that feature by grouping all the sources within a library by compiler and flags they use. Then, I'm splitting said groups into chunks. Chunk size is currently chosen as
2 * number_of_cores
as a heuristic to strike balance between creating large batches and avoiding workload imbalance where some CPU core might've already finished and stays idle while others are still processing their share of source files. Finally, each chunk is added to the list of commands, and all of those commands are executed in parallel across CPU cores like before. (There is still one exception for commands that need to be singled out, but I won't go into those details here.)On my laptop with 8 cores, this leads to the following speedups on large system libraries when building
./em++ temp.cpp -s STRICT
wheretemp.cpp
is justint main() {}
and the cache is empty: