-
-
Couldn't load subscription status.
- Fork 5.7k
[aotcompile] Reduce memory pressure from LLVM: Use more shards than threads, make temporary files #59348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Is there a way to optionally disable this? I'm concerned that this would cause even slower compilation on systems with slow (distributed) filesystems. |
|
It would be easy to add an environment variable that keeps EDIT: LLVM has its own logic for determining the path returned by |
|
I'll chime in a little bit with some historical context here: I considered a queue of smaller compilation units because it also has the advantage of benefiting the long tail problem, where one shard occupies most of the runtime. The reason for the upfront partition into N shards = N threads is because there is (or was) a high fixed cost per shard to deserializing a module from bitcode and other costs e.g. context creation with all its singletons, and when one thread needs to handle two shards it necessarily has to do that serially. In fact the fixed cost was high enough that it simply didn't make sense to have even 2X the number of shards as that would increase time taken very quickly. It would probably be useful to collect performance measurements to see if the fixed costs are still a problem. |
|
Thanks for the context. I suspected as much, but thought it would be worth experimenting with while we look for ways to get the memory high water mark down. I guess we'll see from the data. Ultimately, I'd like to emit code into a reasonable number of modules with separate contexts to begin with and avoid the LLVM linker and serialization, but there are some other changes that need to happen before that. |
Also reuse already-computed ModuleInfo
Yes, clusters, which are environments where Julia are often used. This is Fugaku: julia> tempdir()
"/home/u13541"
julia> run(`df -HT $(tempdir())`);
Filesystem Type Size Used Avail Use% Mounted on
global:/.vol0006 lliofs 26P 18P 6.3P 75% /vol0006By default |
😱 Ok, in fairness I wanted to try making |
632e8bd to
ef35884
Compare
|
Note that Windows has a special API for temporary files that will initially put the temporary file in RAM and only spill that temporary file to disk if the machine starts running low on memory. Quite useful. |
|
That sounds exactly like what I'm looking for. Looks like the LLVM helpers I used at one point created temporary files
I need to re-add the fallback that doesn't serialize/deserialize anything when we decide not to partition and produce some evidence that the partitioning process doesn't add a substantial overhead if the target weight is large. Other than to address memory pressure, I think it would be useful to partition more if we decide to enable more parallelism in package precompiles, coordinated via the make jobserver protocol. |
This PR implements some changes (on top of #59329) to reduce the memory pressure
when compiling large system/package images, especially on Windows where this is
a recurring problem.
First, the compilation outputs are written to temporary files rather than to
memory buffers, and these files are mmap'd when we must read them again to
produce the
.a. The theory is that Windows will not count the size of ourmapped files towards the physical memory and swapfile limit it uses for
VirtualAlloc (TODO: have someone more familiar with Windows verify this). If
profiling reveals writing to temporary files is a big performance hit for small
outputs, I will add a heuristic that avoids the temporary file in those cases.
Second, image compilation now produces a number of shards that is independent of
JULIA_IMAGE_THREADS. We partition into as many pieces as required to getabout 500000 weight in each partition, and have the compile threads work from a
queue of these partitions. The idea is that we can work on smaller pieces, one
at a time, cleaning up the LLVM contexts as we go. The weight target was chosen
based on what takes about ~10s to compile on my machine, but it should be
adjusted if we find that it is too aggressive. We don't want it to be too low,
or we'll start hitting the scaling issues with LLVM again. A possible TODO is
to use a different number for Windows or 32-bit platforms, or to set it based on
the available memory.
Windows trace of building a system image with 1 thread and 1 shard, with the patch that destroys the LLVM context after serializing the combined module and with the deserialization/materialization overhead:

The same, but divided into ~40 shards of weight 500000:

Fix #58201