[aotcompile] Reduce memory pressure from LLVM: Use more shards than threads, make temporary files #59348

xal-0 · 2025-08-20T20:30:25Z

This PR implements some changes (on top of #59329) to reduce the memory pressure
when compiling large system/package images, especially on Windows where this is
a recurring problem.

First, the compilation outputs are written to temporary files rather than to
memory buffers, and these files are mmap'd when we must read them again to
produce the .a. The theory is that Windows will not count the size of our
mapped files towards the physical memory and swapfile limit it uses for
VirtualAlloc (TODO: have someone more familiar with Windows verify this). If
profiling reveals writing to temporary files is a big performance hit for small
outputs, I will add a heuristic that avoids the temporary file in those cases.

Second, image compilation now produces a number of shards that is independent of
JULIA_IMAGE_THREADS. We partition into as many pieces as required to get
about 500000 weight in each partition, and have the compile threads work from a
queue of these partitions. The idea is that we can work on smaller pieces, one
at a time, cleaning up the LLVM contexts as we go. The weight target was chosen
based on what takes about ~10s to compile on my machine, but it should be
adjusted if we find that it is too aggressive. We don't want it to be too low,
or we'll start hitting the scaling issues with LLVM again. A possible TODO is
to use a different number for Windows or 32-bit platforms, or to set it based on
the available memory.

Windows trace of building a system image with 1 thread and 1 shard, with the patch that destroys the LLVM context after serializing the combined module and with the deserialization/materialization overhead:

The same, but divided into ~40 shards of weight 500000:

Fix #58201

giordano · 2025-08-20T21:40:39Z

Is there a way to optionally disable this? I'm concerned that this would cause even slower compilation on systems with slow (distributed) filesystems.

xal-0 · 2025-08-20T22:17:28Z

It would be easy to add an environment variable that keeps AOTOutput in memory, but are there really systems where mktemp() gives you a path to a distributed filesystem?

EDIT: LLVM has its own logic for determining the path returned by createTemporaryFile but seems to respect TMPDIR.

pchintalapudi · 2025-08-21T02:15:22Z

I'll chime in a little bit with some historical context here:

I considered a queue of smaller compilation units because it also has the advantage of benefiting the long tail problem, where one shard occupies most of the runtime. The reason for the upfront partition into N shards = N threads is because there is (or was) a high fixed cost per shard to deserializing a module from bitcode and other costs e.g. context creation with all its singletons, and when one thread needs to handle two shards it necessarily has to do that serially. In fact the fixed cost was high enough that it simply didn't make sense to have even 2X the number of shards as that would increase time taken very quickly. It would probably be useful to collect performance measurements to see if the fixed costs are still a problem.

xal-0 · 2025-08-21T03:25:24Z

Thanks for the context. I suspected as much, but thought it would be worth experimenting with while we look for ways to get the memory high water mark down. I guess we'll see from the data.

Ultimately, I'd like to emit code into a reasonable number of modules with separate contexts to begin with and avoid the LLVM linker and serialization, but there are some other changes that need to happen before that.

Also reuse already-computed ModuleInfo

giordano · 2025-08-21T19:52:52Z

but are there really systems where mktemp() gives you a path to a distributed filesystem?

Yes, clusters, which are environments where Julia are often used. This is Fugaku:

julia> tempdir()
"/home/u13541"

julia> run(`df -HT $(tempdir())`);
Filesystem       Type    Size  Used Avail Use% Mounted on
global:/.vol0006 lliofs   26P   18P  6.3P  75% /vol0006

By default TMPDIR is the user's home directory, which is on a distributed filesystem. And I/O operations on Fugaku are horribly slow. Precompiling an environment containing only Plots.jl already takes about 12 hours, this would only worsen it. Edit: to be clear, the setup isn't always like this (TMPDIR being the home directory is actually rather unusual), but many clusters have weird setup. Also, it should be possible to point TMPDIR to a node-local filesystem, which alleviates at least part of the issue, but in general I/O is troublesome.

xal-0 · 2025-08-21T21:56:30Z

By default TMPDIR is the user's home directory

😱

Ok, in fairness I wanted to try making AOTOutput write to a memory buffer that gets madvised MADV_DONTNEED/MADV_COLD when we're done with the shard anyway. It's a nicer solution than writing to temporary files on systems that support it but I couldn't find an equivalent on Windows.

grandinj · 2025-08-22T21:09:25Z

Note that Windows has a special API for temporary files that will initially put the temporary file in RAM and only spill that temporary file to disk if the machine starts running low on memory. Quite useful.

xal-0 · 2025-08-22T21:44:15Z

That sounds exactly like what I'm looking for. Looks like the LLVM helpers I used at one point created temporary files FILE_ATTRIBUTE_TEMPORARY set, but it was removed in llvm/llvm-project@7a0b640 . Not sure why. I might just use CreateFile directly on Windows, so the situation would be like this:

On Unix, always write to a memory buffer. After writing a large output, suggest that the operating system can page it out if it's low on memory with MADV_DONTNEED. On Linux, use MADV_COLD if the kernel is new enough because they messed up MADV_DONTNEED:

     MADV_COLD (since Linux 5.4)
            Deactivate a given range of pages.  This will make the  pages  a
            more  probable reclaim target should there be a memory pressure.
            This is a nondestructive operation.  The advice might be ignored
            for some pages in the range when it is not applicable.

Hopefully that alleviates @giordano's concerns about TMPDIR.

On Windows, write large outputs (>128 MiB maybe?) to a file created with FILE_ATTRIBUTE_TEMPORARY | FILE_FLAG_DELETE_ON_CLOSE.

I need to re-add the fallback that doesn't serialize/deserialize anything when we decide not to partition and produce some evidence that the partitioning process doesn't add a substantial overhead if the target weight is large.

Other than to address memory pressure, I think it would be useful to partition more if we decide to enable more parallelism in package precompiles, coordinated via the make jobserver protocol.

xal-0 added 3 commits August 18, 2025 21:16

aotcompile: destroy LLVM context after serializing combined module

0451150

Use temporary files for aotcompile outputs instead of memory

9af23b1

Use more partitions than threads in aotcompile

5b2df51

xal-0 added performance Must go faster compiler:llvm For issues that relate to LLVM labels Aug 20, 2025

xal-0 added 2 commits August 21, 2025 10:44

Fix partitioning modules with zero fvars or gvars

fa5c20b

Reduce partition weight to 100000, add JULIA_IMAGE_PARTITION_WEIGHT

76a18f8

Also reuse already-computed ModuleInfo

Use temporary buffers with madvise on Unix

ef35884

xal-0 force-pushed the split-compile-queue branch from 632e8bd to ef35884 Compare August 21, 2025 23:13

xal-0 added 5 commits August 22, 2025 14:45

Use raw_fd_ostream to support old LLVM

f969c6f

Use memory buffers exclusively on Unix, temp file on Windows

760ec00

Do not serialize module if only one shard is necessary

e9471b5

Use temporary file only for sysimgM (Windows)

4f6e0e2

Merge branch 'master' into split-compile-queue

76d2a5c

xal-0 marked this pull request as ready for review August 25, 2025 23:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[aotcompile] Reduce memory pressure from LLVM: Use more shards than threads, make temporary files #59348

[aotcompile] Reduce memory pressure from LLVM: Use more shards than threads, make temporary files #59348

xal-0 commented Aug 20, 2025 •

edited by vtjnash

Loading

Uh oh!

giordano commented Aug 20, 2025

Uh oh!

xal-0 commented Aug 20, 2025 •

edited

Loading

Uh oh!

pchintalapudi commented Aug 21, 2025

Uh oh!

xal-0 commented Aug 21, 2025

Uh oh!

giordano commented Aug 21, 2025 •

edited

Loading

Uh oh!

xal-0 commented Aug 21, 2025 •

edited

Loading

Uh oh!

grandinj commented Aug 22, 2025

Uh oh!

xal-0 commented Aug 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Uh oh!

[aotcompile] Reduce memory pressure from LLVM: Use more shards than threads, make temporary files #59348

Are you sure you want to change the base?

[aotcompile] Reduce memory pressure from LLVM: Use more shards than threads, make temporary files #59348

Conversation

xal-0 commented Aug 20, 2025 • edited by vtjnash Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giordano commented Aug 20, 2025

Uh oh!

xal-0 commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pchintalapudi commented Aug 21, 2025

Uh oh!

xal-0 commented Aug 21, 2025

Uh oh!

giordano commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xal-0 commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grandinj commented Aug 22, 2025

Uh oh!

xal-0 commented Aug 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xal-0 commented Aug 20, 2025 •

edited by vtjnash

Loading

xal-0 commented Aug 20, 2025 •

edited

Loading

giordano commented Aug 21, 2025 •

edited

Loading

xal-0 commented Aug 21, 2025 •

edited

Loading