Skip to content

Conversation

@xal-0
Copy link
Member

@xal-0 xal-0 commented Aug 20, 2025

This PR implements some changes (on top of #59329) to reduce the memory pressure
when compiling large system/package images, especially on Windows where this is
a recurring problem.

First, the compilation outputs are written to temporary files rather than to
memory buffers, and these files are mmap'd when we must read them again to
produce the .a. The theory is that Windows will not count the size of our
mapped files towards the physical memory and swapfile limit it uses for
VirtualAlloc (TODO: have someone more familiar with Windows verify this). If
profiling reveals writing to temporary files is a big performance hit for small
outputs, I will add a heuristic that avoids the temporary file in those cases.

Second, image compilation now produces a number of shards that is independent of
JULIA_IMAGE_THREADS. We partition into as many pieces as required to get
about 500000 weight in each partition, and have the compile threads work from a
queue of these partitions. The idea is that we can work on smaller pieces, one
at a time, cleaning up the LLVM contexts as we go. The weight target was chosen
based on what takes about ~10s to compile on my machine, but it should be
adjusted if we find that it is too aggressive. We don't want it to be too low,
or we'll start hitting the scaling issues with LLVM again. A possible TODO is
to use a different number for Windows or 32-bit platforms, or to set it based on
the available memory.

Windows trace of building a system image with 1 thread and 1 shard, with the patch that destroys the LLVM context after serializing the combined module and with the deserialization/materialization overhead:
image

The same, but divided into ~40 shards of weight 500000:
image

Fix #58201

@xal-0 xal-0 added performance Must go faster compiler:llvm For issues that relate to LLVM labels Aug 20, 2025
@giordano
Copy link
Member

Is there a way to optionally disable this? I'm concerned that this would cause even slower compilation on systems with slow (distributed) filesystems.

@xal-0
Copy link
Member Author

xal-0 commented Aug 20, 2025

It would be easy to add an environment variable that keeps AOTOutput in memory, but are there really systems where mktemp() gives you a path to a distributed filesystem?

EDIT: LLVM has its own logic for determining the path returned by createTemporaryFile but seems to respect TMPDIR.

@pchintalapudi
Copy link
Member

I'll chime in a little bit with some historical context here:

I considered a queue of smaller compilation units because it also has the advantage of benefiting the long tail problem, where one shard occupies most of the runtime. The reason for the upfront partition into N shards = N threads is because there is (or was) a high fixed cost per shard to deserializing a module from bitcode and other costs e.g. context creation with all its singletons, and when one thread needs to handle two shards it necessarily has to do that serially. In fact the fixed cost was high enough that it simply didn't make sense to have even 2X the number of shards as that would increase time taken very quickly. It would probably be useful to collect performance measurements to see if the fixed costs are still a problem.

@xal-0
Copy link
Member Author

xal-0 commented Aug 21, 2025

Thanks for the context. I suspected as much, but thought it would be worth experimenting with while we look for ways to get the memory high water mark down. I guess we'll see from the data.

Ultimately, I'd like to emit code into a reasonable number of modules with separate contexts to begin with and avoid the LLVM linker and serialization, but there are some other changes that need to happen before that.

@giordano
Copy link
Member

giordano commented Aug 21, 2025

but are there really systems where mktemp() gives you a path to a distributed filesystem?

Yes, clusters, which are environments where Julia are often used. This is Fugaku:

julia> tempdir()
"/home/u13541"

julia> run(`df -HT $(tempdir())`);
Filesystem       Type    Size  Used Avail Use% Mounted on
global:/.vol0006 lliofs   26P   18P  6.3P  75% /vol0006

By default TMPDIR is the user's home directory, which is on a distributed filesystem. And I/O operations on Fugaku are horribly slow. Precompiling an environment containing only Plots.jl already takes about 12 hours, this would only worsen it. Edit: to be clear, the setup isn't always like this (TMPDIR being the home directory is actually rather unusual), but many clusters have weird setup. Also, it should be possible to point TMPDIR to a node-local filesystem, which alleviates at least part of the issue, but in general I/O is troublesome.

@xal-0
Copy link
Member Author

xal-0 commented Aug 21, 2025

By default TMPDIR is the user's home directory

😱

Ok, in fairness I wanted to try making AOTOutput write to a memory buffer that gets madvised MADV_DONTNEED/MADV_COLD when we're done with the shard anyway. It's a nicer solution than writing to temporary files on systems that support it but I couldn't find an equivalent on Windows.

@xal-0 xal-0 force-pushed the split-compile-queue branch from 632e8bd to ef35884 Compare August 21, 2025 23:13
@grandinj
Copy link

Note that Windows has a special API for temporary files that will initially put the temporary file in RAM and only spill that temporary file to disk if the machine starts running low on memory. Quite useful.

@xal-0
Copy link
Member Author

xal-0 commented Aug 22, 2025

That sounds exactly like what I'm looking for. Looks like the LLVM helpers I used at one point created temporary files FILE_ATTRIBUTE_TEMPORARY set, but it was removed in llvm/llvm-project@7a0b640 . Not sure why. I might just use CreateFile directly on Windows, so the situation would be like this:

  • On Unix, always write to a memory buffer. After writing a large output, suggest that the operating system can page it out if it's low on memory with MADV_DONTNEED. On Linux, use MADV_COLD if the kernel is new enough because they messed up MADV_DONTNEED:

         MADV_COLD (since Linux 5.4)
                Deactivate a given range of pages.  This will make the  pages  a
                more  probable reclaim target should there be a memory pressure.
                This is a nondestructive operation.  The advice might be ignored
                for some pages in the range when it is not applicable.
    

    Hopefully that alleviates @giordano's concerns about TMPDIR.

  • On Windows, write large outputs (>128 MiB maybe?) to a file created with FILE_ATTRIBUTE_TEMPORARY | FILE_FLAG_DELETE_ON_CLOSE.

I need to re-add the fallback that doesn't serialize/deserialize anything when we decide not to partition and produce some evidence that the partitioning process doesn't add a substantial overhead if the target weight is large.

Other than to address memory pressure, I think it would be useful to partition more if we decide to enable more parallelism in package precompiles, coordinated via the make jobserver protocol.

@xal-0 xal-0 marked this pull request as ready for review August 25, 2025 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

compiler:llvm For issues that relate to LLVM performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[regression] Julia 1.11 needs 60 % more compile time than Julia 1.10 with PackageCompiler 2.2

4 participants