Skip to content

Commit c97a5b2

Browse files
aganeaAlexisPerry
authored andcommitted
[Support] Vendor rpmalloc in-tree and use it for the Windows 64-bit release (llvm#91862)
### Context We have a longstanding performance issue on Windows, where to this day, the default heap allocator is still lockfull. With the number of cores increasing, building and using LLVM with the default Windows heap allocator is sub-optimal. Notably, the ThinLTO link times with LLD are extremely long, and increase proportionally with the number of cores in the machine. In llvm@a6a37a2, I introduced the ability build LLVM with several popular lock-free allocators. Downstream users however have to build their own toolchain with this option, and building an optimal toolchain is a bit tedious and long. Additionally, LLVM is now integrated into Visual Studio, which AFAIK re-distributes the vanilla LLVM binaries/installer. The point being that many users are impacted and might not be aware of this problem, or are unable to build a more optimal version of the toolchain. The symptom before this PR is that most of the CPU time goes to the kernel (darker blue) when linking with ThinLTO: ![16c_ryzen9_windows_heap](https://github.com/llvm/llvm-project/assets/37383324/86c3f6b9-6028-4c1a-ba60-a2fa3876fba7) With this PR, most time is spent in user space (light blue): ![16c_ryzen9_rpmalloc](https://github.com/llvm/llvm-project/assets/37383324/646b88f3-5b6d-485d-a2e4-15b520bdaf5b) On higher core count machines, before this PR, the CPU usage becomes pretty much flat because of contention: <img width="549" alt="VM_176_windows_heap" src="https://github.com/llvm/llvm-project/assets/37383324/f27d5800-ee02-496d-a4e7-88177e0727f0"> With this PR, similarily most CPU time is now used: <img width="549" alt="VM_176_with_rpmalloc" src="https://github.com/llvm/llvm-project/assets/37383324/7d4785dd-94a7-4f06-9b16-aaa4e2e505c8"> ### Changes in this PR The avenue I've taken here is to vendor/re-licence rpmalloc in-tree, and use it when building the Windows 64-bit release. Given the permissive rpmalloc licence, prior discussions with the LLVM foundation and @lattner suggested this vendoring. Rpmalloc's author (@mjansson) kindly agreed to ~~donate~~ re-licence the rpmalloc code in LLVM (please do correct me if I misinterpreted our past communications). I've chosen rpmalloc because it's small and gives the best value overall. The source code is only 4 .c files. Rpmalloc is statically replacing the weak CRT alloc symbols at link time, and has no dynamic patching like mimalloc. As an alternative, there were several unsuccessfull attempts made by Russell Gallop to use SCUDO in the past, please see thread in https://reviews.llvm.org/D86694. If later someone comes up with a PR of similar performance that uses SCUDO, we could then delete this vendored rpmalloc folder. I've added a new cmake flag `LLVM_ENABLE_RPMALLOC` which essentialy sets `LLVM_INTEGRATED_CRT_ALLOC` to the in-tree rpmalloc source. ### Performance The most obvious test is profling a ThinLTO linking step with LLD. I've used a Clang compilation as a testbed, ie. ``` set OPTS=/GS- /D_ITERATOR_DEBUG_LEVEL=0 -Xclang -O3 -fstrict-aliasing -march=native -flto=thin -fwhole-program-vtables -fuse-ld=lld cmake -G Ninja %ROOT%/llvm -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=TRUE -DLLVM_ENABLE_PROJECTS="clang" -DLLVM_ENABLE_PDB=ON -DLLVM_OPTIMIZED_TABLEGEN=ON -DCMAKE_C_COMPILER=clang-cl.exe -DCMAKE_CXX_COMPILER=clang-cl.exe -DCMAKE_LINKER=lld-link.exe -DLLVM_ENABLE_LLD=ON -DCMAKE_CXX_FLAGS="%OPTS%" -DCMAKE_C_FLAGS="%OPTS%" -DLLVM_ENABLE_LTO=THIN ``` I've profiled the linking step with no LTO cache, with Powershell, such as: ``` Measure-Command { lld-link /nologo @CMakeFiles\clang.rsp /out:bin\clang.exe /implib:lib\clang.lib /pdb:bin\clang.pdb /version:0.0 /machine:x64 /STACK:10000000 /DEBUG /OPT:REF /OPT:ICF /INCREMENTAL:NO /subsystem:console /MANIFEST:EMBED,ID=1 }` ``` Timings: | Machine | Allocator | Time to link | |--------|--------|--------| | 16c/32t AMD Ryzen 9 5950X | Windows Heap | 10 min 38 sec | | | **Rpmalloc** | **4 min 11 sec** | | 32c/64t AMD Ryzen Threadripper PRO 3975WX | Windows Heap | 23 min 29 sec | | | **Rpmalloc** | **2 min 11 sec** | | | **Rpmalloc + /threads:64** | **1 min 50 sec** | | 176 vCPU (2 socket) Intel Xeon Platinium 8481C (fixed clock 2.7 GHz) | Windows Heap | 43 min 40 sec | | | **Rpmalloc** | **1 min 45 sec** | This also improves the overall performance when building with clang-cl. I've profiled a regular compilation of clang itself, ie: ``` set OPTS=/GS- /D_ITERATOR_DEBUG_LEVEL=0 /arch:AVX -fuse-ld=lld cmake -G Ninja %ROOT%/llvm -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=TRUE -DLLVM_ENABLE_PROJECTS="clang;lld" -DLLVM_ENABLE_PDB=ON -DLLVM_OPTIMIZED_TABLEGEN=ON -DCMAKE_C_COMPILER=clang-cl.exe -DCMAKE_CXX_COMPILER=clang-cl.exe -DCMAKE_LINKER=lld-link.exe -DLLVM_ENABLE_LLD=ON -DCMAKE_CXX_FLAGS="%OPTS%" -DCMAKE_C_FLAGS="%OPTS%" ``` This saves approx. 30 sec when building on the Threadripper PRO 3975WX: ``` (default Windows Heap) C:\src\git\llvm-project>hyperfine -r 5 -p "make_llvm.bat stage1_test2" "ninja clang -C stage1_test2" Benchmark 1: ninja clang -C stage1_test2 Time (mean ± σ): 392.716 s ± 3.830 s [User: 17734.025 s, System: 1078.674 s] Range (min … max): 390.127 s … 399.449 s 5 runs (rpmalloc) C:\src\git\llvm-project>hyperfine -r 5 -p "make_llvm.bat stage1_test2" "ninja clang -C stage1_test2" Benchmark 1: ninja clang -C stage1_test2 Time (mean ± σ): 360.824 s ± 1.162 s [User: 15148.637 s, System: 905.175 s] Range (min … max): 359.208 s … 362.288 s 5 runs ```
1 parent 32a6c3c commit c97a5b2

File tree

11 files changed

+5547
-4
lines changed

11 files changed

+5547
-4
lines changed

llvm/CMakeLists.txt

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -733,7 +733,35 @@ if( WIN32 AND NOT CYGWIN )
733733
endif()
734734
set(LLVM_NATIVE_TOOL_DIR "" CACHE PATH "Path to a directory containing prebuilt matching native tools (such as llvm-tblgen)")
735735

736-
set(LLVM_INTEGRATED_CRT_ALLOC "" CACHE PATH "Replace the Windows CRT allocator with any of {rpmalloc|mimalloc|snmalloc}. Only works with CMAKE_MSVC_RUNTIME_LIBRARY=MultiThreaded.")
736+
set(LLVM_ENABLE_RPMALLOC "" CACHE BOOL "Replace the CRT allocator with rpmalloc.")
737+
if(LLVM_ENABLE_RPMALLOC)
738+
if(NOT (CMAKE_SYSTEM_NAME MATCHES "Windows|Linux"))
739+
message(FATAL_ERROR "LLVM_ENABLE_RPMALLOC is only supported on Windows and Linux.")
740+
endif()
741+
if(LLVM_USE_SANITIZER)
742+
message(FATAL_ERROR "LLVM_ENABLE_RPMALLOC cannot be used along with LLVM_USE_SANITIZER!")
743+
endif()
744+
if(WIN32)
745+
if(CMAKE_CONFIGURATION_TYPES)
746+
foreach(BUILD_MODE ${CMAKE_CONFIGURATION_TYPES})
747+
string(TOUPPER "${BUILD_MODE}" uppercase_BUILD_MODE)
748+
if(uppercase_BUILD_MODE STREQUAL "DEBUG")
749+
message(WARNING "The Debug target isn't supported along with LLVM_ENABLE_RPMALLOC!")
750+
endif()
751+
endforeach()
752+
else()
753+
if(CMAKE_BUILD_TYPE AND uppercase_CMAKE_BUILD_TYPE STREQUAL "DEBUG")
754+
message(FATAL_ERROR "The Debug target isn't supported along with LLVM_ENABLE_RPMALLOC!")
755+
endif()
756+
endif()
757+
endif()
758+
759+
# Override the C runtime allocator with the in-tree rpmalloc
760+
set(LLVM_INTEGRATED_CRT_ALLOC "${CMAKE_CURRENT_SOURCE_DIR}/lib/Support")
761+
set(CMAKE_MSVC_RUNTIME_LIBRARY "MultiThreaded")
762+
endif()
763+
764+
set(LLVM_INTEGRATED_CRT_ALLOC "${LLVM_INTEGRATED_CRT_ALLOC}" CACHE PATH "Replace the Windows CRT allocator with any of {rpmalloc|mimalloc|snmalloc}. Only works with CMAKE_MSVC_RUNTIME_LIBRARY=MultiThreaded.")
737765
if(LLVM_INTEGRATED_CRT_ALLOC)
738766
if(NOT WIN32)
739767
message(FATAL_ERROR "LLVM_INTEGRATED_CRT_ALLOC is only supported on Windows.")

llvm/docs/CMake.rst

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -710,8 +710,15 @@ enabled sub-projects. Nearly all of these variable names begin with
710710
$ D:\git> git clone https://github.com/mjansson/rpmalloc
711711
$ D:\llvm-project> cmake ... -DLLVM_INTEGRATED_CRT_ALLOC=D:\git\rpmalloc
712712
713-
This flag needs to be used along with the static CRT, ie. if building the
713+
This option needs to be used along with the static CRT, ie. if building the
714714
Release target, add -DCMAKE_MSVC_RUNTIME_LIBRARY=MultiThreaded.
715+
Note that rpmalloc is also supported natively in-tree, see option below.
716+
717+
**LLVM_ENABLE_RPMALLOC**:BOOL
718+
Similar to LLVM_INTEGRATED_CRT_ALLOC, embeds the in-tree rpmalloc into the
719+
host toolchain as a C runtime allocator. The version currently used is
720+
rpmalloc 1.4.5. This option also implies linking with the static CRT, there's
721+
no need to provide CMAKE_MSVC_RUNTIME_LIBRARY.
715722

716723
**LLVM_LINK_LLVM_DYLIB**:BOOL
717724
If enabled, tools will be linked with the libLLVM shared library. Defaults

llvm/docs/ReleaseNotes.rst

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,16 @@ Changes to LLVM infrastructure
8686
Changes to building LLVM
8787
------------------------
8888

89-
- The ``LLVM_ENABLE_TERMINFO`` flag has been removed. LLVM no longer depends on
89+
* LLVM now has rpmalloc version 1.4.5 in-tree, as a replacement C allocator for
90+
hosted toolchains. This supports several host platforms such as Mac or Unix,
91+
however currently only the Windows 64-bit LLVM release uses it.
92+
This has a great benefit in terms of build times on Windows when using ThinLTO
93+
linking, especially on machines with lots of cores, to an order of magnitude
94+
or more. Clang compilation is also improved. Please see some build timings in
95+
(`#91862 <https://github.com/llvm/llvm-project/pull/91862#issue-2291033962>`_)
96+
For more information, refer to the **LLVM_ENABLE_RPMALLOC** option in `CMake variables <https://llvm.org/docs/CMake.html#llvm-related-variables>`_.
97+
98+
* The ``LLVM_ENABLE_TERMINFO`` flag has been removed. LLVM no longer depends on
9099
terminfo and now always uses the ``TERM`` environment variable for color
91100
support autodetection.
92101

llvm/lib/Support/CMakeLists.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,9 +101,10 @@ if(LLVM_INTEGRATED_CRT_ALLOC)
101101
message(FATAL_ERROR "Cannot find the path to `git clone` for the CRT allocator! (${LLVM_INTEGRATED_CRT_ALLOC}). Currently, rpmalloc, snmalloc and mimalloc are supported.")
102102
endif()
103103

104-
if(LLVM_INTEGRATED_CRT_ALLOC MATCHES "rpmalloc$")
104+
if((LLVM_INTEGRATED_CRT_ALLOC MATCHES "rpmalloc$") OR LLVM_ENABLE_RPMALLOC)
105105
add_compile_definitions(ENABLE_OVERRIDE ENABLE_PRELOAD)
106106
set(ALLOCATOR_FILES "${LLVM_INTEGRATED_CRT_ALLOC}/rpmalloc/rpmalloc.c")
107+
set(delayload_flags "${delayload_flags} -INCLUDE:malloc")
107108
elseif(LLVM_INTEGRATED_CRT_ALLOC MATCHES "snmalloc$")
108109
set(ALLOCATOR_FILES "${LLVM_INTEGRATED_CRT_ALLOC}/src/snmalloc/override/new.cc")
109110
set(system_libs ${system_libs} "mincore.lib" "-INCLUDE:malloc")

llvm/lib/Support/rpmalloc/CACHE.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Thread caches
2+
rpmalloc has a thread cache of free memory blocks which can be used in allocations without interfering with other threads or going to system to map more memory, as well as a global cache shared by all threads to let spans of memory pages flow between threads. Configuring the size of these caches can be crucial to obtaining good performance while minimizing memory overhead blowup. Below is a simple case study using the benchmark tool to compare different thread cache configurations for rpmalloc.
3+
4+
The rpmalloc thread cache is configured to be unlimited, performance oriented as meaning default values, size oriented where both thread cache and global cache is reduced significantly, or disabled where both thread and global caches are disabled and completely free pages are directly unmapped.
5+
6+
The benchmark is configured to run threads allocating 150000 blocks distributed in the `[16, 16000]` bytes range with a linear falloff probability. It runs 1000 loops, and every iteration 75000 blocks (50%) are freed and allocated in a scattered pattern. There are no cross thread allocations/deallocations. Parameters: `benchmark n 0 0 0 1000 150000 75000 16 16000`. The benchmarks are run on an Ubuntu 16.10 machine with 8 cores (4 physical, HT) and 12GiB RAM.
7+
8+
The benchmark also includes results for the standard library malloc implementation as a reference for comparison with the nocache setting.
9+
10+
![Ubuntu 16.10 random [16, 16000] bytes, 8 cores](https://docs.google.com/spreadsheets/d/1NWNuar1z0uPCB5iVS_Cs6hSo2xPkTmZf0KsgWS_Fb_4/pubchart?oid=387883204&format=image)
11+
![Ubuntu 16.10 random [16, 16000] bytes, 8 cores](https://docs.google.com/spreadsheets/d/1NWNuar1z0uPCB5iVS_Cs6hSo2xPkTmZf0KsgWS_Fb_4/pubchart?oid=1644710241&format=image)
12+
13+
For single threaded case the unlimited cache and performance oriented cache settings have identical performance and memory overhead, indicating that the memory pages fit in the combined thread and global cache. As number of threads increase to 2-4 threads, the performance settings have slightly higher performance which can seem odd at first, but can be explained by low contention on the global cache where some memory pages can flow between threads without stalling, reducing the overall number of calls to map new memory pages (also indicated by the slightly lower memory overhead).
14+
15+
As threads increase even more to 5-10 threads, the increased contention and eventual limit of global cache cause the unlimited setting to gain a slight advantage in performance. As expected the memory overhead remains constant for unlimited caches, while going down for performance setting when number of threads increases.
16+
17+
The size oriented setting maintain good performance compared to the standard library while reducing the memory overhead compared to the performance setting with a decent amount.
18+
19+
The nocache setting still outperforms the reference standard library allocator for workloads up to 6 threads while maintaining a near zero memory overhead, which is even slightly lower than the standard library. For use case scenarios where number of allocation of each size class is lower the overhead in rpmalloc from the 64KiB span size will of course increase.

0 commit comments

Comments
 (0)