-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Split the JIT compiler into an optimizer and concurrent compiler layer #44364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
356446d
to
b63800f
Compare
e5abc57
to
74583e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
CompileLayerT CompileLayer0; | ||
CompileLayerT CompileLayer1; | ||
CompileLayerT CompileLayer2; | ||
CompileLayerT CompileLayer3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to make this an array too (like the OptimizeLayers)
CompileLayerT CompileLayer0; | |
CompileLayerT CompileLayer1; | |
CompileLayerT CompileLayer2; | |
CompileLayerT CompileLayer3; | |
CompileLayerT CompileLayer[4]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compiler was complaining about the compile layer's internal std::mutex not being move-constructible, which prevented me from actually creating the CompileLayerT instances during construction. I suspect this will be less of an issue in C++17 with compile-time copy elision, so it might be good to revisit then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I see the necessary expression (brace initialization) changed meaning in C++20, so can only be compiled with -std=c++20
or later and gcc++-10
or later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
if (TM.addPassesToEmitMC(PM, Ctx, ObjStream)) | ||
llvm_unreachable("Target does not support MC emission."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any measurable overhead to now recreating a SimpleCompiler and allocating a new legacy::PassManager for each function we compile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into gathering these measurements soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a compilation time difference between this PR and the equivalent master branch
Master
Core.Compiler ──── 52.7408 seconds
Sysimage built. Summary:
Total ─────── 63.663837 seconds
Base: ─────── 26.390346 seconds 41.4526%
Stdlibs: ──── 37.272149 seconds 58.5452%
Precompilation complete. Summary:
Total ─────── 120.947376 seconds
Generation ── 90.811543 seconds 75.0835%
Execution ─── 30.135833 seconds 24.9165%
Performance counter stats for 'make -j80':
468,212.14 msec task-clock # 1.083 CPUs utilized
24,923,685 context-switches # 0.053 M/sec
1,770 cpu-migrations # 0.004 K/sec
5,328,582 page-faults # 0.011 M/sec
1,540,436,072,829 cycles # 3.290 GHz (83.18%)
78,252,750,530 stalled-cycles-frontend # 5.08% frontend cycles idle (83.33%)
273,153,982,805 stalled-cycles-backend # 17.73% backend cycles idle (83.47%)
2,185,937,524,077 instructions # 1.42 insn per cycle
# 0.12 stalled cycles per insn (83.49%)
414,972,231,481 branches # 886.291 M/sec (83.35%)
10,422,654,262 branch-misses # 2.51% of all branches (83.18%)
432.197106701 seconds time elapsed
430.046373000 seconds user
38.399162000 seconds sys
PR
Core.Compiler ──── 55.709 seconds
Sysimage built. Summary:
Total ─────── 65.823383 seconds
Base: ─────── 27.650612 seconds 42.0073%
Stdlibs: ──── 38.170825 seconds 57.9898%
Precompilation complete. Summary:
Total ─────── 128.880076 seconds
Generation ── 98.074744 seconds 76.0977%
Execution ─── 30.805332 seconds 23.9023%
Performance counter stats for 'make -j80':
481,978.79 msec task-clock # 1.076 CPUs utilized
22,612,514 context-switches # 0.047 M/sec
1,792 cpu-migrations # 0.004 K/sec
5,004,754 page-faults # 0.010 M/sec
1,580,536,814,458 cycles # 3.279 GHz (83.30%)
83,180,612,505 stalled-cycles-frontend # 5.26% frontend cycles idle (83.36%)
288,438,311,365 stalled-cycles-backend # 18.25% backend cycles idle (83.41%)
2,256,886,561,070 instructions # 1.43 insn per cycle
# 0.13 stalled cycles per insn (83.38%)
432,075,441,674 branches # 896.462 M/sec (83.30%)
10,795,688,654 branch-misses # 2.50% of all branches (83.25%)
447.817094335 seconds time elapsed
445.979433000 seconds user
36.109125000 seconds sys
I would guess that much of the new overhead comes from some combination of creating a new TargetMachine every time we compile a module vs just reusing the same one, creating that extra PassManager every time, or reallocating the object buffer every time. One thing we could do is simply lock around these shared resources, but if we move to a parallelized middle-end/backend we might want the extra concurrency opportunity here.
This PR broke the whitespace check. |
|
JuliaLang#44364) * Move optimization to IRTransformLayer * Move to ConcurrentIRCompiler * Create an optimization selection layer
By cutting the optimization out of the compile layer, we can remove our custom compiler and replace it with the ORC-provided compiler which provides stronger thread safety guarantees. This gets us closer to being able to insert a CompileOnDemand or IRSpeculationLayer to delay executable code creation.
Depends on #43827 to move the global context into jl_ExecutionEngine