Skip to content

Split the JIT compiler into an optimizer and concurrent compiler layer #44364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 2, 2022

Conversation

pchintalapudi
Copy link
Member

By cutting the optimization out of the compile layer, we can remove our custom compiler and replace it with the ORC-provided compiler which provides stronger thread safety guarantees. This gets us closer to being able to insert a CompileOnDemand or IRSpeculationLayer to delay executable code creation.

Depends on #43827 to move the global context into jl_ExecutionEngine

@pchintalapudi pchintalapudi marked this pull request as draft February 27, 2022 01:26
@ViralBShah ViralBShah added the compiler:codegen Generation of LLVM IR and native code label Feb 27, 2022
@pchintalapudi pchintalapudi force-pushed the pc/opt-layer branch 2 times, most recently from 356446d to b63800f Compare February 27, 2022 18:19
@JeffBezanson JeffBezanson added the latency Latency label Mar 1, 2022
@pchintalapudi pchintalapudi marked this pull request as ready for review March 2, 2022 14:24
Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +264 to +267
CompileLayerT CompileLayer0;
CompileLayerT CompileLayer1;
CompileLayerT CompileLayer2;
CompileLayerT CompileLayer3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to make this an array too (like the OptimizeLayers)

Suggested change
CompileLayerT CompileLayer0;
CompileLayerT CompileLayer1;
CompileLayerT CompileLayer2;
CompileLayerT CompileLayer3;
CompileLayerT CompileLayer[4];

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler was complaining about the compile layer's internal std::mutex not being move-constructible, which prevented me from actually creating the CompileLayerT instances during construction. I suspect this will be less of an issue in C++17 with compile-time copy elision, so it might be good to revisit then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I see the necessary expression (brace initialization) changed meaning in C++20, so can only be compiled with -std=c++20 or later and gcc++-10 or later

Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines -464 to -465
if (TM.addPassesToEmitMC(PM, Ctx, ObjStream))
llvm_unreachable("Target does not support MC emission.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any measurable overhead to now recreating a SimpleCompiler and allocating a new legacy::PassManager for each function we compile?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into gathering these measurements soon

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a compilation time difference between this PR and the equivalent master branch

Master

Core.Compiler ──── 52.7408 seconds

Sysimage built. Summary:
Total ───────  63.663837 seconds 
Base: ───────  26.390346 seconds 41.4526%
Stdlibs: ────  37.272149 seconds 58.5452%

Precompilation complete. Summary:
Total ─────── 120.947376 seconds
Generation ──  90.811543 seconds 75.0835%
Execution ───  30.135833 seconds 24.9165%

Performance counter stats for 'make -j80':

        468,212.14 msec task-clock                #    1.083 CPUs utilized          
        24,923,685      context-switches          #    0.053 M/sec                  
             1,770      cpu-migrations            #    0.004 K/sec                  
         5,328,582      page-faults               #    0.011 M/sec                  
 1,540,436,072,829      cycles                    #    3.290 GHz                      (83.18%)
    78,252,750,530      stalled-cycles-frontend   #    5.08% frontend cycles idle     (83.33%)
   273,153,982,805      stalled-cycles-backend    #   17.73% backend cycles idle      (83.47%)
 2,185,937,524,077      instructions              #    1.42  insn per cycle         
                                                  #    0.12  stalled cycles per insn  (83.49%)
   414,972,231,481      branches                  #  886.291 M/sec                    (83.35%)
    10,422,654,262      branch-misses             #    2.51% of all branches          (83.18%)

     432.197106701 seconds time elapsed

     430.046373000 seconds user
      38.399162000 seconds sys

PR

Core.Compiler ──── 55.709 seconds

Sysimage built. Summary:
Total ───────  65.823383 seconds 
Base: ───────  27.650612 seconds 42.0073%
Stdlibs: ────  38.170825 seconds 57.9898%

Precompilation complete. Summary:
Total ─────── 128.880076 seconds
Generation ──  98.074744 seconds 76.0977%
Execution ───  30.805332 seconds 23.9023%

Performance counter stats for 'make -j80':

        481,978.79 msec task-clock                #    1.076 CPUs utilized          
        22,612,514      context-switches          #    0.047 M/sec                  
             1,792      cpu-migrations            #    0.004 K/sec                  
         5,004,754      page-faults               #    0.010 M/sec                  
 1,580,536,814,458      cycles                    #    3.279 GHz                      (83.30%)
    83,180,612,505      stalled-cycles-frontend   #    5.26% frontend cycles idle     (83.36%)
   288,438,311,365      stalled-cycles-backend    #   18.25% backend cycles idle      (83.41%)
 2,256,886,561,070      instructions              #    1.43  insn per cycle         
                                                  #    0.13  stalled cycles per insn  (83.38%)
   432,075,441,674      branches                  #  896.462 M/sec                    (83.30%)
    10,795,688,654      branch-misses             #    2.50% of all branches          (83.25%)

     447.817094335 seconds time elapsed

     445.979433000 seconds user
      36.109125000 seconds sys

I would guess that much of the new overhead comes from some combination of creating a new TargetMachine every time we compile a module vs just reusing the same one, creating that extra PassManager every time, or reallocating the object buffer every time. One thing we could do is simply lock around these shared resources, but if we move to a parallelized middle-end/backend we might want the extra concurrency opportunity here.

@vtjnash vtjnash merged commit 15b5df4 into JuliaLang:master Mar 2, 2022
@DilumAluthge
Copy link
Member

This PR broke the whitespace check.

@DilumAluthge
Copy link
Member

This PR broke the whitespace check.

#44415

staticfloat pushed a commit to JuliaCI/julia-buildkite-testing that referenced this pull request Mar 2, 2022
JuliaLang#44364)

* Move optimization to IRTransformLayer
* Move to ConcurrentIRCompiler
* Create an optimization selection layer
@pchintalapudi pchintalapudi deleted the pc/opt-layer branch March 6, 2022 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code latency Latency
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants