-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
codegen: add a pass for late conversion of known modify ops to call atomicrmw #57010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why is there such a large aotcompile.cpp diff? Looks unrelated |
It is mostly support code for this, since if you don't have the IR emitted for the target, then it has to use a loop&call, which isn't want people want to see, so we need to make sure to emit the target code too into all of the correct compile units. We could land some of it separately, but it wouldn't do anything (be mostly untested) until this landed |
666985c
to
c98b43f
Compare
Is this ready to merge? Edit: no. CI seems pretty unhappy with this currently. |
Also there's a merge conflict. |
Please review, but hold off on merging. I think there might be one other PR I try to merge first |
CI failures look real: Specifically the following seems to be a multithreading test that breaks with this PR:
|
67ad4b6
to
dbecbe8
Compare
@@ -9476,7 +9552,7 @@ static jl_llvm_functions_t | |||
Instruction &prologue_end = ctx.builder.GetInsertBlock()->back(); | |||
|
|||
// step 11a. Emit the entry safepoint | |||
if (JL_FEAT_TEST(ctx, safepoint_on_entry)) | |||
if (params.safepoint_on_entry && JL_FEAT_TEST(ctx, safepoint_on_entry)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks weird, shouldn't ctx be derived from params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctx derives from params.params not params (I agree the naming choice here was a choice)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This broke GPUCompiler.jl's tests, together with
Lines 797 to 798 in 9b63fd9
if (jl_ir_inlining_cost((jl_value_t*)src) < UINT16_MAX) | |
params.safepoint_on_entry = false; // ensure we don't block ExpandAtomicModifyPass from inlining this code if applicable |
... which seems like a very ad hoc way of avoiding safepoints? It also introduces an inconsistency wrt. code_llvm
, which doesn't go through jl_emit_native
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we might just want to emit two separate copies of the function (for inlining and not for inlining) which is what code_llvm essentially does. I went back and forth when I was making the PR with which strategy made more sense.
dbecbe8
to
7eb8691
Compare
…tomicrmw The ExpandAtomicModify can recognize our pseudo-intrinsic julia.atomicmodify and convert it into some of known atomicrmw expressions, or simplify it with more inlining, as applicable. This ensures that now our `@atomic` modify is as fast as `Threads.Atomic` for the cases we implement now.
7eb8691
to
5cfdf66
Compare
This is now ready for someone to review, particularly any correctness implications of optimizing atomics that I didn't already address |
@@ -823,7 +849,6 @@ void *jl_emit_native_impl(jl_array_t *codeinfos, LLVMOrcThreadSafeModuleRef llvm | |||
size_t idx = 0; | |||
for (auto &global : params.global_targets) { | |||
gvars[idx] = global.second->getName().str(); | |||
global.second->setInitializer(literal_static_pointer_val(global.first, global.second->getValueType())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This broke another part of GPUCompiler.jl, relying on the presence of inttoptr
s in the IR to reconstruct object pointers. Instead, the IR now only contains nulls (set further down):
function kernel()
undefined
return
end
@"*Main.##280.undefined#41043" = internal global ptr null, !julia.constgv !0
@"jl_sym#undefined#41044" = internal global ptr null, !julia.constgv !0
@"jl_global#41045" = internal global ptr null, !julia.constgv !0
define void @julia_kernel_41041() #0 !dbg !5 {
top:
%pgcstack = call ptr @julia.get_pgcstack()
%"*Main.##280.undefined#41043" = load ptr, ptr @"*Main.##280.undefined#41043"
%0 = call ptr addrspace(10) @jl_get_binding_value_seqcst(ptr %"*Main.##280.undefined#41043")
%1 = icmp ne ptr addrspace(10) %0, null
br i1 %1, label %ok, label %err
err:
%"jl_sym#undefined#41044" = load ptr, ptr @"jl_sym#undefined#41044"
%2 = addrspacecast ptr %"jl_sym#undefined#41044" to ptr addrspace(12)
%"jl_global#41045" = load ptr, ptr @"jl_global#41045"
%3 = addrspacecast ptr %"jl_global#41045" to ptr addrspace(12)
call void @ijl_undefined_var_error(ptr addrspace(12) %2, ptr addrspace(12) %3)
unreachable
ok:
ret void
}
Can you explain how we should now use jl_emit_native
to get proper IR?
#57010 overhauled how global variables are initialized in code, breaking GPUCompiler.jl. After talking to @vtjnash, we are apparently supposed to use `jl_get_llvm_gvs` now to get the Julia memory location of a global variable (for a binding or constant value), however, the elements in there don't exactly correspond to the global variables in the module (not all module globals are "managed" by Julia, and the order is different). To make GPUCompiler.jl work again, here I propose renaming `jl_get_llvm_gvs` to `jl_get_llvm_gv_inits`, and making `jl_get_llvm_gvs` instead return the list of the managed global variables, making it possible to perform the global variable initialization that was done by Julia before using something like: ```julia if VERSION >= v"1.13.0-DEV.533" num_gvars = Ref{Csize_t}(0) @CCall jl_get_llvm_gvs(native_code::Ptr{Cvoid}, num_gvars::Ptr{Csize_t}, C_NULL::Ptr{Cvoid})::Nothing gvs = Vector{Ptr{LLVM.API.LLVMOpaqueValue}}(undef, num_gvars[]) @CCall jl_get_llvm_gvs(native_code::Ptr{Cvoid}, num_gvars::Ptr{Csize_t}, gvs::Ptr{LLVM.API.LLVMOpaqueValue})::Nothing inits = Vector{Ptr{Cvoid}}(undef, num_gvars[]) @CCall jl_get_llvm_gv_inits(native_code::Ptr{Cvoid}, num_gvars::Ptr{Csize_t}, inits::Ptr{Cvoid})::Nothing for (gv_ref, init) in zip(gvs, inits) gv = GlobalVariable(gv_ref) ptr = const_inttoptr(ConstantInt(Int64(init)), value_type(gv)) initializer!(gv, ptr) obj = Base.unsafe_pointer_to_objref(init) @safe_info "Resolved $(name(gv)) to $obj" end end ```
The ExpandAtomicModify can recognize our pseudo-intrinsic julia.atomicmodify and convert it into some of known atomicrmw expressions, or simplify it with more inlining, as applicable. Ideally we could get this pass upstreamed, since there's nothing specific to julia about this pass, and LLVM's IR cannot express this quite correctly without making it a new intrinsic.
This ensures that now our
@atomic
modify is as fast asThreads.Atomic
!