Skip to content

codegen: add a pass for late conversion of known modify ops to call atomicrmw #57010

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 7, 2025

Conversation

vtjnash
Copy link
Member

@vtjnash vtjnash commented Jan 10, 2025

The ExpandAtomicModify can recognize our pseudo-intrinsic julia.atomicmodify and convert it into some of known atomicrmw expressions, or simplify it with more inlining, as applicable. Ideally we could get this pass upstreamed, since there's nothing specific to julia about this pass, and LLVM's IR cannot express this quite correctly without making it a new intrinsic.

This ensures that now our @atomic modify is as fast as Threads.Atomic!

julia> @code_llvm Threads.atomic_add!(r, 10)
; Function Signature: atomic_add!(Base.Threads.Atomic{Int64}, Int64)
;  @ atomics.jl:307 within `atomic_add!`
define i64 @"julia_atomic_add!_2680"(ptr noundef nonnull align 8 dereferenceable(8) %"x::Atomic", i64 signext %"v::Int64") #0 {
top:
; ┌ @ Base_compiler.jl:94 within `modifyproperty!`
   %0 = atomicrmw add ptr %"x::Atomic", i64 %"v::Int64" acq_rel, align 8
; └
; ┌ @ Base_compiler.jl:54 within `getproperty`
   ret i64 %0
; └
}

@gbaraldi
Copy link
Member

Why is there such a large aotcompile.cpp diff? Looks unrelated

@oscardssmith oscardssmith added performance Must go faster multithreading Base.Threads and related functionality atomics labels Jan 10, 2025
@vtjnash
Copy link
Member Author

vtjnash commented Jan 13, 2025

Why is there such a large aotcompile.cpp diff? Looks unrelated

It is mostly support code for this, since if you don't have the IR emitted for the target, then it has to use a loop&call, which isn't want people want to see, so we need to make sure to emit the target code too into all of the correct compile units. We could land some of it separately, but it wouldn't do anything (be mostly untested) until this landed

@vtjnash vtjnash force-pushed the jn/atomic-modify-opt branch 2 times, most recently from 666985c to c98b43f Compare January 17, 2025 14:32
@oscardssmith
Copy link
Member

oscardssmith commented Jan 27, 2025

Is this ready to merge? Edit: no. CI seems pretty unhappy with this currently.

@DilumAluthge
Copy link
Member

Also there's a merge conflict.

@vtjnash
Copy link
Member Author

vtjnash commented Jan 28, 2025

Please review, but hold off on merging. I think there might be one other PR I try to merge first

@oscardssmith
Copy link
Member

CI failures look real: Specifically the following seems to be a multithreading test that breaks with this PR:

threads_exec.jl with JULIA_NUM_THREADS == 1,0: Test Failed at /cache/build/tester-demeter6-15/julialang/julia-master/julia-67ad4b675e/share/julia/test/threads_exec.jl:444
  Expression: accum[] + Int(var[]) === sum(0:nloops)
   Evaluated: 129821317250305000 === 500500

@vtjnash vtjnash force-pushed the jn/atomic-modify-opt branch from 67ad4b6 to dbecbe8 Compare February 8, 2025 20:38
@@ -9476,7 +9552,7 @@ static jl_llvm_functions_t
Instruction &prologue_end = ctx.builder.GetInsertBlock()->back();

// step 11a. Emit the entry safepoint
if (JL_FEAT_TEST(ctx, safepoint_on_entry))
if (params.safepoint_on_entry && JL_FEAT_TEST(ctx, safepoint_on_entry))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks weird, shouldn't ctx be derived from params?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx derives from params.params not params (I agree the naming choice here was a choice)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This broke GPUCompiler.jl's tests, together with

julia/src/aotcompile.cpp

Lines 797 to 798 in 9b63fd9

if (jl_ir_inlining_cost((jl_value_t*)src) < UINT16_MAX)
params.safepoint_on_entry = false; // ensure we don't block ExpandAtomicModifyPass from inlining this code if applicable

... which seems like a very ad hoc way of avoiding safepoints? It also introduces an inconsistency wrt. code_llvm, which doesn't go through jl_emit_native.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we might just want to emit two separate copies of the function (for inlining and not for inlining) which is what code_llvm essentially does. I went back and forth when I was making the PR with which strategy made more sense.

vtjnash added 2 commits March 28, 2025 14:35
…tomicrmw

The ExpandAtomicModify can recognize our pseudo-intrinsic
julia.atomicmodify and convert it into some of known atomicrmw
expressions, or simplify it with more inlining, as applicable.

This ensures that now our `@atomic` modify is as fast as
`Threads.Atomic` for the cases we implement now.
@vtjnash vtjnash force-pushed the jn/atomic-modify-opt branch from 7eb8691 to 5cfdf66 Compare March 28, 2025 14:35
@vtjnash
Copy link
Member Author

vtjnash commented Mar 28, 2025

This is now ready for someone to review, particularly any correctness implications of optimizing atomics that I didn't already address

@oscardssmith oscardssmith requested a review from gbaraldi March 28, 2025 19:49
@vtjnash vtjnash merged commit 5a62fe0 into master May 7, 2025
4 of 7 checks passed
@vtjnash vtjnash deleted the jn/atomic-modify-opt branch May 7, 2025 15:03
@maleadt maleadt mentioned this pull request May 8, 2025
7 tasks
@@ -823,7 +849,6 @@ void *jl_emit_native_impl(jl_array_t *codeinfos, LLVMOrcThreadSafeModuleRef llvm
size_t idx = 0;
for (auto &global : params.global_targets) {
gvars[idx] = global.second->getName().str();
global.second->setInitializer(literal_static_pointer_val(global.first, global.second->getValueType()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This broke another part of GPUCompiler.jl, relying on the presence of inttoptrs in the IR to reconstruct object pointers. Instead, the IR now only contains nulls (set further down):

function kernel()
  undefined
  return
end
@"*Main.##280.undefined#41043" = internal global ptr null, !julia.constgv !0
@"jl_sym#undefined#41044" = internal global ptr null, !julia.constgv !0
@"jl_global#41045" = internal global ptr null, !julia.constgv !0

define void @julia_kernel_41041() #0 !dbg !5 {
top:
  %pgcstack = call ptr @julia.get_pgcstack()
  %"*Main.##280.undefined#41043" = load ptr, ptr @"*Main.##280.undefined#41043"
  %0 = call ptr addrspace(10) @jl_get_binding_value_seqcst(ptr %"*Main.##280.undefined#41043")
  %1 = icmp ne ptr addrspace(10) %0, null
  br i1 %1, label %ok, label %err

err:
  %"jl_sym#undefined#41044" = load ptr, ptr @"jl_sym#undefined#41044"
  %2 = addrspacecast ptr %"jl_sym#undefined#41044" to ptr addrspace(12)
  %"jl_global#41045" = load ptr, ptr @"jl_global#41045"
  %3 = addrspacecast ptr %"jl_global#41045" to ptr addrspace(12)
  call void @ijl_undefined_var_error(ptr addrspace(12) %2, ptr addrspace(12) %3)
  unreachable

ok:
  ret void
}

Can you explain how we should now use jl_emit_native to get proper IR?

gbaraldi pushed a commit that referenced this pull request May 21, 2025
#57010 overhauled how global
variables are initialized in code, breaking GPUCompiler.jl. After
talking to @vtjnash, we are apparently supposed to use `jl_get_llvm_gvs`
now to get the Julia memory location of a global variable (for a binding
or constant value), however, the elements in there don't exactly
correspond to the global variables in the module (not all module globals
are "managed" by Julia, and the order is different).

To make GPUCompiler.jl work again, here I propose renaming
`jl_get_llvm_gvs` to `jl_get_llvm_gv_inits`, and making
`jl_get_llvm_gvs` instead return the list of the managed global
variables, making it possible to perform the global variable
initialization that was done by Julia before using something like:

```julia
if VERSION >= v"1.13.0-DEV.533"
    num_gvars = Ref{Csize_t}(0)
    @CCall jl_get_llvm_gvs(native_code::Ptr{Cvoid}, num_gvars::Ptr{Csize_t},
                           C_NULL::Ptr{Cvoid})::Nothing
    gvs = Vector{Ptr{LLVM.API.LLVMOpaqueValue}}(undef, num_gvars[])
    @CCall jl_get_llvm_gvs(native_code::Ptr{Cvoid}, num_gvars::Ptr{Csize_t},
                           gvs::Ptr{LLVM.API.LLVMOpaqueValue})::Nothing
    inits = Vector{Ptr{Cvoid}}(undef, num_gvars[])
    @CCall jl_get_llvm_gv_inits(native_code::Ptr{Cvoid}, num_gvars::Ptr{Csize_t},
                                inits::Ptr{Cvoid})::Nothing

    for (gv_ref, init) in zip(gvs, inits)
        gv = GlobalVariable(gv_ref)
        ptr = const_inttoptr(ConstantInt(Int64(init)), value_type(gv))
        initializer!(gv, ptr)
        obj = Base.unsafe_pointer_to_objref(init)
        @safe_info "Resolved $(name(gv)) to $obj"
    end
end
```
maleadt added a commit to JuliaGPU/GPUCompiler.jl that referenced this pull request May 22, 2025
maleadt added a commit to JuliaGPU/GPUCompiler.jl that referenced this pull request May 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
atomics multithreading Base.Threads and related functionality performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants