Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: unsafe_bitcast #43065

Closed
wants to merge 2 commits into from
Closed

RFC: unsafe_bitcast #43065

wants to merge 2 commits into from

Conversation

tkf
Copy link
Member

@tkf tkf commented Nov 13, 2021

This PR tries to add (hopefully) well-defined low-level type punning facility usable for arbitrary pointer-free immutable objects.

The main use case is to support converting an arbitrary pointer-free immutable object to an opaque chunk of bytes (e.g., NTuple{N,UInt8}) and back. For example, this will be useful for using a rich set of immutable objects on GPU API like CUDA.shfl_sync etc. which are currently defined by reinterpreting some small subset of types to integers. If we can cast nested structs with Union fields, we can correctly execute various transducers with complex state transitions on GPU. Another important use case is an emulation of tearable atomics which is useful for efficient concurrent algorithms such as work-stealing dequeue and seqlock.

However, casting one type to another (aka type punning) is known to be hard when the compiler wants to infer something from the type system. If you can find various one-hour technical talks (e.g., CppCon 2019: Timur Doumler “Type punning in modern C++” - YouTube) on how to do it correctly, it's a good indication that we need to have an API in Base with a clear definition of when and how it can be used. For example, C++20 now has std::bit_cast as a similar API.

I haven't had time to dig deep into this to convince myself that the API I came up with was OK. But given the recent discussion (#32660, #42968, #43035) in expanding what reinterpret does, I think it's worth opening it as an alternative take at it; i.e., unsafe (narrow contract) API with wider use cases but weaker guarantee (no cross-process roundtrip). So, I'd appreciate it if people who know Julia and LLVM compiler can look at it.

One aspect of the API that I'm still worried about is what we can say about the returned object when the input type contains some padding. I wonder if we should rather create an "asymmetric" API unsafe_bitembed(T, x::S) -> y::T and unsafe_bitextract(S, y::T) -> x::S where S can contain padding but T must not. We can then clearly document that T is a chunk of opaque bytes that can only be usable in a meaningful way after unsafe_bitextract. I don't know if it helps the compiler, though.

ping @vtjnash @JeffBezanson @Keno @vchuravy @maleadt

tkf added 2 commits November 13, 2021 01:29
This is required for using these macros before `esc` is defined.  An
alternative fix may be to define `esc` earlier.  However, avoid using
`esc` here seems to be more compatible with other macros defined at this
stage.
@tkf tkf added needs compat annotation Add !!! compat "Julia x.y" to the docstring needs docs Documentation for this change is required needs news A NEWS entry is required for this change labels Nov 13, 2021
@tkf tkf mentioned this pull request Nov 13, 2021
3 tasks
Comment on lines +478 to +481
datatype_pointerfree(T) ||
throw(ArgumentError("output type $T may contain a boxed object"))
datatype_pointerfree(S) ||
throw(ArgumentError("input type $S may contain a boxed object"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
datatype_pointerfree(T) ||
throw(ArgumentError("output type $T may contain a boxed object"))
datatype_pointerfree(S) ||
throw(ArgumentError("input type $S may contain a boxed object"))
isbitstype(T) || throw(ArgumentError("output type $T has undefined layout"))
isbitstype(S) || throw(ArgumentError("input type $S has undefined layout"))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Can we support Union fields? That's one of the main motivations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, you are prohibited from accessing the union-bits fields in most cases (TBAA violation)

@JeffBezanson
Copy link
Member

How do you operate on a a tuple of bytes with an unknown format? I don't really understand how this would be used.

@tkf
Copy link
Member Author

tkf commented Nov 17, 2021

The point is to use opaque bytes only for "semantic-agnostic" low-level APIs like transport and storage.

I think a good example is CUDA.shfl_sync etc. With Base.unsafe_bitcast, a simple-minded generic implementation can be defined as

function CUDA.shfl_recurse(op, x)
    xbytes = Base.unsafe_bitcast(NTuple{sizeof(x),UInt8}, x)
    ybytes = map(op, xbytes)
    y = Base.unsafe_bitcast(typeof(x), ybytes)
    return y
end

Here, we serialize the input x into the opaque bytes xbytes. This is then moved across GPU threads by map(op, xbytes) ("transport"). Once we get the value ybytes for this thread, it is deserialized back to a value y of the original type.

From shful_recurse above, CUDA.jl will derive the public shfl_* APIs. In the real implementation I imagine you'd probably want to pad x and use UInt32 instead.

We can then use, e.g.,

val = something(shfl_down_sync(mask, Some{T}(val), delta))

to move val across threads even if T is a Union.

The use case for concurrent data structures would be similar; i.e., pointer-free values are internally serialized into and deserialized from opaque bytes which are stored into (say) Vector{UInt}. This is hidden inside the implementation. The opaque bytes representation is used only for invoking low-level "tearable" relaxed atomic accesses.

@JeffBezanson
Copy link
Member

What does map(op, xbytes) do? Is this a different meaning of map? What might op be?

@tkf
Copy link
Member Author

tkf commented Nov 19, 2021

Sorry, I should've chosen more concrete example. map(op, xbytes) is the basic map(::Function, ::NTuple). The closure op is something like x -> shfl_down_sync(mask, x, delta). So, concretely, we have

function shfl_down_sync(mask, x, delta)
    @assert sizeof(x) == 16  # the following code is the specialization for 16-byte typeof(x)
    xb1, xb2, xb3, xb4 = Base.unsafe_bitcast(NTuple{4,Int32}, x)
    yb1 = shfl_down_sync(mask, xb1, delta) # calls llvm.nvvm.shfl.sync.down.i32 intrinsic
    yb2 = shfl_down_sync(mask, xb2, delta)
    yb3 = shfl_down_sync(mask, xb3, delta)
    yb4 = shfl_down_sync(mask, xb4, delta)
    ybytes = (yb1, yb2, yb3, yb4)
    y = Base.unsafe_bitcast(typeof(x), ybytes)
    return y
end

(I changed NTuple{_,UInt8} to NTuple{4,Int32} since it looks like there's no llvm.nvvm.shfl.sync.down.i8)

@Moelf

This comment was marked as off-topic.

@tkf

This comment was marked as off-topic.

@Moelf

This comment was marked as off-topic.

@vtjnash

This comment was marked as off-topic.

@Moelf

This comment was marked as off-topic.

@N5N3

This comment was marked as off-topic.

@tkf
Copy link
Member Author

tkf commented Feb 15, 2022

I hide the above comments as off-topic. This PR is all about casting immutable pointerfree values. I think it would be more fruitful to discuss arrays and buffers separately elsewhere.

@maleadt
Copy link
Member

maleadt commented Jun 8, 2022

Some comments by @vtjnash (as I understood them, it had been a while since I looked at this PR):

  • it would be better to extend Core.bitcast to allow this functionality rather than introducing a different function
  • these type of casts might result in padding becoming observable, which LLVM can then turn into UB. should not matter for the CUDA.shfl use case though, as undef values are probably allowed there
  • for the CUDA atomics use case, Julia's atomic intrinsics already correctly deal with the above problem at the codegen level

@KristofferC
Copy link
Member

these type of casts might result in padding becoming observable, which LLVM can then turn into UB.

Just cross-referencing this here: #41071

@tkf
Copy link
Member Author

tkf commented Jun 8, 2022

Another note to myself from the chat with Jameson and Tim: All we can do with bitcast'ed bytes (in general 1) is to store them, load them, and bitcast them back to the original type. CUDA's shfl can be modeled as store+load and so it's OK to use bitcast.

Footnotes

  1. With some extra information on types (e.g., no padding), there may be other allowed operations. But store and load are just the minimal allowed operations.

@vtjnash vtjnash closed this Jul 17, 2023
@vtjnash
Copy link
Member

vtjnash commented Jul 17, 2023

reinterpret now exists for all bitstypes, and conscientiously skips over undefined bits to avoid exposing UB to the user

@maleadt
Copy link
Member

maleadt commented Jun 20, 2024

It may still be valuable to have this as an unsafe operation, because reinterpret is often GPU compatible. For example, doing a simple contiguous tuple reinterpret:

function vecdot_q4_kernel_MVP(scales)
  kmask1, kmask2, kmask3 = 0x3f3f3f3f, 0x0f0f0f0f, 0x03030303
  scales_uint32 = reinterpret(NTuple{3, UInt32}, scales)
  utmp0, utmp1, utmp2 = scales[1], scales[2], scales[3]

  return
end

function main()
    scales = UInt8.((1,2,3,4, 2,3,4,5, 3,4,5,6)) # NTuple{12, UInt8}
    @cuda threads=1 blocks=1 vecdot_q4_kernel_MVP(scales)
end
Reason: unsupported dynamic function invocation (call to -)
Stacktrace:
 [1] packedsize
   @ ./reinterpretarray.jl:763
 [2] _reinterpret
   @ ./reinterpretarray.jl:805
 [3] reinterpret
   @ ./essentials.jl:584
 [4] vecdot_q4_kernel_MVP
   @ ~/Julia/pkg/CUDA/wip2.jl:9
Reason: unsupported dynamic function invocation (call to padding(T::DataType, baseoffset::Int64) @ Base reinterpretarray.jl:701)
Stacktrace:
 [1] padding
   @ ./reinterpretarray.jl:702
 [2] packedsize
   @ ./reinterpretarray.jl:762
 [3] _reinterpret
   @ ./reinterpretarray.jl:804
 [4] reinterpret
   @ ./essentials.jl:584
 [5] vecdot_q4_kernel_MVP
   @ ~/Julia/pkg/CUDA/wip2.jl:9
Reason: unsupported dynamic function invocation (call to -)
Stacktrace:
 [1] packedsize
   @ ./reinterpretarray.jl:763
 [2] _reinterpret
   @ ./reinterpretarray.jl:804
 [3] reinterpret
   @ ./essentials.jl:584
 [4] vecdot_q4_kernel_MVP
   @ ~/Julia/pkg/CUDA/wip2.jl:9

Worth reopening over? Or can we could improve reinterpret for the contiguous case?

@vtjnash
Copy link
Member

vtjnash commented Jun 20, 2024

That seems more like a CUDA.jl issue (refusing to constant fold packedsize and padding)?

@maleadt
Copy link
Member

maleadt commented Jun 20, 2024

refusing to constant fold packedsize and padding

Ah, if that's expected I will look into that. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs compat annotation Add !!! compat "Julia x.y" to the docstring needs docs Documentation for this change is required needs news A NEWS entry is required for this change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants