RFC: unsafe_bitcast #43065

tkf · 2021-11-13T06:42:17Z

This PR tries to add (hopefully) well-defined low-level type punning facility usable for arbitrary pointer-free immutable objects.

The main use case is to support converting an arbitrary pointer-free immutable object to an opaque chunk of bytes (e.g., NTuple{N,UInt8}) and back. For example, this will be useful for using a rich set of immutable objects on GPU API like CUDA.shfl_sync etc. which are currently defined by reinterpreting some small subset of types to integers. If we can cast nested structs with Union fields, we can correctly execute various transducers with complex state transitions on GPU. Another important use case is an emulation of tearable atomics which is useful for efficient concurrent algorithms such as work-stealing dequeue and seqlock.

However, casting one type to another (aka type punning) is known to be hard when the compiler wants to infer something from the type system. If you can find various one-hour technical talks (e.g., CppCon 2019: Timur Doumler “Type punning in modern C++” - YouTube) on how to do it correctly, it's a good indication that we need to have an API in Base with a clear definition of when and how it can be used. For example, C++20 now has std::bit_cast as a similar API.

I haven't had time to dig deep into this to convince myself that the API I came up with was OK. But given the recent discussion (#32660, #42968, #43035) in expanding what reinterpret does, I think it's worth opening it as an alternative take at it; i.e., unsafe (narrow contract) API with wider use cases but weaker guarantee (no cross-process roundtrip). So, I'd appreciate it if people who know Julia and LLVM compiler can look at it.

One aspect of the API that I'm still worried about is what we can say about the returned object when the input type contains some padding. I wonder if we should rather create an "asymmetric" API unsafe_bitembed(T, x::S) -> y::T and unsafe_bitextract(S, y::T) -> x::S where S can contain padding but T must not. We can then clearly document that T is a chunk of opaque bytes that can only be usable in a meaningful way after unsafe_bitextract. I don't know if it helps the compiler, though.

ping @vtjnash @JeffBezanson @Keno @vchuravy @maleadt

This is required for using these macros before `esc` is defined. An alternative fix may be to define `esc` earlier. However, avoid using `esc` here seems to be more compatible with other macros defined at this stage.

vtjnash · 2021-11-15T19:48:21Z

base/essentials.jl

+    datatype_pointerfree(T) ||
+        throw(ArgumentError("output type $T may contain a boxed object"))
+    datatype_pointerfree(S) ||
+        throw(ArgumentError("input type $S may contain a boxed object"))


Suggested change

datatype_pointerfree(T) ||

throw(ArgumentError("output type $T may contain a boxed object"))

datatype_pointerfree(S) ||

throw(ArgumentError("input type $S may contain a boxed object"))

isbitstype(T) || throw(ArgumentError("output type $T has undefined layout"))

isbitstype(S) || throw(ArgumentError("input type $S has undefined layout"))

Hmm... Can we support Union fields? That's one of the main motivations.

IIRC, you are prohibited from accessing the union-bits fields in most cases (TBAA violation)

JeffBezanson · 2021-11-17T22:35:47Z

How do you operate on a a tuple of bytes with an unknown format? I don't really understand how this would be used.

tkf · 2021-11-17T23:06:40Z

The point is to use opaque bytes only for "semantic-agnostic" low-level APIs like transport and storage.

I think a good example is CUDA.shfl_sync etc. With Base.unsafe_bitcast, a simple-minded generic implementation can be defined as

function CUDA.shfl_recurse(op, x)
    xbytes = Base.unsafe_bitcast(NTuple{sizeof(x),UInt8}, x)
    ybytes = map(op, xbytes)
    y = Base.unsafe_bitcast(typeof(x), ybytes)
    return y
end

Here, we serialize the input x into the opaque bytes xbytes. This is then moved across GPU threads by map(op, xbytes) ("transport"). Once we get the value ybytes for this thread, it is deserialized back to a value y of the original type.

From shful_recurse above, CUDA.jl will derive the public shfl_* APIs. In the real implementation I imagine you'd probably want to pad x and use UInt32 instead.

We can then use, e.g.,

val = something(shfl_down_sync(mask, Some{T}(val), delta))

to move val across threads even if T is a Union.

The use case for concurrent data structures would be similar; i.e., pointer-free values are internally serialized into and deserialized from opaque bytes which are stored into (say) Vector{UInt}. This is hidden inside the implementation. The opaque bytes representation is used only for invoking low-level "tearable" relaxed atomic accesses.

JeffBezanson · 2021-11-19T18:43:03Z

What does map(op, xbytes) do? Is this a different meaning of map? What might op be?

tkf · 2021-11-19T23:14:22Z

Sorry, I should've chosen more concrete example. map(op, xbytes) is the basic map(::Function, ::NTuple). The closure op is something like x -> shfl_down_sync(mask, x, delta). So, concretely, we have

function shfl_down_sync(mask, x, delta)
    @assert sizeof(x) == 16  # the following code is the specialization for 16-byte typeof(x)
    xb1, xb2, xb3, xb4 = Base.unsafe_bitcast(NTuple{4,Int32}, x)
    yb1 = shfl_down_sync(mask, xb1, delta) # calls llvm.nvvm.shfl.sync.down.i32 intrinsic
    yb2 = shfl_down_sync(mask, xb2, delta)
    yb3 = shfl_down_sync(mask, xb3, delta)
    yb4 = shfl_down_sync(mask, xb4, delta)
    ybytes = (yb1, yb2, yb3, yb4)
    y = Base.unsafe_bitcast(typeof(x), ybytes)
    return y
end

(I changed NTuple{_,UInt8} to NTuple{4,Int32} since it looks like there's no llvm.nvvm.shfl.sync.down.i8)

tkf · 2022-02-15T04:59:34Z

I hide the above comments as off-topic. This PR is all about casting immutable pointerfree values. I think it would be more fruitful to discuss arrays and buffers separately elsewhere.

maleadt · 2022-06-08T13:55:47Z

Some comments by @vtjnash (as I understood them, it had been a while since I looked at this PR):

it would be better to extend Core.bitcast to allow this functionality rather than introducing a different function
these type of casts might result in padding becoming observable, which LLVM can then turn into UB. should not matter for the CUDA.shfl use case though, as undef values are probably allowed there
for the CUDA atomics use case, Julia's atomic intrinsics already correctly deal with the above problem at the codegen level

KristofferC · 2022-06-08T14:09:34Z

these type of casts might result in padding becoming observable, which LLVM can then turn into UB.

Just cross-referencing this here: #41071

tkf · 2022-06-08T14:10:10Z

Another note to myself from the chat with Jameson and Tim: All we can do with bitcast'ed bytes (in general ¹) is to store them, load them, and bitcast them back to the original type. CUDA's shfl can be modeled as store+load and so it's OK to use bitcast.

With some extra information on types (e.g., no padding), there may be other allowed operations. But store and load are just the minimal allowed operations. ↩

vtjnash · 2023-07-17T18:34:39Z

reinterpret now exists for all bitstypes, and conscientiously skips over undefined bits to avoid exposing UB to the user

maleadt · 2024-06-20T15:09:49Z

It may still be valuable to have this as an unsafe operation, because reinterpret is often GPU compatible. For example, doing a simple contiguous tuple reinterpret:

function vecdot_q4_kernel_MVP(scales)
  kmask1, kmask2, kmask3 = 0x3f3f3f3f, 0x0f0f0f0f, 0x03030303
  scales_uint32 = reinterpret(NTuple{3, UInt32}, scales)
  utmp0, utmp1, utmp2 = scales[1], scales[2], scales[3]

  return
end

function main()
    scales = UInt8.((1,2,3,4, 2,3,4,5, 3,4,5,6)) # NTuple{12, UInt8}
    @cuda threads=1 blocks=1 vecdot_q4_kernel_MVP(scales)
end

Reason: unsupported dynamic function invocation (call to -)
Stacktrace:
 [1] packedsize
   @ ./reinterpretarray.jl:763
 [2] _reinterpret
   @ ./reinterpretarray.jl:805
 [3] reinterpret
   @ ./essentials.jl:584
 [4] vecdot_q4_kernel_MVP
   @ ~/Julia/pkg/CUDA/wip2.jl:9
Reason: unsupported dynamic function invocation (call to padding(T::DataType, baseoffset::Int64) @ Base reinterpretarray.jl:701)
Stacktrace:
 [1] padding
   @ ./reinterpretarray.jl:702
 [2] packedsize
   @ ./reinterpretarray.jl:762
 [3] _reinterpret
   @ ./reinterpretarray.jl:804
 [4] reinterpret
   @ ./essentials.jl:584
 [5] vecdot_q4_kernel_MVP
   @ ~/Julia/pkg/CUDA/wip2.jl:9
Reason: unsupported dynamic function invocation (call to -)
Stacktrace:
 [1] packedsize
   @ ./reinterpretarray.jl:763
 [2] _reinterpret
   @ ./reinterpretarray.jl:804
 [3] reinterpret
   @ ./essentials.jl:584
 [4] vecdot_q4_kernel_MVP
   @ ~/Julia/pkg/CUDA/wip2.jl:9

Worth reopening over? Or can we could improve reinterpret for the contiguous case?

vtjnash · 2024-06-20T15:11:25Z

That seems more like a CUDA.jl issue (refusing to constant fold packedsize and padding)?

maleadt · 2024-06-20T15:53:57Z

refusing to constant fold packedsize and padding

Ah, if that's expected I will look into that. Thanks.

tkf added 2 commits November 13, 2021 01:29

Add unsafe_bitcast

193e903

Don't use esc in _gc_preserve_begin/end

b34ebf1

This is required for using these macros before `esc` is defined. An alternative fix may be to define `esc` earlier. However, avoid using `esc` here seems to be more compatible with other macros defined at this stage.

tkf added needs compat annotation Add !!! compat "Julia x.y" to the docstring needs docs Documentation for this change is required needs news A NEWS entry is required for this change labels Nov 13, 2021

tkf mentioned this pull request Nov 13, 2021

WIP: make reinterpret work on structs #32660

Closed

3 tasks

vtjnash reviewed Nov 15, 2021

View reviewed changes

maleadt mentioned this pull request Nov 17, 2021

Support dot product on GPU between CuArrays with inconsistent eltypes JuliaGPU/CUDA.jl#1240

Merged

Uh oh!

RFC: unsafe_bitcast #43065

RFC: unsafe_bitcast #43065

Uh oh!

Conversation

tkf commented Nov 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vtjnash Nov 15, 2021

Choose a reason for hiding this comment

Uh oh!

tkf Nov 15, 2021

Choose a reason for hiding this comment

Uh oh!

vtjnash Feb 15, 2022

Choose a reason for hiding this comment

Uh oh!

JeffBezanson commented Nov 17, 2021

Uh oh!

tkf commented Nov 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JeffBezanson commented Nov 19, 2021

Uh oh!

tkf commented Nov 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

tkf commented Feb 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Jun 8, 2022

Uh oh!

KristofferC commented Jun 8, 2022

Uh oh!

tkf commented Jun 8, 2022

Footnotes

Uh oh!

vtjnash commented Jul 17, 2023

Uh oh!

maleadt commented Jun 20, 2024

Uh oh!

vtjnash commented Jun 20, 2024

Uh oh!

maleadt commented Jun 20, 2024

Uh oh!

Uh oh!

tkf commented Nov 13, 2021 •

edited

Loading

tkf commented Nov 17, 2021 •

edited

Loading

tkf commented Nov 19, 2021 •

edited

Loading

tkf commented Feb 15, 2022 •

edited

Loading