Add built-in support to write compound data types #1013

tamasgal · 2022-09-27T21:03:50Z

This PR contains a utility function (EDIT: actually after some discussions it's now realised via adding a method to datatype()) which I carry around my projects and I discovered that there is already an open issue which is related too, see #819

It's by all means not fully fledged, but it works for structs with primitive field types very well. It's not reversible, meaning that read will yield the usual NamedTuple. One way to make it reversible would be to attach metadata to the dataset with e.g. the type name and when read back, check if the type is already defined in the namespace and if not, create it dynamically. Kind of what I do in https://github.com/JuliaHEP/UnROOT.jl where generating types during runtime is mandatory.

I provided an example in the function docstring which demonstrates that it's fairly intuitive to use and covers the basic needs. Here is it:

julia> struct Foo
           x::Int32
           y::Float64
       end

julia> foos = [[Foo(1, 2) Foo(3, 4) Foo(5, 6)]; [Foo(7, 8) Foo(9, 10) Foo(11, 12)]]
2×3 Matrix{Foo}:
 Foo(1, 2.0)  Foo(3, 4.0)   Foo(5, 6.0)
 Foo(7, 8.0)  Foo(9, 10.0)  Foo(11, 12.0)

julia> h5open("foo.h5", "w") do h5f
           write_dataset(h5f, "the/foo", foos)
       end

julia> thefoo = h5open("foo.h5", "r") do file
           read(file, "the/foo")
       end
2×3 Matrix{NamedTuple{(:x, :y), Tuple{Int32, Float64}}}:
 (x = 1, y = 2.0)  (x = 3, y = 4.0)   (x = 5, y = 6.0)
 (x = 7, y = 8.0)  (x = 9, y = 10.0)  (x = 11, y = 12.0)

The data can be reinterpreted to the original data type using reinterpret():

julia> reinterpret(Foo, thefoo)
2×3 reinterpret(Foo, ::Matrix{NamedTuple{(:x, :y), Tuple{Int32, Float64}}}):
 Foo(1, 2.0)  Foo(3, 4.0)   Foo(5, 6.0)
 Foo(7, 8.0)  Foo(9, 10.0)  Foo(11, 12.0)

src/datasets.jl

mkitti

At minimum I think this should be write_compound_dataset since one could also write a compound type to an attribute.

My preference would be to extend HDF5.datatype to work on arbitrary structs and named tuples so that we can just use the normal write_dataset interface. After enabling that, then we should reevaluate if a special convenience method is still warranted.

src/datasets.jl

test/compound.jl

src/datasets.jl

src/HDF5.jl

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>

tamasgal · 2022-10-10T14:48:11Z

Let me comment at the end since I changed a few things.

write_compound_dataset() is now removed
datatype() has now a new method with datatype(A::AbstractArray{T}) where {T}

so that write_dataset() now automatically works with an array of (flat) structs.

@mkitti I am still not sure about the type freedom, so let me know what you think!

src/typeconversions.jl

mkitti · 2022-10-12T13:33:23Z

@tamasgal, in case you don't follow what I'm talking about in the comments, I mean the following:

julia> import HDF5: API, Datatype

julia> HDF5.datatype(x::T) where T = HDF5.datatype(x, Val(isstructtype(T)))

julia> function HDF5.datatype(x::T, isstruct::Val{true}) where T
           dtype = API.h5t_create(API.H5T_COMPOUND, sizeof(T))
           for (idx, fn) in enumerate(fieldnames(T))
               API.h5t_insert(dtype, fn, fieldoffset(T, idx), datatype(fieldtype(T, idx)))
           end
           Datatype(dtype)
       end

julia> datatype(MyStruct(3,5))
HDF5.Datatype: H5T_COMPOUND {
      H5T_STD_I64LE "x" : 0;
      H5T_STD_I64LE "y" : 8;
   }

julia> datatype((x=5, y=3.f0))
HDF5.Datatype: H5T_COMPOUND {
      H5T_STD_I64LE "x" : 0;
      H5T_IEEE_F32LE "y" : 8;
   }

tamasgal · 2022-10-12T13:34:34Z

Ah yes, I was struggling with that. Did not came up with the Val-solution, thanks, I will try ;)

mkitti · 2022-10-12T13:42:29Z

People may want to override this method still for their specific types. Your implementation relies on Julia's memory aligned version, which may take up more space than needed.

julia> struct MyStruct32
           x::Int32
           y::Int32
       end

julia> datatype(MyStruct32(3,5))
HDF5.Datatype: H5T_COMPOUND {
      H5T_STD_I32LE "x" : 0;
      H5T_STD_I32LE "y" : 4;
   }

julia> struct MyStructMixed
           x::Int32
           y::Int
       end

julia> datatype(MyStructMixed(3,5))
HDF5.Datatype: H5T_COMPOUND {
      H5T_STD_I32LE "x" : 0;
      H5T_STD_I64LE "y" : 8;
   }

julia> sizeof(MyStruct32)
8

julia> sizeof(MyStructMixed)
16

As for the NamedTuple thing, perhaps people should write conversions methods to their types from a NamedTuple? It might be as simple as using Base.@kwdef:

julia> Base.@kwdef struct MyStructKW
           x::Int32
           y::Int
       end
MyStructKW

julia> MyStructKW(;(y = 1, x = 2)...)
MyStructKW(2, 1)

julia> Base.convert(::Type{MyStructKW}, nt::NamedTuple) = MyStructKW(; nt...)

julia> convert(MyStructKW, (y = 1, x = 2))
MyStructKW(2, 1)

tamasgal · 2022-10-12T13:45:56Z

You are right about the memory padding, I guess I will just use the actual sizes as offset then.

Not sure if I understand regarding NamedTuple, at least in the tests they work fine. Or did I miss anything? ;)

mkitti · 2022-10-12T13:48:38Z

I'm talking about your comment:

It's not reversible, meaning that read will yield the usual NamedTuple

If they wrote a convert method they could just use a type assertion to get what they want.

mkitti · 2022-10-12T13:50:25Z

You are right about the memory padding, I guess I will just use the actual sizes as offset then.

We would have to test the conversion closely I think. This may make memory mapping difficult for example.

tamasgal · 2022-10-12T13:51:40Z

Ah yes, understood :). I guess that should go into the docs then.

Btw. regarding the Type Array does not have a definite size.: I think it would make sense to simply use the length of the passed AbstractArray at the time of writing the dataset. I mean, using StaticArrays is of course a viable option but it might be user-friendly to support also the "obvious" use-case. Or do you think we should force fixed size sets?

tamasgal · 2022-10-12T13:52:41Z

You are right about the memory padding, I guess I will just use the actual sizes as offset then.

We would have to test the conversion closely I think. This may make memory mapping difficult for example.

I see. I don't know much about the memory mapping features of HDF5.jl but I'll check how to test it.

tamasgal · 2022-10-12T13:56:29Z

I am still struggling to find out where I hit the Type Array does not have a definite size. error since when I define datatype() for AbstractArrays, it does not complain and uses the actual size of the array (read: everything works as expected). If I only define the struct (the very same function body for the method, just a different dispatch type), the error bubbles up from somewhere.

mkitti · 2022-10-12T14:07:18Z

What is the stack trace?

tamasgal · 2022-10-12T14:12:24Z

So with this it works (but of course not for writing single structs):

datatype(x::AbstractArray{T}) where {T} = HDF5.datatype(x, Val(isstructtype(T)))
function datatype(x::AbstractArray{T}, isstruct::Val{true}) where {T}
    dtype = API.h5t_create(API.H5T_COMPOUND, sizeof(T))
    for (idx, fn) in enumerate(fieldnames(T))
        API.h5t_insert(dtype, fn, fieldoffset(T, idx), datatype(fieldtype(T, idx)))
    end
    Datatype(dtype)
end

and if I change the signatures to:

datatype(x::T) where {T} = HDF5.datatype(x, Val(isstructtype(T)))
function datatype(x::T, isstruct::Val{true}) where {T}

and call with the same tests, like

    bars = [
        [Bar(1, 1.1, true) Bar(2, 2.1, false) Bar(3, 3.1, true)]
        [Bar(4, 4.1, false) Bar(5, 5.1, true) Bar(6, 6.1, false)]
    ]

    namedtuples = [(a=1, b=2.3), (a=4, b=5.6)]

    fn = tempname()
    h5open(fn, "w") do h5f
       write_dataset(h5f, "the/bars", bars)
       write_dataset(h5f, "the/namedtuples", namedtuples)
    end

I get Type Array does not have a definite size. and I don't even find that line in the source code. It's most likely in HDF5_jll. I checked the dataspace() methods but I have not stepped through with the debugger.

  Got exception outside of a @test
  Type Array does not have a definite size.
  Stacktrace:
    [1] sizeof(x::Type)
      @ Base ./essentials.jl:473
    [2] datatype(x::Matrix{Bar}, isstruct::Val{true})
      @ HDF5 ~/Dev/HDF5.jl/src/typeconversions.jl:66
    [3] datatype(x::Matrix{Bar})
      @ HDF5 ~/Dev/HDF5.jl/src/typeconversions.jl:64
    [4] create_dataset(parent::HDF5.File, name::String, data::Matrix{Bar}; pv::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      @ HDF5 ~/Dev/HDF5.jl/src/datasets.jl:268
    [5] create_dataset(parent::HDF5.File, name::String, data::Matrix{Bar})
      @ HDF5 ~/Dev/HDF5.jl/src/datasets.jl:265
    [6] write_dataset(parent::HDF5.File, name::String, data::Matrix{Bar}; pv::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      @ HDF5 ~/Dev/HDF5.jl/src/datasets.jl:282
    [7] write_dataset(parent::HDF5.File, name::String, data::Matrix{Bar})
      @ HDF5 ~/Dev/HDF5.jl/src/datasets.jl:279
    [8] (::var"#254#257"{Vector{NamedTuple{(:a, :b), Tuple{Int64, Float64}}}, Matrix{Bar}})(h5f::HDF5.File)
      @ Main ~/Dev/HDF5.jl/test/compound.jl:167
    [9] (::HDF5.var"#17#18"{HDF5.HDF5Context, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, var"#254#257"{Vector{NamedTuple{(:a, :b), Tuple{Int64, Float64}}}, Matrix{Bar}}, HDF5.File})()
      @ HDF5 ~/Dev/HDF5.jl/src/file.jl:96
   [10] task_local_storage(body::HDF5.var"#17#18"{HDF5.HDF5Context, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, var"#254#257"{Vector{NamedTuple{(:a, :b), Tuple{Int64, Float64}}}, Matrix{Bar}}, HDF5.File}, key::Symbol, val::HDF5.HDF5Context)
      @ Base ./task.jl:292
   [11] h5open(::var"#254#257"{Vector{NamedTuple{(:a, :b), Tuple{Int64, Float64}}}, Matrix{Bar}}, ::String, ::Vararg{String}; context::HDF5.HDF5Context, pv::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      @ HDF5 ~/Dev/HDF5.jl/src/file.jl:91
   [12] h5open(::Function, ::String, ::String)
      @ HDF5 ~/Dev/HDF5.jl/src/file.jl:89
   [13] macro expansion
      @ ~/Dev/HDF5.jl/test/compound.jl:165 [inlined]
   [14] macro expansion
      @ ~/.julia/juliaup/julia-1.8.1+0.aarch64/share/julia/stdlib/v1.8/Test/src/Test.jl:1357 [inlined]
   [15] top-level scope
      @ ~/Dev/HDF5.jl/test/compound.jl:157
   [16] include(fname::String)
      @ Base.MainInclude ./client.jl:476
   [17] macro expansion
      @ ~/Dev/HDF5.jl/test/runtests.jl:24 [inlined]
   [18] macro expansion
      @ ~/.julia/juliaup/julia-1.8.1+0.aarch64/share/julia/stdlib/v1.8/Test/src/Test.jl:1357 [inlined]
   [19] top-level scope
      @ ~/Dev/HDF5.jl/test/runtests.jl:19
   [20] include(fname::String)
      @ Base.MainInclude ./client.jl:476
   [21] top-level scope
      @ none:6
   [22] eval
      @ ./boot.jl:368 [inlined]
   [23] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:276
   [24] _start()
      @ Base ./client.jl:52

tamasgal · 2022-10-12T14:17:51Z

I also checked with StaticArrays but then it hangs and uses 100% CPU (I waited for 5 minutes, it's stuck somewhere):

    bars = @SMatrix [
        [Bar(1, 1.1, true) Bar(2, 2.1, false) Bar(3, 3.1, true)]
        [Bar(4, 4.1, false) Bar(5, 5.1, true) Bar(6, 6.1, false)]
    ]

    namedtuples = @SVector [(a=1, b=2.3), (a=4, b=5.6)]

tamasgal · 2022-10-24T15:34:18Z

Writing a dataset with an array of mutable structs will fail silently, and just write garbage data.

I played around a bit but the problem is bigger than I thought. It basically boils down to the low level I/O where pointers are written instead of the values of the primitive types.

I am not sure which what's the best approach is but one fast solution would be throwing an error if not isimmutable(T) (need something similar for Julia <1.7). I don't think that mutable structs are a common thing in data which is dumped to HDF5 but I might be wrong. Note also that nested structures will not work as well, so we might get away with throwing an error until someone else really needs mutable struct support (or nested structures).

What do you think?

mkitti · 2022-10-24T15:41:39Z

We may need to dispatch on isbitstype.

nhz2 · 2022-10-24T16:54:10Z

Yeah, I think it makes sense to error somewhere if the struct type isn't isbitstype. Also if you update to Compat 3.40 you can use ismutabletype, but I'm not sure if that would work, because an immutable struct may have mutable struct type fields.

mkitti · 2022-10-25T01:38:07Z

I have solution for mutable types. Basically, we need to convert the mutable type to a NamedTuple. Fortunately, we already have HDF5.get_mem_compatible_jl_type which already describes that NamedTuple. I will push soon.

tamasgal · 2022-10-25T11:55:47Z

Thanks for jumping in @mkitti ! :)

How should we proceed with the finalisation problem?

mkitti · 2022-10-25T14:12:57Z

Thanks for jumping in @mkitti ! :)

How should we proceed with the finalisation problem?

For now, I've changed the second argument to true in 8eacf5f with some reservation.

Perhaps we should probably close any compound datatypes that we create eagerly. Perhaps we should even close any compound datatypes passed into these methods. The larger issue of threading and overall use of finalization should be addressed, but perhaps elsewhere.

simonbyrne · 2022-11-21T19:02:23Z

One thing we recently added in MPI.jl (which has a similar construction for custom datatypes) is an internal cache, which may be helpful when reading/writing multiple times:
https://github.com/JuliaParallel/MPI.jl/blob/eaa2599f4f79611852df21d7c86cd81da8b9f156/src/datatypes.jl#L132-L153

mkitti · 2022-11-28T19:16:33Z

Should this code be here or should it be in JLD.jl or a third package?

tamasgal · 2022-11-28T20:22:18Z

I am a bit confused which actions need to be taken in order to merge this. Can I do anything else to make progress?

mkitti · 2022-11-28T20:46:44Z

I'm leaning towards the merge side, but I think @musm needs to review.

I think we should consider @simonbyrne suggestion of a cache, but that could be another pull request.

musm · 2022-12-01T22:21:28Z

src/typeconversions.jl

+
+# These will use finalizers. Close them eagerly to avoid issues.
+datatype(::T) where {T} = Datatype(hdf5_type_id(T), true)
+datatype(::Type{T}) where {T} = Datatype(hdf5_type_id(T), true)


Isn't this definition redundant with the method right above it?

No. The first works on instances of a type. The second works on the type itself.

We don't seem to need this definition (on L67)

musm · 2022-12-01T22:32:35Z

src/dataspaces.jl

@@ -46,6 +46,12 @@ The default `Dataspace` used for representing a Julia object `data`:
 - arrays: a simple `Dataspace`
 - `nothing` or an `EmptyArray`: a null dataspace
 """


Should we update the docs here to include the fact that the new default is a scalar Dataspace for struct types?

musm · 2022-12-02T07:17:35Z

LGTM sans minor comment

Agree cache sounds interesting but probably for seperate PR

mkitti · 2022-12-06T19:30:05Z

I will see if I can address @musm's comments and then merge.

musm · 2022-12-06T19:36:02Z

Sounds good!

src/typeconversions.jl

src/dataspaces.jl

Co-authored-by: Mustafa M <mus-m@outlook.com>

alhirzel · 2022-12-20T18:59:31Z

Great work! I am wondering if it is possible to do a release soon so that this can be consumed by general users without ]deving the package?

musm · 2022-12-20T19:05:23Z

done

tamasgal added 3 commits September 27, 2022 22:53

Add write_compound function and tests

9abd3ab

Remove HDF5 namespace

aa9c942

Fix code formatting

e216f36

mkitti reviewed Sep 28, 2022

View reviewed changes

src/datasets.jl Outdated Show resolved Hide resolved

tamasgal added 2 commits October 5, 2022 12:04

Add more type annotations

8ec5d5b

Julia formatter

a49d33d

mkitti requested changes Oct 5, 2022

View reviewed changes

tamasgal and others added 3 commits October 10, 2022 13:45

Rename write_compound

6330373

Rename write_compound to write_compound_dataset

89fc449

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>

Implement proper datatype discovery for compound

fe273a9

tamasgal force-pushed the write-compound branch from f65a0dd to fe273a9 Compare October 10, 2022 14:42

tamasgal changed the title ~~Utility function to write compound data types~~ Add built-in support to write compound data types Oct 10, 2022

mkitti reviewed Oct 10, 2022

View reviewed changes

src/typeconversions.jl Outdated Show resolved Hide resolved

tamasgal added 3 commits October 10, 2022 17:00

Dispatch on struct-type itself

4c66a5c

Readd AbstractArray to datatype dispatch

a7f3b55

JuliaFormatter

5507f6f

Cleanup

856af2f

simonbyrne mentioned this pull request Oct 25, 2022

Finalizers cause exception_access_violation while using HDF5 from a thread. #990

Closed

mkitti added 4 commits October 24, 2022 21:39

Convert non-bitstypes to a NamedTuple when writing datasets

507fd1c

Make attributes compound type compatible

40cf17b

Garbage collect compound types

8eacf5f

Formatting

a5b1689

mkitti mentioned this pull request Nov 18, 2022

Merge strategy #1025

Open

musm reviewed Dec 1, 2022

View reviewed changes

musm reviewed Dec 9, 2022

View reviewed changes

src/typeconversions.jl Outdated Show resolved Hide resolved

musm reviewed Dec 9, 2022

View reviewed changes

src/dataspaces.jl Show resolved Hide resolved

Apply suggestions from code review

cff8206

Co-authored-by: Mustafa M <mus-m@outlook.com>

musm merged commit 205518d into JuliaIO:master Dec 9, 2022

tamasgal deleted the write-compound branch December 9, 2022 09:01

tamasgal mentioned this pull request May 16, 2023

Creating compound datasets is not working #1068

Closed

s-henao-castellanos mentioned this pull request Mar 5, 2025

Compound structures fail with String fields #1189

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add built-in support to write compound data types #1013

Add built-in support to write compound data types #1013

tamasgal commented Sep 27, 2022 •

edited

Loading

mkitti left a comment

tamasgal commented Oct 10, 2022

mkitti commented Oct 12, 2022

tamasgal commented Oct 12, 2022

mkitti commented Oct 12, 2022 •

edited

Loading

tamasgal commented Oct 12, 2022

mkitti commented Oct 12, 2022

mkitti commented Oct 12, 2022

tamasgal commented Oct 12, 2022

tamasgal commented Oct 12, 2022

tamasgal commented Oct 12, 2022

mkitti commented Oct 12, 2022

tamasgal commented Oct 12, 2022

tamasgal commented Oct 12, 2022

tamasgal commented Oct 24, 2022

mkitti commented Oct 24, 2022

nhz2 commented Oct 24, 2022

mkitti commented Oct 25, 2022

tamasgal commented Oct 25, 2022

mkitti commented Oct 25, 2022

simonbyrne commented Nov 21, 2022

mkitti commented Nov 28, 2022

tamasgal commented Nov 28, 2022

mkitti commented Nov 28, 2022

musm Dec 1, 2022

mkitti Dec 2, 2022

musm Dec 2, 2022

musm Dec 1, 2022

musm commented Dec 2, 2022

mkitti commented Dec 6, 2022

musm commented Dec 6, 2022

alhirzel commented Dec 20, 2022

musm commented Dec 20, 2022

Add built-in support to write compound data types #1013

Add built-in support to write compound data types #1013

Conversation

tamasgal commented Sep 27, 2022 • edited Loading

mkitti left a comment

Choose a reason for hiding this comment

tamasgal commented Oct 10, 2022

mkitti commented Oct 12, 2022

tamasgal commented Oct 12, 2022

mkitti commented Oct 12, 2022 • edited Loading

tamasgal commented Oct 12, 2022

mkitti commented Oct 12, 2022

mkitti commented Oct 12, 2022

tamasgal commented Oct 12, 2022

tamasgal commented Oct 12, 2022

tamasgal commented Oct 12, 2022

mkitti commented Oct 12, 2022

tamasgal commented Oct 12, 2022

tamasgal commented Oct 12, 2022

tamasgal commented Oct 24, 2022

mkitti commented Oct 24, 2022

nhz2 commented Oct 24, 2022

mkitti commented Oct 25, 2022

tamasgal commented Oct 25, 2022

mkitti commented Oct 25, 2022

simonbyrne commented Nov 21, 2022

mkitti commented Nov 28, 2022

tamasgal commented Nov 28, 2022

mkitti commented Nov 28, 2022

musm Dec 1, 2022

Choose a reason for hiding this comment

mkitti Dec 2, 2022

Choose a reason for hiding this comment

musm Dec 2, 2022

Choose a reason for hiding this comment

musm Dec 1, 2022

Choose a reason for hiding this comment

musm commented Dec 2, 2022

mkitti commented Dec 6, 2022

musm commented Dec 6, 2022

alhirzel commented Dec 20, 2022

musm commented Dec 20, 2022

tamasgal commented Sep 27, 2022 •

edited

Loading

mkitti commented Oct 12, 2022 •

edited

Loading