Skip to content

Add built-in support to write compound data types #1013

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Dec 9, 2022

Conversation

tamasgal
Copy link
Contributor

@tamasgal tamasgal commented Sep 27, 2022

This PR contains a utility function (EDIT: actually after some discussions it's now realised via adding a method to datatype()) which I carry around my projects and I discovered that there is already an open issue which is related too, see #819

It's by all means not fully fledged, but it works for structs with primitive field types very well. It's not reversible, meaning that read will yield the usual NamedTuple. One way to make it reversible would be to attach metadata to the dataset with e.g. the type name and when read back, check if the type is already defined in the namespace and if not, create it dynamically. Kind of what I do in https://github.com/JuliaHEP/UnROOT.jl where generating types during runtime is mandatory.

I provided an example in the function docstring which demonstrates that it's fairly intuitive to use and covers the basic needs. Here is it:

julia> struct Foo
           x::Int32
           y::Float64
       end

julia> foos = [[Foo(1, 2) Foo(3, 4) Foo(5, 6)]; [Foo(7, 8) Foo(9, 10) Foo(11, 12)]]
2×3 Matrix{Foo}:
 Foo(1, 2.0)  Foo(3, 4.0)   Foo(5, 6.0)
 Foo(7, 8.0)  Foo(9, 10.0)  Foo(11, 12.0)

julia> h5open("foo.h5", "w") do h5f
           write_dataset(h5f, "the/foo", foos)
       end

julia> thefoo = h5open("foo.h5", "r") do file
           read(file, "the/foo")
       end
2×3 Matrix{NamedTuple{(:x, :y), Tuple{Int32, Float64}}}:
 (x = 1, y = 2.0)  (x = 3, y = 4.0)   (x = 5, y = 6.0)
 (x = 7, y = 8.0)  (x = 9, y = 10.0)  (x = 11, y = 12.0)

The data can be reinterpreted to the original data type using reinterpret():

julia> reinterpret(Foo, thefoo)
2×3 reinterpret(Foo, ::Matrix{NamedTuple{(:x, :y), Tuple{Int32, Float64}}}):
 Foo(1, 2.0)  Foo(3, 4.0)   Foo(5, 6.0)
 Foo(7, 8.0)  Foo(9, 10.0)  Foo(11, 12.0)

Copy link
Member

@mkitti mkitti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At minimum I think this should be write_compound_dataset since one could also write a compound type to an attribute.

My preference would be to extend HDF5.datatype to work on arbitrary structs and named tuples so that we can just use the normal write_dataset interface. After enabling that, then we should reevaluate if a special convenience method is still warranted.

@tamasgal tamasgal changed the title Utility function to write compound data types Add built-in support to write compound data types Oct 10, 2022
@tamasgal
Copy link
Contributor Author

Let me comment at the end since I changed a few things.

  • write_compound_dataset() is now removed
  • datatype() has now a new method with datatype(A::AbstractArray{T}) where {T}

so that write_dataset() now automatically works with an array of (flat) structs.

@mkitti I am still not sure about the type freedom, so let me know what you think!

@mkitti
Copy link
Member

mkitti commented Oct 12, 2022

@tamasgal, in case you don't follow what I'm talking about in the comments, I mean the following:

julia> import HDF5: API, Datatype

julia> HDF5.datatype(x::T) where T = HDF5.datatype(x, Val(isstructtype(T)))

julia> function HDF5.datatype(x::T, isstruct::Val{true}) where T
           dtype = API.h5t_create(API.H5T_COMPOUND, sizeof(T))
           for (idx, fn) in enumerate(fieldnames(T))
               API.h5t_insert(dtype, fn, fieldoffset(T, idx), datatype(fieldtype(T, idx)))
           end
           Datatype(dtype)
       end

julia> datatype(MyStruct(3,5))
HDF5.Datatype: H5T_COMPOUND {
      H5T_STD_I64LE "x" : 0;
      H5T_STD_I64LE "y" : 8;
   }

julia> datatype((x=5, y=3.f0))
HDF5.Datatype: H5T_COMPOUND {
      H5T_STD_I64LE "x" : 0;
      H5T_IEEE_F32LE "y" : 8;
   }

@tamasgal
Copy link
Contributor Author

Ah yes, I was struggling with that. Did not came up with the Val-solution, thanks, I will try ;)

@mkitti
Copy link
Member

mkitti commented Oct 12, 2022

People may want to override this method still for their specific types. Your implementation relies on Julia's memory aligned version, which may take up more space than needed.

julia> struct MyStruct32
           x::Int32
           y::Int32
       end

julia> datatype(MyStruct32(3,5))
HDF5.Datatype: H5T_COMPOUND {
      H5T_STD_I32LE "x" : 0;
      H5T_STD_I32LE "y" : 4;
   }

julia> struct MyStructMixed
           x::Int32
           y::Int
       end

julia> datatype(MyStructMixed(3,5))
HDF5.Datatype: H5T_COMPOUND {
      H5T_STD_I32LE "x" : 0;
      H5T_STD_I64LE "y" : 8;
   }

julia> sizeof(MyStruct32)
8

julia> sizeof(MyStructMixed)
16

As for the NamedTuple thing, perhaps people should write conversions methods to their types from a NamedTuple? It might be as simple as using Base.@kwdef:

julia> Base.@kwdef struct MyStructKW
           x::Int32
           y::Int
       end
MyStructKW

julia> MyStructKW(;(y = 1, x = 2)...)
MyStructKW(2, 1)

julia> Base.convert(::Type{MyStructKW}, nt::NamedTuple) = MyStructKW(; nt...)

julia> convert(MyStructKW, (y = 1, x = 2))
MyStructKW(2, 1)

@tamasgal
Copy link
Contributor Author

You are right about the memory padding, I guess I will just use the actual sizes as offset then.

Not sure if I understand regarding NamedTuple, at least in the tests they work fine. Or did I miss anything? ;)

@mkitti
Copy link
Member

mkitti commented Oct 12, 2022

I'm talking about your comment:

It's not reversible, meaning that read will yield the usual NamedTuple

If they wrote a convert method they could just use a type assertion to get what they want.

@mkitti
Copy link
Member

mkitti commented Oct 12, 2022

You are right about the memory padding, I guess I will just use the actual sizes as offset then.

We would have to test the conversion closely I think. This may make memory mapping difficult for example.

@tamasgal
Copy link
Contributor Author

Ah yes, understood :). I guess that should go into the docs then.

Btw. regarding the Type Array does not have a definite size.: I think it would make sense to simply use the length of the passed AbstractArray at the time of writing the dataset. I mean, using StaticArrays is of course a viable option but it might be user-friendly to support also the "obvious" use-case. Or do you think we should force fixed size sets?

@tamasgal
Copy link
Contributor Author

You are right about the memory padding, I guess I will just use the actual sizes as offset then.

We would have to test the conversion closely I think. This may make memory mapping difficult for example.

I see. I don't know much about the memory mapping features of HDF5.jl but I'll check how to test it.

@tamasgal
Copy link
Contributor Author

I am still struggling to find out where I hit the Type Array does not have a definite size. error since when I define datatype() for AbstractArrays, it does not complain and uses the actual size of the array (read: everything works as expected). If I only define the struct (the very same function body for the method, just a different dispatch type), the error bubbles up from somewhere.

@mkitti
Copy link
Member

mkitti commented Oct 12, 2022

What is the stack trace?

@tamasgal
Copy link
Contributor Author

So with this it works (but of course not for writing single structs):

datatype(x::AbstractArray{T}) where {T} = HDF5.datatype(x, Val(isstructtype(T)))
function datatype(x::AbstractArray{T}, isstruct::Val{true}) where {T}
    dtype = API.h5t_create(API.H5T_COMPOUND, sizeof(T))
    for (idx, fn) in enumerate(fieldnames(T))
        API.h5t_insert(dtype, fn, fieldoffset(T, idx), datatype(fieldtype(T, idx)))
    end
    Datatype(dtype)
end

and if I change the signatures to:

datatype(x::T) where {T} = HDF5.datatype(x, Val(isstructtype(T)))
function datatype(x::T, isstruct::Val{true}) where {T}

and call with the same tests, like

    bars = [
        [Bar(1, 1.1, true) Bar(2, 2.1, false) Bar(3, 3.1, true)]
        [Bar(4, 4.1, false) Bar(5, 5.1, true) Bar(6, 6.1, false)]
    ]

    namedtuples = [(a=1, b=2.3), (a=4, b=5.6)]

    fn = tempname()
    h5open(fn, "w") do h5f
       write_dataset(h5f, "the/bars", bars)
       write_dataset(h5f, "the/namedtuples", namedtuples)
    end

I get Type Array does not have a definite size. and I don't even find that line in the source code. It's most likely in HDF5_jll. I checked the dataspace() methods but I have not stepped through with the debugger.

  Got exception outside of a @test
  Type Array does not have a definite size.
  Stacktrace:
    [1] sizeof(x::Type)
      @ Base ./essentials.jl:473
    [2] datatype(x::Matrix{Bar}, isstruct::Val{true})
      @ HDF5 ~/Dev/HDF5.jl/src/typeconversions.jl:66
    [3] datatype(x::Matrix{Bar})
      @ HDF5 ~/Dev/HDF5.jl/src/typeconversions.jl:64
    [4] create_dataset(parent::HDF5.File, name::String, data::Matrix{Bar}; pv::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      @ HDF5 ~/Dev/HDF5.jl/src/datasets.jl:268
    [5] create_dataset(parent::HDF5.File, name::String, data::Matrix{Bar})
      @ HDF5 ~/Dev/HDF5.jl/src/datasets.jl:265
    [6] write_dataset(parent::HDF5.File, name::String, data::Matrix{Bar}; pv::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      @ HDF5 ~/Dev/HDF5.jl/src/datasets.jl:282
    [7] write_dataset(parent::HDF5.File, name::String, data::Matrix{Bar})
      @ HDF5 ~/Dev/HDF5.jl/src/datasets.jl:279
    [8] (::var"#254#257"{Vector{NamedTuple{(:a, :b), Tuple{Int64, Float64}}}, Matrix{Bar}})(h5f::HDF5.File)
      @ Main ~/Dev/HDF5.jl/test/compound.jl:167
    [9] (::HDF5.var"#17#18"{HDF5.HDF5Context, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, var"#254#257"{Vector{NamedTuple{(:a, :b), Tuple{Int64, Float64}}}, Matrix{Bar}}, HDF5.File})()
      @ HDF5 ~/Dev/HDF5.jl/src/file.jl:96
   [10] task_local_storage(body::HDF5.var"#17#18"{HDF5.HDF5Context, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, var"#254#257"{Vector{NamedTuple{(:a, :b), Tuple{Int64, Float64}}}, Matrix{Bar}}, HDF5.File}, key::Symbol, val::HDF5.HDF5Context)
      @ Base ./task.jl:292
   [11] h5open(::var"#254#257"{Vector{NamedTuple{(:a, :b), Tuple{Int64, Float64}}}, Matrix{Bar}}, ::String, ::Vararg{String}; context::HDF5.HDF5Context, pv::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      @ HDF5 ~/Dev/HDF5.jl/src/file.jl:91
   [12] h5open(::Function, ::String, ::String)
      @ HDF5 ~/Dev/HDF5.jl/src/file.jl:89
   [13] macro expansion
      @ ~/Dev/HDF5.jl/test/compound.jl:165 [inlined]
   [14] macro expansion
      @ ~/.julia/juliaup/julia-1.8.1+0.aarch64/share/julia/stdlib/v1.8/Test/src/Test.jl:1357 [inlined]
   [15] top-level scope
      @ ~/Dev/HDF5.jl/test/compound.jl:157
   [16] include(fname::String)
      @ Base.MainInclude ./client.jl:476
   [17] macro expansion
      @ ~/Dev/HDF5.jl/test/runtests.jl:24 [inlined]
   [18] macro expansion
      @ ~/.julia/juliaup/julia-1.8.1+0.aarch64/share/julia/stdlib/v1.8/Test/src/Test.jl:1357 [inlined]
   [19] top-level scope
      @ ~/Dev/HDF5.jl/test/runtests.jl:19
   [20] include(fname::String)
      @ Base.MainInclude ./client.jl:476
   [21] top-level scope
      @ none:6
   [22] eval
      @ ./boot.jl:368 [inlined]
   [23] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:276
   [24] _start()
      @ Base ./client.jl:52

@tamasgal
Copy link
Contributor Author

I also checked with StaticArrays but then it hangs and uses 100% CPU (I waited for 5 minutes, it's stuck somewhere):

    bars = @SMatrix [
        [Bar(1, 1.1, true) Bar(2, 2.1, false) Bar(3, 3.1, true)]
        [Bar(4, 4.1, false) Bar(5, 5.1, true) Bar(6, 6.1, false)]
    ]

    namedtuples = @SVector [(a=1, b=2.3), (a=4, b=5.6)]

@tamasgal
Copy link
Contributor Author

Writing a dataset with an array of mutable structs will fail silently, and just write garbage data.

I played around a bit but the problem is bigger than I thought. It basically boils down to the low level I/O where pointers are written instead of the values of the primitive types.

I am not sure which what's the best approach is but one fast solution would be throwing an error if not isimmutable(T) (need something similar for Julia <1.7). I don't think that mutable structs are a common thing in data which is dumped to HDF5 but I might be wrong. Note also that nested structures will not work as well, so we might get away with throwing an error until someone else really needs mutable struct support (or nested structures).

What do you think?

@mkitti
Copy link
Member

mkitti commented Oct 24, 2022

We may need to dispatch on isbitstype.

@nhz2
Copy link
Member

nhz2 commented Oct 24, 2022

Yeah, I think it makes sense to error somewhere if the struct type isn't isbitstype. Also if you update to Compat 3.40 you can use ismutabletype, but I'm not sure if that would work, because an immutable struct may have mutable struct type fields.

@mkitti
Copy link
Member

mkitti commented Oct 25, 2022

I have solution for mutable types. Basically, we need to convert the mutable type to a NamedTuple. Fortunately, we already have HDF5.get_mem_compatible_jl_type which already describes that NamedTuple. I will push soon.

@tamasgal
Copy link
Contributor Author

Thanks for jumping in @mkitti ! :)

How should we proceed with the finalisation problem?

@mkitti
Copy link
Member

mkitti commented Oct 25, 2022

Thanks for jumping in @mkitti ! :)

How should we proceed with the finalisation problem?

For now, I've changed the second argument to true in 8eacf5f with some reservation.

Perhaps we should probably close any compound datatypes that we create eagerly. Perhaps we should even close any compound datatypes passed into these methods. The larger issue of threading and overall use of finalization should be addressed, but perhaps elsewhere.

@mkitti mkitti mentioned this pull request Nov 18, 2022
@simonbyrne
Copy link
Collaborator

One thing we recently added in MPI.jl (which has a similar construction for custom datatypes) is an internal cache, which may be helpful when reading/writing multiple times:
https://github.com/JuliaParallel/MPI.jl/blob/eaa2599f4f79611852df21d7c86cd81da8b9f156/src/datatypes.jl#L132-L153

@mkitti
Copy link
Member

mkitti commented Nov 28, 2022

Should this code be here or should it be in JLD.jl or a third package?

@tamasgal
Copy link
Contributor Author

I am a bit confused which actions need to be taken in order to merge this. Can I do anything else to make progress?

@mkitti
Copy link
Member

mkitti commented Nov 28, 2022

I'm leaning towards the merge side, but I think @musm needs to review.

I think we should consider @simonbyrne suggestion of a cache, but that could be another pull request.


# These will use finalizers. Close them eagerly to avoid issues.
datatype(::T) where {T} = Datatype(hdf5_type_id(T), true)
datatype(::Type{T}) where {T} = Datatype(hdf5_type_id(T), true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this definition redundant with the method right above it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The first works on instances of a type. The second works on the type itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to need this definition (on L67)

@@ -46,6 +46,12 @@ The default `Dataspace` used for representing a Julia object `data`:
- arrays: a simple `Dataspace`
- `nothing` or an `EmptyArray`: a null dataspace
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update the docs here to include the fact that the new default is a scalar Dataspace for struct types?

@musm
Copy link
Member

musm commented Dec 2, 2022

LGTM sans minor comment

Agree cache sounds interesting but probably for seperate PR

@mkitti
Copy link
Member

mkitti commented Dec 6, 2022

I will see if I can address @musm's comments and then merge.

@musm
Copy link
Member

musm commented Dec 6, 2022

Sounds good!

Co-authored-by: Mustafa M <mus-m@outlook.com>
@musm musm merged commit 205518d into JuliaIO:master Dec 9, 2022
@tamasgal tamasgal deleted the write-compound branch December 9, 2022 09:01
@alhirzel
Copy link

Great work! I am wondering if it is possible to do a release soon so that this can be consumed by general users without ]deving the package?

@musm
Copy link
Member

musm commented Dec 20, 2022

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants