-
Notifications
You must be signed in to change notification settings - Fork 142
Add built-in support to write compound data types #1013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At minimum I think this should be write_compound_dataset
since one could also write a compound type to an attribute.
My preference would be to extend HDF5.datatype
to work on arbitrary structs and named tuples so that we can just use the normal write_dataset
interface. After enabling that, then we should reevaluate if a special convenience method is still warranted.
Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
f65a0dd
to
fe273a9
Compare
Let me comment at the end since I changed a few things.
so that @mkitti I am still not sure about the type freedom, so let me know what you think! |
@tamasgal, in case you don't follow what I'm talking about in the comments, I mean the following: julia> import HDF5: API, Datatype
julia> HDF5.datatype(x::T) where T = HDF5.datatype(x, Val(isstructtype(T)))
julia> function HDF5.datatype(x::T, isstruct::Val{true}) where T
dtype = API.h5t_create(API.H5T_COMPOUND, sizeof(T))
for (idx, fn) in enumerate(fieldnames(T))
API.h5t_insert(dtype, fn, fieldoffset(T, idx), datatype(fieldtype(T, idx)))
end
Datatype(dtype)
end
julia> datatype(MyStruct(3,5))
HDF5.Datatype: H5T_COMPOUND {
H5T_STD_I64LE "x" : 0;
H5T_STD_I64LE "y" : 8;
}
julia> datatype((x=5, y=3.f0))
HDF5.Datatype: H5T_COMPOUND {
H5T_STD_I64LE "x" : 0;
H5T_IEEE_F32LE "y" : 8;
}
|
Ah yes, I was struggling with that. Did not came up with the |
People may want to override this method still for their specific types. Your implementation relies on Julia's memory aligned version, which may take up more space than needed. julia> struct MyStruct32
x::Int32
y::Int32
end
julia> datatype(MyStruct32(3,5))
HDF5.Datatype: H5T_COMPOUND {
H5T_STD_I32LE "x" : 0;
H5T_STD_I32LE "y" : 4;
}
julia> struct MyStructMixed
x::Int32
y::Int
end
julia> datatype(MyStructMixed(3,5))
HDF5.Datatype: H5T_COMPOUND {
H5T_STD_I32LE "x" : 0;
H5T_STD_I64LE "y" : 8;
}
julia> sizeof(MyStruct32)
8
julia> sizeof(MyStructMixed)
16 As for the julia> Base.@kwdef struct MyStructKW
x::Int32
y::Int
end
MyStructKW
julia> MyStructKW(;(y = 1, x = 2)...)
MyStructKW(2, 1)
julia> Base.convert(::Type{MyStructKW}, nt::NamedTuple) = MyStructKW(; nt...)
julia> convert(MyStructKW, (y = 1, x = 2))
MyStructKW(2, 1) |
You are right about the memory padding, I guess I will just use the actual sizes as offset then. Not sure if I understand regarding |
I'm talking about your comment:
If they wrote a |
We would have to test the conversion closely I think. This may make memory mapping difficult for example. |
Ah yes, understood Btw. regarding the |
I see. I don't know much about the memory mapping features of |
I am still struggling to find out where I hit the |
What is the stack trace? |
So with this it works (but of course not for writing single structs): datatype(x::AbstractArray{T}) where {T} = HDF5.datatype(x, Val(isstructtype(T)))
function datatype(x::AbstractArray{T}, isstruct::Val{true}) where {T}
dtype = API.h5t_create(API.H5T_COMPOUND, sizeof(T))
for (idx, fn) in enumerate(fieldnames(T))
API.h5t_insert(dtype, fn, fieldoffset(T, idx), datatype(fieldtype(T, idx)))
end
Datatype(dtype)
end and if I change the signatures to: datatype(x::T) where {T} = HDF5.datatype(x, Val(isstructtype(T)))
function datatype(x::T, isstruct::Val{true}) where {T} and call with the same tests, like bars = [
[Bar(1, 1.1, true) Bar(2, 2.1, false) Bar(3, 3.1, true)]
[Bar(4, 4.1, false) Bar(5, 5.1, true) Bar(6, 6.1, false)]
]
namedtuples = [(a=1, b=2.3), (a=4, b=5.6)]
fn = tempname()
h5open(fn, "w") do h5f
write_dataset(h5f, "the/bars", bars)
write_dataset(h5f, "the/namedtuples", namedtuples)
end I get
|
I also checked with bars = @SMatrix [
[Bar(1, 1.1, true) Bar(2, 2.1, false) Bar(3, 3.1, true)]
[Bar(4, 4.1, false) Bar(5, 5.1, true) Bar(6, 6.1, false)]
]
namedtuples = @SVector [(a=1, b=2.3), (a=4, b=5.6)] |
I played around a bit but the problem is bigger than I thought. It basically boils down to the low level I/O where pointers are written instead of the values of the primitive types. I am not sure which what's the best approach is but one fast solution would be throwing an error if What do you think? |
We may need to dispatch on |
Yeah, I think it makes sense to error somewhere if the struct type isn't |
I have solution for mutable types. Basically, we need to convert the mutable type to a |
Thanks for jumping in @mkitti ! How should we proceed with the finalisation problem? |
For now, I've changed the second argument to Perhaps we should probably |
One thing we recently added in MPI.jl (which has a similar construction for custom datatypes) is an internal cache, which may be helpful when reading/writing multiple times: |
Should this code be here or should it be in JLD.jl or a third package? |
I am a bit confused which actions need to be taken in order to merge this. Can I do anything else to make progress? |
I'm leaning towards the merge side, but I think @musm needs to review. I think we should consider @simonbyrne suggestion of a cache, but that could be another pull request. |
src/typeconversions.jl
Outdated
|
||
# These will use finalizers. Close them eagerly to avoid issues. | ||
datatype(::T) where {T} = Datatype(hdf5_type_id(T), true) | ||
datatype(::Type{T}) where {T} = Datatype(hdf5_type_id(T), true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this definition redundant with the method right above it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The first works on instances of a type. The second works on the type itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't seem to need this definition (on L67)
@@ -46,6 +46,12 @@ The default `Dataspace` used for representing a Julia object `data`: | |||
- arrays: a simple `Dataspace` | |||
- `nothing` or an `EmptyArray`: a null dataspace | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update the docs here to include the fact that the new default is a scalar Dataspace
for struct types?
LGTM sans minor comment Agree cache sounds interesting but probably for seperate PR |
I will see if I can address @musm's comments and then merge. |
Sounds good! |
Co-authored-by: Mustafa M <mus-m@outlook.com>
Great work! I am wondering if it is possible to do a release soon so that this can be consumed by general users without |
done |
This PR contains a utility function (EDIT: actually after some discussions it's now realised via adding a method to
datatype()
) which I carry around my projects and I discovered that there is already an open issue which is related too, see #819It's by all means not fully fledged, but it works for
struct
s with primitive field types very well. It's not reversible, meaning thatread
will yield the usualNamedTuple
. One way to make it reversible would be to attach metadata to the dataset with e.g. the type name and when read back, check if the type is already defined in the namespace and if not, create it dynamically. Kind of what I do in https://github.com/JuliaHEP/UnROOT.jl where generating types during runtime is mandatory.I provided an example in the function docstring which demonstrates that it's fairly intuitive to use and covers the basic needs. Here is it:
The data can be reinterpreted to the original data type using
reinterpret()
: