Description
Hello, I do not know if this is a bug report or a feature request.
The documentation states that strings as are supported, but there is no example of how to do it. I have read this issue, but the implementation seems to fail with strings fields on the struct.
I want to save structures like:
struct TestRow
N::Int32
V::Float64
context::String
end
And the following code executes with no error:
table = [TestRow(x...) for x in [[3, 1.0, "one"], [17, 0.17625, "two"], [0, NaN, " "]]]
h5open("test.h5", "w") do h5f
ds = create_dataset(h5f, "test", datatype(TestRow), dataspace(table))
ds[:] = table
end
But the file does not contain the string fields properly. Here is the dump of the created file:
$ h5dump test.h5
HDF5 "test.h5" {
GROUP "/" {
DATASET "test" {
DATATYPE H5T_COMPOUND {
H5T_STD_I32LE "N";
H5T_IEEE_F64LE "V";
H5T_STRING {
STRSIZE 1;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
} "context";
}
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): {3,1,""},
(1): {17,0.17625,"8"},
(2): {0,nan,"\37777777740"}
}
}
}
}
The strings are clearly messed up, while the numeric data is intact.
Looking at the code, I noticed that the implementation of datatype appears to get the size of the fixed string without concern or the actual string in the struct. I do not know if that is the actual function being called, but examining the datatype...
> datatype(TestRow)
HDF5.Datatype: H5T_COMPOUND {
H5T_STD_I32LE "N" : 0;
H5T_IEEE_F64LE "V" : 8;
H5T_STRING {
STRSIZE 1;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
} "context" : 16;
}
...it seems that the string is of length 1. This is to be expected, as strings have no definite size, but there is a FixedString{N,PAD}
datatype, which I have no idea how to use in my structs.
Alse, I have found no way of modifying that resulting datatype to achieve fixed strings of another length other than 1. And even with strings of length 1, the data is still messed up.
Is there a way to define fixed-size (say, length 100) strings in my structure so that HDF5.jl can save them properly on the file?
Of course, I can write them as NTuple{100,UInt8}
objects, but that way I have to cast them back on each read, and tools like HDFview will not show the strings.
EDIT: while I was writing this issue I discovered the package StaticStrings.jl. The conversion somehow works, but only for trings of length 1. The same example as before:
using HDF5
using StaticStrings
struct TestRowStatic
N::Int32 # 4
V::Float64 # 8
context::StaticString{1}
end
table = [TestRowStatic(x...) for x in [[3,1.0,"a"],[17,0.17625,"t"],[0,NaN," "]]]
h5open("test.h5", "w") do h5f
ds = create_dataset(h5f,"test",datatype(TestRowStatic),dataspace(table))
ds[:] = table
end
Produces the correct file, with the "a", "t" and " " represented correctly.
BUT with any other size, for instance StaticString{10}, it throws an error (raised from _memtype
on readwrite.jl#298):
Type size mismatch
sizeof(TestRowStatic) = 32
sizeof(HDF5.Datatype: H5T_COMPOUND {
H5T_STD_I32LE "N" : 0;
H5T_IEEE_F64LE "V" : 8;
H5T_STRING {
STRSIZE 1;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
} "context" : 16;
}) = 24
which may be related to this comment, but I think that the string size of 1 is what's causing this.
(Context: my reason for wanting to do a table like this is that I have on the order of 10k datasets of different sizes (N,4), and I want to filter them based on the size N and other properties, some of them categorical. I currently have all of those properties as attributes of each dataset, and it works wth strings, but in order to filter and process them I have to read the attributes of all of them, which is slow. I also have that information on a separate CSV file, but that kinda defeats the purpose of HDF5 being "self described")