Skip to content

fix Enum and Empty attributes#102

Merged
valeriupredoi merged 7 commits intoNCAS-CMS:mainfrom
kmuehlbauer:enum-attributes
Oct 1, 2025
Merged

fix Enum and Empty attributes#102
valeriupredoi merged 7 commits intoNCAS-CMS:mainfrom
kmuehlbauer:enum-attributes

Conversation

@kmuehlbauer
Copy link
Collaborator

@kmuehlbauer kmuehlbauer commented Sep 28, 2025

Description

This fixes an issue with enum attributes, where instead of the value, the dtype was returned. It also fixes an issue where Empty attributes (with NULL dataspace) where not handled properly.

Closes #101
Closes #100

While doing this, I had to enhance the tests and according make/data file. Also, I had to fix some inconsistencies in that particular test file wrt numpy.

Before you get started

Checklist

  • This pull request has a descriptive title and labels
  • This pull request has a minimal description (most was discussed in the issue, but a two-liner description is still desirable)
  • Unit tests have been added (if codecov test fails)
  • Any changed dependencies have been added or removed correctly (if need be)
  • If you are working on the documentation, please ensure the current build passes
  • All tests pass

@codecov
Copy link

codecov bot commented Sep 28, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.73%. Comparing base (fefabba) to head (bc57cf4).
⚠️ Report is 190 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #102      +/-   ##
==========================================
+ Coverage   72.58%   72.73%   +0.14%     
==========================================
  Files          11       11              
  Lines        2499     2505       +6     
  Branches      379      382       +3     
==========================================
+ Hits         1814     1822       +8     
+ Misses        583      580       -3     
- Partials      102      103       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator Author

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some comments for clarification.

Comment on lines 83 to 89
def test_string_array_attr_datatypes():
with pyfive.File(ATTR_DATATYPES_HDF5_FILE) as hfile:
assert hfile.attrs['vlen_str_array'][0] == 'Hello'
assert hfile.attrs['vlen_str_array'][1] == 'World!'

assert hfile.attrs['vlen_str_array'].dtype == np.dtype('O')
assert hfile.attrs['vlen_str_array'].dtype.metadata == {'h5py_encoding': 'utf-8'}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads back now as strings. Because I had to fix this, I've taken the chance to move this to its own test function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think this is a different behaviour from h5py.

with h5py.File(ATTR_DATATYPES_HDF5_FILE) as hfile:
   ...:     x=hfile.attrs['vlen_str_array'][0]
   ...:     print(x)
b'Hello'

so I'm thinking this is not what we want?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll double check this one.

Copy link
Collaborator Author

@kmuehlbauer kmuehlbauer Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnlawrence This is obviously new behaviour of h5py introduced some time between 2017 and now. If we recreate the file now, the following line of code:

attrs['vlen_str_array'] = [b'Hello', b'World!']

translates to this HDF5 dump:

ATTRIBUTE "vlen_str_array" {
   DATATYPE  H5T_STRING {
      STRSIZE H5T_VARIABLE;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   }
   DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
   DATA {
   (0): "Hello", "World!"
   }
}

whereas the old file translates to:

ATTRIBUTE "vlen_str_array" {
   DATATYPE  H5T_STRING {
      STRSIZE 6;
      STRPAD H5T_STR_NULLPAD;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   }
   DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
   DATA {
   (0): "Hello\000", "World!"
   }
}

So from my perspective this reveals a big flaw in our testing suite. Although we should be able to read ancient files, we should not expect that this gives the same output with current versions of h5py/libhdf5.

I've managed to create both of the above representations with recent h5py and added the two options for now to the tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And from looking at this, the old version doesn't seem to be variable length at all.

@kmuehlbauer kmuehlbauer changed the title fix enum attribute fix enum and Empty attributes Sep 29, 2025
Copy link
Collaborator Author

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a fix for #100 as it fits thematically in here. Please let me know, if you are OK with this.

@kmuehlbauer kmuehlbauer changed the title fix enum and Empty attributes fix Enum and Empty attributes Sep 29, 2025
@valeriupredoi
Copy link
Collaborator

valeriupredoi commented Sep 29, 2025

woohoo @kmuehlbauer you're an absolute legend, was bashing me brains about #100 on Friday, many thanks! Would you mind if I added an integration test here, with a real (but small) file that replicates that issue? 🍻

@kmuehlbauer
Copy link
Collaborator Author

Yes, please go ahead. 🚀

I'm currently working on ironing out some more issues with EnumType and other user types (eg. COMPOUND).

@valeriupredoi
Copy link
Collaborator

Yes, please go ahead. 🚀

I'm currently working on ironing out some more issues with EnumType and other user types (eg. COMPOUND).

brill! Should I just PR it to your fork? I still need to get the correct (small) file, I was banking on Iris to chop the original file - which I did, but Iris has fixed that NULL (which is interesting to see it doing that)

@kmuehlbauer
Copy link
Collaborator Author

@valeriupredoi Just push any relevant changes directly to my PR. No need to fiddle with PR's here.

@valeriupredoi
Copy link
Collaborator

OK @kmuehlbauer I've been trying to write a file with that NULL DATATSPACE for a wee while now - no success, netCDF4 is actually writing it but adds a point to the data, and if you try pass a Null/None, it complains - do you know how to create one? Got no idea how that file was created to have such a peculiar attribute/null value 🍻

Copy link
Collaborator

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fixes #100 very nicely - I tested both locally (POSIX) and remotely (S3), very many thanks @kmuehlbauer 🍻

@kmuehlbauer
Copy link
Collaborator Author

OK @kmuehlbauer I've been trying to write a file with that NULL DATATSPACE for a wee while now - no success, netCDF4 is actually writing it but adds a point to the data, and if you try pass a Null/None, it complains - do you know how to create one? Got no idea how that file was created to have such a peculiar attribute/null value 🍻

I've added one attribute empty_string to attr_datatypes.hdf5 (and the respective make-file). I'm not sure, when this is triggered. as this attribute is just a global attribute, no attribute of a dataset. Not sure, if that makes a difference. But yes, you could check if this breaks without the fix.

@valeriupredoi
Copy link
Collaborator

@bnlawrence Kai was very kind with my HDF5-PTSD-ed nerves, and fixed the NULL issue here (and other bits too), tests pass, including the one I added to test with that CORDEX file, but it'd be useful if you also had a minute to have a look over it, and merge, please 🍺

Copy link
Collaborator

@bnlawrence bnlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly concerned that pyfive will now be doing different things with strings? (The bug fix is fine though!)

Comment on lines 83 to 89
def test_string_array_attr_datatypes():
with pyfive.File(ATTR_DATATYPES_HDF5_FILE) as hfile:
assert hfile.attrs['vlen_str_array'][0] == 'Hello'
assert hfile.attrs['vlen_str_array'][1] == 'World!'

assert hfile.attrs['vlen_str_array'].dtype == np.dtype('O')
assert hfile.attrs['vlen_str_array'].dtype.metadata == {'h5py_encoding': 'utf-8'}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think this is a different behaviour from h5py.

with h5py.File(ATTR_DATATYPES_HDF5_FILE) as hfile:
   ...:     x=hfile.attrs['vlen_str_array'][0]
   ...:     print(x)
b'Hello'

so I'm thinking this is not what we want?

Copy link
Collaborator Author

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnlawrence I've worked along your suggestions.

@kmuehlbauer
Copy link
Collaborator Author

Should we consider #108 (comment)?

@kmuehlbauer
Copy link
Collaborator Author

@bnlawrence According to our discussion in #108 I've reverted the changes of the ingested hdf5 file. Instead I only changed the according make file, taking now an argument destination file path.

For testing I introduced a fixture, which creates that file on demand and tests the necessary added features in new functions.

Is this a way forward? Slight changes to the make scripts and only some changes to the tests.

@bnlawrence
Copy link
Collaborator

Great. Thanks. @valeriupredoi do your thing.

@valeriupredoi
Copy link
Collaborator

Very many thanks @kmuehlbauer and @bnlawrence 🍺🍺

@valeriupredoi valeriupredoi merged commit 4512046 into NCAS-CMS:main Oct 1, 2025
6 checks passed
@kmuehlbauer kmuehlbauer deleted the enum-attributes branch October 2, 2025 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error Reading Enum Attributes Issue loading file due to buffer too small

3 participants