Skip to content

Functionality enhancements to address lazy loading of chunked data, variable length strings, and other minor bug fixes#68

Merged
valeriupredoi merged 130 commits intomasterfrom
wacasoft
Jun 3, 2025
Merged

Functionality enhancements to address lazy loading of chunked data, variable length strings, and other minor bug fixes#68
valeriupredoi merged 130 commits intomasterfrom
wacasoft

Conversation

@bnlawrence
Copy link
Collaborator

This pull request was originally prompted by #6, insofar as we needed lazy loading of chunked data, but also our need to a) have a pure python backend reader for h5netcdf, and b) which could expose variable b-trees to other downstream software.

Our motivations included thread-safety and performance at scale in a cloud environment. To do this we have implemented versions of some more components of the h5py stack, and in particular, a version of the h5d.DatasetID class, which is now holds all the code which is used for data access (as opposed to attribute access, which still lives in dataobjects). There are a couple of extra methods for exposing the chunk index directly rather than via an iterator and to access chunk info using the zarr indexing scheme rather than the h5py indexing scheme.

The code also includes an implementation of what we have called pseudochunking which is used for accessing a contiguous array which is larger than memory via S3. In essence all this does is declare default chunks aligned with the array order on disk and use them for data access.

There are many small bug fixes and optimisations to support cloud usage, the most important of which is that once a variable is instantiated (i.e. for an open pyfive.File instance f, when you do v=f['variable_name']) the attributes and b-tree are read, and it is then possible to close the parent file (f), but continue to use (v) - and we have test coverage that shows that this usage of v is thread-safe (there is a test which demonstrates this, it's slow, but it needs to be as shorter tests were sporadically passing). (The test harness now includes all the components necessary for testing pyfive accessing data via both Posix and S3).

As well as closing #6, this pull request would close: #41,#59,#60,#64.

Bryan Lawrence and others added 30 commits February 22, 2024 12:20
…changes to actually using the filter pipeline. At this point is failling test_reference.
…lso remove list definition which breaks references.
… a pseudo chunked read. Lots of things to do around optimising that read, but let's test this more widely first.
Adding support for reading only chunks and various pieces of the H5Py lower level interface
@bnlawrence
Copy link
Collaborator Author

(I see we're failing the checks due to some build dependencies. Will sort that shortly.

@valeriupredoi
Copy link
Collaborator

valeriupredoi commented Jan 30, 2025

(I see we're failing the checks due to some build dependencies. Will sort that shortly.

I've switched to pip install .[testing] in the GA workflow so we get them deps for testing (only) 👍

.idea
.DS_Store
test-reports/
<_io.Bytes*>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If filenames with < in them are generated, I'd like to see them.

Copy link
Collaborator

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnlawrence you got three instances of match case: - that is Python>=3.10 syntax, see https://github.com/jjhelmus/pyfive/actions/runs/13996247036/job/39192033797

I would argue you just drop support for Python 3.9 since 3.9 it will be fully retired in August anyway 🍺

@bnlawrence
Copy link
Collaborator Author

Clearly I'd vote for dropping 3.9. Given that we don't think pyfive is in heavy use anywhere (yet), most folks using it are likely to be agile enough to move if they need to, particularly given they'd need to do so by August. If a tiny number of folks have to pin to an older version, then presumably they don't need all the new functionality anyway.

@bnlawrence
Copy link
Collaborator Author

Hi Folks. We're rather desperate to announce this to the community, and I want to do that via a poster, which I will have to send for printing on Wednesday. Ideally we have links to this repository and the main ... what do we need to do to get this accepted? The code has been reviewed by three of us here at NCAS (and we will continue to maintain it). We'd prefer not to make changes to support Python 3.9 as that's only got months to live, and we think that few people will be using this in the wild.

@valeriupredoi
Copy link
Collaborator

hi folks, just a heads up that @jjhelmus and @bnlawrence had a very constructive conversation, and the decision was taken to allow the move of this repo over to our NCAS-CMS organization https://github.com/NCAS-CMS where we will be taking very good care of Pyfive. I am inclined to have this merged before we do the move, and also, would like to ask the developers and contributors here if they'd be happy for me to add them to the moved repo (once we move it, and if the move missed any of them). Many thanks again @jjhelmus 🍺

@valeriupredoi
Copy link
Collaborator

thanks for looking and confirming @kmuehlbauer and @jjhelmus - I am starting the operations now: first step is to retire support for the defunct Python 3.9, so we can merge this PR with no test failures (and obviously, repair anything may be caused by a new generation of the environment here) 🍻

Copy link
Collaborator

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests pass, thorough review performed by @davidhassell (code review and operationsl) and myself (more operational). Very many thanks to all involved 🍻

@valeriupredoi valeriupredoi merged commit 917025b into NCAS-CMS:master Jun 3, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

unable to read attributes when the size of attribute lists is relatively big

4 participants