Functionality enhancements to address lazy loading of chunked data, variable length strings, and other minor bug fixes#68
Conversation
…ven't got any tests around this yet.
…changes to actually using the filter pipeline. At this point is failling test_reference.
…lso remove list definition which breaks references.
… a pseudo chunked read. Lots of things to do around optimising that read, but let's test this more widely first.
Adding support for reading only chunks and various pieces of the H5Py lower level interface
|
(I see we're failing the checks due to some build dependencies. Will sort that shortly. |
I've switched to |
| .idea | ||
| .DS_Store | ||
| test-reports/ | ||
| <_io.Bytes*> |
There was a problem hiding this comment.
If filenames with < in them are generated, I'd like to see them.
valeriupredoi
left a comment
There was a problem hiding this comment.
@bnlawrence you got three instances of match case: - that is Python>=3.10 syntax, see https://github.com/jjhelmus/pyfive/actions/runs/13996247036/job/39192033797
I would argue you just drop support for Python 3.9 since 3.9 it will be fully retired in August anyway 🍺
|
Clearly I'd vote for dropping 3.9. Given that we don't think pyfive is in heavy use anywhere (yet), most folks using it are likely to be agile enough to move if they need to, particularly given they'd need to do so by August. If a tiny number of folks have to pin to an older version, then presumably they don't need all the new functionality anyway. |
|
Hi Folks. We're rather desperate to announce this to the community, and I want to do that via a poster, which I will have to send for printing on Wednesday. Ideally we have links to this repository and the main ... what do we need to do to get this accepted? The code has been reviewed by three of us here at NCAS (and we will continue to maintain it). We'd prefer not to make changes to support Python 3.9 as that's only got months to live, and we think that few people will be using this in the wild. |
|
hi folks, just a heads up that @jjhelmus and @bnlawrence had a very constructive conversation, and the decision was taken to allow the move of this repo over to our NCAS-CMS organization https://github.com/NCAS-CMS where we will be taking very good care of Pyfive. I am inclined to have this merged before we do the move, and also, would like to ask the developers and contributors here if they'd be happy for me to add them to the moved repo (once we move it, and if the move missed any of them). Many thanks again @jjhelmus 🍺 |
|
thanks for looking and confirming @kmuehlbauer and @jjhelmus - I am starting the operations now: first step is to retire support for the defunct Python 3.9, so we can merge this PR with no test failures (and obviously, repair anything may be caused by a new generation of the environment here) 🍻 |
valeriupredoi
left a comment
There was a problem hiding this comment.
tests pass, thorough review performed by @davidhassell (code review and operationsl) and myself (more operational). Very many thanks to all involved 🍻
This pull request was originally prompted by #6, insofar as we needed lazy loading of chunked data, but also our need to a) have a pure python backend reader for
h5netcdf, and b) which could expose variable b-trees to other downstream software.Our motivations included thread-safety and performance at scale in a cloud environment. To do this we have implemented versions of some more components of the
h5pystack, and in particular, a version of theh5d.DatasetIDclass, which is now holds all the code which is used for data access (as opposed to attribute access, which still lives indataobjects). There are a couple of extra methods for exposing the chunk index directly rather than via an iterator and to access chunk info using the zarr indexing scheme rather than theh5pyindexing scheme.The code also includes an implementation of what we have called
pseudochunkingwhich is used for accessing a contiguous array which is larger than memory via S3. In essence all this does is declare default chunks aligned with the array order on disk and use them for data access.There are many small bug fixes and optimisations to support cloud usage, the most important of which is that once a variable is instantiated (i.e. for an open
pyfive.Fileinstancef, when you dov=f['variable_name']) the attributes and b-tree are read, and it is then possible to close the parent file (f), but continue to use (v) - and we have test coverage that shows that this usage ofvis thread-safe (there is a test which demonstrates this, it's slow, but it needs to be as shorter tests were sporadically passing). (The test harness now includes all the components necessary for testing pyfive accessing data via both Posix and S3).As well as closing #6, this pull request would close: #41,#59,#60,#64.