Functionality enhancements to address lazy loading of chunked data, variable length strings, and other minor bug fixes by bnlawrence · Pull Request #68 · NCAS-CMS/pyfive

bnlawrence · 2025-01-30T13:41:30Z

This pull request was originally prompted by #6, insofar as we needed lazy loading of chunked data, but also our need to a) have a pure python backend reader for h5netcdf, and b) which could expose variable b-trees to other downstream software.

Our motivations included thread-safety and performance at scale in a cloud environment. To do this we have implemented versions of some more components of the h5py stack, and in particular, a version of the h5d.DatasetID class, which is now holds all the code which is used for data access (as opposed to attribute access, which still lives in dataobjects). There are a couple of extra methods for exposing the chunk index directly rather than via an iterator and to access chunk info using the zarr indexing scheme rather than the h5py indexing scheme.

The code also includes an implementation of what we have called pseudochunking which is used for accessing a contiguous array which is larger than memory via S3. In essence all this does is declare default chunks aligned with the array order on disk and use them for data access.

There are many small bug fixes and optimisations to support cloud usage, the most important of which is that once a variable is instantiated (i.e. for an open pyfive.File instance f, when you do v=f['variable_name']) the attributes and b-tree are read, and it is then possible to close the parent file (f), but continue to use (v) - and we have test coverage that shows that this usage of v is thread-safe (there is a test which demonstrates this, it's slow, but it needs to be as shorter tests were sporadically passing). (The test harness now includes all the components necessary for testing pyfive accessing data via both Posix and S3).

As well as closing #6, this pull request would close: #41,#59,#60,#64.

…er right yet.

…ven't got any tests around this yet.

…changes to actually using the filter pipeline. At this point is failling test_reference.

…lso remove list definition which breaks references.

… a pseudo chunked read. Lots of things to do around optimising that read, but let's test this more widely first.

Adding support for reading only chunks and various pieces of the H5Py lower level interface

VLEN strings

bnlawrence · 2025-01-30T13:45:58Z

(I see we're failing the checks due to some build dependencies. Will sort that shortly.

valeriupredoi · 2025-01-30T15:04:58Z

(I see we're failing the checks due to some build dependencies. Will sort that shortly.

I've switched to pip install .[testing] in the GA workflow so we get them deps for testing (only) 👍

kalvdans · 2025-02-07T15:44:05Z

.gitignore

+.idea
+.DS_Store
+test-reports/
+<_io.Bytes*>


If filenames with < in them are generated, I'd like to see them.

Fix h5py attribute compatibility

valeriupredoi

@bnlawrence you got three instances of match case: - that is Python>=3.10 syntax, see https://github.com/jjhelmus/pyfive/actions/runs/13996247036/job/39192033797

I would argue you just drop support for Python 3.9 since 3.9 it will be fully retired in August anyway 🍺

bnlawrence · 2025-03-25T14:24:19Z

Clearly I'd vote for dropping 3.9. Given that we don't think pyfive is in heavy use anywhere (yet), most folks using it are likely to be agile enough to move if they need to, particularly given they'd need to do so by August. If a tiny number of folks have to pin to an older version, then presumably they don't need all the new functionality anyway.

bnlawrence · 2025-04-18T17:51:10Z

Hi Folks. We're rather desperate to announce this to the community, and I want to do that via a poster, which I will have to send for printing on Wednesday. Ideally we have links to this repository and the main ... what do we need to do to get this accepted? The code has been reviewed by three of us here at NCAS (and we will continue to maintain it). We'd prefer not to make changes to support Python 3.9 as that's only got months to live, and we think that few people will be using this in the wild.

valeriupredoi · 2025-05-22T13:04:05Z

hi folks, just a heads up that @jjhelmus and @bnlawrence had a very constructive conversation, and the decision was taken to allow the move of this repo over to our NCAS-CMS organization https://github.com/NCAS-CMS where we will be taking very good care of Pyfive. I am inclined to have this merged before we do the move, and also, would like to ask the developers and contributors here if they'd be happy for me to add them to the moved repo (once we move it, and if the move missed any of them). Many thanks again @jjhelmus 🍺

valeriupredoi · 2025-06-03T15:12:23Z

thanks for looking and confirming @kmuehlbauer and @jjhelmus - I am starting the operations now: first step is to retire support for the defunct Python 3.9, so we can merge this PR with no test failures (and obviously, repair anything may be caused by a new generation of the environment here) 🍻

valeriupredoi

tests pass, thorough review performed by @davidhassell (code review and operationsl) and myself (more operational). Very many thanks to all involved 🍻

Bryan Lawrence and others added 30 commits February 22, 2024 12:20

Using s3 to get at some real data for testing

02fca54

Getting the address as well as size into the index

df3669a

With timer

16c0e81

Not working yet. Don't reckon I have the arguments to OrthogonalIndex…

c464be8

…er right yet.

A few more notes in the code so I can come back to it anon.

afaa4f5

Woops. Need this.

18bc37c

First working lazy read (only reads chunks needed for selection)

4b0ac08

Woops didnt' commit the real oil

5356aa0

Should now support filtering chunks in the partical chunk loading. Ha…

9fe2394

…ven't got any tests around this yet.

Some additional documentation

dafb3c9

Seems to work, prior to re-integration

53e4ebe

Moved chunk support into standard API

9ac0bbd

removing playing code

a88a150

Merge branch 'jjhelmus:master' into issue6

89aafe3

Fixes bug which stops the selection read from actually occurring and …

96dc178

…changes to actually using the filter pipeline. At this point is failling test_reference.

Hack to avoid reference datatypes in chunk by chunk selections.

eb44c15

Remove obsolete function

51f7cca

Support for third party access to contiguous data address and size. A…

1f61d6c

…lso remove list definition which breaks references.

First cut, fails references and classic, even with new stuff turned off?

e6217b5

This version appears to now support failing over from a memory map to…

67c93e0

… a pseudo chunked read. Lots of things to do around optimising that read, but let's test this more widely first.

First cut, no tests yet

a08ee20

Improvements

dc00503

With some failing tests

9ffb5b2

Fixed one test

223a931

All tests for new functionality pass, but I've broken something old

3a256ab

Now passing all tests

32d83dd

Checking coverage of get_chunk_info_by_coord(method)

f5f89c5

Missing docstring

2c8f59c

Cleaning up

013ce62

Merge pull request #5 from bnlawrence/h5pyapi

400c798

Adding support for reading only chunks and various pieces of the H5Py lower level interface

davidhassell and others added 7 commits January 23, 2025 09:57

dev

599db7b

dev

4b4fbc3

dev

a2cfaeb

vlen related fixes

bd16147

Update pyfive/indexing.py

a50204d

Merge pull request #37 from NCAS-CMS/vlen-dtype

1f9b2c0

VLEN strings

Merge branch 'master' into wacasoft

6c02408

install only in test mode

eed7e99

actual correct name for testing regime

6255fc0

bnlawrence mentioned this pull request Jan 30, 2025

Support pyfive as an alternative backend h5netcdf/h5netcdf#25

Closed

valeriupredoi mentioned this pull request Feb 7, 2025

Modernize package #69

Merged

kalvdans reviewed Feb 7, 2025

View reviewed changes

.gitignore

.idea

.DS_Store

test-reports/

<_io.Bytes*>

Copy link

kalvdans Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If filenames with < in them are generated, I'd like to see them.

davidhassell and others added 7 commits March 17, 2025 14:54

fix h5py attribute compatibility

507dfc7

dev

beb4936

Merge pull request #39 from NCAS-CMS/attributes

97a6fcb

Fix h5py attribute compatibility

Merge branch 'master' into wacasoft

a3010bf

run pytests serially

cf0ece1

noqa py310+ line

0b3ede4

fix the remaining two match syntax

1e4c7f5

valeriupredoi requested changes Mar 21, 2025

View reviewed changes

valeriupredoi added 2 commits June 3, 2025 16:09

change python dependency pin to 3.10

6467a6c

don't run GHA with python 3.9

4740acc

valeriupredoi approved these changes Jun 3, 2025

View reviewed changes

valeriupredoi merged commit 917025b into NCAS-CMS:master Jun 3, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functionality enhancements to address lazy loading of chunked data, variable length strings, and other minor bug fixes#68

Functionality enhancements to address lazy loading of chunked data, variable length strings, and other minor bug fixes#68
valeriupredoi merged 130 commits intomasterfrom
wacasoft

bnlawrence commented Jan 30, 2025

Uh oh!

bnlawrence commented Jan 30, 2025

Uh oh!

valeriupredoi commented Jan 30, 2025 •

edited

Loading

Uh oh!

kalvdans Feb 7, 2025

Uh oh!

valeriupredoi left a comment

Uh oh!

bnlawrence commented Mar 25, 2025

Uh oh!

bnlawrence commented Apr 18, 2025

Uh oh!

valeriupredoi commented May 22, 2025

Uh oh!

valeriupredoi commented Jun 3, 2025

Uh oh!

valeriupredoi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

bnlawrence commented Jan 30, 2025

Uh oh!

bnlawrence commented Jan 30, 2025

Uh oh!

valeriupredoi commented Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kalvdans Feb 7, 2025

Choose a reason for hiding this comment

Uh oh!

valeriupredoi left a comment

Choose a reason for hiding this comment

Uh oh!

bnlawrence commented Mar 25, 2025

Uh oh!

bnlawrence commented Apr 18, 2025

Uh oh!

valeriupredoi commented May 22, 2025

Uh oh!

valeriupredoi commented Jun 3, 2025

Uh oh!

valeriupredoi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

valeriupredoi commented Jan 30, 2025 •

edited

Loading