Optimise when we get access to b-tree by providing lazier view of datasets, access to b-tree location, and new p5dump by bnlawrence · Pull Request #138 · NCAS-CMS/pyfive

bnlawrence · 2025-10-27T14:03:36Z

Description

This pull request addresses three specific issues:

Closes Over agressive optimisation of b-tree reading. #135: what used to happen was that the b-tree had to be loaded as soon as you wanted to inspect any dataset attributes. This is still the default behaviour, but now there is an additional option to get a lazy view of the variable which doesn't load the b-tree, this lazy support is propagated into visititems as well.
Closes Exposing b-tree layout information #130: the issue here was that we want to make information about the b-tree location available for inspection (to help diagnose performance problems when accessing files). This is a simple fix (new attribute on DatsetID).
Closes A pyfive "ncdump" #134: Introducing a p5dump so that we can inspect a file without any c-code (and potentially inspect many files in parallel).

@zequihg50 Could you please have a look at this one too? Especially the b-tree range stuff and the related tests.

Checklist

This pull request has a descriptive title and labels
This pull request has a minimal description (most was discussed in the issue, but a two-liner description is still desirable)
Unit tests have been added (if codecov test fails)
Any changed dependencies have been added or removed correctly (if need be)
If you are working on the documentation, please ensure the current build passes
All tests pass

…ould we have a v3 layout in the future.

…or file or group attributes yet, or phony dimensions.

… implementation of p5dump functionality (#134). Unit tests are failing due to a desire to get closer to (but not exactly) what ncdump will do.

… well I think)

…on unchunked data and some tests to keep V happy.

bnlawrence · 2025-10-27T14:06:22Z

I have not added any documentation yet, I figured I'd do that once we had agreed the p5dump API and functionality, and agreed on the btree range.

codecov · 2025-10-27T14:07:06Z

Codecov Report

❌ Patch coverage is 80.38278% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.21%. Comparing base (a2fa21b) to head (0080adf).
⚠️ Report is 184 commits behind head on main.

Files with missing lines	Patch %	Lines
pyfive/inspect.py	80.15%	16 Missing and 9 partials ⚠️
pyfive/h5d.py	72.91%	6 Missing and 7 partials ⚠️
pyfive/p5dump.py	90.00%	1 Missing and 1 partial ⚠️
pyfive/btree.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #138      +/-   ##
==========================================
+ Coverage   74.66%   76.21%   +1.54%     
==========================================
  Files          12       14       +2     
  Lines        2712     2867     +155     
  Branches      407      450      +43     
==========================================
+ Hits         2025     2185     +160     
+ Misses        576      561      -15     
- Partials      111      121      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

zequihg50 · 2025-10-27T15:34:35Z

This could serve to validate using external tools, although I think I'm more in favor of checking with hardcoded values here, just to avoid a false positive if the external tool is wrong.

def test_get_chunk_info_chunked():
    # start lazy, then go real

    with pyfive.File(DATASET_CHUNKED_HDF5_FILE) as hfile, \
            h5py.File(DATASET_CHUNKED_HDF5_FILE) as h5f, \
            open(DATASET_CHUNKED_HDF5_FILE, "rb") as f:

        ds = hfile.get_lazy_view('dataset1')
        assert ds.id._DatasetID__index_built == False

        si = StoreInfo((0, 0), 0, 4016, 16)
        info = ds.id.get_chunk_info(0)
        assert info == si

        assert ds.id.get_num_chunks() == 88
        assert h5f["dataset1"].id.get_num_chunks() == 88
        assert h5f["dataset1"].id.get_chunk_info(0) == si

        assert ds.id.btree_range == (1072, 8680)
        f.seek(1072)
        assert f.read(4) == b"TREE"  # only v1 btrees
        f.seek(8680)
        assert f.read(4) == b"TREE"  # only v1 btrees

bnlawrence · 2025-10-28T08:48:57Z

@zequihg50 Thanks. That's an excellent addition and makes me far happier. I'll push that up in a minute!

davidhassell

Hi Bryan,

Looks goods. Some suggestions made, but nothing that's any problem ...

... apart from running p5dump on the console command line. Should that be possible? I couldn't make it so (best I got was a nasty circular import).

As far as I know, the usual thing is to put the command line script into pyfive/scripts, and then the script will get copied to somewhere in $PATH at intall time (e.g. https://github.com/NCAS-CMS/cfdm/blob/main/scripts/cfdump) - but I might not be up to date on this!

From my perspective, if the command-line thing is an issue (might just have been me) then that can be sorted in another PR, rather than holding this up.

Thanks,
David

pyfive/inspect.py

pyfive/h5d.py

pyfive/inspect.py

tests/test_chunk_index_options.py

pyfive/p5dump.py

kmuehlbauer

Can't say much about the changes to btree, but the p5dump works smoothly on my system.

.vscode/settings.json

@davidhassell

Add a simpler interface to first chunk in datasetid (courtesy of @davidhassell) Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>

…mise

bnlawrence · 2025-11-03T12:33:34Z

@valeriupredoi I think i've made all the requested changes. Can you please also have a look at the docs and make sure you're happy with the changes? Otherwise I think this is good to go.

valeriupredoi · 2025-11-03T12:50:20Z

@valeriupredoi I think i've made all the requested changes. Can you please also have a look at the docs and make sure you're happy with the changes? Otherwise I think this is good to go.

awesome, many thanks @bnlawrence 🍻 Shall dos, in a jiffy!

bnlawrence · 2025-11-03T12:57:35Z

Actually, it's not ready, I think I need to document the lazy evaluation stuff first. Tomorrow.

valeriupredoi · 2025-11-03T13:07:11Z

Actually, it's not ready, I think I need to document the lazy evaluation stuff first. Tomorrow.

just lemme know when you ready, and I'll have a close look, I'd like to test the executable functionality too, and I will add a small test for it in the GHA; but when you ready, no rush 🍻

tests/test_mock_s3fs.py

bnlawrence · 2025-11-03T20:20:57Z

Ok, this time I think it's ready. I just pushed up some documentation. If you find issues with it, please just fix it :-)!

valeriupredoi

spelling! Brayn 😁

doc/p5dump.rst

valeriupredoi

many thanks to @bnlawrence and everyone else who pitched in here 🍻

Bryan Lawrence added 17 commits October 21, 2025 14:33

Calculate b-tree range

9bf85ba

Merge remote-tracking branch 'origin/main' into btreeloc

83515d0

Remove redudant chunk handling from btree.py (part of #131)

70ad54e

Just making a note that we need to be careful about chunk indexing sh…

cd9c038

…ould we have a v3 layout in the future.

First cut at supporting ncdump like behaviour. Doesn't have support f…

1c9bef4

…or file or group attributes yet, or phony dimensions.

Working implementation of lazy access to variables (#135) and partial…

9c1b688

… implementation of p5dump functionality (#134). Unit tests are failing due to a desire to get closer to (but not exactly) what ncdump will do.

allow visititems to be lazy

2c3a3e3

p5dump works for the test cases

7827480

Fixed string handling in groups

481ca47

Better testing

5ab84bf

Support for chunk information in p5dump via -s. (Fixed btree_range as…

a917d9e

… well I think)

p5dump -s includes storage type (from layout_class)

173d9d1

Edge case detection and bug fix

dae9eb1

Handling broken pipes more gracefully

aa4ca30

Why don't I run my tests before committing?

a84fea6

Cleaning up the DatasetID interface error handling for chunk queries …

b02083c

…on unchunked data and some tests to keep V happy.

Merge remote-tracking branch 'origin/main' into optimise

6c0a6c8

bnlawrence requested review from davidhassell and kmuehlbauer October 27, 2025 14:03

Merge branch 'main' into optimise

f3eb83e

Better checking of chunk info testing answers, courtesy of @zequihg50

3040c3d

davidhassell requested changes Oct 31, 2025

View reviewed changes

kmuehlbauer reviewed Oct 31, 2025

View reviewed changes

.vscode/settings.json Outdated Show resolved Hide resolved

bnlawrence and others added 3 commits November 3, 2025 09:38

Update pyfive/h5d.py

d168277

Add a simpler interface to first chunk in datasetid (courtesy of @davidhassell) Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>

Requested changes from review

1c9cb5e

Merge remote-tracking branch 'refs/remotes/origin/optimise' into opti…

82d2566

…mise

Bryan Lawrence added 2 commits November 3, 2025 09:55

IDE config in by mistake - removed

d4d538e

Added documentation for p5dump

a2a785e

valeriupredoi reviewed Nov 3, 2025

View reviewed changes

tests/test_mock_s3fs.py Show resolved Hide resolved

Improved documentation of the extra methods

40b3fa8

valeriupredoi added 5 commits November 4, 2025 12:50

add p5dump test in GHA

f7cc695

add test module for p5dump

d1432af

add one more test

bdc4752

change test name

e1d4979

flake8 correct

a5caf4d

valeriupredoi reviewed Nov 4, 2025

View reviewed changes

doc/p5dump.rst Outdated Show resolved Hide resolved

doc/p5dump.rst Outdated Show resolved Hide resolved

doc/p5dump.rst Outdated Show resolved Hide resolved

doc/p5dump.rst Outdated Show resolved Hide resolved

doc/p5dump.rst Outdated Show resolved Hide resolved

valeriupredoi added 5 commits November 4, 2025 14:03

Update doc/p5dump.rst

e6ae44a

Update doc/p5dump.rst

e05748d

Update doc/p5dump.rst

83e7e98

Update doc/p5dump.rst

9a2e74b

Update doc/p5dump.rst

0080adf

valeriupredoi approved these changes Nov 4, 2025

View reviewed changes

valeriupredoi merged commit bac7664 into main Nov 4, 2025
6 of 7 checks passed

valeriupredoi deleted the optimise branch November 4, 2025 14:21

zequihg50 mentioned this pull request Nov 4, 2025

Proposal for is_cloud_optimized #145

Merged

Conversation

bnlawrence commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

bnlawrence commented Oct 27, 2025

Uh oh!

codecov bot commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zequihg50 commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bnlawrence commented Oct 28, 2025

Uh oh!

davidhassell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kmuehlbauer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bnlawrence commented Nov 3, 2025

Uh oh!

valeriupredoi commented Nov 3, 2025

Uh oh!

bnlawrence commented Nov 3, 2025

Uh oh!

valeriupredoi commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bnlawrence commented Nov 3, 2025

Uh oh!

valeriupredoi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

valeriupredoi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bnlawrence commented Oct 27, 2025 •

edited

Loading

codecov bot commented Oct 27, 2025 •

edited

Loading

zequihg50 commented Oct 27, 2025 •

edited

Loading

kmuehlbauer left a comment •

edited

Loading

valeriupredoi commented Nov 3, 2025 •

edited

Loading