Optimise when we get access to b-tree by providing lazier view of datasets, access to b-tree location, and new p5dump#138
Conversation
…ould we have a v3 layout in the future.
…or file or group attributes yet, or phony dimensions.
… implementation of p5dump functionality (#134). Unit tests are failing due to a desire to get closer to (but not exactly) what ncdump will do.
…on unchunked data and some tests to keep V happy.
|
I have not added any documentation yet, I figured I'd do that once we had agreed the p5dump API and functionality, and agreed on the btree range. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #138 +/- ##
==========================================
+ Coverage 74.66% 76.21% +1.54%
==========================================
Files 12 14 +2
Lines 2712 2867 +155
Branches 407 450 +43
==========================================
+ Hits 2025 2185 +160
+ Misses 576 561 -15
- Partials 111 121 +10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This could serve to validate using external tools, although I think I'm more in favor of checking with hardcoded values here, just to avoid a false positive if the external tool is wrong. def test_get_chunk_info_chunked():
# start lazy, then go real
with pyfive.File(DATASET_CHUNKED_HDF5_FILE) as hfile, \
h5py.File(DATASET_CHUNKED_HDF5_FILE) as h5f, \
open(DATASET_CHUNKED_HDF5_FILE, "rb") as f:
ds = hfile.get_lazy_view('dataset1')
assert ds.id._DatasetID__index_built == False
si = StoreInfo((0, 0), 0, 4016, 16)
info = ds.id.get_chunk_info(0)
assert info == si
assert ds.id.get_num_chunks() == 88
assert h5f["dataset1"].id.get_num_chunks() == 88
assert h5f["dataset1"].id.get_chunk_info(0) == si
assert ds.id.btree_range == (1072, 8680)
f.seek(1072)
assert f.read(4) == b"TREE" # only v1 btrees
f.seek(8680)
assert f.read(4) == b"TREE" # only v1 btrees |
|
@zequihg50 Thanks. That's an excellent addition and makes me far happier. I'll push that up in a minute! |
davidhassell
left a comment
There was a problem hiding this comment.
Hi Bryan,
Looks goods. Some suggestions made, but nothing that's any problem ...
... apart from running p5dump on the console command line. Should that be possible? I couldn't make it so (best I got was a nasty circular import).
As far as I know, the usual thing is to put the command line script into pyfive/scripts, and then the script will get copied to somewhere in $PATH at intall time (e.g. https://github.com/NCAS-CMS/cfdm/blob/main/scripts/cfdump) - but I might not be up to date on this!
From my perspective, if the command-line thing is an issue (might just have been me) then that can be sorted in another PR, rather than holding this up.
Thanks,
David
Add a simpler interface to first chunk in datasetid (courtesy of @davidhassell) Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>
|
@valeriupredoi I think i've made all the requested changes. Can you please also have a look at the docs and make sure you're happy with the changes? Otherwise I think this is good to go. |
awesome, many thanks @bnlawrence 🍻 Shall dos, in a jiffy! |
|
Actually, it's not ready, I think I need to document the lazy evaluation stuff first. Tomorrow. |
just lemme know when you ready, and I'll have a close look, I'd like to test the executable functionality too, and I will add a small test for it in the GHA; but when you ready, no rush 🍻 |
|
Ok, this time I think it's ready. I just pushed up some documentation. If you find issues with it, please just fix it :-)! |
valeriupredoi
left a comment
There was a problem hiding this comment.
spelling! Brayn 😁
valeriupredoi
left a comment
There was a problem hiding this comment.
many thanks to @bnlawrence and everyone else who pitched in here 🍻
Description
This pull request addresses three specific issues:
@zequihg50 Could you please have a look at this one too? Especially the b-tree range stuff and the related tests.
Checklist