Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
9bf85ba
Calculate b-tree range
Oct 21, 2025
83515d0
Merge remote-tracking branch 'origin/main' into btreeloc
Oct 21, 2025
70ad54e
Remove redudant chunk handling from btree.py (part of #131)
Oct 22, 2025
cd9c038
Just making a note that we need to be careful about chunk indexing sh…
Oct 22, 2025
1c9bef4
First cut at supporting ncdump like behaviour. Doesn't have support f…
Oct 22, 2025
9c1b688
Working implementation of lazy access to variables (#135) and partial…
Oct 23, 2025
2c3a3e3
allow visititems to be lazy
Oct 24, 2025
7827480
p5dump works for the test cases
Oct 24, 2025
481ca47
Fixed string handling in groups
Oct 24, 2025
5ab84bf
Better testing
Oct 24, 2025
a917d9e
Support for chunk information in p5dump via -s. (Fixed btree_range as…
Oct 26, 2025
173d9d1
p5dump -s includes storage type (from layout_class)
Oct 27, 2025
dae9eb1
Edge case detection and bug fix
Oct 27, 2025
aa4ca30
Handling broken pipes more gracefully
Oct 27, 2025
a84fea6
Why don't I run my tests before committing?
Oct 27, 2025
b02083c
Cleaning up the DatasetID interface error handling for chunk queries …
Oct 27, 2025
6c0a6c8
Merge remote-tracking branch 'origin/main' into optimise
Oct 27, 2025
f3eb83e
Merge branch 'main' into optimise
valeriupredoi Oct 27, 2025
3040c3d
Better checking of chunk info testing answers, courtesy of @zequihg50
Oct 28, 2025
d168277
Update pyfive/h5d.py
bnlawrence Nov 3, 2025
1c9cb5e
Requested changes from review
Nov 3, 2025
82d2566
Merge remote-tracking branch 'refs/remotes/origin/optimise' into opti…
Nov 3, 2025
d4d538e
IDE config in by mistake - removed
Nov 3, 2025
a2a785e
Added documentation for p5dump
Nov 3, 2025
40b3fa8
Improved documentation of the extra methods
Nov 3, 2025
f7cc695
add p5dump test in GHA
valeriupredoi Nov 4, 2025
d1432af
add test module for p5dump
valeriupredoi Nov 4, 2025
bdc4752
add one more test
valeriupredoi Nov 4, 2025
e1d4979
change test name
valeriupredoi Nov 4, 2025
a5caf4d
flake8 correct
valeriupredoi Nov 4, 2025
e6ae44a
Update doc/p5dump.rst
valeriupredoi Nov 4, 2025
e05748d
Update doc/p5dump.rst
valeriupredoi Nov 4, 2025
83e7e98
Update doc/p5dump.rst
valeriupredoi Nov 4, 2025
9a2e74b
Update doc/p5dump.rst
valeriupredoi Nov 4, 2025
0080adf
Update doc/p5dump.rst
valeriupredoi Nov 4, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,11 @@ jobs:
run: |
conda list
pip list
- name: Test p5dump
run: |
which p5dump
p5dump tests/data/groups.hdf5
p5dump tests/data/issue23_A.nc
- name: Test with pytest
run: |
# pytest tries to split test_threadsafe_data_access.py
Expand Down
1 change: 1 addition & 0 deletions doc/_sidebar.rst.inc
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,5 @@
API Reference <api_reference>
Additional API Features <additional>
Optimising Data Access Speed <optimising>
The p5dump utility <p5dump>
Change Log <changelog>
30 changes: 25 additions & 5 deletions doc/additional.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,33 @@ Additional API Features

In this section we highlight the additional API features and optimisations that ``pyfive`` provides beyond the standard ``h5py`` functionality.

Datasets (aka "variables") are actually implemented as ``pyfive.h5d.DatasetID`` objects. The ``pyfive.h5d.DatasetID`` class is designed to both support the same API as ``h5py.DatasetID`` and to
provide additional functionality.
Modifications to the File API
-----------------------------

The autogenerated documentation identifies both the methods and attributes that are part of the ``h5py`` API and those that are extensions provided by ``pyfive``.
When acccessing a file, in addition there are two modifications to the standard ``h5py`` API that can be used to optimise
performance. A new method (``get_lazy_view``) and an additional keyword argument on ``visititems`` (noindex) are provided
to support access to all dataset metadata without loading chunk indices. (Loading chunk indices at dataset
instantiation is mostly a useful optimisation, but not if you have no intent of accessing the data itself.)

.. autoclass:: pyfive.h5d.DatasetID
:members:
The ``Group`` API is fully documented in the autogenerated API reference, but the additional methods and keyword arguments are highlighted here.
These methods are also avilable on the ``File`` class, since ``File`` is a subclass of ``Group``.

.. automethod:: pyfive.high_level.Group.get_lazy_view
.. automethod:: pyfive.high_level.Group.visititems
:noindex:

Modifications to the DatasetID API
----------------------------------

When accessing datasets, additional functionality is exposed via the ``pyfive.h5d.DatasetID`` class, which
is the class which implements the low-level data access methods for datasets (aka "variables").

The DatasetID API is fully documented in the autogenerated API reference, but the additional methods and attributes are highlighted here:

.. autoattribute:: pyfive.h5d.DatasetID.first_chunk
.. autoattribute:: pyfive.h5d.DatasetID.btree_range
.. automethod:: pyfive.h5d.DatasetID.set_pseudo_chunk_size




23 changes: 18 additions & 5 deletions doc/api_reference.rst
Original file line number Diff line number Diff line change
@@ -1,29 +1,42 @@
API Reference
*************


File
-------

.. autoclass:: pyfive.File
:members:
:noindex:

----

Group
--------

.. autoclass:: pyfive.Group
:members:
:noindex:

----
Dataset
--------

.. autoclass:: pyfive.Dataset
:members:
:noindex:

----

DatasetID
----------
.. autoclass:: pyfive.h5d.DatasetID
:members:
:noindex:

Datatype
--------

.. autoclass:: pyfive.Datatype
:members:
:noindex:

----

The h5t module
--------------
Expand Down
1 change: 1 addition & 0 deletions doc/gensidebar.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ def _header(project, text):
_write("pyfive", "API Reference", "api_reference")
_write("pyfive", "Additional API Features", "additional")
_write("pyfive", "Optimising Data Access Speed", "optimising")
_write("pyfive", "The p5dump utility", "p5dump")
_write("pyfive", "Change Log", "changelog")
# _write("pyfive", "Examples", "examples")
# _write("pyfive", "Contributing to the community", "community/index")
Expand Down
2 changes: 2 additions & 0 deletions doc/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@ once a variable is instantiated (i.e. for an open ``pyfive.File`` instance ``f``
the attributes and b-tree (chunk index) are read, and it is then possible to close the parent file (``f``),
but continue to use (``v``).

The package includes a script ``p5dump`` which can be used to dump the contents of an HDF5 file to the terminal.

.. note::

We have test coverage that shows that the usage of ``v`` in this way is thread-safe - the test which demonstrates this is slow,
Expand Down
19 changes: 19 additions & 0 deletions doc/p5dump.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
p5dump
******

``pyfive`` includes a command line tool ``p5dump`` which can be used to dump the contents of an HDF5 file to the
terminal. This is similar to the ``ncdump`` tool included with the NetCDF library, or the ``h5dump`` tool included
with the HDF5 library, but like the rest of pyfive, is implemented in pure Python without any dependencies on the
HDF5 C library.

It is not identical to either of these tools, though the default output is very close to that of ``ncdump``.
When called with `-s` (e.g ``p5dump -s myfile.hdf5``, or simply ``p5dump myfile.hdf5``) the output provides extra information for chunked
datasets, including the locations of the start and end of the chunk index b-tree
and the location of the first data chunk for that variable. This extra information is useful for understanding
the performance of data access for chunked variables, particularly when accessing data in object stores such as
S3. In general, if one finds that the b-tree index continues past the first data chunk, access
performance may be sub-optimal - in this situation, if you have control over the data, you might well
consider using the ``h5repack`` tool from the standard HDF5 distribution to make a copy of the file with the
chunk index and attributes stored contiguously. All tools which read HDF5 files will benefit from this.


1 change: 0 additions & 1 deletion doc/quickstart/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,5 @@ Getting started
Usage <usage>
Enumerations <enums>
Opaque Datasets <opaque>



1 change: 1 addition & 0 deletions pyfive/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from pyfive.h5t import check_enum_dtype, check_string_dtype, check_dtype, opaque_dtype, check_opaque_dtype
from pyfive.h5py import Datatype, Empty
from importlib.metadata import version
from pyfive.inspect import p5ncdump

__version__ = '0.5.0.dev'

52 changes: 3 additions & 49 deletions pyfive/btree.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ def __init__(self, fh, offset):
self.offset = offset
self.depth = None
self.all_nodes = {}
self.last_offset = offset

self._read_root_node()
self._read_children()
Expand Down Expand Up @@ -53,6 +54,7 @@ def _read_node(self, offset, node_level):
node = self._read_node_header(offset, node_level)
node['keys'] = []
node['addresses'] = []
self.last_offset=max(offset,self.last_offset)
return node

def _read_node_header(self, offset):
Expand Down Expand Up @@ -149,57 +151,9 @@ def _read_node(self, offset, node_level):
addresses.append(chunk_address)
node['keys'] = keys
node['addresses'] = addresses
self.last_offset=max(offset,self.last_offset)
return node

def construct_data_from_chunks(
self, chunk_shape, data_shape, dtype, filter_pipeline):
""" Build a complete data array from chunks. """
if isinstance(dtype, tuple):
true_dtype = tuple(dtype)
dtype_class = dtype[0]
if dtype_class == 'REFERENCE':
size = dtype[1]
if size != 8:
raise NotImplementedError('Unsupported Reference type')
dtype = '<u8'
else:
raise NotImplementedError('datatype not implemented')
else:
true_dtype = None

# create array to store data
shape = [_padded_size(i, j) for i, j in zip(data_shape, chunk_shape)]
data = np.zeros(shape, dtype=dtype)

# loop over chunks reading each into the full data array
count = np.prod(chunk_shape)
itemsize = np.dtype(dtype).itemsize
chunk_buffer_size = count * itemsize
for node in self.all_nodes[0]:
for node_key, addr in zip(node['keys'], node['addresses']):
self.fh.seek(addr)
if filter_pipeline is None:
chunk_buffer = self.fh.read(chunk_buffer_size)
else:
chunk_buffer = self.fh.read(node_key['chunk_size'])
filter_mask = node_key['filter_mask']
chunk_buffer = self._filter_chunk(
chunk_buffer, filter_mask, filter_pipeline, itemsize)

chunk_data = np.frombuffer(chunk_buffer, dtype=dtype)
start = node_key['chunk_offset'][:-1]
region = [slice(i, i+j) for i, j in zip(start, chunk_shape)]
data[tuple(region)] = chunk_data.reshape(chunk_shape)

if isinstance(true_dtype, tuple):
if dtype_class == 'REFERENCE':
to_reference = np.vectorize(Reference)
data = to_reference(data)
else:
raise NotImplementedError('datatype not implemented')

non_padded_region = tuple([slice(i) for i in data_shape])
return data[non_padded_region]

@classmethod
def _filter_chunk(cls, chunk_buffer, filter_mask, filter_pipeline, itemsize):
Expand Down
Loading
Loading