NCAS-CMS · valeriupredoi · Nov 4, 2025 · Oct 21, 2025 · Oct 21, 2025 · Oct 22, 2025
diff --git a/.github/workflows/pytest.yml b/.github/workflows/pytest.yml
@@ -53,6 +53,11 @@ jobs:
         run: |
           conda list
           pip list
+      - name: Test p5dump
+        run: |
+          which p5dump
+          p5dump tests/data/groups.hdf5
+          p5dump tests/data/issue23_A.nc
       - name: Test with pytest
         run: |
           # pytest tries to split test_threadsafe_data_access.py

diff --git a/doc/_sidebar.rst.inc b/doc/_sidebar.rst.inc
@@ -10,4 +10,5 @@
     API Reference <api_reference>
     Additional API Features <additional>
     Optimising Data Access Speed <optimising>
+    The p5dump utility <p5dump>
     Change Log <changelog>
diff --git a/doc/additional.rst b/doc/additional.rst
@@ -3,13 +3,33 @@ Additional API Features
 
 In this section we highlight the additional API features and optimisations that ``pyfive`` provides beyond the standard ``h5py`` functionality.
 
-Datasets (aka "variables") are actually implemented as ``pyfive.h5d.DatasetID`` objects. The ``pyfive.h5d.DatasetID`` class is  designed to both support the same API as ``h5py.DatasetID`` and to 
-provide additional functionality.
+Modifications to the File API
+-----------------------------
 
-The autogenerated documentation identifies both the methods and attributes that are part of the ``h5py`` API and those that are extensions provided by ``pyfive``.
+When acccessing a file, in addition there are two modifications to the standard ``h5py`` API that can be used to optimise 
+performance. A new method (``get_lazy_view``) and an additional keyword argument on ``visititems`` (noindex) are provided
+to support access to all dataset metadata without loading chunk indices. (Loading chunk indices at dataset
+instantiation is mostly a useful optimisation, but not if you have no intent of accessing the data itself.)
 
-.. autoclass:: pyfive.h5d.DatasetID
-   :members:
+The ``Group`` API is fully documented in the autogenerated API reference, but the additional methods and keyword arguments are highlighted here.
+These methods are also avilable on the ``File`` class, since ``File`` is a subclass of ``Group``. 
+
+.. automethod:: pyfive.high_level.Group.get_lazy_view
+.. automethod:: pyfive.high_level.Group.visititems
    :noindex:
 
+Modifications to the DatasetID API
+----------------------------------
+
+When accessing datasets, additional functionality is exposed via the ``pyfive.h5d.DatasetID`` class, which
+is the class which implements the low-level data access methods for datasets (aka "variables").
+
+The DatasetID API is fully documented in the autogenerated API reference, but the additional methods and attributes are highlighted here:
+
+.. autoattribute:: pyfive.h5d.DatasetID.first_chunk
+.. autoattribute:: pyfive.h5d.DatasetID.btree_range
+.. automethod:: pyfive.h5d.DatasetID.set_pseudo_chunk_size
+
+
+
 
diff --git a/doc/api_reference.rst b/doc/api_reference.rst
@@ -1,29 +1,42 @@
 API Reference
 *************
 
+
+File
+-------
+
 .. autoclass:: pyfive.File
    :members:
    :noindex:
-
-----
+
+Group  
+--------
 
 .. autoclass:: pyfive.Group
    :members:
    :noindex:
 
-----
+Dataset
+--------
 
 .. autoclass:: pyfive.Dataset
    :members:
    :noindex:
 
-----
+
+DatasetID
+----------
+.. autoclass:: pyfive.h5d.DatasetID
+   :members:
+   :noindex:
+
+Datatype
+--------
 
 .. autoclass:: pyfive.Datatype
    :members:
    :noindex:
 
-----
 
 The h5t module
 --------------

diff --git a/doc/gensidebar.py b/doc/gensidebar.py
@@ -62,6 +62,7 @@ def _header(project, text):
     _write("pyfive", "API Reference", "api_reference")
     _write("pyfive", "Additional API Features", "additional")
     _write("pyfive", "Optimising Data Access Speed", "optimising")
+    _write("pyfive", "The p5dump utility", "p5dump")
     _write("pyfive", "Change Log", "changelog")
     # _write("pyfive", "Examples", "examples")
     # _write("pyfive", "Contributing to the community", "community/index")

diff --git a/doc/introduction.rst b/doc/introduction.rst
@@ -35,6 +35,8 @@ once a variable is instantiated (i.e. for an open ``pyfive.File`` instance ``f``
 the attributes and b-tree (chunk index) are read, and it is then possible to close the parent file (``f``), 
 but continue to use (``v``).
 
+The package includes a script ``p5dump`` which can be used to dump the contents of an HDF5 file to the terminal. 
+
 .. note::
 
     We have test coverage that shows that the usage of ``v`` in this way is thread-safe -  the test which demonstrates this is slow, 

diff --git a/doc/p5dump.rst b/doc/p5dump.rst
@@ -0,0 +1,19 @@
+p5dump
+******
+
+``pyfive`` includes a command line tool ``p5dump`` which can be used to dump the contents of an HDF5 file to the 
+terminal. This is similar to the ``ncdump`` tool included with the NetCDF library, or the ``h5dump`` tool included 
+with the HDF5 library, but like the rest of pyfive, is implemented in pure Python without any dependencies on the 
+HDF5 C library.
+
+It is not identical to either of these tools, though the default output is very close to that of ``ncdump``.
+When called with `-s` (e.g ``p5dump -s myfile.hdf5``, or simply ``p5dump myfile.hdf5``) the output provides extra information for chunked
+datasets, including the locations of the start and end of the chunk index b-tree 
+and the location of the first data chunk for that variable. This extra information is useful for understanding
+the performance of data access for chunked variables, particularly when accessing data in object stores such as
+S3. In general, if one finds that the b-tree index continues past the first data chunk, access 
+performance may be sub-optimal - in this situation, if you have control over the data, you might well
+consider using the ``h5repack`` tool from the standard HDF5 distribution to make a copy of the file with the 
+chunk index and attributes stored contiguously.  All tools which read HDF5 files will benefit from this.
+
+
diff --git a/doc/quickstart/index.rst b/doc/quickstart/index.rst
@@ -8,6 +8,5 @@ Getting started
     Usage <usage>
     Enumerations <enums>
     Opaque Datasets <opaque>
-
 
 
diff --git a/pyfive/__init__.py b/pyfive/__init__.py
@@ -8,6 +8,7 @@
 from pyfive.h5t import check_enum_dtype, check_string_dtype, check_dtype, opaque_dtype, check_opaque_dtype
 from pyfive.h5py import Datatype, Empty
 from importlib.metadata import version
+from pyfive.inspect import p5ncdump
 
 __version__ = '0.5.0.dev'
 
diff --git a/pyfive/btree.py b/pyfive/btree.py
@@ -22,6 +22,7 @@ def __init__(self, fh, offset):
         self.offset = offset
         self.depth = None
         self.all_nodes = {}
+        self.last_offset = offset
 
         self._read_root_node()
         self._read_children()
@@ -53,6 +54,7 @@ def _read_node(self, offset, node_level):
         node = self._read_node_header(offset, node_level)
         node['keys'] = []
         node['addresses'] = []
+        self.last_offset=max(offset,self.last_offset)
         return node
 
     def _read_node_header(self, offset):
@@ -149,57 +151,9 @@ def _read_node(self, offset, node_level):
             addresses.append(chunk_address)
         node['keys'] = keys
         node['addresses'] = addresses
+        self.last_offset=max(offset,self.last_offset)
         return node
 
-    def construct_data_from_chunks(
-            self, chunk_shape, data_shape, dtype, filter_pipeline):
-        """ Build a complete data array from chunks. """
-        if isinstance(dtype, tuple):
-            true_dtype = tuple(dtype)
-            dtype_class = dtype[0]
-            if dtype_class == 'REFERENCE':
-                size = dtype[1]
-                if size != 8:
-                    raise NotImplementedError('Unsupported Reference type')
-                dtype = '<u8'
-            else:
-                raise NotImplementedError('datatype not implemented')
-        else:
-            true_dtype = None
-
-        # create array to store data
-        shape = [_padded_size(i, j) for i, j in zip(data_shape, chunk_shape)]
-        data = np.zeros(shape, dtype=dtype)
-
-        # loop over chunks reading each into the full data array
-        count = np.prod(chunk_shape)
-        itemsize = np.dtype(dtype).itemsize
-        chunk_buffer_size = count * itemsize
-        for node in self.all_nodes[0]:
-            for node_key, addr in zip(node['keys'], node['addresses']):
-                self.fh.seek(addr)
-                if filter_pipeline is None:
-                    chunk_buffer = self.fh.read(chunk_buffer_size)
-                else:
-                    chunk_buffer = self.fh.read(node_key['chunk_size'])
-                    filter_mask = node_key['filter_mask']
-                    chunk_buffer = self._filter_chunk(
-                        chunk_buffer, filter_mask, filter_pipeline, itemsize)
-
-                chunk_data = np.frombuffer(chunk_buffer, dtype=dtype)
-                start = node_key['chunk_offset'][:-1]
-                region = [slice(i, i+j) for i, j in zip(start, chunk_shape)]
-                data[tuple(region)] = chunk_data.reshape(chunk_shape)
-
-        if isinstance(true_dtype, tuple):
-            if dtype_class == 'REFERENCE':
-                to_reference = np.vectorize(Reference)
-                data = to_reference(data)
-            else:
-                raise NotImplementedError('datatype not implemented')
-
-        non_padded_region = tuple([slice(i) for i in data_shape])
-        return data[non_padded_region]
 
     @classmethod
     def _filter_chunk(cls, chunk_buffer, filter_mask, filter_pipeline, itemsize):
Original file line number	Diff line number	Diff line change
Expand Up		@@ -8,6 +8,5 @@ Getting started
		Usage <usage>
		Enumerations <enums>
		Opaque Datasets <opaque>