NCAS-CMS · valeriupredoi · Aug 15, 2025 · Jul 17, 2025 · Aug 11, 2025 · Aug 13, 2025
diff --git a/doc/_sidebar.rst.inc b/doc/_sidebar.rst.inc
@@ -7,3 +7,6 @@
 
     Introduction <introduction>
     Getting started <quickstart/index>
+    API Reference <api_reference>
+    Additional API Features <additional>
+    Optimising Data Access Speed <optimising>
diff --git a/doc/additional.rst b/doc/additional.rst
@@ -0,0 +1,15 @@
+Additional API Features
+*********************** 
+
+In this section we highlight the additional API features and optimisations that ``pyfive`` provides beyond the standard ``h5py`` functionality.
+
+Datasets (aka "variables") are actually implemented as ``pyfive.h5d.DatasetID`` objects. The ``pyfive.h5d.DatasetID`` class is  designed to both support the same API as ``h5py.DatasetID`` and to 
+provide additional functionality.
+
+The autogenerated documentation identifies both the methods and attributes that are part of the ``h5py`` API and those that are extensions provided by ``pyfive``.
+
+.. autoclass:: pyfive.h5d.DatasetID
+   :members:
+   :noindex:
+
+
diff --git a/doc/api_reference.rst b/doc/api_reference.rst
@@ -0,0 +1,24 @@
+API Reference
+*************
+
+.. autoclass:: pyfive.File
+   :members:
+   :noindex:
+
+----
+
+.. autoclass:: pyfive.Group
+   :members:
+   :noindex:
+
+----
+
+.. autoclass:: pyfive.Dataset
+   :members:
+   :noindex:
+
+----
+
+.. autoclass:: pyfive.Datatype
+   :members:
+   :noindex:
diff --git a/doc/gensidebar.py b/doc/gensidebar.py
@@ -59,6 +59,9 @@ def _header(project, text):
     _header("pyfive", "Pyfive")
     _write("pyfive", "Introduction", "introduction")
     _write("pyfive", "Getting started", "quickstart/index")
+    _write("pyfive", "API Reference", "api_reference")
+    _write("pyfive", "Additional API Features", "additional")
+    _write("pyfive", "Optimising Data Access Speed", "optimising")
     # _write("pyfive", "Examples", "examples")
     # _write("pyfive", "Contributing to the community", "community/index")
     # _write("pyfive", "Utilities", "utils")

diff --git a/doc/introduction.rst b/doc/introduction.rst
@@ -4,10 +4,40 @@ Introduction
 About Pyfive
 ============
 
-Pyfive provides a pure Python backend reader for ``h5netcdf``, it also exposes variable b-trees to other downstream software.
+``pyfive`` provides a pure Python HDF reader which has been designed to be a thread-safe drop in replacement
+for `h5py <https://github.com/h5py/h5py>`_ with no dependencies on the HDF C library.  It aims to support the same API as 
+for reading files. Cases where access to a file uses a feature that is supported by the high-level ``h5py`` interface but not ``pyfive`` are considered bugs and 
+should be reported in our `Issues <https://github.com/NCAS-CMS/pyfive/issues>`_. 
+Writing HDF5 is not a goal of pyfive and portions of the ``h5py`` API which apply only to writing will not be
+implemented.
 
-Our motivations included thread-safety and performance at scale in a cloud environment. To do this we have implemented versions of some more components of the h5py stack, and in particular, a version of the h5d.DatasetID class, which is now holds all the code which is used for data access (as opposed to attribute access, which still lives in dataobjects). There are a couple of extra methods for exposing the chunk index directly rather than via an iterator and to access chunk info using the zarr indexing scheme rather than the h5py indexing scheme.
+.. note::
+    While ``pyfive`` is designed to be a drop-in replacement for ``h5py``, the reverse may not be possible. It is possible to do things with ``pyfive`` 
+    that will not work with ``h5py``, and ``pyfive`` definitely includes *extensions* to the ``h5py`` API. This documentation makes clear which parts of
+    the API are extensions and where behaviour differs *by design* from ``h5py``.
 
-The code also includes an implementation of what we have called pseudochunking which is used for accessing a contiguous array which is larger than memory via S3. In essence all this does is declare default chunks aligned with the array order on disk and use them for data access.
+The motivation for ``pyfive`` development were many, but recent developments prioritised thread-safety, lazy loading, and 
+performance at scale in a cloud environment both standalone, 
+and as a backend for other software such as `cf-python <https://ncas-cms.github.io/cf-python/>`_, `xarray <https://docs.xarray.dev/en/stable/>`_,  and `h5netcdf <https://h5netcdf.org/index.html>`_. 
 
-There are many small bug fixes and optimisations to support cloud usage, the most important of which is that once a variable is instantiated (i.e. for an open pyfive.File instance f, when you do ``v=f['variable_name']``) the attributes and b-tree are read, and it is then possible to close the parent file (f), but continue to use (v) - and we have test coverage that shows that this usage of v is thread-safe (there is a test which demonstrates this, it's slow, but it needs to be as shorter tests were sporadically passing). (The test harness now includes all the components necessary for testing pyfive accessing data via both Posix and S3).
+As well as the high-level ``h5py`` API we have implemented a version of the ``h5d.DatasetID`` class, which now 
+holds all the code which is used for data access  (as opposed to attribute access).  We have also implemented
+extra methods (beyond the ``h5py`` API) to expose the chunk index directly (as well as via an iterator) and 
+to access chunk info using the ``zarr`` indexing scheme rather than the ``h5py`` indexing scheme. This is useful for avoiding
+the need for *a priori* use of ``kerchunk`` to make a ``zarr`` index for a file. 
+
+The code also includes an implementation of what we have called pseudochunking which is used for accessing 
+a contiguous array which is larger than memory via S3. In essence all this does is declare default chunks 
+aligned with the array order on disk and use them for data access.
+
+There are optimisations to support cloud usage, the most important of which is that 
+once a variable is instantiated (i.e. for an open ``pyfive.File`` instance ``f``, when you do ``v=f['variable_name']``) 
+the attributes and b-tree (chunk index) are read, and it is then possible to close the parent file (``f``), 
+but continue to use (``v``).
+
+.. note::
+
+    We have test coverage that shows that the usage of ``v`` in this way is thread-safe -  the test which demonstrates this is slow, 
+    but it needs to be, since shorter tests did not always exercise expected failure modes. 
+
+The pyfive test suite includes all the components necessary for testing pyfive accessing data via both POSIX and S3.
diff --git a/doc/optimising.rst b/doc/optimising.rst
@@ -0,0 +1,124 @@
+Optimising speed of data access
+******************************* 
+
+HDF5 files can be large and complicated, with complex internal structures which can introduce signficant overheads when accessing the data.
+
+These complexities (and the overheads they introduce) can be mitigated by optimising how you access the data, but this requires an understanding of 
+how the data is stored in the file and how the data access library (in this case ``pyfive``) works.
+
+The data storage complexities arise from two main factors: the use of chunking, and the way attributes are stored in the files
+
-The data storage complexities arise from two main factors: the use of chunking, and the way attributes are stored in the files
+The data storage complexities arise from two main factors: the use of chunking, and the way attributes are stored in the files.
+
-The data storage complexities arise from two main factors: the use of chunking, and the way attributes are stored in the files
+The data storage complexities arise from two main factors: the use of chunking, and the way attributes are stored in the files.
+
+**Chunking**: HDF5 files can store data in chunks, which allows for more efficient access to large datasets. 
+However, this also means that the library needs to maintain an index (a "b-tree") which relates the position in coordinate space to where each chunk is stored in the file.
+There is a b-tree index for each chunked variable, and this index can be scattered across the file, which can introduce overheads when accessing the data.
+
+**Attributes**: HDF5 files can store attributes (metadata) associated with datasets and groups, and these attributes are stored in a separate section of the file.
+Again, these can be scattered across the files.
+
+
+Optimising the files themselves
+-------------------------------
+
+Optimal access to data occurs when the data is chunked in a way that matches the access patterns of your application, and when the
+b-tree indexes and attributess are stored contiguously in the file.  
-b-tree indexes and attributess are stored contiguously in the file.  
+b-tree indexes and attributes are stored contiguously in the file.  
-b-tree indexes and attributess are stored contiguously in the file.  
+b-tree indexes and attributes are stored contiguously in the file.  
+
+Users of ``pyfive`` will always confront data files which have been  created by other software, but if possible, it is worth exploring whether 
+the `h5repack <https://docs.h5py.org/en/stable/special.html#h5repack>`_ tool can 
+be used to make a copy of the file which is optimised for access by using sensible chunks and to store the attributes and b-tree indexes contiguously.
+If that is possible, then all access will benefit from fewer calls to storage to get the necessary metadata, and the data access will be faster.
+
+
+Avoiding Loading Information You Don't Need
+-------------------------------------------
+
+In general, the more information you load from the file, the slower the access will be. If you know the variables you need, then don't iterate
+over the variables, instantiate them directly.
+
+For example, instead of doing:
+
+.. code-block:: python      
+
+    import pyfive
+
+    with pyfive.File("data.h5", "r") as f:
+        variables = [f for var in f]
-        variables = [f for var in f]
+        variables = [var for var in f]
-        variables = [f for var in f]
+        variables = [var for var in f]
+        print("Variables in file:", variables)
+        temp = variables['temp']
+
+You can do:
+
+.. code-block:: python
+
+    import pyfive
+    with pyfive.File("data.h5", "r") as f:
+        temp = f['temp']            
+
+You might do the first when finding out what is in the file, but once you know what you need, it is much more efficient to access the variables directly.
+That avoids a lot of loading of metadata and attributes that you don't need, and speeds up the access to the data.
+
+
+Parallel Data Access
+--------------------
+
+Unlike ``h5py``, ``pyfive`` is designed to be thread-safe, and it is possible to access the same file from multiple threads without contention.
+This is particularly useful when working with large datasets, as it allows you to read data in parallel without blocking other threads.
+
+For example, you can use the `concurrent.futures` module to read data from multiple variables in parallel:
+
+.. code-block:: python
+
+    import pyfive
+    from concurrent.futures import ThreadPoolExecutor
+
+    variable_names = ["var1", "var2", "var3"]
+
+    with pyfive.File("data.h5", "r") as f:
+
+        def get_min_of_variable(var_name):
+            dset = f[var_name]
+            data = dset[...]  # Read the entire variable
+            return data.min()
+
+        with ThreadPoolExecutor() as executor:
+            results = list(executor.map(get_min_of_variable, variable_names))
+
+    print("Results:", results)
+
+
+You can do the same thing to parallelise manipuations within the variables, by for example using, ``Dask``, but that is beyond the scope of this document.
-You can do the same thing to parallelise manipuations within the variables, by for example using, ``Dask``, but that is beyond the scope of this document.
+You can do the same thing to parallelise manipulations within the variables, by for example using ``Dask``, but that is beyond the scope of this document.
-You can do the same thing to parallelise manipuations within the variables, by for example using, ``Dask``, but that is beyond the scope of this document.
+You can do the same thing to parallelise manipulations within the variables, by for example using ``Dask``, but that is beyond the scope of this document.
+
+
+Using pyfive with S3
+--------------------
+
+HDF5 was designed for usage on POSIX file systems where it makes sense to get specific ranges of bytes from files as they are needed.
+For example, the extraction of a specific range of bytes from a variable with a statement like ``x=myvar[10:1]`` would require
+first the calculation of where that selection of data (10:12) sits in storage, and then the extraction (and perhaps decompression) 
+of just the chunks of data needed to get that data.  If the index needed to work that location wasn't in memory, that would need to
+be read first.  In practice with ``pyfive`` we try and preload the index, but the net effect of all these operations are a lot of 
+small reads from storage. Across a network, using S3, this would be prohibitive, so the ``s3fs`` middleware (used to make the remote
+file, which for HDF5 will be stored as one object, look like it is on a file system) tries to make fewer reads and cache those in
+memory so repeated reads can be more efficient.  The optimal caching strategy is dependent on the file layout
+and the expected access pattern, so ``s3fs`` provides a lot of flexibility as to how to configure that caching strategy.
+
+For ``pyfive`` the three most important variables to consider altering are the 
+``default_block_size`` number, the ``default_cache_type`` option and the ``default_fill_cache`` boolean.
+
+- **default_block_size**  
+    This is the size (in bytes) of the blocks that ``s3fs`` will read in one transaction.  
+    The bigger this is, the fewer reads that are undertaken, but the more memory and bandwidth are used.  
+    The default is 50 MB, which is a poor choice for most HDF5 files where the metadata may be scattered across the files.  
+    In practice, a value of a small number of MB could be a good compromise for files which have not been repacked to store the metadata contiguously and/or where the data access pattern will be small random chunks.
+
+- **default_cache_type**  
+    This is the type of caching that ``s3fs`` will use.  
+    Details of the available options for S3 are formally in the `fsspec documentation <https://filesystem-spec.readthedocs.io/en/latest/api.html#read-buffering>`_.  
+    Often the default of ``readahead`` is a good choice.
+
+- **default_fill_cache**  
+    This is a boolean which determines whether ``s3fs`` will persistently cache the data that it reads.  
+    If this is set to ``True``, then the blocks are cached persistently in memory, but if set to ``False``, then it only makes sense in conjunction with ``default_cache_type`` set to ``readahead`` or ``bytes`` to support streaming access to the data.
+
+
+
+
diff --git a/doc/quickstart/configuration.rst b/doc/quickstart/configuration.rst
diff --git a/doc/quickstart/index.rst b/doc/quickstart/index.rst
@@ -5,6 +5,5 @@ Getting started
    :maxdepth: 1
 
     Installation <installation>
-    Configuration <configuration>
-    Running <running>
-    Output <output>
+    Usage <usage>
+
diff --git a/doc/quickstart/installation.rst b/doc/quickstart/installation.rst
@@ -4,8 +4,27 @@
 Installation
 ************
 
-Conda-mamba environment
------------------------
+Installation from conda-forge
+-----------------------------
+
+``pyfive`` is on conda forge and can be installed with either ``conda`` or ``mamba`` (``mamba`` is now the
+defaut solver for ``conda`` so might as well just use ``conda``):
+
+.. code-block:: bash
+
+    conda install -c conda-forge pyfive
+
+Installation from PyPI
+----------------------
+
+``pyfive`` can be installed from PyPI:
+
+.. code-block:: bash
+
+    pip install pyfive 
+
+Install from source: conda-mamba environment
+--------------------------------------------
 
 Use a Miniconda/Miniforge3 installer to create an environment using
 our conda ``environment.yml`` file; download the latest Miniconda3 for Linux installer from
@@ -16,16 +35,15 @@ install it, then create and activate the Pyfive environment:
 
     wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
     bash Miniconda3-latest-Linux-x86_64.sh
-    (base) conda env create -n activestorage -f environment.yml
-    (base) conda activate activestorage
+    (base) conda env create -n pyfive -f environment.yml
+    (base) conda activate pyfive
 
 .. note::
 
-    Our dependencies are all from ``conda-forge`` so there is no issue related
-    to the buggy (and paid-for) Anaconda main/defaults channel!
+    Our dependencies are all from conda-forge, ensuring a smooth and reliable installation process.
 
-Installing Pyfive
---------------------------
+Installing Pyfive from source
+-----------------------------
 
 The installation then can proceed: installing with ``pip`` and installing ``all`` (ie
 installing the development and test install):

diff --git a/doc/quickstart/output.rst b/doc/quickstart/output.rst
diff --git a/doc/quickstart/running.rst b/doc/quickstart/running.rst