Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
b0fc133
correct installation doc page
valeriupredoi Jul 17, 2025
7095b1b
Update doc/quickstart/installation.rst
valeriupredoi Aug 11, 2025
9234977
Improved intro, starting to scaffold usage and useful stuff
Aug 13, 2025
8da5631
Working with datasets
Aug 13, 2025
a2ed024
removing the original scaffolding
Aug 13, 2025
569eb3b
s3 access docs
Aug 13, 2025
1b58b2a
Emphasizing that there are differences from h5py
Aug 13, 2025
e89c8a5
add api reference
davidhassell Aug 13, 2025
effb4a9
Merge remote-tracking branch 'refs/remotes/origin/add_documentation' …
Aug 13, 2025
fbf5e16
Adding the h5d and additional API information
Aug 13, 2025
c8e39ac
Improving docstrings in h5d.py
Aug 13, 2025
0139693
Some material for the optimising section. Absolutely needs checking (…
Aug 13, 2025
425732e
Would have been smart to have run sphinx first
Aug 13, 2025
ea20d70
add no0index to api members
valeriupredoi Aug 13, 2025
f06dbac
Adding details on S3 config to the optimising section
Aug 14, 2025
933bffc
Merge remote-tracking branch 'refs/remotes/origin/add_documentation' …
Aug 14, 2025
72b0d68
Update doc/introduction.rst
bnlawrence Aug 15, 2025
658aa1e
Update doc/introduction.rst
bnlawrence Aug 15, 2025
a10833e
Update doc/optimising.rst
bnlawrence Aug 15, 2025
24f0eaa
Update doc/optimising.rst
bnlawrence Aug 15, 2025
db09a82
Update doc/optimising.rst
bnlawrence Aug 15, 2025
bfb4b8a
Update doc/optimising.rst
bnlawrence Aug 15, 2025
bec1fbd
Update doc/quickstart/usage.rst
bnlawrence Aug 15, 2025
e8a6c2d
Update doc/quickstart/usage.rst
bnlawrence Aug 15, 2025
45149cd
Update doc/quickstart/usage.rst
bnlawrence Aug 15, 2025
28c3a0c
Update doc/quickstart/usage.rst
bnlawrence Aug 15, 2025
1c1cdbc
Modified the DatasetID class docstring to better reflect what it is r…
Aug 15, 2025
bd9622a
:noindex:
davidhassell Aug 15, 2025
dd435af
add correct s3 example
valeriupredoi Aug 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions doc/_sidebar.rst.inc
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,6 @@

Introduction <introduction>
Getting started <quickstart/index>
API Reference <api_reference>
Additional API Features <additional>
Optimising Data Access Speed <optimising>
15 changes: 15 additions & 0 deletions doc/additional.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Additional API Features
***********************

In this section we highlight the additional API features and optimisations that ``pyfive`` provides beyond the standard ``h5py`` functionality.

Datasets (aka "variables") are actually implemented as ``pyfive.h5d.DatasetID`` objects. The ``pyfive.h5d.DatasetID`` class is designed to both support the same API as ``h5py.DatasetID`` and to
provide additional functionality.

The autogenerated documentation identifies both the methods and attributes that are part of the ``h5py`` API and those that are extensions provided by ``pyfive``.

.. autoclass:: pyfive.h5d.DatasetID
:members:
:noindex:


24 changes: 24 additions & 0 deletions doc/api_reference.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
API Reference
*************

.. autoclass:: pyfive.File
:members:
:noindex:

----

.. autoclass:: pyfive.Group
:members:
:noindex:

----

.. autoclass:: pyfive.Dataset
:members:
:noindex:

----

.. autoclass:: pyfive.Datatype
:members:
:noindex:
3 changes: 3 additions & 0 deletions doc/gensidebar.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,9 @@ def _header(project, text):
_header("pyfive", "Pyfive")
_write("pyfive", "Introduction", "introduction")
_write("pyfive", "Getting started", "quickstart/index")
_write("pyfive", "API Reference", "api_reference")
_write("pyfive", "Additional API Features", "additional")
_write("pyfive", "Optimising Data Access Speed", "optimising")
# _write("pyfive", "Examples", "examples")
# _write("pyfive", "Contributing to the community", "community/index")
# _write("pyfive", "Utilities", "utils")
Expand Down
38 changes: 34 additions & 4 deletions doc/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,40 @@ Introduction
About Pyfive
============

Pyfive provides a pure Python backend reader for ``h5netcdf``, it also exposes variable b-trees to other downstream software.
``pyfive`` provides a pure Python HDF reader which has been designed to be a thread-safe drop in replacement
for `h5py <https://github.com/h5py/h5py>`_ with no dependencies on the HDF C library. It aims to support the same API as
for reading files. Cases where access to a file uses a feature that is supported by the high-level ``h5py`` interface but not ``pyfive`` are considered bugs and
should be reported in our `Issues <https://github.com/NCAS-CMS/pyfive/issues>`_.
Writing HDF5 is not a goal of pyfive and portions of the ``h5py`` API which apply only to writing will not be
implemented.

Our motivations included thread-safety and performance at scale in a cloud environment. To do this we have implemented versions of some more components of the h5py stack, and in particular, a version of the h5d.DatasetID class, which is now holds all the code which is used for data access (as opposed to attribute access, which still lives in dataobjects). There are a couple of extra methods for exposing the chunk index directly rather than via an iterator and to access chunk info using the zarr indexing scheme rather than the h5py indexing scheme.
.. note::
While ``pyfive`` is designed to be a drop-in replacement for ``h5py``, the reverse may not be possible. It is possible to do things with ``pyfive``
that will not work with ``h5py``, and ``pyfive`` definitely includes *extensions* to the ``h5py`` API. This documentation makes clear which parts of
the API are extensions and where behaviour differs *by design* from ``h5py``.

The code also includes an implementation of what we have called pseudochunking which is used for accessing a contiguous array which is larger than memory via S3. In essence all this does is declare default chunks aligned with the array order on disk and use them for data access.
The motivation for ``pyfive`` development were many, but recent developments prioritised thread-safety, lazy loading, and
performance at scale in a cloud environment both standalone,
and as a backend for other software such as `cf-python <https://ncas-cms.github.io/cf-python/>`_, `xarray <https://docs.xarray.dev/en/stable/>`_, and `h5netcdf <https://h5netcdf.org/index.html>`_.

There are many small bug fixes and optimisations to support cloud usage, the most important of which is that once a variable is instantiated (i.e. for an open pyfive.File instance f, when you do ``v=f['variable_name']``) the attributes and b-tree are read, and it is then possible to close the parent file (f), but continue to use (v) - and we have test coverage that shows that this usage of v is thread-safe (there is a test which demonstrates this, it's slow, but it needs to be as shorter tests were sporadically passing). (The test harness now includes all the components necessary for testing pyfive accessing data via both Posix and S3).
As well as the high-level ``h5py`` API we have implemented a version of the ``h5d.DatasetID`` class, which now
holds all the code which is used for data access (as opposed to attribute access). We have also implemented
extra methods (beyond the ``h5py`` API) to expose the chunk index directly (as well as via an iterator) and
to access chunk info using the ``zarr`` indexing scheme rather than the ``h5py`` indexing scheme. This is useful for avoiding
the need for *a priori* use of ``kerchunk`` to make a ``zarr`` index for a file.

The code also includes an implementation of what we have called pseudochunking which is used for accessing
a contiguous array which is larger than memory via S3. In essence all this does is declare default chunks
aligned with the array order on disk and use them for data access.

There are optimisations to support cloud usage, the most important of which is that
once a variable is instantiated (i.e. for an open ``pyfive.File`` instance ``f``, when you do ``v=f['variable_name']``)
the attributes and b-tree (chunk index) are read, and it is then possible to close the parent file (``f``),
but continue to use (``v``).

.. note::

We have test coverage that shows that the usage of ``v`` in this way is thread-safe - the test which demonstrates this is slow,
but it needs to be, since shorter tests did not always exercise expected failure modes.

The pyfive test suite includes all the components necessary for testing pyfive accessing data via both POSIX and S3.
124 changes: 124 additions & 0 deletions doc/optimising.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
Optimising speed of data access
*******************************

HDF5 files can be large and complicated, with complex internal structures which can introduce signficant overheads when accessing the data.

These complexities (and the overheads they introduce) can be mitigated by optimising how you access the data, but this requires an understanding of
how the data is stored in the file and how the data access library (in this case ``pyfive``) works.

The data storage complexities arise from two main factors: the use of chunking, and the way attributes are stored in the files

Comment on lines +9 to +10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The data storage complexities arise from two main factors: the use of chunking, and the way attributes are stored in the files
The data storage complexities arise from two main factors: the use of chunking, and the way attributes are stored in the files.

**Chunking**: HDF5 files can store data in chunks, which allows for more efficient access to large datasets.
However, this also means that the library needs to maintain an index (a "b-tree") which relates the position in coordinate space to where each chunk is stored in the file.
There is a b-tree index for each chunked variable, and this index can be scattered across the file, which can introduce overheads when accessing the data.

**Attributes**: HDF5 files can store attributes (metadata) associated with datasets and groups, and these attributes are stored in a separate section of the file.
Again, these can be scattered across the files.


Optimising the files themselves
-------------------------------

Optimal access to data occurs when the data is chunked in a way that matches the access patterns of your application, and when the
b-tree indexes and attributess are stored contiguously in the file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
b-tree indexes and attributess are stored contiguously in the file.
b-tree indexes and attributes are stored contiguously in the file.


Users of ``pyfive`` will always confront data files which have been created by other software, but if possible, it is worth exploring whether
the `h5repack <https://docs.h5py.org/en/stable/special.html#h5repack>`_ tool can
be used to make a copy of the file which is optimised for access by using sensible chunks and to store the attributes and b-tree indexes contiguously.
If that is possible, then all access will benefit from fewer calls to storage to get the necessary metadata, and the data access will be faster.


Avoiding Loading Information You Don't Need
-------------------------------------------

In general, the more information you load from the file, the slower the access will be. If you know the variables you need, then don't iterate
over the variables, instantiate them directly.

For example, instead of doing:

.. code-block:: python

import pyfive

with pyfive.File("data.h5", "r") as f:
variables = [f for var in f]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
variables = [f for var in f]
variables = [var for var in f]

print("Variables in file:", variables)
temp = variables['temp']

You can do:

.. code-block:: python

import pyfive
with pyfive.File("data.h5", "r") as f:
temp = f['temp']

You might do the first when finding out what is in the file, but once you know what you need, it is much more efficient to access the variables directly.
That avoids a lot of loading of metadata and attributes that you don't need, and speeds up the access to the data.


Parallel Data Access
--------------------

Unlike ``h5py``, ``pyfive`` is designed to be thread-safe, and it is possible to access the same file from multiple threads without contention.
This is particularly useful when working with large datasets, as it allows you to read data in parallel without blocking other threads.

For example, you can use the `concurrent.futures` module to read data from multiple variables in parallel:

.. code-block:: python

import pyfive
from concurrent.futures import ThreadPoolExecutor

variable_names = ["var1", "var2", "var3"]

with pyfive.File("data.h5", "r") as f:

def get_min_of_variable(var_name):
dset = f[var_name]
data = dset[...] # Read the entire variable
return data.min()

with ThreadPoolExecutor() as executor:
results = list(executor.map(get_min_of_variable, variable_names))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works! But - I have not noticed any time improvements, it's very possible since I had to cut the size of the data because my RAM is not big enough, so what I was loading would be as fast, if not faster, in single thread process

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I'm slightly concerned by your RAM problems, it's probably ok to leave this (contrived) example in the docs as all our real examples are too complex to use as examplars, but we should do some due diligence on why this is happening.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works, as I said - the issues I have with the buffer could well be because I am running out of both RAM and actual disk, so those may be very user-specific, that's why I said I need to look a lot closer at it


print("Results:", results)


You can do the same thing to parallelise manipuations within the variables, by for example using, ``Dask``, but that is beyond the scope of this document.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can do the same thing to parallelise manipuations within the variables, by for example using, ``Dask``, but that is beyond the scope of this document.
You can do the same thing to parallelise manipulations within the variables, by for example using ``Dask``, but that is beyond the scope of this document.



Using pyfive with S3
--------------------

HDF5 was designed for usage on POSIX file systems where it makes sense to get specific ranges of bytes from files as they are needed.
For example, the extraction of a specific range of bytes from a variable with a statement like ``x=myvar[10:1]`` would require
first the calculation of where that selection of data (10:12) sits in storage, and then the extraction (and perhaps decompression)
of just the chunks of data needed to get that data. If the index needed to work that location wasn't in memory, that would need to
be read first. In practice with ``pyfive`` we try and preload the index, but the net effect of all these operations are a lot of
small reads from storage. Across a network, using S3, this would be prohibitive, so the ``s3fs`` middleware (used to make the remote
file, which for HDF5 will be stored as one object, look like it is on a file system) tries to make fewer reads and cache those in
memory so repeated reads can be more efficient. The optimal caching strategy is dependent on the file layout
and the expected access pattern, so ``s3fs`` provides a lot of flexibility as to how to configure that caching strategy.

For ``pyfive`` the three most important variables to consider altering are the
``default_block_size`` number, the ``default_cache_type`` option and the ``default_fill_cache`` boolean.

- **default_block_size**
This is the size (in bytes) of the blocks that ``s3fs`` will read in one transaction.
The bigger this is, the fewer reads that are undertaken, but the more memory and bandwidth are used.
The default is 50 MB, which is a poor choice for most HDF5 files where the metadata may be scattered across the files.
In practice, a value of a small number of MB could be a good compromise for files which have not been repacked to store the metadata contiguously and/or where the data access pattern will be small random chunks.

- **default_cache_type**
This is the type of caching that ``s3fs`` will use.
Details of the available options for S3 are formally in the `fsspec documentation <https://filesystem-spec.readthedocs.io/en/latest/api.html#read-buffering>`_.
Often the default of ``readahead`` is a good choice.

- **default_fill_cache**
This is a boolean which determines whether ``s3fs`` will persistently cache the data that it reads.
If this is set to ``True``, then the blocks are cached persistently in memory, but if set to ``False``, then it only makes sense in conjunction with ``default_cache_type`` set to ``readahead`` or ``bytes`` to support streaming access to the data.




9 changes: 0 additions & 9 deletions doc/quickstart/configuration.rst

This file was deleted.

5 changes: 2 additions & 3 deletions doc/quickstart/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,5 @@ Getting started
:maxdepth: 1

Installation <installation>
Configuration <configuration>
Running <running>
Output <output>
Usage <usage>

34 changes: 26 additions & 8 deletions doc/quickstart/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,27 @@
Installation
************

Conda-mamba environment
-----------------------
Installation from conda-forge
-----------------------------

``pyfive`` is on conda forge and can be installed with either ``conda`` or ``mamba`` (``mamba`` is now the
defaut solver for ``conda`` so might as well just use ``conda``):

.. code-block:: bash

conda install -c conda-forge pyfive

Installation from PyPI
----------------------

``pyfive`` can be installed from PyPI:

.. code-block:: bash

pip install pyfive

Install from source: conda-mamba environment
--------------------------------------------

Use a Miniconda/Miniforge3 installer to create an environment using
our conda ``environment.yml`` file; download the latest Miniconda3 for Linux installer from
Expand All @@ -16,16 +35,15 @@ install it, then create and activate the Pyfive environment:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
(base) conda env create -n activestorage -f environment.yml
(base) conda activate activestorage
(base) conda env create -n pyfive -f environment.yml
(base) conda activate pyfive

.. note::

Our dependencies are all from ``conda-forge`` so there is no issue related
to the buggy (and paid-for) Anaconda main/defaults channel!
Our dependencies are all from conda-forge, ensuring a smooth and reliable installation process.

Installing Pyfive
--------------------------
Installing Pyfive from source
-----------------------------

The installation then can proceed: installing with ``pip`` and installing ``all`` (ie
installing the development and test install):
Expand Down
7 changes: 0 additions & 7 deletions doc/quickstart/output.rst

This file was deleted.

7 changes: 0 additions & 7 deletions doc/quickstart/running.rst

This file was deleted.

Loading
Loading