Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDF5 I/O optimizations #1129

Merged
merged 11 commits into from
Oct 27, 2021
Merged
6 changes: 6 additions & 0 deletions .rodare.json
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,12 @@
"name": "Schnetter, Erik",
"orcid": "0000-0002-4518-9017",
"type": "Other"
},
{
"affiliation": "Lawrence Berkeley National Laboratory",
"name": "Bez, Jean Luca",
"orcid": "0000-0002-3915-1135",
"type": "Other"
}
],
"title": "C++ & Python API for Scientific I/O with openPMD",
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -413,6 +413,8 @@ Further thanks go to improvements and contributions from:
Dask guidance & reviews
* [Erik Schnetter (PITP)](https://github.com/eschnett):
C++ API bug fixes
* [Jean Luca Bez (LBNL)](https://github.com/jeanbez):
HDF5 performance tuning

### Grants

Expand Down
50 changes: 38 additions & 12 deletions docs/source/backends/hdf5.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,31 +21,57 @@ Backend-Specific Controls

The following environment variables control HDF5 I/O behavior at runtime.

===================================== ========= ===========================================================================================================
Environment variable Default Description
===================================== ========= ===========================================================================================================
``OPENPMD_HDF5_INDEPENDENT`` ``ON`` Sets the MPI-parallel transfer mode to collective (``OFF``) or independent (``ON``).
``OPENPMD_HDF5_ALIGNMENT`` ``1`` Tuning parameter for parallel I/O, choose an alignment which is a multiple of the disk block size.
``OPENPMD_HDF5_CHUNKS`` ``auto`` Defaults for ``H5Pset_chunk``: ``"auto"`` (heuristic) or ``"none"`` (no chunking).
``H5_COLL_API_SANITY_CHECK`` unset Debug: Set to ``1`` to perform an ``MPI_Barrier`` inside each meta-data operation.
``HDF5_USE_FILE_LOCKING`` ``TRUE`` Work-around: Set to ``FALSE`` in case you are on an HPC or network file system that hang in open for reads.
``OMPI_MCA_io`` unset Work-around: Disable OpenMPI's I/O implementation for older releases by setting this to ``^ompio``.
===================================== ========= ===========================================================================================================
======================================== ============ ===========================================================================================================
Environment variable Default Description
======================================== ============ ===========================================================================================================
``OPENPMD_HDF5_INDEPENDENT`` ``ON`` Sets the MPI-parallel transfer mode to collective (``OFF``) or independent (``ON``).
``OPENPMD_HDF5_ALIGNMENT`` ``1`` Tuning parameter for parallel I/O, choose an alignment which is a multiple of the disk block size.
``OPENPMD_HDF5_THRESHOLD`` ``0`` Tuning parameter for parallel I/O, where ``0`` aligns all requests and other values act as a threshold.
``OPENPMD_HDF5_CHUNKS`` ``auto`` Defaults for ``H5Pset_chunk``: ``"auto"`` (heuristic) or ``"none"`` (no chunking).
``OPENPMD_HDF5_COLLECTIVE_METADATA`` ``ON`` Sets the MPI-parallel transfer mode for metadata operations to collective (``ON``) or independent (``OFF``).
``OPENPMD_HDF5_PAGED_ALLOCATION`` ``ON`` Tuning parameter for parallel I/O in HDF5 to enable paged allocation.
``OPENPMD_HDF5_PAGED_ALLOCATION_SIZE`` ``33554432`` Size of the page, in bytes, if HDF5 paged allocation optimization is enabled.
``OPENPMD_HDF5_DEFER_METADATA`` ``ON`` Tuning parameter for parallel I/O in HDF5 to enable deferred HDF5 metadata operations.
``OPENPMD_HDF5_DEFER_METADATA_SIZE`` ``ON`` Size of the buffer, in bytes, if HDF5 deferred metadata optimization is enabled.
``H5_COLL_API_SANITY_CHECK`` unset Debug: Set to ``1`` to perform an ``MPI_Barrier`` inside each meta-data operation.
``HDF5_USE_FILE_LOCKING`` ``TRUE`` Work-around: Set to ``FALSE`` in case you are on an HPC or network file system that hang in open for reads.
``OMPI_MCA_io`` unset Work-around: Disable OpenMPI's I/O implementation for older releases by setting this to ``^ompio``.
======================================== ============ ===========================================================================================================

``OPENPMD_HDF5_INDEPENDENT``: by default, we implement MPI-parallel data ``storeChunk`` (write) and ``loadChunk`` (read) calls as `none-collective MPI operations <https://www.mpi-forum.org/docs/mpi-2.2/mpi22-report/node87.htm#Node87>`_.
Attribute writes are always collective in parallel HDF5.
Although we choose the default to be non-collective (independent) for ease of use, be advised that performance penalties may occur, although this depends heavily on the use-case.
For independent parallel I/O, potentially prefer using a modern version of the MPICH implementation (especially, use ROMIO instead of OpenMPI's ompio implementation).
Please refer to the `HDF5 manual, function H5Pset_dxpl_mpio <https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_dxpl_mpio.htm>`_ for more details.

``OPENPMD_HDF5_ALIGNMENT`` This sets the alignment in Bytes for writes via the ``H5Pset_alignment`` function.
``OPENPMD_HDF5_ALIGNMENT``: this sets the alignment in Bytes for writes via the ``H5Pset_alignment`` function.
According to the `HDF5 documentation <https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_alignment.htm>`_:
*For MPI IO and other parallel systems, choose an alignment which is a multiple of the disk block size.*
On Lustre filesystems, according to the `NERSC documentation <https://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/?start=5>`_, it is advised to set this to the Lustre stripe size. In addition, ORNL Summit GPFS users are recommended to set the alignment value to 16777216(16MB).

``OPENPMD_HDF5_CHUNKS`` This sets defaults for data chunking via `H5Pset_chunk <https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_chunk.htm>`__.
``OPENPMD_HDF5_THRESHOLD``: this sets the threshold for the alignment of HDF5 operations via the ``H5Pset_alignment`` function.
Setting it to ``0`` will force all requests to be aligned.
Any file object greater than or equal in size to threshold bytes will be aligned on an address which is a multiple of ``OPENPMD_HDF5_ALIGNMENT``.

``OPENPMD_HDF5_CHUNKS``: this sets defaults for data chunking via `H5Pset_chunk <https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_chunk.htm>`__.
Chunking generally improves performance and only needs to be disabled in corner-cases, e.g. when heavily relying on independent, parallel I/O that non-collectively declares data records.

``OPENPMD_HDF5_COLLECTIVE_METADATA``: this is an option to enable collective MPI calls for HDF5 metadata operations via `H5Pset_all_coll_metadata_ops <https://support.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetAllCollMetadataOps>`__ and `H5Pset_coll_metadata_write <https://support.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetCollMetadataWrite>`__.
By default, this optimization is enabled as it has proven to provide performance improvements.
This option is only available from HDF5 1.10.0 onwards. For previous version it will fallback to independent MPI calls.

``OPENPMD_HDF5_PAGED_ALLOCATION``: this option enables paged allocation for HDF5 operations via `H5Pset_file_space_strategy <https://support.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetFileSpaceStrategy>`__.
The page size can be controlled by the ``OPENPMD_HDF5_PAGED_ALLOCATION_SIZE`` option.

``OPENPMD_HDF5_PAGED_ALLOCATION_SIZE``: this option configures the size of the page if ``OPENPMD_HDF5_PAGED_ALLOCATION`` optimization is enabled via `H5Pset_file_space_page_size <https://support.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetFileSpacePageSize>`__.
Values are expressed in bytes. Default is set to 32MB.

``OPENPMD_HDF5_DEFER_METADATA``: this option enables deffered HDF5 metadata operations.
The metadata buffer size can be controlled by the ``OPENPMD_HDF5_DEFER_METADATA_SIZE`` option.

``OPENPMD_HDF5_DEFER_METADATA_SIZE``: this option configures the size of the buffer if ``OPENPMD_HDF5_DEFER_METADATA`` optimization is enabled via `H5Pset_mdc_config <https://support.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetMdcConfig>`__.
Values are expressed in bytes. Default is set to 32MB.

``H5_COLL_API_SANITY_CHECK``: this is a HDF5 control option for debugging parallel I/O logic (API calls).
Debugging a parallel program with that option enabled can help to spot bugs such as collective MPI-calls that are not called by all participating MPI ranks.
Do not use in production, this will slow parallel I/O operations down.
Expand Down
3 changes: 3 additions & 0 deletions include/openPMD/IO/HDF5/HDF5IOHandlerImpl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,9 @@ namespace openPMD

hid_t m_datasetTransferProperty;
hid_t m_fileAccessProperty;
hid_t m_fileCreateProperty;

hbool_t m_hdf5_collective_metadata = 1;

// h5py compatible types for bool and complex
hid_t m_H5T_BOOL_ENUM;
Expand Down
Loading