Closed
Description
Background information
What version of Open MPI are you using?
3.1.3
Update: affects 2.X, 3.X and 4.0.0
Describe how Open MPI was installed
From source via spack or as modules build from source on various different HPC systems.
Please describe the system on which you are running
- Operating system/version: Debian 9.6 and derivates such as Ubuntu
- Computer hardware: Laptops and HPC
- Network type: local and remote (Ethernet and IB)
Details of the problem
We are currently reporting two parallel HDF5 issues that either crash on write or corrupt data. The issues only occur with OpenMPI, not with MPICH 3.3 which we used for comparison.
We (@ax3l, @psychocoderHPC) are not the upstream authors of HDF5, but wanted to inform and connect you, so you are aware of these issues as they might be rooted in the OpenMPI I/O layer somewhere.
Link to HDF5 parallel I/O write issues with reproducers:
- crash with some special-size struct types: https://forum.hdfgroup.org/t/cannot-write-more-than-512-mb-in-1d/5118
-> common/ompio: fix a floating point division problem #6286 common/ompio: possible rounding issue #6287 (a specific OpenMPI I/O issue) - data corruption: https://forum.hdfgroup.org/t/hdf5bug-h5fd-mpio-collective-chunking/5279
-> opal/datatype: fix opal_convertor_raw() #6295 Provide a better fix for #6285. #6326 Improve LOOP datatype exchanges performances for sizes above the eage… #6172 (a general OpenMPI Datatype issue)