Skip to content

mca_io_ompio_file_write_at_all() failed during parallel write of PnetCDF lib #10297

Open
@dqwu

Description

@dqwu

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

spack installation

Please describe the system on which you are running

  • Operating system/version: CentOS 8
  • Computer hardware: AMD Epyc 7532 processors (32 cores per CPU, 2.4 GHz)
  • Network type: N.A.

Details of the problem

This issue occurs at a machine used by E3SM (e3sm.org)
https://e3sm.org/model/running-e3sm/supported-machines/chrysalis-anl

modules used: gcc/9.2.0-ugetvbp openmpi/4.1.3-sxfyy4k parallel-netcdf/1.11.0-mirrcz7

The test to reproduce this issue was run with 512 MPI tasks, 8 nodes (64 tasks per node). This issue is also reproducible with modules built with intel compiler.

With the same test, this issue is not reproducible with intel MPI on the same machine.

Error messages and backtrace in the output log file:

[1650397594.556260] [chr-0496:1587709:0]         cma_ep.c:87   UCX  ERROR process_vm_readv(pid=1587708 length=524288) returned -1: Bad address
[1650397596.163612] [chr-0500:609796:0]         cma_ep.c:87   UCX  ERROR process_vm_readv(pid=609795 length=524288) returned -1: No such process
...
[1650397606.028410] [chr-0496:1587764:0]         cma_ep.c:87   UCX  ERROR process_vm_readv(pid=1587763 length=524288) returned -1: No such process
srun: error: chr-0494: tasks 0,9,11,25: Killed
...
[1650397612.452233] [chr-0500:609772:0]         cma_ep.c:87   UCX  ERROR process_vm_readv(pid=609771 length=333312) returned -1: No such process
==== backtrace (tid:3089585) ====
 0 0x0000000000055969 ucs_debug_print_backtrace()  ???:0
 1 0x00000000000200a9 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042972 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024c1a ucp_worker_progress()  ???:0
 4 0x0000000000232b77 mca_pml_ucx_send_nbr()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/pml/ucx/pml_ucx.c:923
 5 0x0000000000232b77 mca_pml_ucx_send_nbr()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/pml/ucx/pml_ucx.c:923
 6 0x0000000000232b77 mca_pml_ucx_send()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/pml/ucx/pml_ucx.c:944
 7 0x00000000000d7c32 ompi_coll_base_sendrecv_actual()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/coll/base/coll_base_util.c:58
 8 0x00000000000d707b ompi_coll_base_sendrecv()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/coll/base/coll_base_util.h:133
 9 0x000000000010ced0 ompi_coll_tuned_allgatherv_intra_dec_fixed()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:1363
10 0x000000000016697a mca_fcoll_vulcan_file_write_all()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/fcoll/vulcan/fcoll_vulcan_file_write_all.c:418
11 0x00000000000c2b39 mca_common_ompio_file_write_at_all()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/common/ompio/common_ompio_file_write.c:452
12 0x00000000001aff57 mca_io_ompio_file_write_at_all()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/io/ompio/io_ompio_file_write.c:174
13 0x00000000000aaaae PMPI_File_write_at_all()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mpi/c/profile/pfile_write_at_all.c:75
14 0x000000000016fbc2 ncmpio_read_write()  ???:0
15 0x000000000016a7f6 mgetput()  ncmpio_wait.c:0
16 0x00000000001682bc req_aggregation()  ncmpio_wait.c:0
17 0x0000000000169e40 wait_getput()  ncmpio_wait.c:0
18 0x00000000001661a4 req_commit()  ncmpio_wait.c:0
19 0x0000000000166a0c ncmpio_wait()  ???:0
20 0x00000000000b727a ncmpi_wait_all()  ???:0
21 0x000000000046c5f9 flush_output_buffer()  ???:0
22 0x000000000042dc5e sync_file()  pio_file.c:0
23 0x000000000042df88 PIOc_closefile()  ???:0
24 0x0000000000414355 __piolib_mod_MOD_closefile()  ???:0
25 0x000000000040dd1c pioperformance_rearrtest.4019()  pioperformance_rearr.F90:0
26 0x000000000040ad15 MAIN__()  pioperformance_rearr.F90:0
27 0x0000000000411b49 main()  ???:0
28 0x00000000000237b3 __libc_start_main()  ???:0
29 0x000000000040a48e _start()  ???:0
=================================

A lock file was generated in addition to the expected output .nc file:

pioperf-rearr-2-ncomptasks-00512-niotasks-00512-stride-00001-iotype-1-nframes-00001-nvars-00500.nc
pioperf-rearr-2-ncomptasks-00512-niotasks-00512-stride-00001-iotype-1-nframes-00001-nvars-00500.nc-2399076352-875649.lock

Metadata

Metadata

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions