Skip to content

SST with backend MPI, hangs on own machine with recent MPICH  #3377

Closed
@franzpoeschel

Description

MPI-based streaming works (mostly) on Crusher which is where we need it, but hangs on other systems on the reader end with plain MPICH. The writer finishes cleanly when killing the reader.

Describe the bug

Writer SstVerbose log:

Sst set to use sockets as a Control Transport
Considering DataPlane "evpath" for possible use, priority is 1
Considering DataPlane "mpi" for possible use, priority is 100
Selecting DataPlane "mpi", priority 100 for use
MpiInitWriter initialized addr=0x162ba70
Opening Stream "stream"
Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   StepDistributionMode=StepsAllToAll
Param -   DataTransport=mpi
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Stream "stream" waiting for 1 readers
Beginning writer-side reader open protocol
MPI dataplane WriterPerReader to be initialized
Setting SpeculativePreload ON for new reader
My oldest timestep was 0, global oldest timestep was 0
Finish writer-side reader open protocol for reader 0x162b4d0, reader ready response pending
(PID 1924, TID 7f505db8e780) Waiting for Reader ready on WSR 0x162b4d0.
Reader Activate message received for Stream 0x162b4d0.  Setting state to Established.
Parent stream reader count is now 1.
Reader ready on WSR 0x162b4d0, Stream established, Starting 0 LastProvided 0.
Finish opening Stream "stream"
Reader 0 status Established has last released 4294967295, last sent 0
QueueMaintenance, smallest last released = -1, count = 1
Removing dead entries
QueueMaintenance complete
Sending TimestepMetadata for timestep 0 (ref count 1), one to each reader
Sent timestep 0 to reader cohort 0
ADDING timestep 0 to sent list for reader cohort 0, READER 0x162b4d0, reference count is now 2
PRELOADMODE for timestep 0 non-default for reader , active at timestep 0, mode 1
Sending a message to reader 0 (0x241cec0)
SubRef : Writer-side Timestep 0 now has reference count 1, expired 0, precious 0
Reader 0 status Established has last released 4294967295, last sent 0
QueueMaintenance, smallest last released = -1, count = 1
Removing dead entries
QueueMaintenance complete
Reader 0 status Established has last released 4294967295, last sent 0
QueueMaintenance, smallest last released = -1, count = 2
Removing dead entries
QueueMaintenance complete
Sending TimestepMetadata for timestep 1 (ref count 1), one to each reader
Sent timestep 1 to reader cohort 0
ADDING timestep 1 to sent list for reader cohort 0, READER 0x162b4d0, reference count is now 2
PRELOADMODE for timestep 1 non-default for reader , active at timestep 0, mode 1
Sending a message to reader 0 (0x241cec0)
SubRef : Writer-side Timestep 1 now has reference count 1, expired 0, precious 0
Reader 0 status Established has last released 4294967295, last sent 1
QueueMaintenance, smallest last released = -1, count = 2
Removing dead entries
QueueMaintenance complete
Reader 0 status Established has last released 4294967295, last sent 1
QueueMaintenance, smallest last released = -1, count = 3
Removing dead entries
QueueMaintenance complete
Sending TimestepMetadata for timestep 2 (ref count 1), one to each reader
Sent timestep 2 to reader cohort 0
ADDING timestep 2 to sent list for reader cohort 0, READER 0x162b4d0, reference count is now 2
PRELOADMODE for timestep 2 non-default for reader , active at timestep 0, mode 1
Sending a message to reader 0 (0x241cec0)
SubRef : Writer-side Timestep 2 now has reference count 1, expired 0, precious 0
Reader 0 status Established has last released 4294967295, last sent 2
QueueMaintenance, smallest last released = -1, count = 3
Removing dead entries
QueueMaintenance complete
Reader 0 status Established has last released 4294967295, last sent 2
QueueMaintenance, smallest last released = -1, count = 4
Removing dead entries
QueueMaintenance complete
Sending TimestepMetadata for timestep 3 (ref count 1), one to each reader
Sent timestep 3 to reader cohort 0
ADDING timestep 3 to sent list for reader cohort 0, READER 0x162b4d0, reference count is now 2
PRELOADMODE for timestep 3 non-default for reader , active at timestep 0, mode 1
Sending a message to reader 0 (0x241cec0)
SubRef : Writer-side Timestep 3 now has reference count 1, expired 0, precious 0
Reader 0 status Established has last released 4294967295, last sent 3
QueueMaintenance, smallest last released = -1, count = 4
Removing dead entries
QueueMaintenance complete
Reader 0 status Established has last released 4294967295, last sent 3
QueueMaintenance, smallest last released = -1, count = 5
Removing dead entries
QueueMaintenance complete
Sending TimestepMetadata for timestep 4 (ref count 1), one to each reader
Sent timestep 4 to reader cohort 0
ADDING timestep 4 to sent list for reader cohort 0, READER 0x162b4d0, reference count is now 2
PRELOADMODE for timestep 4 non-default for reader , active at timestep 0, mode 1
Sending a message to reader 0 (0x241cec0)
SubRef : Writer-side Timestep 4 now has reference count 1, expired 0, precious 0
Reader 0 status Established has last released 4294967295, last sent 4
QueueMaintenance, smallest last released = -1, count = 5
Removing dead entries
QueueMaintenance complete
Reader 0 status Established has last released 4294967295, last sent 4
QueueMaintenance, smallest last released = -1, count = 6
Removing dead entries
QueueMaintenance complete
Sending TimestepMetadata for timestep 5 (ref count 1), one to each reader
Sent timestep 5 to reader cohort 0
ADDING timestep 5 to sent list for reader cohort 0, READER 0x162b4d0, reference count is now 2
PRELOADMODE for timestep 5 non-default for reader , active at timestep 0, mode 1
Sending a message to reader 0 (0x241cec0)
SubRef : Writer-side Timestep 5 now has reference count 1, expired 0, precious 0
Reader 0 status Established has last released 4294967295, last sent 5
QueueMaintenance, smallest last released = -1, count = 6
Removing dead entries
QueueMaintenance complete
Reader 0 status Established has last released 4294967295, last sent 5
QueueMaintenance, smallest last released = -1, count = 7
Removing dead entries
QueueMaintenance complete
Sending TimestepMetadata for timestep 6 (ref count 1), one to each reader
Sent timestep 6 to reader cohort 0
ADDING timestep 6 to sent list for reader cohort 0, READER 0x162b4d0, reference count is now 2
PRELOADMODE for timestep 6 non-default for reader , active at timestep 0, mode 1
Sending a message to reader 0 (0x241cec0)
SubRef : Writer-side Timestep 6 now has reference count 1, expired 0, precious 0
Reader 0 status Established has last released 4294967295, last sent 6
QueueMaintenance, smallest last released = -1, count = 7
Removing dead entries
QueueMaintenance complete
Reader 0 status Established has last released 4294967295, last sent 6
QueueMaintenance, smallest last released = -1, count = 8
Removing dead entries
QueueMaintenance complete
Sending TimestepMetadata for timestep 7 (ref count 1), one to each reader
Sent timestep 7 to reader cohort 0
ADDING timestep 7 to sent list for reader cohort 0, READER 0x162b4d0, reference count is now 2
PRELOADMODE for timestep 7 non-default for reader , active at timestep 0, mode 1
Sending a message to reader 0 (0x241cec0)
SubRef : Writer-side Timestep 7 now has reference count 1, expired 0, precious 0
Reader 0 status Established has last released 4294967295, last sent 7
QueueMaintenance, smallest last released = -1, count = 8
Removing dead entries
QueueMaintenance complete
Reader 0 status Established has last released 4294967295, last sent 7
QueueMaintenance, smallest last released = -1, count = 9
Removing dead entries
QueueMaintenance complete
Sending TimestepMetadata for timestep 8 (ref count 1), one to each reader
Sent timestep 8 to reader cohort 0
ADDING timestep 8 to sent list for reader cohort 0, READER 0x162b4d0, reference count is now 2
PRELOADMODE for timestep 8 non-default for reader , active at timestep 0, mode 1
Sending a message to reader 0 (0x241cec0)
SubRef : Writer-side Timestep 8 now has reference count 1, expired 0, precious 0
Reader 0 status Established has last released 4294967295, last sent 8
QueueMaintenance, smallest last released = -1, count = 9
Removing dead entries
QueueMaintenance complete
Reader 0 status Established has last released 4294967295, last sent 8
QueueMaintenance, smallest last released = -1, count = 10
Removing dead entries
QueueMaintenance complete
Sending TimestepMetadata for timestep 9 (ref count 1), one to each reader
Sent timestep 9 to reader cohort 0
ADDING timestep 9 to sent list for reader cohort 0, READER 0x162b4d0, reference count is now 2
PRELOADMODE for timestep 9 non-default for reader , active at timestep 0, mode 1
Sending a message to reader 0 (0x241cec0)
SubRef : Writer-side Timestep 9 now has reference count 1, expired 0, precious 0
Reader 0 status Established has last released 4294967295, last sent 9
QueueMaintenance, smallest last released = -1, count = 10
Removing dead entries
QueueMaintenance complete
SstWriterClose, Sending Close at Timestep 9, one to each reader
Working on reader cohort 0
Sending a message to reader 0 (0x241cec0)
Reader 0 status Established has last released 4294967295, last sent 9
QueueMaintenance, smallest last released = -1, count = 10
Removing dead entries
QueueMaintenance complete
MpiReadRequestHandler:read request from reader=0,ts=0,off=136,len=40
MpiReadRequestHandler: Replying reader=0 with MPI port name=tag#0$connentry#02008D25C0A801650000000000000000$
Waiting for timesteps to be released in WriterClose
IN TS WAIT, ENTRIES are Timestep 9 (exp 0, Prec 0, Ref 1), Count now 10
IN TS WAIT, ENTRIES are Timestep 8 (exp 0, Prec 0, Ref 1), Count now 10
IN TS WAIT, ENTRIES are Timestep 7 (exp 0, Prec 0, Ref 1), Count now 10
IN TS WAIT, ENTRIES are Timestep 6 (exp 0, Prec 0, Ref 1), Count now 10
IN TS WAIT, ENTRIES are Timestep 5 (exp 0, Prec 0, Ref 1), Count now 10
IN TS WAIT, ENTRIES are Timestep 4 (exp 0, Prec 0, Ref 1), Count now 10
IN TS WAIT, ENTRIES are Timestep 3 (exp 0, Prec 0, Ref 1), Count now 10
IN TS WAIT, ENTRIES are Timestep 2 (exp 0, Prec 0, Ref 1), Count now 10
IN TS WAIT, ENTRIES are Timestep 1 (exp 0, Prec 0, Ref 1), Count now 10
IN TS WAIT, ENTRIES are Timestep 0 (exp 0, Prec 0, Ref 1), Count now 10
The timesteps still queued are: 9 8 7 6 5 4 3 2 1 0 
Reader Count is 1
Reader [0] status is Established
MpiReadRequestHandler: Accepted client, Link.CohortSize=1
Writer-side Rank received a connection-close event during normal operations, peer likely failed
In PeerFailCloseWSReader, releasing sent timesteps
Dereferencing all timesteps sent to reader 0x162b4d0
Checking on timestep 9
Reader sent timestep list 0x1636300, trying to release 9
Reader considering sent timestep 0,trying to release 9
Reader considering sent timestep 1,trying to release 9
Reader considering sent timestep 2,trying to release 9
Reader considering sent timestep 3,trying to release 9
Reader considering sent timestep 4,trying to release 9
Reader considering sent timestep 5,trying to release 9
Reader considering sent timestep 6,trying to release 9
Reader considering sent timestep 7,trying to release 9
Reader considering sent timestep 8,trying to release 9
Reader considering sent timestep 9,trying to release 9
SubRef : Writer-side Timestep 9 now has reference count 0, expired 0, precious 0
Checking on timestep 8
Reader sent timestep list 0x1636300, trying to release 8
Reader considering sent timestep 0,trying to release 8
Reader considering sent timestep 1,trying to release 8
Reader considering sent timestep 2,trying to release 8
Reader considering sent timestep 3,trying to release 8
Reader considering sent timestep 4,trying to release 8
Reader considering sent timestep 5,trying to release 8
Reader considering sent timestep 6,trying to release 8
Reader considering sent timestep 7,trying to release 8
Reader considering sent timestep 8,trying to release 8
SubRef : Writer-side Timestep 8 now has reference count 0, expired 0, precious 0
Checking on timestep 7
Reader sent timestep list 0x1636300, trying to release 7
Reader considering sent timestep 0,trying to release 7
Reader considering sent timestep 1,trying to release 7
Reader considering sent timestep 2,trying to release 7
Reader considering sent timestep 3,trying to release 7
Reader considering sent timestep 4,trying to release 7
Reader considering sent timestep 5,trying to release 7
Reader considering sent timestep 6,trying to release 7
Reader considering sent timestep 7,trying to release 7
SubRef : Writer-side Timestep 7 now has reference count 0, expired 0, precious 0
Checking on timestep 6
Reader sent timestep list 0x1636300, trying to release 6
Reader considering sent timestep 0,trying to release 6
Reader considering sent timestep 1,trying to release 6
Reader considering sent timestep 2,trying to release 6
Reader considering sent timestep 3,trying to release 6
Reader considering sent timestep 4,trying to release 6
Reader considering sent timestep 5,trying to release 6
Reader considering sent timestep 6,trying to release 6
SubRef : Writer-side Timestep 6 now has reference count 0, expired 0, precious 0
Checking on timestep 5
Reader sent timestep list 0x1636300, trying to release 5
Reader considering sent timestep 0,trying to release 5
Reader considering sent timestep 1,trying to release 5
Reader considering sent timestep 2,trying to release 5
Reader considering sent timestep 3,trying to release 5
Reader considering sent timestep 4,trying to release 5
Reader considering sent timestep 5,trying to release 5
SubRef : Writer-side Timestep 5 now has reference count 0, expired 0, precious 0
Checking on timestep 4
Reader sent timestep list 0x1636300, trying to release 4
Reader considering sent timestep 0,trying to release 4
Reader considering sent timestep 1,trying to release 4
Reader considering sent timestep 2,trying to release 4
Reader considering sent timestep 3,trying to release 4
Reader considering sent timestep 4,trying to release 4
SubRef : Writer-side Timestep 4 now has reference count 0, expired 0, precious 0
Checking on timestep 3
Reader sent timestep list 0x1636300, trying to release 3
Reader considering sent timestep 0,trying to release 3
Reader considering sent timestep 1,trying to release 3
Reader considering sent timestep 2,trying to release 3
Reader considering sent timestep 3,trying to release 3
SubRef : Writer-side Timestep 3 now has reference count 0, expired 0, precious 0
Checking on timestep 2
Reader sent timestep list 0x1636300, trying to release 2
Reader considering sent timestep 0,trying to release 2
Reader considering sent timestep 1,trying to release 2
Reader considering sent timestep 2,trying to release 2
SubRef : Writer-side Timestep 2 now has reference count 0, expired 0, precious 0
Checking on timestep 1
Reader sent timestep list 0x1636300, trying to release 1
Reader considering sent timestep 0,trying to release 1
Reader considering sent timestep 1,trying to release 1
SubRef : Writer-side Timestep 1 now has reference count 0, expired 0, precious 0
Checking on timestep 0
Reader sent timestep list 0x1636300, trying to release 0
Reader considering sent timestep 0,trying to release 0
SubRef : Writer-side Timestep 0 now has reference count 0, expired 0, precious 0
Waiting for timesteps to be released in WriterClose
IN TS WAIT, ENTRIES are Timestep 9 (exp 0, Prec 0, Ref 0), Count now 10
IN TS WAIT, ENTRIES are Timestep 8 (exp 0, Prec 0, Ref 0), Count now 10
IN TS WAIT, ENTRIES are Timestep 7 (exp 0, Prec 0, Ref 0), Count now 10
IN TS WAIT, ENTRIES are Timestep 6 (exp 0, Prec 0, Ref 0), Count now 10
IN TS WAIT, ENTRIES are Timestep 5 (exp 0, Prec 0, Ref 0), Count now 10
IN TS WAIT, ENTRIES are Timestep 4 (exp 0, Prec 0, Ref 0), Count now 10
IN TS WAIT, ENTRIES are Timestep 3 (exp 0, Prec 0, Ref 0), Count now 10
IN TS WAIT, ENTRIES are Timestep 2 (exp 0, Prec 0, Ref 0), Count now 10
IN TS WAIT, ENTRIES are Timestep 1 (exp 0, Prec 0, Ref 0), Count now 10
IN TS WAIT, ENTRIES are Timestep 0 (exp 0, Prec 0, Ref 0), Count now 10
The timesteps still queued are: 9 8 7 6 5 4 3 2 1 0 
Reader Count is 1
Reader [0] status is PeerFailed
DONE DEREFERENCING
Moving Reader stream 0x162b4d0 to status PeerFailed
Reader 0 status PeerFailed has last released 4294967295, last sent 9
QueueMaintenance, smallest last released = LONG_MAX, count = 10
Writer tagging timestep 9 as expired
Releasing timestep 9
Writer tagging timestep 8 as expired
Releasing timestep 8
Writer tagging timestep 7 as expired
Releasing timestep 7
Writer tagging timestep 6 as expired
Releasing timestep 6
Writer tagging timestep 5 as expired
Releasing timestep 5
Writer tagging timestep 4 as expired
Releasing timestep 4
Writer tagging timestep 3 as expired
Releasing timestep 3
Writer tagging timestep 2 as expired
Releasing timestep 2
Writer tagging timestep 1 as expired
Releasing timestep 1
Writer tagging timestep 0 as expired
Releasing timestep 0
Removing dead entries
Remove queue Entries removing Timestep 9 (exp 1, Prec 0, Ref 0), Count now 9
Remove queue Entries removing Timestep 8 (exp 1, Prec 0, Ref 0), Count now 8
Remove queue Entries removing Timestep 7 (exp 1, Prec 0, Ref 0), Count now 7
Remove queue Entries removing Timestep 6 (exp 1, Prec 0, Ref 0), Count now 6
Remove queue Entries removing Timestep 5 (exp 1, Prec 0, Ref 0), Count now 5
Remove queue Entries removing Timestep 4 (exp 1, Prec 0, Ref 0), Count now 4
Remove queue Entries removing Timestep 3 (exp 1, Prec 0, Ref 0), Count now 3
Remove queue Entries removing Timestep 2 (exp 1, Prec 0, Ref 0), Count now 2
Remove queue Entries removing Timestep 1 (exp 1, Prec 0, Ref 0), Count now 1
Remove queue Entries removing Timestep 0 (exp 1, Prec 0, Ref 0), Count now 0
QueueMaintenance complete
Reader 0 status PeerFailed has last released 4294967295, last sent 9
QueueMaintenance, smallest last released = LONG_MAX, count = 0
Removing dead entries
QueueMaintenance complete
Got an unexpected connection close event
Writer-side Rank received a connection-close event in unexpected state PeerFailed

Stream "stream" (0x15c7ec0) summary info:
	Duration (secs) = 1.88709
	Timesteps Created = 10
	Timesteps Delivered = 10

All timesteps are released in WriterClose
Destroying stream 0x15c7ec0, name stream
Reference count now zero, Destroying process SST info cache
Freeing LastCallList
SstStreamDestroy successful, returning

Reader SstVerbose log:

Sst set to use sockets as a Control Transport
Looking for writer contact in file stream.sst, with timeout 60 secs
Waiting for writer DPResponse message in SstReadOpen("stream")
finished wait writer DPresponse message in read_open, WRITER is using "mpi" DataPlane
Prefered dataplane name is "mpi"
Considering DataPlane "evpath" for possible use, priority is 1
Considering DataPlane "mpi" for possible use, priority is 100
Selecting DataPlane "mpi" (preferred) for use
MPI dataplane reader initialized, reader rank 0
Waiting for writer response message in SstReadOpen("stream")
finished wait writer response message in read_open
Opening Reader Stream.
Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   StepDistributionMode=StepsAllToAll
Param -   DataTransport=mpi
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Reader stream params are:
Param -   RegistrationMethod=File
Param -   DataTransport=mpi
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   AlwaysProvideLatestTimestep=False
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Writer is doing BP-based marshalling
Writer is using Minimum Connection Communication pattern (min)
Sending Reader Activate messages to writer
Finish opening Stream "stream", starting with Step number 0
Wait for next metadata after last timestep -1
Waiting for metadata for a Timestep later than TS -1
(PID 192e, TID 7f5b7d4d9780) Stream status is Established
Received a Timestep metadata message for timestep 0, signaling condition
Received a Timestep metadata message for timestep 1, signaling condition
Received a Timestep metadata message for timestep 2, signaling condition
Received a Timestep metadata message for timestep 3, signaling condition
Received a Timestep metadata message for timestep 4, signaling condition
Received a Timestep metadata message for timestep 5, signaling condition
Examining metadata for Timestep 0
Returning metadata for Timestep 0
Setting TSmsg to Rootentry value
SstAdvanceStep returning Success on timestep 0
Received a Timestep metadata message for timestep 6, signaling condition
Received a Timestep metadata message for timestep 7, signaling condition
Received a Timestep metadata message for timestep 8, signaling condition
Received a Timestep metadata message for timestep 9, signaling condition
Received a writer close message. Timestep 9 was the final timestep.
Reader (rank 0) requesting to read remote memory for TimeStep 0 from Rank 0, StreamWPR =0x162a740, Offset=136, Length=40
ReadRemoteMemory: Send to server, Link.CohortSize=1
Waiting for completion of memory read to rank 0, condition 4,timestep=0, is_local=0
MpiReadReplyHandler: Read recv from rank=0,condition=4,size=40
MpiReadReplyHandler: Connecting to MPI Server

Note that the last line says "Connecting to MPI Server", but I think that this call finishes cleanly. The hanging call is inside MpiWaitForCompletion, so during data loading:

Reader backtrace:

#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5555556494d0) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55555564f480, cond=0x5555556494a8) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5555556494a8, mutex=0x55555564f480) at pthread_cond_wait.c:647
#3  0x00007ffff29bf55d in INT_C#3  0x00007ffff29bf55d in INT_CMCondition_wait (cm=0x55555564f410, condition=4) at /home/franzpoeschel/git-repos/ADIOS2/thirdparty/EVPath/EVPath/cm_control.c:299
#4  0x00007ffff29cca44 in CMCondition_wait (cm=0x55555564f410, condition=4) at /home/franzpoeschel/singularity_build/ADIOS2_build/thirdparty/EVPath/EVPath/cm_interface.c:85
#5  0x00007ffff4ba044b in MpiWaitForCompletion (Svcs=0x7ffff4e9f2a0 <Svcs>, Handle_v=0x555555662780) at /home/franzpoeschel/git-repos/ADIOS2/source/adios2/toolkit/sst/dp/mpi_dp.c:604
#6  0x00007ffff4b80bf7 in SstWaitForCompletion (Stream=0x55555564dbd0, handle=0x555555662780) at /home/franzpoeschel/git-repos/ADIOS2/source/adios2/toolkit/sst/cp/cp_reader.c:2423
#7  0x00007ffff4aa918f in adios2::core::engine::SstReader::PerformGets (this=0x55555560b4c0) at /home/franzpoeschel/git-repos/ADIOS2/source/adios2/engine/sst/SstReader.cpp:787
#8  0x00007ffff4a9dcb7 in adios2::core::engine::SstReader::EndStep (this=0x55555560b4c0) at /home/franzpoeschel/git-repos/ADIOS2/source/adios2/engine/sst/SstReader.cpp:502
#9  0x00007ffff7df0ae3 in adios2::Engine::EndStep (this=0x7fffffff0b48) at /home/franzpoeschel/git-repos/ADIOS2/bindings/CXX11/adios2/cxx11/Engine.cpp:109
#10 0x0000555555559a82 in main ()

To Reproduce
Minimal Stream Writer and Reader:

#include <mpi.h>

#include <adios2.h>
#include <numeric>
#include <vector>

int main(int argsc, char **argsv)
{
    MPI_Init_thread(nullptr, nullptr, MPI_THREAD_MULTIPLE, nullptr);

    std::string engine_type = "sst";

    adios2::ADIOS adios{MPI_COMM_SELF};
    adios2::IO IO = adios.DeclareIO("IO");
    IO.SetEngine(engine_type);
    adios2::Engine engine = IO.Open("stream", adios2::Mode::Write);
    std::vector<int> v(10, 17);
    auto var = IO.DefineVariable<int>("var", {10}, {0}, {10});

    for (unsigned step = 0; step < 10; ++step)
    {
        engine.BeginStep();
        engine.Put(var, v.data());
        engine.EndStep();
    }
    engine.Close();
    MPI_Finalize();
}
#include <mpi.h>

#include <adios2.h>
#include <iostream>
#include <string>
#include <vector>

int main(int argsc, char **argsv)
{
    MPI_Init_thread(nullptr, nullptr, MPI_THREAD_MULTIPLE, nullptr);

    std::string engine_type = "sst";

    adios2::ADIOS adios{MPI_COMM_SELF};
    adios2::IO IO = adios.DeclareIO("IO");
    IO.SetEngine(engine_type);
    adios2::Engine engine = IO.Open("stream", adios2::Mode::Read);

    std::vector<int> v(10);

    while (engine.BeginStep() == adios2::StepStatus::OK)
    {
        auto var = IO.InquireVariable<int>("var");
        var.SetSelection({{0}, {10}});
        engine.Get(var, v.data());
        engine.EndStep();
        std::cout << "In Step " << engine.CurrentStep() << ": ";
        for(auto val : v)
        {
            std::cout << val << ", ";
        }
        std::cout << std::endl;
    }
    engine.Close();

    MPI_Finalize();
}

Expected behavior
no hangup

Desktop (please complete the following information):

  • I tested with MPICH 4.0.2 and 4.1a1
  • Recent ADIOS2 master branch, tag c9335fd
  • Noticed on openSUSE Leap 15.4 as well as on Ubuntu 20.04
  • Build: Both builds used g++11, I saw the error with Debug and Release Builds

Additional context
ping @pnorbert

Following up
Was the issue fixed? Please report back.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions