Skip to content

Out-of-sequence (OOS) messages #2067

Closed
Closed
@bosilca

Description

@bosilca

Short version Out-of-sequence messages exists in a single link scenario

Cause Intermediary buffering at different layers in the software stack allows the delivery of message out-of-sequence

Target Most of the BTLs, with a particular emphasis on vader and IB.

Long version This issue was raised during the discussion about the performance degradation seem between 1.8 and what will eventually become 3.x. While we identified the builtin atomics as being the main culprit, it turns out that enabling multi-threading raised a set of additional issues, not necessarily visible outside this particular usage.

Having multiple threads injecting messages into the PML in the context of a single communicator, lead to a significant number of out-of-sequence messages. The reason is that the per peer sequence number is taken very early in the software stack (optimization that makes sense for single threaded scenarios). Thus, between the moment when a thread acquires the sequence number and the moment when it's message is pushed into the network, there are many opportunities for another thread to bypass and reach the network first. From the receiver perspective this is seen as an out-of-sequence message, and it will be kept in linear structures and copied multiple time before it becomes in-sequence and can be delivered to the matching logic. There are multiple ways to mitigate this, but this discussion is outside the scope of this particular issue.

More worrisome is the fact that we observe out-of-sequence messages, using a single link and supposedly ordered BTLs, and this even when each thread is using a unique communicator. Logically, in this case no out-of-sequence message should be seen. At this point we assume that the immediate send optimization without a proper implementation in the BTLs is allowing message to bypass other messages waiting in the PML/BTL queues.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions