Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A problem about 2subs 1pub #119

Closed
xu3stones opened this issue Feb 22, 2019 · 5 comments
Closed

A problem about 2subs 1pub #119

xu3stones opened this issue Feb 22, 2019 · 5 comments

Comments

@xu3stones
Copy link

Hi,
Recently, i met a strange problem that when i run 2 sub Helloworld apps and 1 pub Helloworld app, they can not communicate (On the same environment, 1 sub and 1 pub works fine).

After my analyze, the SPDP step seems ok that both 3 apps can find the other 2.
Problem comes when it goes to SEDP step:
in my test, sub1 runs first, then the sub2, pub runs at last.
By reading the log, we got the following clue:
pub---> sub1 (pub sends SEDP info. to sub1 and sub1 goes into handle_SEDP)
pub---> sub2(pub sends SEDP info. to sub2 and sub2 goes into handle_SEDP)
sub2--->sub1(sub2 sends SEDP info. to sub1 and sub1 goes into handle_SEDP)
but all the others are lost.

Obviously, the app which starts first seems can not send SEDP info to the one that starts later.
It makes me confused, i'm not familiar with the WHC mechanism and the timeout mechanism.
Could you give me some potential possibility to help me check what causes this problem on my
test environment.

@xu3stones
Copy link
Author

If i run 3sub apps and 1 pub app. Then i got
pub->sub1
pub->sub2
pub->sub3
sub3->sub2
sub3->sub1
sub2->sub1
the others are lost.

The app which starts first seems can not send SEDP info to the one that starts later.
This phenomenon really confused me. There must be some problems on my test environment that can trigger WHC timeout or something else.

Need you give me some points.

@eboasson
Copy link
Contributor

HI @xu3stones,

It vaguely reminds me of a long-since fixed problem caused by improper use of the REUSEADDR option, with the result of a later process "stealing" ports — if I recall correctly that one had it the other way around, where you could only reach the process that was started later, but still, there is similarity. I just double checked (by looking at the code and by trying a few combinations of options on macOS) and I don't have this problem, so some additional factor has to be at play here.

I'll first try to summarize how the discovery works — you probably know all or most of it already, judging by your question, but if not it could be useful in understanding this — and then give some suggestions for things to look at.

The way discovery works is that on startup, each process starts sending SPDP messages to all configured addresses (by default the 239.255.0.1 multicast address and any unicast addresses you have listed as peers — loopback on Linux doesn't advertise multicast capability so it falls back to unicast with "localhost" as a peer). Any process receiving an SPDP message from a not-yet known participant adds that participant and responds (via unicast).

SPDP messages advertise addresses and indicate which discovery built-in readers/writers exist, so that both sides can match these built-in readers/writers immediately, without any further traffic, and use these to do the SEDP phase. These readers and writers are transient-local ones, and the way that transient-local works is that the readers need to request the historical data (describing any existing endpoints) by requesting a retransmit in exactly the same manner as they do to recover from packet loss.

To know what retransmits to request, a reader needs a Heartbeat message from a writer that tells the reader which sequence numbers exist. So, on matching a new reader/writer pair, in Cyclone the reader starts sending out "no-op" AckNack messages (ack 0 samples, request retransmit of 0 samples) but that do request a response in the form of a Heartbeat; and the writer starts out sending Hearbeats periodically. Some flags will toggle based on what they receive so it quiets down once all is well. The result is that the reader learns what data to request, regardless of how the connection came into existence — including cases where you had a temporary disconnection in one direction only.

The Heartbeats then trigger the "real" AckNack messages requesting retransmits, those then trigger retransmits and further Heartbeats and the cycle repeats until everything has been received — here you are dealing with so little data that I would expect it to be handled in one round. The addressing of retransmits in Cyclone is a complicated affair, because in the default behaviour it can be unicast or multicast, depending on the details. In the case of recovery of historical data by a new reader, the retransmits will typically be sent unicast.

In all this, the WHC merely stores the data that is available for retransmit, and there is no reason to suspect that that one would be involved.

So, with all that said, it seems to fit a pattern where multicast works fine, but unicast does not. Each process should have its own unique port numbers for unicasting, and if they don't (the REUSEADDR from the first beginning) I wouldn't be surprised to see some misbehaviour like what you describe.

From your message I gather you turned on the tracing, which is exactly the right thing to do, but it is notoriously hard to read … Still a few things are easily extracted:

  • At the beginning of the trace (following the config settings), it gives port numbers and some mode settings. If there is overlap in port numbers for sockets used for receiving unicast, there's a problem;
  • Lines matching /SPDP.*NEW/ are triggered by the discovery of a new remote participant, and those list the IP:PORT pairs associated with the discovered participant;
  • The first 3 words of the participant GUID are also used as the first 3 words of the endpoint GUIDs for that participant, so it is easy to see if there are HEARTBEAT and ACKNACK (in capitals) lines originating from that participant (so if the participant GUID is 123:456:def:1c1, search for /(HEARTBEAT|ACKNACK).*123:456:def:/. That could potentially include receiving your own looped-back multicast packets, but as these lines have two GUIDS in the form "source -> destination", that's not so hard to deal with;
  • Finally, anything transmitted by a process ends up in a line starting with "nn_xpack_send", and with the set of IP addresses at the end of that line between '[' and ']'. Searching for /\[ [0-9]/ is generally a good strategy to see where things are being sent to.

Of course you can use Wireshark instead, but while Wireshark makes it really easy to decode packets, it doesn't make it at all easy to understand how the packets relate to each other, nor why some are sent. So I pretty much rely on the traces.

Perhaps this will help you a bit in understanding what is going on. If you would like me to, I'd be happy to have a look, too. This is the kind of problem that must be understood and dealt with quickly, I don't want people new to Cyclone to immediately be disappointed …

@xu3stones
Copy link
Author

xu3stones commented Feb 23, 2019

HI @eboasson

Thanks for your explanation, after enable trace log, i still can not find out the reason.
Could you spend a little time to help me read my log and give some points~~
Here is my trace log ( you may paste the url into your browser manually):
https://github.com/xu3stones/cyclonedds/blob/master/1pub2sub_trace1.txt

@eboasson
Copy link
Contributor

I have spotted a few things that I think have something to do with it — but there's a bit of guesswork involved: the first guess is that you're using some RTOS or so 👍🏻 and I think that may have something to do with it.

As expected, each process creates a pair of sockets:

[U]8.578931/    (anon): p70  ddsi_udp_create_conn unicast socket 128 port 7410
[U]8.599279/    (anon): p70  ddsi_udp_create_conn unicast socket 129 port 7411
[U]9.999012/    (anon): p76  ddsi_udp_create_conn unicast socket 132 port 7412
[U]10.028503/    (anon): p77  ddsi_udp_create_conn unicast socket 133 port 7414
[U]10.047328/    (anon): p76  ddsi_udp_create_conn unicast socket 134 port 7413
[U]10.057797/    (anon): p77  ddsi_udp_create_conn unicast socket 135 port 7415

by the spec, the even one is for discovery traffic, and the odd one for data. Cyclone pretty much couldn't care less about that distinction as it really just looks at the contents, but it does try to play by the rules. So what you get is that the SPDP messages are sent to the even port numbers (there's add_peer_addresses where it adds them to the discovery address set during startup, and there is the tev thread that sends them; look for things like handle_xevk_spdp and the nn_xpack_send line (for a 236-byte large packet in this case) that comes not much later, followed by a bunch of IP address (it's spread over multiple lines). That's the pinging of discovery data.

The curious thing here is these SPDP messages should list the even port number for the discovery data and the odd one for the application data. However, when it traces the creation of the participant, the "meta" and "data" addresses are identical. I don't quite see how that can be the case if there are two sockets. It isn't that I'd expect it to break things — as mentioned, it doesn't care about port numbers as long as receives the data — but it is odd, and I wonder whether that may somehow be related to the platform you're using.

The other thing is that in the default configuration, it creates multiple threads for accepting incoming data: recv, recvMC and recvUC. The first is the catch-all for the ones not handled by other threads, the second is used for incoming multicast data traffic (so typically the socket bound to port 7401), and the third for incoming unicast data traffic (so typically the socket bound to a random port number if multicast works, or 7411/7413/7415/… if it doesn't). The reason it uses multiple threads is reducing the time spent in the select (or equivalent function) for multiplexing many sockets in a single thread.

In your case, there is no multicast and recvMC doesn't exist. There are simply two sockets: in p70, 7410 and 7411; in p76, 7412 and 7413, &c. — I'll just use the p70 ports for convenience, but it is essentially the same for the others. Therefore recv handles just traffic arriving on port 7410 (but using select, or equivalent), and recvUC handles just the traffic on port 7411. (By the way, you can configure it to handle all incoming data on the same recv thread if you'd like to have fewer threads.)

When you look at when which thread receives data, you'll see that at first it receives exclusively on the recv thread. This makes sense: nothing is sent to a process' unicast sockets until after it has been discovered by the others, and the SPDP ones are sent to the even-numbered ports. But after that, it only receives data on recvUC — and yet, there is data being sent to the even numbered ports later on as well (it periodically sends SPDP messages, you can find the tev: lines where it sends them). So how come those don't show up anymore? (Note: you do see SPDP messages arriving on recvUC, but that is because of the odd port numbers in the SPDP message causing them also to be sent to those odd ports.)

So it looks a bit like the recv thread receives nothing once the recvUC thread starts receiving things. You might want to investigate that a bit further, I suspect that's where the answer to riddle lies. Finally, there might be an innocent explanation why it is as if p76 and p77 can't do anything at the same time, but that, too is a little strange.

In short … no smoking gun, but a few things worth looking into in more detail. Hope you this helps you a bit in the investigation.

@eboasson
Copy link
Contributor

eboasson commented May 1, 2019

Hopefully the problem went away ...

@eboasson eboasson closed this as completed May 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants