Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ros2 topic info -v prints incorrect QoS info for macOS + CycloneDDS #525

Closed
prajakta-gokhale opened this issue Jun 3, 2020 · 38 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@prajakta-gokhale
Copy link

Bug report

Required Info:

  • Operating System:
    • macOS 10.14 Mojave
  • Installation type:
  • Version or commit hash:
    • Foxy prerelease
  • DDS implementation:
    • CycloneDDS
  • Client library (if applicable):
    • N/A

Steps to reproduce issue

ros2 topic info -v does not print correct QoS information for a topic.

~ ros2 topic pub /talker --qos-durability volatile std_msgs/String "data: Hello World volatile"~ ros2 topic info -v /talker
Type: std_msgs/msg/String

Publisher count: 1

Node name: _CREATED_BY_BARE_DDS_APP_
Node namespace: _CREATED_BY_BARE_DDS_APP_
Topic type: std_msgs/msg/String
Endpoint type: PUBLISHER
GID: a3.6b.10.01.32.8a.d9.a4.bb.0c.3b.46.00.00.08.03.00.00.00.00.00.00.00.00
QoS profile:
  Reliability: RMW_QOS_POLICY_RELIABILITY_RELIABLE
  Durability: RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL
  Lifespan: 2147483651294967295 nanoseconds
  Deadline: 2147483651294967295 nanoseconds
  Liveliness: RMW_QOS_POLICY_LIVELINESS_AUTOMATIC
  Liveliness lease duration: 2147483651294967295 nanoseconds

Subscription count: 0

Expected behavior

Durability value printed should be:
Durability: RMW_QOS_POLICY_DURABILITY_VOLATILE

Actual behavior

Durability value printed is:
Durability: RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL

Additional information

These commands appear to work as expected on Linux and Windows for CycloneDDS.

@ahcorde
Copy link
Contributor

ahcorde commented Jun 23, 2020

I have runned this with the Fat binary from https://ci.ros2.org/view/packaging/ and compiling from sources and in both cases I can see the right durability.

  Durability: RMW_QOS_POLICY_DURABILITY_VOLATILE

can you double checked this?

@sloretz
Copy link
Contributor

sloretz commented Jun 25, 2020

@prajakta-gokhale It looks like this was reported in a Foxy prerelease. Do you still see the issue now that Foxy is released? It looks like @ahcorde doesn't see it.

@prajakta-gokhale
Copy link
Author

Thanks for looking! I'll try again and provide an update.

@fujitatomoya
Copy link
Collaborator

I got the same issue on Ubuntu with source build ros2/ros2@0ee1ca2, only rmw_cyclonedds can reproduce this issue.

  • rmw_fastrtps_cpp
# ros2 topic pub /talker --qos-durability volatile std_msgs/String "data: Hello World volatile"
publisher: beginning loop
publishing #1: std_msgs.msg.String(data='Hello World volatile')
...<snip>

# ros2 topic info -v /talker
Type: std_msgs/msg/String

Publisher count: 1

Node name: _ros2cli_140
Node namespace: /
Topic type: std_msgs/msg/String
Endpoint type: PUBLISHER
GID: 01.0f.eb.7d.8c.00.00.00.01.00.00.00.00.00.05.03.00.00.00.00.00.00.00.00
QoS profile:
  Reliability: RMW_QOS_POLICY_RELIABILITY_RELIABLE
  Durability: RMW_QOS_POLICY_DURABILITY_VOLATILE
  Lifespan: 2147483651294967295 nanoseconds
  Deadline: 2147483651294967295 nanoseconds
  Liveliness: RMW_QOS_POLICY_LIVELINESS_AUTOMATIC
  Liveliness lease duration: 2147483651294967295 nanoseconds

Subscription count: 0

  • rmw_cyclonedds_cpp
# export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
# ros2 topic pub /talker --qos-durability volatile std_msgs/String "data: Hello World volatile"
publisher: beginning loop
publishing #1: std_msgs.msg.String(data='Hello World volatile')
...<snip>

# ros2 topic info -v /talker
Type: std_msgs/msg/String

Publisher count: 1

Node name: _CREATED_BY_BARE_DDS_APP_
Node namespace: _CREATED_BY_BARE_DDS_APP_
Topic type: std_msgs/msg/String
Endpoint type: PUBLISHER
GID: ee.68.10.01.32.dd.ec.d6.7f.cb.d7.e6.00.00.08.03.00.00.00.00.00.00.00.00
QoS profile:
  Reliability: RMW_QOS_POLICY_RELIABILITY_RELIABLE
  Durability: RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL
  Lifespan: 2147483651294967295 nanoseconds
  Deadline: 2147483651294967295 nanoseconds
  Liveliness: RMW_QOS_POLICY_LIVELINESS_AUTOMATIC
  Liveliness lease duration: 2147483651294967295 nanoseconds

Subscription count: 0

@ivanpauno
Copy link
Member

Closing!
@prajakta-gokhale if this is still a bug, let me know and I will reopen.

@ivanpauno ivanpauno reopened this Aug 27, 2020
@ivanpauno
Copy link
Member

ivanpauno commented Aug 27, 2020

I didn't see this comment, so this seem to still be a bug.
@ahcorde did you try with cyclonedds?

@ivanpauno ivanpauno added the bug Something isn't working label Aug 27, 2020
@fujitatomoya
Copy link
Collaborator

As far as i see, this still happens.

And when we use cyclonedds rmw_publisher_get_actual_qos says RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL after setting RMW_QOS_POLICY_DURABILITY_VOLATILE via rmw_create_publisher.

@eboasson friendly ping, if you could share your thought on this?

@eboasson
Copy link

eboasson commented Sep 4, 2020

@fujitatomoya that's rather interesting, which in this case means it doesn't ring any bell and I don't see this problem when I try it ("Debug" source build on macOS, mid august). It could be some missing initialization somewhere that acts up in some build types and configurations and not in others ...

In any case, I don't quite see how:

https://github.com/ros2/rmw_cyclonedds/blob/a07553e620eb92321949d7951b08161a65559302/rmw_cyclonedds_cpp/src/rmw_node.cpp#L1631
https://github.com/ros2/rmw_cyclonedds/blob/a07553e620eb92321949d7951b08161a65559302/rmw_cyclonedds_cpp/src/rmw_node.cpp#L1758

could get the conversion wrong. The functions to create publishers and subscriptions, as well as those to retrieve the actual QoS settings, rely on them.

Given the absence of an obvious culprit, to me it would make the most sense to look at the DDS level directly to learn a bit more about where the origin of the problem is. A Cyclone trace can be used to for this (there are other ways, but I find the trace files often the easiest — that may be a deformation 🤷). Would you be able to give this a try?

# ros2 daemon stop
# CYCLONEDDS_URI='<Tr><V>fine</><Out>cdds.log.${CYCLONEDDS_PID}</>'
# ros2 topic pub /talker --qos-durability volatile std_msgs/String "data: Hello World volatile" & sleep 2 ; ros2 topic info -v /talker ; kill % ; wait
<output snipped>
# grep -E '(SEDP|WRITER).*topic_name="rt/talker"' cdds.log.*

That should give you 3 lines of output, one for each process (the daemon is the 3rd). In my case, where everything works fine, they all show durability=0. If the writer really is transient-local, it'll say durability=1. If there's a disagreement between the three lines, there's an even more worrisome problem. E.g.:

cdds.log.3549:1599209872.469863 [0]    1036121: WRITER 15291001:2f38f033:2da8884e:803 QOS={user_data=0<>,topic_name="rt/talker",type_name="std_msgs::msg::dds_::String_",topic_data=0<>,group_data=0<>,durability=0,durability_service=0:0:1:-1:-1:-1,deadline=9223372036854775807,latency_budget=0,liveliness=0:9223372036854775807,reliability=1:9223372036854775807,lifespan=9223372036854775807,destination_order=0,history=0:1,resource_limits=-1:-1:-1,ownership=0,ownership_strength=0,presentation=0:0:0,partition={},transport_priority=0,adlink_entity_factory=1,adlink_writer_data_lifecycle=0}
cdds.log.3553:1599209873.942993 [0] dq.builtin: SEDP ST0 15291001:2f38f033:2da8884e:803 reliable volatile writer: (default).rt/talker/std_msgs::msg::dds_::String_ p(open) NEW (as udp/239.255.0.1:7401 udp/127.0.0.1:52592 ssm=0) QOS={user_data=0<>,topic_name="rt/talker",type_name="std_msgs::msg::dds_::String_",topic_data=0<>,group_data=0<>,durability=0,durability_service=0:0:1:-1:-1:-1,deadline=9223372036854775807,latency_budget=0,liveliness=0:9223372036854775807,reliability=1:9223372036854775807,lifespan=9223372036854775807,destination_order=0,history=0:1,resource_limits=-1:-1:-1,ownership=0,ownership_strength=0,presentation=0:0:0,partition={},transport_priority=0,adlink_entity_factory=1,adlink_writer_data_lifecycle=0}
cdds.log.3555:1599209874.185813 [0] dq.builtin: SEDP ST0 15291001:2f38f033:2da8884e:803 reliable volatile writer: (default).rt/talker/std_msgs::msg::dds_::String_ p(open) NEW (as udp/239.255.0.1:7401 udp/127.0.0.1:52592 ssm=0) QOS={user_data=0<>,topic_name="rt/talker",type_name="std_msgs::msg::dds_::String_",topic_data=0<>,group_data=0<>,durability=0,durability_service=0:0:1:-1:-1:-1,deadline=9223372036854775807,latency_budget=0,liveliness=0:9223372036854775807,reliability=1:9223372036854775807,lifespan=9223372036854775807,destination_order=0,history=0:1,resource_limits=-1:-1:-1,ownership=0,ownership_strength=0,presentation=0:0:0,partition={},transport_priority=0,adlink_entity_factory=1,adlink_writer_data_lifecycle=0}

I'm thinking this should at least tell us whether we need to be looking at the creation of the publisher or at the retrieving of the actual QoS.

@fujitatomoya
Copy link
Collaborator

@eboasson

appreciate for your help!!! yes it is weird that some people say it works okay but some don't. maybe i miss something here... 😭 anyway i have reproduction environment, so i will try to get trace information by procedure you provided.

@fujitatomoya
Copy link
Collaborator

Reproducible Procedure

  1. start daemon with fastdds
# export RMW_IMPLEMENTATION=
# ros2 daemon stop
# ros2 daemon start -d
  1. start publication via ros2cli with cyclonedds
# export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
# ros2 topic pub /talker --qos-durability volatile std_msgs/String "data: Hello World volatile"
<snip...>
  1. check topic information
# export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
# ros2 topic info -v /talker
<snip...>
  • Note
    • using cyclonedds for ros2daemon, and topic pub&sub with fastdds, it DOES NOT have the issue.
    • I think that user is allowed to use different implementation from ros2daemon as long as it is dds, so it should be working.

could someone try to confirm above procedure to reproduce the issue just in case?

@fujitatomoya
Copy link
Collaborator

@eboasson

# CYCLONEDDS_URI='<Tr><V>fine</><Out>cdds.log.${CYCLONEDDS_PID}</>'
# ros2 topic pub /talker --qos-durability volatile std_msgs/String "data: Hello World volatile" & sleep 2 ; ros2 topic info -v /talker ; kill % ; wait
# grep -E '(SEDP|WRITER).*topic_name="rt/talker"' cdds.log.*
grep: cdds.log.*: No such file or directory

I cannot get the log file cdds.log...did i miss something?

@fujitatomoya
Copy link
Collaborator

after restart container, I am no longer able to reproduce this issue with #525 (comment). i don't know what to tell, but sorry 😢 I'll try to find another environment.

@eboasson
Copy link

eboasson commented Sep 7, 2020

@eboasson

# CYCLONEDDS_URI='<Tr><V>fine</><Out>cdds.log.${CYCLONEDDS_PID}</>'
# ros2 topic pub /talker --qos-durability volatile std_msgs/String "data: Hello World volatile" & sleep 2 ; ros2 topic info -v /talker ; kill % ; wait
# grep -E '(SEDP|WRITER).*topic_name="rt/talker"' cdds.log.*
grep: cdds.log.*: No such file or directory

I cannot get the log file cdds.log...did i miss something?

Foolish me ... trying to be helpful giving copy-paste-ready lines and then forgetting the export CYCLONEDDS_URI part ...

after restart container, I am no longer able to reproduce this issue with #525 (comment). i don't know what to tell, but sorry 😢 I'll try to find another environment.

Hopefully you can find one without much effort 🤞

@iuhilnehc-ynos
Copy link

@fujitatomoya

could someone try to confirm above procedure to reproduce the issue just in case?

Yes, I can reproduce this issue by your steps. (I agree we can use a different RMW to get information from others)
I can also use the following steps to reproduce it in a container.

  • terminal 1
$ export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
$ ros2 topic pub /talker1234 --qos-durability volatile  std_msgs/String "data: Hello World volatile" > /dev/null 2>&1
  • terminal 2
$ ros2 daemon stop
$ export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp

$ ros2 topic info -v /talker1234

1599459365.320665 [0]       ros2: using network interface eno1 (udp/192.168.0.61) selected arbitrarily from: eno1, docker0
Type: std_msgs/msg/String

Publisher count: 1

Node name: _ros2cli_783
Node namespace: /
Topic type: std_msgs/msg/String
Endpoint type: PUBLISHER
GID: 49.a5.10.01.7d.f9.95.aa.aa.a2.1a.b4.00.00.08.03.00.00.00.00.00.00.00.00
QoS profile:
  Reliability: RMW_QOS_POLICY_RELIABILITY_RELIABLE
  Durability: RMW_QOS_POLICY_DURABILITY_VOLATILE                     # good, it's correct.
  Lifespan: 9223372036854775807 nanoseconds
  Deadline: 9223372036854775807 nanoseconds
  Liveliness: RMW_QOS_POLICY_LIVELINESS_AUTOMATIC
  Liveliness lease duration: 9223372036854775807 nanoseconds

Subscription count: 0


$ ros2 daemon stop
$ export RMW_IMPLEMENTATION=rmw_fastrtps_cpp

$ ros2 topic info -v /talker1234
Type: std_msgs/msg/String

Publisher count: 1

Node name: _CREATED_BY_BARE_DDS_APP_
Node namespace: _CREATED_BY_BARE_DDS_APP_
Topic type: std_msgs/msg/String
Endpoint type: PUBLISHER
GID: 49.a5.10.01.7d.f9.95.aa.aa.a2.1a.b4.00.00.08.03.00.00.00.00.00.00.00.00
QoS profile:
  Reliability: RMW_QOS_POLICY_RELIABILITY_RELIABLE
  Durability: RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL             # bad.
  Lifespan: 2147483651294967295 nanoseconds
  Deadline: 2147483651294967295 nanoseconds
  Liveliness: RMW_QOS_POLICY_LIVELINESS_AUTOMATIC
  Liveliness lease duration: 2147483651294967295 nanoseconds

Subscription count: 0

@fujitatomoya
Copy link
Collaborator

@iuhilnehc-ynos

thanks for checking!

@eboasson
Copy link

eboasson commented Sep 8, 2020

@iuhilnehc-ynos this is getting weirder by the comment! So now Cyclone DDS shows the correct durability setting and Fast-RTPS shows the wrong one!

One thing it proves: we really need to know what the publisher process advertises over discovery.

I would appreciate it very much if could you run it with the CYCLONEDDS_URI bit I mentioned earlier to get log files from Cyclone and run Wireshark to collect the packets, and get them to me somehow. One never knows where things go wrong, but that gives a ton of information, including the raw data that's on the wire, and really should help in pinpointing it.

@eboasson
Copy link

eboasson commented Sep 8, 2020

@iuhilnehc-ynos perhaps you need not do what I asked just now: after I published the comment, I realised that I could try this particular experiment of mixing a Cyclone DDS-based publisher with a Fast-RTPS-based ros2 cli. I get exactly the same result. That's good news.

I've checked with Wireshark:
image

Cyclone only publishes QoS settings that are different from the default. In section 9.6.2.2.1 the DDSI-RTPS specification states that for parameters (e.g., durability QoS, a.k.a. PID_DURABILITY) missing from a discovery message, the default should be applied, and then references the DDS specification for the default

image

which is volatile per section 2.1.3 of the DCPS spec, so Cyclone DDS is entirely correct:

image

and Fast-RTPS or its RMW layer is misinterpreting the discovery data.

I suspect that when @fujitatomoya reproduced it, the ros2 daemon was running using Fast-RTPS. That one then stores the incorrect discovery data and regurgitates it to "ros2 topic info" even if that one is running Cyclone. Restarting the machine/container could easily have resulted in using Cyclone DDS for the daemon.

And if instead you use Cyclone DDS for the daemon, even Fast-RTPS gets it right:

$ RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 topic pub /talker1234 --qos-durability volatile  std_msgs/String "data: Hello World volatile" > /dev/null 2>&1 &

$ ros2 daemon stop
$ RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 topic info -v /talker1234

Type: std_msgs/msg/String

Publisher count: 1

Node name: _ros2cli_28883
Node namespace: /
Topic type: std_msgs/msg/String
Endpoint type: PUBLISHER
GID: 5c.18.10.01.85.70.a9.07.7d.e5.91.87.00.00.08.03.00.00.00.00.00.00.00.00
QoS profile:
  Reliability: RMW_QOS_POLICY_RELIABILITY_RELIABLE
  Durability: RMW_QOS_POLICY_DURABILITY_VOLATILE               # good
  Lifespan: 9223372036854775807 nanoseconds
  Deadline: 9223372036854775807 nanoseconds
  Liveliness: RMW_QOS_POLICY_LIVELINESS_AUTOMATIC
  Liveliness lease duration: 9223372036854775807 nanoseconds

Subscription count: 0

$ RMW_IMPLEMENTATION=rmw_fastrtps_cpp ros2 topic info -v /talker1234

Type: std_msgs/msg/String

Publisher count: 1

Node name: _ros2cli_28883
Node namespace: /
Topic type: std_msgs/msg/String
Endpoint type: PUBLISHER
GID: 5c.18.10.01.85.70.a9.07.7d.e5.91.87.00.00.08.03.00.00.00.00.00.00.00.00
QoS profile:
  Reliability: RMW_QOS_POLICY_RELIABILITY_RELIABLE
  Durability: RMW_QOS_POLICY_DURABILITY_VOLATILE               # good
  Lifespan: 9223372036854775807 nanoseconds
  Deadline: 9223372036854775807 nanoseconds
  Liveliness: RMW_QOS_POLICY_LIVELINESS_AUTOMATIC
  Liveliness lease duration: 9223372036854775807 nanoseconds

Subscription count: 0

$ ros2 daemon stop
The daemon has been stopped
$ RMW_IMPLEMENTATION=rmw_fastrtps_cpp ros2 topic info -v /talker1234  

Type: std_msgs/msg/String

Publisher count: 1

Node name: _CREATED_BY_BARE_DDS_APP_
Node namespace: _CREATED_BY_BARE_DDS_APP_
Topic type: std_msgs/msg/String
Endpoint type: PUBLISHER
GID: 5c.18.10.01.85.70.a9.07.7d.e5.91.87.00.00.08.03.00.00.00.00.00.00.00.00
QoS profile:
  Reliability: RMW_QOS_POLICY_RELIABILITY_RELIABLE
  Durability: RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL               # bad
  Lifespan: 2147483651294967295 nanoseconds
  Deadline: 2147483651294967295 nanoseconds
  Liveliness: RMW_QOS_POLICY_LIVELINESS_AUTOMATIC
  Liveliness lease duration: 2147483651294967295 nanoseconds

Subscription count: 0

$ RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 topic info -v /talker1234

Type: std_msgs/msg/String

Publisher count: 1

Node name: _CREATED_BY_BARE_DDS_APP_
Node namespace: _CREATED_BY_BARE_DDS_APP_
Topic type: std_msgs/msg/String
Endpoint type: PUBLISHER
GID: 5c.18.10.01.85.70.a9.07.7d.e5.91.87.00.00.08.03.00.00.00.00.00.00.00.00
QoS profile:
  Reliability: RMW_QOS_POLICY_RELIABILITY_RELIABLE
  Durability: RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL               # bad
  Lifespan: 2147483651294967295 nanoseconds
  Deadline: 2147483651294967295 nanoseconds
  Liveliness: RMW_QOS_POLICY_LIVELINESS_AUTOMATIC
  Liveliness lease duration: 2147483651294967295 nanoseconds

Subscription count: 0

@iuhilnehc-ynos
Copy link

@eboasson

Fast-RTPS or its RMW layer is misinterpreting the discovery data.

I used another export RMW_IMPLEMENTATION=rmw_connext_cpp for testing, but daemon with fastrtps could get correct information.

@eboasson
Copy link

eboasson commented Sep 8, 2020

I used another export RMW_IMPLEMENTATION=rmw_connext_cpp for testing, but daemon with fastrtps could get correct information.

Thanks for checking, my employer requires me to avoid touching Connext because of some less than pleasant interactions in the past ...

Presumably Connext then also includes the durability QoS when it is volatile (you can fairly easily check with Wireshark: record the startup, then look for DATA(w) submessages and look through for the topic name in the expanded parameter list).

By the way, you can make Cyclone publish all QoS settings all the time with an option:

$ export CYCLONEDDS_URI='<Compatibility><ExplicitlyPublishQosSetToDefault>true</></>'

and when you do that, Fast-RTPS suddenly gets it right. So there's a work around, but it relies on wasting network bandwidth.

I also know from completely unrelated situations that are no problems interoperating with Connext if you leave out the durability QoS (or many others) from projects using both Connext and OpenSplice (which has the same behaviour with respect to publish QoS settings as Cyclone). So it really is the Fast-RTPS side that is broken.

@clalancette
Copy link
Contributor

Thanks for all of the debugging on this issue, it is really appreciated.

@iuhilnehc-ynos @fujitatomoya Since it seems like this is a bug in Fast-DDS, would one of you mind summarizing the problem and opening a new issue on https://github.com/eProsima/Fast-DDS/issues ?

@iuhilnehc-ynos
Copy link

iuhilnehc-ynos commented Sep 8, 2020

@clalancette @eboasson

I don't think this bug is in Fast-DDS. IMO, it's because ros2_daemon(rmw_fastrtps) not received QoS information from ros2cli pub with rmw_cyclonedds_cpp, so ros2 topic -v can't get QoS information from ros2_daemon.

NOTE: I believe it belongs to cyclonedds more.

@clalancette
Copy link
Contributor

I don't think this bug is in Fast-DDS. IMO, it's because ros2_daemon(rmw_fastrtps) not received QoS information from ros2cli pub with rmw_cyclonedds_cpp, so ros2 topic -v can't get QoS information from ros2_daemon.

NOTE: I believe it belongs to cyclonedds more.

OK, I misunderstood. If @eboasson agrees, then I think we'll want to open up an issue there. Thanks.

@eboasson
Copy link

eboasson commented Sep 8, 2020

@clalancette and @iuhilnehc-ynos, it is a bug in Fast-DDS: Cyclone DDS relies on the receivers of the discovery data to implement required behaviour and fill in the correct values for QoS settings not explicitly included in the discovery message. The bug here is that Fast-DDS uses the wrong default value (at least for durability).

@fujitatomoya
Copy link
Collaborator

@clalancette @eboasson @iuhilnehc-ynos

I am not sure where exactly this problem belongs to yet 🤕 , i just need more time make it clear. I think what we see here is some misunderstanding of syntax between different dds implementation, that will lead the discovery process broken then wrong QoS. that is as far as i can tell. (i guess that we all agree on this.)

Problem

ros2 topic info -v receives wrong QoS profile from ros2 daemon

Reproducible Procedure

Symptom

  • Problem only happens when ros2 daemon is started with rmw_fastdds as experience.
  • When problem happens, publication is registered as _CREATED_BY_BARE_DDS_APP_, Not a ros node.

after all, maybe it would be suitable to create move this issue rmw_fastrtps to check the discovery process, find out why publication is registered as _CREATED_BY_BARE_DDS_APP_. I believe that would lead to the root cause.

what would you think?

@iuhilnehc-ynos
Copy link

iuhilnehc-ynos commented Sep 9, 2020

@eboasson

Sorry, I think it's my mistake.

FastDDS does have a problem. If WriterProxyData::readFromCDRMessage not received durability, its QoS will be WriterQos m_qos;

WriterQos::WriterQos()
{
    this->m_reliability.kind = RELIABLE_RELIABILITY_QOS;
    this->m_durability.kind = TRANSIENT_LOCAL_DURABILITY_QOS;    // not use default value here ?
}

But I can't find the information in DDSI-RTPS specification about
In section 9.6.2.2.1 the DDSI-RTPS specification states that for parameters (e.g., durability QoS, a.k.a. PID_DURABILITY) missing from a discovery message, the default should be applied

or Let me put it another way, why cyclonedds send PID_RELIABILITY even if it's the default value ?

@iuhilnehc-ynos
Copy link

iuhilnehc-ynos commented Sep 9, 2020

@fujitatomoya

Thanks for your summary.

When problem happens, publication is registered as CREATED_BY_BARE_DDS_APP, Not a ros node.

I noticed this. I think it's also a problem. (About _CREATED_BY_BARE_DDS_APP_, it's located at rmw_dds_common, after finishing DURABILITY issue, I'll continue to look into it.)

@eboasson
Copy link

eboasson commented Sep 9, 2020

No worries, @iuhilnehc-ynos, there're just too many details to be on top of them all.

FastDDS does have a problem. If WriterProxyData::readFromCDRMessage not received durability, its QoS will be WriterQos m_qos;

WriterQos::WriterQos()
{
    this->m_reliability.kind = RELIABLE_RELIABILITY_QOS;
    this->m_durability.kind = TRANSIENT_LOCAL_DURABILITY_QOS;    // not use default value here ?
}

Yes, that totally explains it

But I can't find the information in DDSI-RTPS specification about
In section 9.6.2.2.1 the DDSI-RTPS specification states that for parameters (e.g., durability QoS, a.k.a. PID_DURABILITY) missing from a discovery message, the default should be applied

It is this paragraph:

For backwards compatibility, both subspaces are subdivided again. If a ParameterId is expected, but not present, the protocol will assume the default value. Similarly, if a ParameterId is present but not recognized, the protocol will either skip and ignore the parameter or treat the parameter as an incompatible QoS. The actual behavior depends on the ParameterId value, see Table 9.11.

(Page 169 of DDSI-RTPS 2.3, emphasis mine)

Table 9.11 then says "see DDS specification", which is the big table in the DDS 1.4 spec starting on page 2-104.

or Let me put it another way, why cyclonedds send PID_RELIABILITY even if it's the default value ?

It is because it isn't actually set to the default value: the reliability QoS is a pair of kind (best-effort or reliable) and a "max blocking time" that says for how long the writer should block when resource limits prevent it from completing the write operation. The DDS default for "max blocking time" is 100ms, but the Cyclone RMW layer sets it to ∞ and that's the reason it is included.

I now see that Wireshark doesn't print the "max blocking time" in the table. It doesn't affect reader/writer matching, so in that respect it is not very important, but it is a required part of the wire format so it is kinda odd that Wireshark omits it. (Required because the DDSI-RTPS spec says the type of the "reliability" parameter in the discovery is ReliabilityQosPolicy, in turn defined in the DDS spec, although with a footnote that "The encoding of DDS::ReliabilityQoSPolicy::kind is defined by RTPS::ReliabilityKind_t (9.3.2)".

image

It's the "ff ff ff 7f ff ff ff ff" bit in the highlighted bytes.

@fujitatomoya
Copy link
Collaborator

When the CREATED_BY_BARE_DDS_APP problem is observed, process_discovery_info RTPSParticipantKey() is all zero (confirmed). It should be something like rmw_gid_t 1 64 16 1 -46 123 40 76 -73 127 -106 -118 0 0 1 -63 0 0 0 0 0 0 0 0 which is the same rmw_gid_t with node_listener msg.gid.data.

and the questions is why? @MiguelCompany could you kindly share your thought? this is really getting into dds implementation.

@iuhilnehc-ynos
Copy link

iuhilnehc-ynos commented Sep 9, 2020

FastDDS expected that it will get fastdds::dds::PID_PARTICIPANT_GUID message while getting DATA(w) information, but CycloneDDS only send fastdds::dds::PID_PARTICIPANT_GUID at DATA(p) not DATA(w).
(It seems there is no such item 'must', 'optional' for these message-ids in the spec.)

@iuhilnehc-ynos
Copy link

@eboasson

It is because it isn't actually set to the default value

I appreciate your explanation.

@eboasson
Copy link

eboasson commented Sep 9, 2020

@iuhilnehc-ynos

FastDDS expected that it will get fastdds::dds::PID_PARTICIPANT_GUID message while getting DATA(w) information, but CycloneDDS only send fastdds::dds::PID_PARTICIPANT_GUID at DATA(p) not DATA(w).
(It seems there is no such item 'must', 'optional' for these message-ids in the spec.)

This is an interesting. Quoting from section 9.6.2.2 in the DDSI-RTPS spec:

For optimization, implementations of the protocol shall not include a parameter in the Data submessage if it contains information that is redundant with other parameters already present in that same Data submessage. As a result of this optimization an implementation shall omit the serialization of the parameters listed in Table 9.10.

The participant GUID is entirely redundant because it is always the same as the reader/writer GUID with the entity id component replaced by 0x1c1. So in my view the normative text says the participant GUID should be omitted and the referenced table is therefore simply an incomplete list of "forbidden" items.

I guess one could hold a different opinion and argue that the intent behind that paragraph is that only those listed in table 9.10 should be left. But the fact of the matter is that neither OpenSplice nor Cyclone DDS has ever published it in the reader/writer information for a decade without it causing any interoperability issues, and I have furthermore checked some packet captures related to investigating interoperability issues in that decade that show Connext and CoreDX also leave it out. (That also means I would expect the same all-zero GUID to show when using Connext instead of Cyclone in this experiment.)

Finally, if eProsima is of the opinion that the participant GUID must be present in the reader/writer information, they should reject the discover data as invalid — maybe I should try fuzzing it? — and not use a nonsensical default value instead.

@iuhilnehc-ynos
Copy link

@clalancette

If @eboasson agrees, then I think we'll want to open up an issue there. Thanks.

Anyway, I think it's time to pull eProsima in. I have created an issue on Fast-DDS.

@fujitatomoya
Copy link
Collaborator

@eboasson

I see what you mentioned in this thread makes perfect sense.
thanks for the insights, you actually really helped us about specification.

@MiguelCompany
Copy link

Regarding missing durability info, this is a bug which is being fixed by eProsima/Fast-DDS#1384

Regarding PID_PARTICIPANT_GUID, I've been reading DDSI-RTPS v2.3 once more, and I am interpreting it the following way:

Section 9.6.2.2 indicates that DiscoveredWriterData is defined with the following IDL:

struct DiscoveredWriterData {
 DDS::PublicationBuiltinTopicData ddsPublicationData;
 WriterProxy mWriterProxy;
};

It then says on Table 9.10 that WriterProxy::remoteWriterGuid does not need to be serialized, as it can be obtained from PublicationBuiltinTopicData::key

Mapping of PID_PARTICIPANT_GUID is specified later on Table 9.14 - ParameterId mapping and default values

image

This means that field PublicationBuiltinTopicData::participant_key from the DDS v1.4 standard is serialized using PID_PARTICIPANT_GUID. On the default value it says N/A, meaning there is no default value for it, so it should always be sent.

I have created PR eProsima/Fast-DDS#1382 to handle the case of a missing PID_PARTICIPANT_GUID and in that case default it to using the GUID prefix coming on PID_ENDPOINT_GUID and the default participant entity id

@fujitatomoya
Copy link
Collaborator

@clalancette @ahcorde

let me summarize current status in quick,

Problems

  1. Node name is ros2cli{pid}, not _CREATED_BY_BARE_DDS_APP_
  2. Namespace is /, not _CREATED_BY_BARE_DDS_APP_
  3. Durability is RMW_QOS_POLICY_DURABILITY_VOLATILE (run ros2 with durability(volatile) )

Current Fix

i confirmed the fix but it would be better if someone does double check in macOS to make sure since this is registered for macOS at 1st.

Note

there's been still discussion about DDSI-RTPS specification.

@iuhilnehc-ynos
Copy link

@fujitatomoya

if someone does double check in macOS to make sure since this is registered for macOS at 1st.

I have confirmed all these issues will be fixed in macOS after eProsima/Fast-DDS#1384 is merged.
(eProsima/Fast-DDS#1382 is already merged.)

chenlh@lihuideMacBook-Pro ROS2 % ros2 topic info -v /talker1234
Type: std_msgs/msg/String

Publisher count: 1

Node name: _ros2cli_2279
Node namespace: /
Topic type: std_msgs/msg/String
Endpoint type: PUBLISHER
GID: 9c.d0.10.01.ef.00.1e.2b.04.29.a2.c8.00.00.08.03.00.00.00.00.00.00.00.00
QoS profile:
  Reliability: RMW_QOS_POLICY_RELIABILITY_RELIABLE
  Durability: RMW_QOS_POLICY_DURABILITY_VOLATILE
  Lifespan: 2147483651294967295 nanoseconds
  Deadline: 2147483651294967295 nanoseconds
  Liveliness: RMW_QOS_POLICY_LIVELINESS_AUTOMATIC
  Liveliness lease duration: 2147483651294967295 nanoseconds

Subscription count: 0

@fujitatomoya
Copy link
Collaborator

@prajakta-gokhale
CC: @ahcorde @clalancette

this is taken care by eProsima/Fast-DDS#1380, could you kindly confirm?

@clalancette
Copy link
Contributor

I just verified that on Galactic, this problem is solved. I'm going to go ahead and close this one out, thanks all for the debugging and fixes here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants