Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcoap: Process CON responses #14178

Merged

Conversation

chrysn
Copy link
Member

@chrysn chrysn commented May 30, 2020

This generalizes the existing code for answering CoAP pings into general
message-layer responses. Such responses are now also sent as a reaction
to CON responses, which can otherwise follow the same code path as
existing other responses.

As a side effect, issues that would crop up when responding to odd empty
requests that have token length set are resolved.

Testing procedure

[edit: further down on #14178 (comment) ]

Issues/PRs references

Fixes #14169 (eventually; currently it only addresses the first parts).

@chrysn chrysn added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) State: WIP State: The PR is still work-in-progress and its code is not in its final presentable form yet Area: CoAP Area: Constrained Application Protocol implementations labels May 30, 2020
@chrysn
Copy link
Member Author

chrysn commented May 30, 2020

Note to self: There's a quite odd empty_as_response goto location sitting in the code; I presume that was the location where the response was originally envisioned to be sent, but as now we need to track some local state in the function anyway (which type to set), this is easier done in an if at the end, and the goto can go away.

@chrysn
Copy link
Member Author

chrysn commented Jun 3, 2020

A nice side effect of this is that servers that take longer to create a response than they want to keep the gcoap thread waiting can now use gcoap_send_req to send their response as a CON. (Phrasing that in the documentation in a non-confusing way will be hard, and maybe we'd like to introduce a thin wrapper function just to make the docs easier).

This shows an open bug with the current PR (that CON requests without a response callback could have their memo freed immediately when an empty ACK arrives), but that's a missing optimization there anyway.

@chrysn
Copy link
Member Author

chrysn commented Jun 29, 2020

Coming back to this, I think it'll be easiest not to cancel the timeout, but just to set a flag that "yes this was a CON but stop transmitting it's already ACKed". That would work around the race condition, and while this incurs the runtime penalty of having a few timers fire needlessly, it keeps the code in the timers smaller (set a flag vs. compute the total timeout and fiddle with timers).

Next steps on my side are addressing Cenk's comment and implementing that to replace the current WIP head.

@chrysn
Copy link
Member Author

chrysn commented Jun 29, 2020

This should be reviewable now that I pushed an non-WIP final commit and the...

Testing procedure

As there are no delayed servers in gcoap yet this needs to be tested against an external application, for example aiocoap:

$ git clone https://github.com/chrysn/aiocoap
$ cd aiocoap
$ ./server.py

[edit: added]
As the /other/separate resource this will be tested on is rather large, you'll need to set CONFIG_GCOAP_PDU_BUF_SIZE to about 600. Otherwise, the eventual response CON will be silently dropped by gcoap as it can't receive it in parts, and thus the server will keep transmitting the CON responses. See also #14167.
[/edit]

Watch the network traffic on the bridge with Wireshark (filtering for CoAP) and run RIOT:

$ make -C examples/gcoap all term
> coap get -c fe80::176b:fd74:a58f:ff97%6 5683 /time
gcoap_cli: sending msg ID 14102, 11 bytes
> gcoap: response Success, code 2.05, 16 bytes
00000000  32  30  32  30  2D  30  36  2D  32  39  20  31  37  3A  35  35
> coap get -c fe80::176b:fd74:a58f:ff97%6 5683 /other/separate
gcoap_cli: sending msg ID 14103, 21 bytes
> gcoap: response Success, code 2.05, 189 bytes
00000000  54  68  72  65  65  20  72  69  6E  67  73  20  66  6F  72  20
...
>

The first response arrives immediately and shows as a CON and an ACK (before and after).

The second request shows as a CON GET (almost) immediately followed by an empty ACK. Then almost 3s later, a CON 2.05 and an empty ACK are shown.

Note that:

  • The response is correctly matched to the request
  • Gcoap sends an empty ACK for otherwise the server would keep sending
  • Gcoap does not send a second CON GET, even though the response takes longer than the default retransmission timeout of 2s. (The incoming ACK indicates that the request was received, and that the server takes responsibility for any further retransmission.)

It'd also be nice to test that if the final empty ACK is lost and the server sends another CON response, an RST is generated. I'm not perfectly sure it does right now, and either way I'd like to test it. Do we have provisions for automatically testing against external software, and in particular for introducing package loss at a defined pattern or sequence?

@cgundogan
Copy link
Member

This should be reviewable now that I pushed an non-WIP final commit

nice, will have a look at it (:

Copy link
Member

@cgundogan cgundogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my first batch of comments. In general, the changes look good. IMO, we should find a better solution for handling the request timeout after receiving an empty response. I'll test the code now as outlined in the PR description #14178 (comment).

sys/net/application_layer/gcoap/gcoap.c Outdated Show resolved Hide resolved
sys/net/application_layer/gcoap/gcoap.c Outdated Show resolved Hide resolved
sys/net/application_layer/gcoap/gcoap.c Outdated Show resolved Hide resolved
sys/net/application_layer/gcoap/gcoap.c Outdated Show resolved Hide resolved
@@ -256,6 +278,12 @@ static void _on_resp_timeout(void *arg) {
#endif
event_timeout_set(&memo->resp_evt_tmout, timeout);

if (memo->state == GCOAP_MEMO_WAIT) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, handling the timeout this way is IMO suboptimal as it depends on the number of configured retransmissions (and how many of them are still left to fire, imagine we are in the last iteration when the empty response arrives). The random component of the backoff calculation also makes it quite indeterministic.

Another solution could be to disarm the timer in _cease_retransmission() and set a new one with a constant (or maybe slightly jittered) offset, like e.g. done with CONFIG_GCOAP_NON_TIMEOUT.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with disarming timers is that that'd open up to a race condition described in #14390 -- that condition already exists, but I'd prefer not to add another case.

I've toyed around a bit with such solutions, but they were invariably more complex than just accepting that a timer fires a few more times than absolutely necessary without doing anything than decrement a counter.

Yes, there's some random component (and one might consider trimming that off the decrement-only case); the way it is phrased now, the CON timeout will happen just when it would have happened if there had not been any ACK. (Conceptually, the application and retransmission timeouts are different things anyway, but it's a practical choice widespread in implementations to conflate them. Thus, sticking with what would happen if there was no ACK is consistent.).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean you could just not add the random jitter in the memo->state == GCOAP_MEMO_WAIT case, if that is desirable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could, but I doubt it'd make it better. The backup calculation code is already hard to read as it was, I don't expect that skipping the jitter will make the ROM size shrink, and at runtime it'd save the random number generation of an unlikely event at cost of another jump.

When the general raciness is addressed, there'll be a code path that explicitly calculates the remaining timeout and that can then be done bypassing the jitter as it happens at a completely different point, but that's for #14390.

@cgundogan
Copy link
Member

The second request shows as a CON GET (almost) immediately followed by an empty ACK. Then almost 3s later, a CON 2.05 and an empty ACK are shown.

I followed the test description in #14178 (comment). It seems to work properly on the RIOT side, I was a little bit confused about the unnecessary retransmission that originate from ./server.py. Is that a problem with aiocoap?

chrysn added a commit to chrysn-pull-requests/RIOT that referenced this pull request Sep 2, 2020
See RIOT-OS#14178 (comment) and
code guidelines point about parentheses in complex statements.
chrysn added a commit to chrysn-pull-requests/RIOT that referenced this pull request Sep 2, 2020
See RIOT-OS#14178 (comment) and
code guidelines point about parentheses in complex statements.
@chrysn chrysn force-pushed the gcoap-handle-con-response branch from d3a218a to 094e9eb Compare September 2, 2020 13:20
@chrysn
Copy link
Member Author

chrysn commented Sep 2, 2020

Hm, my own tests confirm what you've observed, that the client's ACK is not sent, and thus the server keeps retransmitting.

Ah, found it: The typical response of an aiocoap server to the test resource is too long to fit in the recv. That is already tracked in #14167 ; I'm updating the testing instructions to increase CONFIG_GCOAP_PDU_BUF_SIZE.

@chrysn
Copy link
Member Author

chrysn commented Oct 14, 2020

@cgundogan, is there anything you'd like to have addressed before I rebase this onto the current master? It's been sitting a while, I'd like to get the proxy parts upstreamed, and this is part of the ground work.

@benpicco
Copy link
Contributor

I'd say just rebase 😉

@chrysn chrysn force-pushed the gcoap-handle-con-response branch from 094e9eb to 9eaa959 Compare October 20, 2020 22:48
@chrysn
Copy link
Member Author

chrysn commented Oct 20, 2020

Rebased and tested again here. Ran into #14167 again but at least this time I was faster seeing it :-)

@chrysn chrysn added the CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR label Oct 21, 2020
Copy link
Contributor

@benpicco benpicco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the WIP label still valid?

sys/net/application_layer/gcoap/gcoap.c Show resolved Hide resolved
* The current implementation does not touch the timers, but merely sets the
* memo's state to GCOAP_MEMO_WAIT. This approach needs less complex code at
* the cost of the remaining `send_limit` timers firing and some memory not
* being freed until the actual response arrives.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the response never arrives?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as if no ACK were sent: When the retransmission timer fires but the retransmission counter has reached zero, the application is informed of the failure. By not touching the rest of the mechanism, things stay exactly as it would have been.

@chrysn chrysn removed the State: WIP State: The PR is still work-in-progress and its code is not in its final presentable form yet label Oct 22, 2020
Copy link
Contributor

@benpicco benpicco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good.
The CON timeout being jittery should not be an issue and if it keeps the code simpler that is preferable.

This generalizes the existing code for answering CoAP pings into general
message-layer responses. Such responses are now also sent as a reaction
to CON responses, which can otherwise follow the same code path as
existing other responses.

As a side effect, issues that would crop up when responding to odd empty
requests that have token length set are resolved.

Contributes-To: RIOT-OS#14169
Simplify the code path and give consistent debug messages.
The actual implementation will follow in a separate commit, this does
the groundwork and sets the intention.
This introduces an additional state to the COAP_MEMO_* series to avoid
enlarging the memo struct needlessly. While they are documented
publicly, practically only the COAP_MEMO_TIMEOUT and COAP_MEMO_RESPONSE
are used in communication with the application, as a
gcoap_request_memo_t is only handed out in that state.
@chrysn chrysn force-pushed the gcoap-handle-con-response branch from d2011c8 to 462f886 Compare October 22, 2020 17:09
@benpicco benpicco merged commit b5cd19e into RIOT-OS:master Oct 23, 2020
@chrysn chrysn deleted the gcoap-handle-con-response branch October 23, 2020 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: CoAP Area: Constrained Application Protocol implementations CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gcoap does not handle separate responses
3 participants