-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gcoap: Process CON responses #14178
gcoap: Process CON responses #14178
Conversation
Note to self: There's a quite odd |
A nice side effect of this is that servers that take longer to create a response than they want to keep the gcoap thread waiting can now use This shows an open bug with the current PR (that CON requests without a response callback could have their memo freed immediately when an empty ACK arrives), but that's a missing optimization there anyway. |
Coming back to this, I think it'll be easiest not to cancel the timeout, but just to set a flag that "yes this was a CON but stop transmitting it's already ACKed". That would work around the race condition, and while this incurs the runtime penalty of having a few timers fire needlessly, it keeps the code in the timers smaller (set a flag vs. compute the total timeout and fiddle with timers). Next steps on my side are addressing Cenk's comment and implementing that to replace the current WIP head. |
741d253
to
32e51c0
Compare
This should be reviewable now that I pushed an non-WIP final commit and the... Testing procedureAs there are no delayed servers in gcoap yet this needs to be tested against an external application, for example aiocoap: $ git clone https://github.com/chrysn/aiocoap
$ cd aiocoap
$ ./server.py [edit: added] Watch the network traffic on the bridge with Wireshark (filtering for CoAP) and run RIOT: $ make -C examples/gcoap all term
> coap get -c fe80::176b:fd74:a58f:ff97%6 5683 /time
gcoap_cli: sending msg ID 14102, 11 bytes
> gcoap: response Success, code 2.05, 16 bytes
00000000 32 30 32 30 2D 30 36 2D 32 39 20 31 37 3A 35 35
> coap get -c fe80::176b:fd74:a58f:ff97%6 5683 /other/separate
gcoap_cli: sending msg ID 14103, 21 bytes
> gcoap: response Success, code 2.05, 189 bytes
00000000 54 68 72 65 65 20 72 69 6E 67 73 20 66 6F 72 20
...
> The first response arrives immediately and shows as a CON and an ACK (before and after). The second request shows as a CON GET (almost) immediately followed by an empty ACK. Then almost 3s later, a CON 2.05 and an empty ACK are shown. Note that:
It'd also be nice to test that if the final empty ACK is lost and the server sends another CON response, an RST is generated. I'm not perfectly sure it does right now, and either way I'd like to test it. Do we have provisions for automatically testing against external software, and in particular for introducing package loss at a defined pattern or sequence? |
nice, will have a look at it (: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is my first batch of comments. In general, the changes look good. IMO, we should find a better solution for handling the request timeout after receiving an empty response. I'll test the code now as outlined in the PR description #14178 (comment).
@@ -256,6 +278,12 @@ static void _on_resp_timeout(void *arg) { | |||
#endif | |||
event_timeout_set(&memo->resp_evt_tmout, timeout); | |||
|
|||
if (memo->state == GCOAP_MEMO_WAIT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, handling the timeout this way is IMO suboptimal as it depends on the number of configured retransmissions (and how many of them are still left to fire, imagine we are in the last iteration when the empty response arrives). The random component of the backoff calculation also makes it quite indeterministic.
Another solution could be to disarm the timer in _cease_retransmission()
and set a new one with a constant (or maybe slightly jittered) offset, like e.g. done with CONFIG_GCOAP_NON_TIMEOUT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with disarming timers is that that'd open up to a race condition described in #14390 -- that condition already exists, but I'd prefer not to add another case.
I've toyed around a bit with such solutions, but they were invariably more complex than just accepting that a timer fires a few more times than absolutely necessary without doing anything than decrement a counter.
Yes, there's some random component (and one might consider trimming that off the decrement-only case); the way it is phrased now, the CON timeout will happen just when it would have happened if there had not been any ACK. (Conceptually, the application and retransmission timeouts are different things anyway, but it's a practical choice widespread in implementations to conflate them. Thus, sticking with what would happen if there was no ACK is consistent.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean you could just not add the random jitter in the memo->state == GCOAP_MEMO_WAIT
case, if that is desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could, but I doubt it'd make it better. The backup calculation code is already hard to read as it was, I don't expect that skipping the jitter will make the ROM size shrink, and at runtime it'd save the random number generation of an unlikely event at cost of another jump.
When the general raciness is addressed, there'll be a code path that explicitly calculates the remaining timeout and that can then be done bypassing the jitter as it happens at a completely different point, but that's for #14390.
I followed the test description in #14178 (comment). It seems to work properly on the RIOT side, I was a little bit confused about the unnecessary retransmission that originate from |
See RIOT-OS#14178 (comment) and code guidelines point about parentheses in complex statements.
See RIOT-OS#14178 (comment) and code guidelines point about parentheses in complex statements.
d3a218a
to
094e9eb
Compare
Hm, my own tests confirm what you've observed, that the client's ACK is not sent, and thus the server keeps retransmitting. Ah, found it: The typical response of an aiocoap server to the test resource is too long to fit in the recv. That is already tracked in #14167 ; I'm updating the testing instructions to increase CONFIG_GCOAP_PDU_BUF_SIZE. |
@cgundogan, is there anything you'd like to have addressed before I rebase this onto the current master? It's been sitting a while, I'd like to get the proxy parts upstreamed, and this is part of the ground work. |
I'd say just rebase 😉 |
094e9eb
to
9eaa959
Compare
Rebased and tested again here. Ran into #14167 again but at least this time I was faster seeing it :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the WIP label still valid?
* The current implementation does not touch the timers, but merely sets the | ||
* memo's state to GCOAP_MEMO_WAIT. This approach needs less complex code at | ||
* the cost of the remaining `send_limit` timers firing and some memory not | ||
* being freed until the actual response arrives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the response never arrives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as if no ACK were sent: When the retransmission timer fires but the retransmission counter has reached zero, the application is informed of the failure. By not touching the rest of the mechanism, things stay exactly as it would have been.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good.
The CON timeout being jittery should not be an issue and if it keeps the code simpler that is preferable.
This generalizes the existing code for answering CoAP pings into general message-layer responses. Such responses are now also sent as a reaction to CON responses, which can otherwise follow the same code path as existing other responses. As a side effect, issues that would crop up when responding to odd empty requests that have token length set are resolved. Contributes-To: RIOT-OS#14169
Simplify the code path and give consistent debug messages.
The actual implementation will follow in a separate commit, this does the groundwork and sets the intention.
This introduces an additional state to the COAP_MEMO_* series to avoid enlarging the memo struct needlessly. While they are documented publicly, practically only the COAP_MEMO_TIMEOUT and COAP_MEMO_RESPONSE are used in communication with the application, as a gcoap_request_memo_t is only handed out in that state.
d2011c8
to
462f886
Compare
This generalizes the existing code for answering CoAP pings into general
message-layer responses. Such responses are now also sent as a reaction
to CON responses, which can otherwise follow the same code path as
existing other responses.
As a side effect, issues that would crop up when responding to odd empty
requests that have token length set are resolved.
Testing procedure
[edit: further down on #14178 (comment) ]
Issues/PRs references
Fixes #14169 (eventually; currently it only addresses the first parts).