Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some commands can take longer in sync execution time than the MRP timeout #19132

Closed
bzbarsky-apple opened this issue Jun 2, 2022 · 7 comments · Fixed by #20461
Closed

Some commands can take longer in sync execution time than the MRP timeout #19132

bzbarsky-apple opened this issue Jun 2, 2022 · 7 comments · Fixed by #20461

Comments

@bzbarsky-apple
Copy link
Contributor

Problem

Some commands take longer than the MRP timeout to execute. Their execution then blocks standalone acks from being sent, and the sender keeps resending.

We have run into this with AddNOC/UpdateNOC (due to the crypto bits they have to do).

I believe @bluebin14 has also run into it for some commands that touched storage when storage was being (due to a bug) very slow. But in general, we don't have a good idea of how long sync storage might take.

Proposed Solution

For commands that involve known slow crypto, we should probably just add manual FlushAcks() calls.

For others, not clear.

@bzbarsky-apple
Copy link
Contributor Author

Also, the newly added IM timeout of 2 seconds + MRP bits is probably too low for these too... @mrjerryjohns @erjiaqing

@tcarmelveilleux
Copy link
Contributor

@turon

@erjiaqing
Copy link
Contributor

Also, the newly added IM timeout of 2 seconds + MRP bits is probably too low for these too... @mrjerryjohns @erjiaqing

The caller can still set the timeout for each command they send.

The 2s timeout should cover 99% cases. Since only cryptographic bits and some async commands are using longer time.

@bzbarsky-apple
Copy link
Contributor Author

bzbarsky-apple commented Jun 3, 2022

The caller can still set the timeout for each command they send.

Yes, but it's not trivial. And in fact, the 2s change broke commissioning... Anyway, that's a separate thing from what we do about MRP for these commands.

@bluebin14
Copy link
Contributor

Although the storage fragmentation bug has been fixed, back then only an MRP of 5s+2s on both client and server could make it reliably past commissioning (changing it on one side had no effect). Also, RemoveFabric was just as slow as AddNOC/UpdateNOC because it has to do a lot of cleanup.

@bzbarsky-apple
Copy link
Contributor Author

This is going to somewhat change what's happening on the wire (not at the protocol level, but in terms of actual messages sent), when fixed...

@woody-apple
Copy link
Contributor

SDK Review: Given this will make the test harness more resilient to a variety of DUTs, we believe this is appropriate for the SVE branch.

tcarmelveilleux added a commit to tcarmelveilleux/connectedhomeip that referenced this issue Jul 7, 2022
- UpdateNOC did not clear session of previous fabric like spec intended
  (project-chip#20379)
- All OpCreds cluster slow commands did not try to early-ack to avoid
  MRP timeouts (project-chip#19132)

Fixes project-chip#20379
Fixes project-chip#19132

This PR:

- Fixes UpdateNOC expiring all sessions for the updated node
- Adds `FlushAcksRightAwayOnSlowCommand` to CommandHandler to flush acks
  on slow commands
- Adds Python tests for UpdateNOC behavior of session expiring
- Adds `ExpireSessions` to Python for testing

Testing done:

- Unit tests all pass
- Cert tests pass
- With the session clearing, previous Python tests failed, until
  I fixed them with the new `ExpireSessions` API
- Observed standalone acks immediately sent on opcreds cluster commands
tcarmelveilleux added a commit that referenced this issue Jul 8, 2022
* Fix UpdateNOC session invalidation and opcreds timing

- UpdateNOC did not clear session of previous fabric like spec intended
  (#20379)
- All OpCreds cluster slow commands did not try to early-ack to avoid
  MRP timeouts (#19132)

Fixes #20379
Fixes #19132

This PR:

- Fixes UpdateNOC expiring all sessions for the updated node
- Adds `FlushAcksRightAwayOnSlowCommand` to CommandHandler to flush acks
  on slow commands
- Adds Python tests for UpdateNOC behavior of session expiring
- Adds `ExpireSessions` to Python for testing

Testing done:

- Unit tests all pass
- Cert tests pass
- With the session clearing, previous Python tests failed, until
  I fixed them with the new `ExpireSessions` API
- Observed standalone acks immediately sent on opcreds cluster commands

* Restyled

* Removed leftover debug
github-actions bot pushed a commit that referenced this issue Jul 8, 2022
* Fix UpdateNOC session invalidation and opcreds timing

- UpdateNOC did not clear session of previous fabric like spec intended
  (#20379)
- All OpCreds cluster slow commands did not try to early-ack to avoid
  MRP timeouts (#19132)

Fixes #20379
Fixes #19132

This PR:

- Fixes UpdateNOC expiring all sessions for the updated node
- Adds `FlushAcksRightAwayOnSlowCommand` to CommandHandler to flush acks
  on slow commands
- Adds Python tests for UpdateNOC behavior of session expiring
- Adds `ExpireSessions` to Python for testing

Testing done:

- Unit tests all pass
- Cert tests pass
- With the session clearing, previous Python tests failed, until
  I fixed them with the new `ExpireSessions` API
- Observed standalone acks immediately sent on opcreds cluster commands

* Restyled

* Removed leftover debug
andy31415 pushed a commit that referenced this issue Jul 9, 2022
* Fix UpdateNOC session invalidation and opcreds timing

- UpdateNOC did not clear session of previous fabric like spec intended
  (#20379)
- All OpCreds cluster slow commands did not try to early-ack to avoid
  MRP timeouts (#19132)

Fixes #20379
Fixes #19132

This PR:

- Fixes UpdateNOC expiring all sessions for the updated node
- Adds `FlushAcksRightAwayOnSlowCommand` to CommandHandler to flush acks
  on slow commands
- Adds Python tests for UpdateNOC behavior of session expiring
- Adds `ExpireSessions` to Python for testing

Testing done:

- Unit tests all pass
- Cert tests pass
- With the session clearing, previous Python tests failed, until
  I fixed them with the new `ExpireSessions` API
- Observed standalone acks immediately sent on opcreds cluster commands

* Restyled

* Removed leftover debug

Co-authored-by: Tennessee Carmel-Veilleux <tennessee.carmelveilleux@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants