Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeviceCommissioner does not handle async operations across commissioning stop/start correctly #21120

Open
bzbarsky-apple opened this issue Jul 22, 2022 · 3 comments
Assignees

Comments

@bzbarsky-apple
Copy link
Contributor

Problem

DeviceCommissioner kicks off various un-cancelable async operations (device attestation verification, NOC issuance, commands sent via the InteractionModel.h/CHIPCluster.h APIs). If the consumer calls StopPairing and then starts a new commissioning process during one of these async operations, when the operation result comes back it will be applied to the wrong commissioning process. This means in practice StopPairing is not really usable.

And shutting down the DeviceCommissioner during any of those async bits is liable to crash with use-after-free.

Proposed Solution

What we should probably do is that instead of using the DeviceCommissioner itself as the callback context for async operations
we should use some per-commissioning-process or per-async-operation object that lives until the async operation completes and has a back-pointer to the DeviceCommissioner. The DeviceCommissioner can null out that backpointer as needed (when it's shut down, or on StopPairing, etc), which will make the async callback get ignored when it happens.

This is sort of reimplementing Callback to some extent, but the problem is that in practice the async bits are not using Callback and changing them to use it would be fairly difficult.

@mrjerryjohns @cecille @tcarmelveilleux @tehampson @vivien-apple @msandstedt

@woody-apple
Copy link
Contributor

Issue Review: Assigning to @bzbarsky-apple to take a look to see if this can be done for 1.0

@Emill
Copy link

Emill commented Nov 17, 2023

Has there been any thoughts/work on improving this?

I'm also trying to use the StopPairing method but noticed pretty quickly that it seems seriously broken and in even more situations than you describe.

In my case, the administrator/commissioner runs on the same device, only on-network resolve and using the built-in
default AutoCommisioner.

How broken it is and what happens depends on the commissioning state. Here are a few situations what happens for some different current states when StopPairing is called:

  1. No device has been found yet. In this case StopPairing appears to work but there is no kind of complete callback being triggered.
  2. The SPAKE2+ protocol is running (which can take multiple seconds on slow devices). In this case the StopPairing appears to do nothing. When PASE later is established after a few seconds, the commissioning continues as normal until completion. The SetUpCodePairer::StopPairing seems broken here since it does not actually stop the attempt, but just resets some internal pointers/delegates and stops the browsing process. It does not prevent the commissioning process inside the DeviceCommissioner from starting once the PASE session has been established.
  3. kReadCommissioningInfo is in progress. Appears to somewhat work in case no new device starts pairing shortly after StopPairing returns. The StopPairing call triggers the ArmFailSafe command request to immediate expiration and when a response is received (or at timeout), the DisarmDone function will clean up everything and finally call the user's complete callbacks with the error "cancelled". The read response will arrive at some point but void DeviceCommissioner::OnDone(app::ReadClient *) will do nothing in case of the wrong state, which is why I wrote "in no case no new device starts pairing shortly after StopPairing returns".
  4. For other states, the situation is more broken. As in 3., the ArmFailSafe command will be sent and the pending state is set to kCleanup, but the other complete handlers (e.g. OnSetRegulatoryConfigResponse) just assume we are in the corresponding state and therefore calls CommissioningStageComplete which will, together with CommissioningStepFinished handle the success of kCleanup which will just set a bunch of pointers to null and immediately return. Since the DisarmDone returns immediately if mDeviceBeingCommissioned is null (which it is now since the success of kCleanup has already been handled), nothing will call the complete callbacks and free the commissionee device from the mCommissioneeDevicePool. The state will also remain forever in kCleanup. To be clear, the packet order will be TX regulatory config request, [stop pairing triggered], TX ArmFailSafe command, RX regulatory config request, RX ArmFailSafe response.

Apart from being broken, I also miss proper API documentation for the StopPairing method which answers the following questions:

  1. Will there be any callback indicating the pairing has stopped, and if so, which callback and with what parameters?
  2. Will/can that callback arrive before StopPairing has returned or a long time after?
  3. Can I expect other callbacks before a possible complete callback arrives, in case the StopPairing would be asynchronous and in-progress steps complete before the pairing stops?
  4. At what point can I safely again call PairDevice?
  5. Or is it possible to have multiple pairing operations in progress? The API documentation (i.e. the header file) doesn't really say anything about this.

Without this being in place, it's hard to review PRs such as #28939 and #28881 if they behave correctly or not, with respect to expected callbacks. Do those PRs for example correctly fix #25724?

I agree with the proposed solution that there should not be a "global" delegate for the commissioner. Another way that I think would work is to use the feature of the interaction model objects (CommandSender, ReadClient etc.) that whenever one of these objects are destroyed, the corresponding operation is also immediately cancelled, which means there will not be any stray callbacks. That way we could correctly cancel the pairing operation with no fear of getting stray callbacks (not sure if we should wait for disarm done though). However, the ClusterBase::InvokeCommand wrappers etc., which the commissioner currently use, unfortunately "fires and forgets" the object, so the operation cannot be cancelled.

In my opinion, each PairDevice operation should result in a separate object with its own callback delegate so that each pairing operation is entirely separate, which would also naturally support multiple parallel commissioning operations.

@bzbarsky-apple
Copy link
Contributor Author

@Emill I've got a plan for what to do here, but have not had a chance to do it. If someone else steps up to work on this, I would be very happy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

5 participants