Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] OTA Provider / BDX state is erroneous #23573

Closed
fessehaeve opened this issue Nov 10, 2022 · 2 comments · Fixed by #23642
Closed

[BUG] OTA Provider / BDX state is erroneous #23573

fessehaeve opened this issue Nov 10, 2022 · 2 comments · Fixed by #23642
Assignees

Comments

@fessehaeve
Copy link
Contributor

Reproduction steps

I believe there is a bug on the OTA provider/BDX implementation. This is how to reproduce it:
Start an OTA upgrade by announcing the provider to the requestor. 
The requestor ignores the update since the version is the same/not higher. This puts the provider in a not clean state and this is the cause of the issue.
Announce the provider again to the requestor and this time the requestor proceeds to BDX. But provider isn’t in a good state and can’t reply to ReceiveInit properly. (It doesn’t send out ReceiveAccept.

Bug prevalence

Always after an unsuccessful OTA upgrade.

GitHub hash of the SDK that was being used

0b17bce

Platform

nrf, raspi, core

Platform Version(s)

No response

Anything else?

Please ignore the actual value of the software version you see in the logs.
OTA Requestor log

16:51:18.047   5855.853        6  0 I: 5299307 [DMG]Received Command Response Data, Endpoint=0 Cluster=0x0000_0029 Command=0x0000_0001
16:51:18.056   5855.862        9  0 D: 5299316 [SWU]QueryImageResponse:
16:51:18.056   5855.862        0  0 D: 5299319 [SWU]  status: 0
16:51:18.062   5855.868        6  0 D: 5299322 [SWU]  imageURI: bdx://0000000000000064//home/ubuntu/eve_6000-matter.bin
16:51:18.068   5855.874        6  0 D: 5299331 [SWU]  softwareVersion: 6000
16:51:18.075   5855.881        7  0 D: 5299335 [SWU]  softwareVersionString: x.x.0
16:51:18.075   5855.881        0  0 D: 5299339 [SWU]  updateToken: 32
16:51:18.081   5855.887        6  0 D: 5299342 [SWU]  userConsentNeeded: 0
16:51:18.081   5855.887        0  0 D: 5299346 [SWU]  metadataForRequestor: 0
16:51:18.089   5855.895        8  0 D: 5299349 [SWU]Update available from version 6000 to 6000
16:51:18.093   5855.899        4  0 D: 5299355 [DMG]ICR moving to [AwaitingDe]
16:51:18.098   5855.904        5  0 I: 5299359 [DL]OpenThread SED interval set to 2500ms
16:51:18.104   5855.910        6  0 I: 5299364 [EM]<<< [E:33892i M:21335215 (Ack:2450850)] (S) Msg TX to 1:0000000000000064 [501B] --- Type 0000:10 (SecureChannel:StandaloneAck)
16:51:18.116   5855.922       12  0 I: 5299377 [IN](S) Sending msg 21335215 on secure session with LSID: 5465
16:51:18.125   5855.931        9  0 D: 5299385 [EM]Flushed pending ack for MessageCounter:2450850 on exchange 33892i
16:51:18.129   5855.935        4  0 D: 5299393 [DMG]Endpoint 0, Cluster 0x0000_002A update version to 810e63c3
16:51:18.140   5855.946       11  0 D: 5299401 [EVL]Copy Event to next buffer with priority 1
16:51:18.145   5855.951        5  0 D: 5299406 [EVL]LogEvent event number: 0x000000000007000B priority: 1, endpoint id:  0x0 cluster id: 0x0000_002A event id: 0x0 Sys timestamp: 0x000000000050DCC8
16:51:18.160   5855.966       15  0 D: 5299420 [SWU]Establishing session to provider node ID 0x0000000000000064 on fabric index 1
16:51:18.169   5855.975        9  0 D: 5299429 [CSM]FindOrEstablishSession: PeerId = [1:0000000000000064]
16:51:18.175   5855.981        6  0 D: 5299435 [CSM]FindOrEstablishSession: No existing OperationalSessionSetup instance found
16:51:18.182   5855.988        7  0 I: 5299443 [CTL]Found an existing secure session to [1:0000000000000064]!
16:51:18.189   5855.995        7  0 D: 5299450 [CTL]OperationalSessionSetup[1:0000000000000064]: State change 1 --> 5
16:51:18.194   5856.000        5  0 D: 5299457 [SWU]BDX::SendMessage
16:51:18.200   5856.006        6  0 I: 5299461 [EM]<<< [E:33893i M:21335216] (S) Msg TX to 1:0000000000000064 [501B] --- Type 0002:04 (BDX:ReceiveInit)
16:51:18.210   5856.016       10  0 I: 5299471 [IN](S) Sending msg 21335216 on secure session with LSID: 5465
16:51:18.218   5856.024        8  0 I: 5299479 [DL]OpenThread SED interval set to 300ms
16:51:18.617   5856.423      399  0 I: 5299876 [EM]>>> [E:33893i M:2450851 (Ack:21335216)] (S) Msg RX from 1:0000000000000064 [501B] --- Type 0000:10 (SecureChannel:StandaloneAck)
16:51:18.629   5856.435       12  0 D: 5299889 [EM]Found matching exchange: 33893i, Delegate: 0x20005e20
16:51:18.635   5856.441        6  0 D: 5299895 [EM]Rxd Ack; Removing MessageCounter:21335216 from Retrans Table on exchange 33893i

OTA Provider log

[1668009078.043638][15238:15238] CHIP:DMG: AccessControl: checking f=1 a=c s=0x0000000000000032 t= c=0x0000_0029 e=0 p=o
[1668009078.043800][15238:15238] CHIP:DMG: AccessControl: allowed
[1668009078.043943][15238:15238] CHIP:DMG: Received command for Endpoint=0 Cluster=0x0000_0029 Command=0x0000_0000
[1668009078.044014][15238:15238] CHIP:ZCL: OTA Provider received QueryImage
[1668009078.044155][15238:15238] CHIP:ZCL:   VendorID: 0x5555
[1668009078.044198][15238:15238] CHIP:ZCL:   ProductID: 55
[1668009078.044584][15238:15238] CHIP:ZCL:   SoftwareVersion: 6000
[1668009078.044631][15238:15238] CHIP:ZCL:   ProtocolsSupported: [
[1668009078.044680][15238:15238] CHIP:ZCL:     0
[1668009078.044841][15238:15238] CHIP:ZCL:   ]
[1668009078.044885][15238:15238] CHIP:ZCL:   HardwareVersion: 1
[1668009078.044927][15238:15238] CHIP:ZCL:   Location: XX
[1668009078.044968][15238:15238] CHIP:ZCL:   RequestorCanConsent: 0
[1668009078.046081][15238:15238] CHIP:SWU: Generated updateToken: 8441D97DDB6C983509D7D78D685B34
[1668009078.046490][15238:15238] CHIP:DMG: Command handler moving to [ Preparing]
[1668009078.046767][15238:15238] CHIP:DMG: Command handler moving to [AddingComm]
[1668009078.047039][15238:15238] CHIP:DMG: Command handler moving to [AddedComma]
[1668009078.047391][15238:15238] CHIP:DMG: Decreasing reference count for CommandHandler, remaining 0
[1668009078.047737][15238:15238] CHIP:EM: <<< [E:33892r M:2450850 (Ack:21335214)] (S) Msg TX to 1:0000000000000032 [501B] --- Type 0001:09 (IM:InvokeCommandResponse)
[1668009078.048026][15238:15238] CHIP:IN: (S) Sending msg 2450850 on secure session with LSID: 27844
[1668009078.048744][15238:15238] CHIP:DMG: Command handler moving to [CommandSen]
[1668009078.049058][15238:15238] CHIP:DMG: Command handler moving to [AwaitingDe]
[1668009078.480301][15238:15238] CHIP:EM: >>> [E:33892r M:21335215 (Ack:2450850)] (S) Msg RX from 1:0000000000000032 [501B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1668009078.480357][15238:15238] CHIP:EM: Found matching exchange: 33892r, Delegate: (nil)
[1668009078.480399][15238:15238] CHIP:EM: Rxd Ack; Removing MessageCounter:2450850 from Retrans Table on exchange 33892r
[1668009078.577341][15238:15238] CHIP:EM: >>> [E:33893r M:21335216] (S) Msg RX from 1:0000000000000032 [501B] --- Type 0002:04 (BDX:ReceiveInit)
[1668009078.577467][15238:15238] CHIP:EM: Handling via exchange: 33893r, Delegate: 0xaaaad2e07840
[1668009078.577556][15238:15238] CHIP:BDX: OnMessageReceived: message 0x4 protocol (0, 2)
[1668009078.577616][15238:15238] CHIP:atm: Handling received BDX Message
[1668009078.577684][15238:15238] CHIP:atm: ReceiveInit
[1668009078.577736][15238:15238] CHIP:atm:   Proposed Transfer Control: 0x20
[1668009078.577787][15238:15238] CHIP:atm:   Range Control: 0x0
[1668009078.577874][15238:15238] CHIP:atm:   Proposed Max Block Size: 1024
[1668009078.577923][15238:15238] CHIP:atm:   Start Offset: 0x0000000000000000
[1668009078.577974][15238:15238] CHIP:atm:   Proposed Max Length: 0x0000000000000000
[1668009078.578044][15238:15238] CHIP:atm:   File Designator Length: 49
[1668009078.578093][15238:15238] CHIP:atm:   File Designator: /home/ubuntu/eve_6000-matter.bin
[1668009078.777683][15238:15238] CHIP:EM: <<< [E:33893r M:2450851 (Ack:21335216)] (S) Msg TX to 1:0000000000000032 [501B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1668009078.777806][15238:15238] CHIP:IN: (S) Sending msg 2450851 on secure session with LSID: 27844
@samadDotDev
Copy link
Contributor

This may be due to the chip::bdx::TransferFacilitator::mStopPolling not being set back to false after it is set as true during BdxOtaSender::Reset() call from BdxOtaSender::InitializeTransfer (when there's stale initialization done earlier for the same node):



Also, an observation from the logs, I'm assuming that you're using an instrumented version of the code on the requestor to actually fall into this state as otherwise it wouldn't fall into this arm.

[SWU]Update available from version 6000 to 6000

But yeah in general, it can be any other failed BDX too that should have the same symptoms given the improper state the new transfer gets initialized with.

@carol-apple
Copy link
Contributor

Sending it to @harsha-rajendran to take an initial look

nivi-apple added a commit to nivi-apple/connectedhomeip that referenced this issue Feb 3, 2023
…polling again

- This covers the use case where we have an image available but the requestor rejects
  the transfer for some reason. On the next query image the requestor is able to
  continue with the BDX transfer. The provider state was not being reset for this scenario
  due to the polling flag not being reset.

- Fixes project-chip#23573
bzbarsky-apple added a commit that referenced this issue Feb 3, 2023
#24777)

* Add a timer to track whether we have received BDX init after a query image was successful

- Currently if an ota requester has a successful query image and an image it available but
  if for any reason, the ota requester doesn't send BDX init, the provider will be stuck until
  we reboot the resident. In order to have a fail safe, we are adding a timer that starts after
  query image returns image available and waits for a BDX init to come. In case BDX init doesn't
  come, it times out and resets state

- Add code to reset state if any API fails on the provider once we prepare for BDX transfer

- Stop polling when BDX transfer reset is called

- Return QueryImageResponse status busy instead of general failure if the sdk is busy and
  gets a sexond query image so accessory can handle the error correctly and retry until the
  sdk is done

- When we send a message, handle the case where if an error happens in sending the message
  close the exchange context and reset state. If the message is a status report and it
  succeeds, null out the exchange and reset state

* Update src/darwin/Framework/CHIP/MTROTAProviderDelegateBridge.mm

Co-authored-by: Boris Zbarsky <bzbarsky@apple.com>

* Update src/darwin/Framework/CHIP/MTROTAProviderDelegateBridge.mm

Co-authored-by: Boris Zbarsky <bzbarsky@apple.com>

* Reset mStopPolling to false when we call PrepareForTransfer to start polling again

- This covers the use case where we have an image available but the requestor rejects
  the transfer for some reason. On the next query image the requestor is able to
  continue with the BDX transfer. The provider state was not being reset for this scenario
  due to the polling flag not being reset.

- Fixes #23573

* Addressed review comments

* Fix the handling of the use case when hasUpdate is false we need to send the response as is

- Log the error if the BDX timer fails to start

---------

Co-authored-by: Boris Zbarsky <bzbarsky@apple.com>
lecndav pushed a commit to lecndav/connectedhomeip that referenced this issue Mar 22, 2023
project-chip#24777)

* Add a timer to track whether we have received BDX init after a query image was successful

- Currently if an ota requester has a successful query image and an image it available but
  if for any reason, the ota requester doesn't send BDX init, the provider will be stuck until
  we reboot the resident. In order to have a fail safe, we are adding a timer that starts after
  query image returns image available and waits for a BDX init to come. In case BDX init doesn't
  come, it times out and resets state

- Add code to reset state if any API fails on the provider once we prepare for BDX transfer

- Stop polling when BDX transfer reset is called

- Return QueryImageResponse status busy instead of general failure if the sdk is busy and
  gets a sexond query image so accessory can handle the error correctly and retry until the
  sdk is done

- When we send a message, handle the case where if an error happens in sending the message
  close the exchange context and reset state. If the message is a status report and it
  succeeds, null out the exchange and reset state

* Update src/darwin/Framework/CHIP/MTROTAProviderDelegateBridge.mm

Co-authored-by: Boris Zbarsky <bzbarsky@apple.com>

* Update src/darwin/Framework/CHIP/MTROTAProviderDelegateBridge.mm

Co-authored-by: Boris Zbarsky <bzbarsky@apple.com>

* Reset mStopPolling to false when we call PrepareForTransfer to start polling again

- This covers the use case where we have an image available but the requestor rejects
  the transfer for some reason. On the next query image the requestor is able to
  continue with the BDX transfer. The provider state was not being reset for this scenario
  due to the polling flag not being reset.

- Fixes project-chip#23573

* Addressed review comments

* Fix the handling of the use case when hasUpdate is false we need to send the response as is

- Log the error if the BDX timer fails to start

---------

Co-authored-by: Boris Zbarsky <bzbarsky@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants