Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Matter Thread Accessory Crash Due to Corner Case(TTL=0) #31031

Closed
fuxiaoming-lumi opened this issue Dec 15, 2023 · 11 comments · Fixed by #31227
Closed

[BUG] Matter Thread Accessory Crash Due to Corner Case(TTL=0) #31031

fuxiaoming-lumi opened this issue Dec 15, 2023 · 11 comments · Fixed by #31227
Labels
bug Something isn't working needs triage openthread

Comments

@fuxiaoming-lumi
Copy link
Contributor

fuxiaoming-lumi commented Dec 15, 2023

Reproduction steps

When the accessory performs DNS-SD during operational discovery, it will receive the dns result.
The TTL maybe ZERO in a specific environment. And this will cause the device to crash.
We can 100% reproduce this issue base on lighting-app by Silabs efr32 and NXP K32W.

IEEE 802.15.4 Data, Dst: 0xc83a, Src: 0xc800
6LoWPAN
Internet Protocol Version 6, Src: fd57:a6c0:7758:693b:f72e:7188:4f38:146c, Dst: fd57:a6c0:7758:693b:fd73:7a92:aa1a:3584
User Datagram Protocol, Src Port: 53, Dst Port: 49153
Domain Name System (response)
    Transaction ID: 0x0430
    Flags: 0x8000 Standard query response, No error
    Questions: 2
    Answer RRs: 2
    Authority RRs: 0
    Additional RRs: 0
    Queries
        BD80AB754C4E2BA1-0000000000000003._matter._tcp.default.service.arpa: type SRV, class IN
        BD80AB754C4E2BA1-0000000000000003._matter._tcp.default.service.arpa: type TXT, class IN
    Answers
        BD80AB754C4E2BA1-0000000000000003._matter._tcp.default.service.arpa: type SRV, class IN, priority 0, weight 0, port 5540, target E45F019B70A60000.default.service.arpa
            Service: BD80AB754C4E2BA1-0000000000000003
            Protocol: _matter
            Name: _tcp.default.service.arpa
            Type: SRV (Server Selection) (33)
            Class: IN (0x0001)
            Time to live: 0 (0 seconds)
            Data length: 45
            Priority: 0
            Weight: 0
            Port: 5540
            Target: E45F019B70A60000.default.service.arpa
        BD80AB754C4E2BA1-0000000000000003._matter._tcp.default.service.arpa: type TXT, class IN
            Name: BD80AB754C4E2BA1-0000000000000003._matter._tcp.default.service.arpa
            Type: TXT (Text strings) (16)
            Class: IN (0x0001)
            Time to live: 0 (0 seconds)
            Data length: 30
            TXT Length: 7
            TXT: SII=300
            TXT Length: 8
            TXT: SAI=2000
            TXT Length: 8
            TXT: SAT=4000
            TXT Length: 3
            TXT: T=1
    [Request In: 12525]
    [Time: 5.059264000 seconds]

We get an illegal value(mTextEntrySize:662774400) causing the device to crash.
https://github.com/project-chip/connectedhomeip/blob/master/src/lib/dnssd/Discovery_ImplPlatform.cpp#L325

The key point is here: https://github.com/project-chip/connectedhomeip/blob/master/src/platform/OpenThread/GenericThreadStackManagerImpl_OpenThread.hpp#L2622

 DnsResult * dnsResult = Platform::New<DnsResult>(aContext, MapOpenThreadError(aError));

When the TTL is 0, it is important to ensure that the variable is properly initialized before being passed to the upper layer. Failing to initialize the variable may result in unpredictable parameter values being propagated upwards.
We need to sort out this part of the logic to deal with the situation where TTL is 0 for Matter and define how this part of the behavior should be done correctly.

openthread: https://github.com/openthread/openthread/blob/main/src/core/net/dns_client.cpp#L346

Bug prevalence

100%

GitHub hash of the SDK that was being used

52e4a64

Platform

efr32, k32w

Platform Version(s)

N/A

Anything else?

Device Log:
Uploading C5DC8C3E-CDE4-42b1-BDB6-39536ADD7B57.txt…
DIS:
matter-efr32-lighting-example.zip

@fuxiaoming-lumi fuxiaoming-lumi added bug Something isn't working needs triage labels Dec 15, 2023
@bzbarsky-apple
Copy link
Contributor

@Damian-Nordic

@fuxiaoming-lumi
Copy link
Contributor Author

fuxiaoming-lumi commented Dec 19, 2023

https://github.com/project-chip/connectedhomeip/blob/master/src/platform/OpenThread/GenericThreadStackManagerImpl_OpenThread.hpp#L2670

According to this case, is it necessary to continue querying for the AAAA address if first query response only include SRV + TXT and the TTL is ZERO? What's the best practise here.

Small change for OTBR to easy reproduce this issue. https://github.com/openthread/openthread/blob/main/src/core/net/dnssd_server.cpp#L618

Error Server::Response::AppendTxtRecord(const void *aTxtData, uint16_t aTxtLength, uint32_t aTtl)
{
    Error error = kErrorNone;
    TxtRecord txtRecord;
    uint8_t emptyTxt = 0;
    
    if (aTxtLength == 0)
    {
      aTxtData = &emptyTxt;
      aTxtLength = sizeof(emptyTxt);
    }
    aTtl = 0;
    txtRecord.Init();
    txtRecord.SetTtl(aTtl);
    txtRecord.SetLength(aTxtLength);
    
    SuccessOrExit(error = Name::AppendPointerLabel(mOffsets.mInstanceName, *mMessage));
    SuccessOrExit(error = mMessage->Append(txtRecord));
    SuccessOrExit(error = mMessage->AppendBytes(aTxtData, aTxtLength));
    
    IncResourceRecordCount();
    
    exit:
    return error;
}

After configuring this special otbr, let device perform operational discovery to reproduce.
Following OTA guide and implement chip-tool otasoftwareupdaterequestor announce-otaprovider 1 0 0 0 2 0.

@doru91
Copy link
Contributor

doru91 commented Dec 19, 2023

TTL of 0 in a DNS query is valid for GoodBye packets, when certain resource record data is about to become invalid: https://datatracker.ietf.org/doc/html/rfc6762#section-10.1.

I am still trying to understand how https://github.com/project-chip/connectedhomeip/blob/master/src/lib/dnssd/Discovery_ImplPlatform.cpp#L325 - mTextEntrySize - gets an invalid value.

@fuxiaoming-lumi
Copy link
Contributor Author

fuxiaoming-lumi commented Dec 19, 2023

@doru91

  1. https://github.com/project-chip/connectedhomeip/blob/master/src/platform/OpenThread/GenericThreadStackManagerImpl_OpenThread.hpp#L2622
    We initialize the following variable here but there is no initial value assigned.
    DnsResult * dnsResult = Platform::New<DnsResult>(aContext, MapOpenThreadError(aError));

  2. https://github.com/openthread/openthread/blob/main/src/core/net/dns_client.cpp#L650
    In the case we end up in Client::ServiceResponse::GetServiceInfo here and it will have aServiceInfo.mTxtDataSize = 0 since the TTL is set to 0.

  3. https://github.com/project-chip/connectedhomeip/blob/master/src/platform/OpenThread/GenericThreadStackManagerImpl_OpenThread.hpp#L2388
    Since the aServiceInfo.mTxtDataSize = 0 when we call: OnDnsResolveResult -> FromOtDnsResponseToMdnsData we don't load data into the variable mdnsService.mTextEntrySize properly

  4. https://github.com/project-chip/connectedhomeip/blob/master/src/lib/dnssd/Discovery_ImplPlatform.cpp#L325
    This becomes an issue and leads to the crash due to not initializing the dnsResult upon creation.

@doru91
Copy link
Contributor

doru91 commented Dec 19, 2023

@fuxiaoming-lumi , thx for the explanations, in the end I also discovered the failing flow. For sure we need to fix code that is handling a GoodBye Packet on the device side and maybe set mTextEntrySize == 0 on the else path for https://github.com/project-chip/connectedhomeip/blob/master/src/platform/OpenThread/GenericThreadStackManagerImpl_OpenThread.hpp#L2388.

Now, the question is on whether the GoodBye packet should reach the Matter device. My understanding is that the DNS client is sending a query for getting the IPv6 address of the OTA Provider. This query reaches the DNS server (running on the BR) which first consults the SRP cache and it can't find any entry. Then the DNS server is asking the mDNS proxy to forward the query on LAN. If the mDNS proxy receives the Goodbye Packet from the OTA Provider then probably it shouldn't forward it back to the Matter device but only delete its cache (if the mDNS proxy has a cache - this is optional):

Queriers receiving a Multicast DNS response with a TTL of zero SHOULD
NOT immediately delete the record from the cache, but instead record
a TTL of 1 and then delete the record one second later.  In the case
of multiple Multicast DNS responders on the network described in
[Section 6.6](https://datatracker.ietf.org/doc/html/rfc6762#section-6.6) above, 
if one of the responders shuts down and
incorrectly sends goodbye packets for its records, it gives the other
cooperating responders one second to send out their own response to
"rescue" the records before they expire and are deleted.

@Damian-Nordic
Copy link
Contributor

Damian-Nordic commented Jan 3, 2024

I second @doru91 - it's not clear to me after glancing at RFCs if receiving RR with TTL of zero from the discovery proxy is valid. The original unicast DNS RFC allows RRs with TTL of zero:

Zero values are interpreted to mean that the RR can only be used for the transaction in progress, and should not be cached.

Though given that in multicast DNS TTL of zero means the goodbye packet, I'm not sure in which cases the discovery proxy is allowed to return that as an answer to the standard DNS query.

@fuxiaoming-lumi Which border router did you use to reproduce the original issue?

@fuxiaoming-lumi
Copy link
Contributor Author

fuxiaoming-lumi commented Jan 3, 2024

@Damian-Nordic @doru91 Actually, we can reproduce this issue on SVE2 Raspberry Pi BR. And I agree with doru that we can set mTextEntrySize == 0 on the else path for https://github.com/project-chip/connectedhomeip/blob/master/src/platform/OpenThread/GenericThreadStackManagerImpl_OpenThread.hpp#L2388. This is our workaround for our product or our product willl crash when encountering this problem. I didn't see the Matter Spec describing how we should deal with this situation (when TTL=0). Although TTL=0 is effective, this situation can cause the device to crash based on our Matter SDK(TOT).

@Damian-Nordic
Copy link
Contributor

@fuxiaoming-lumi Thanks, yeah, let me post a PR with this change.

I just wondered if any changes in OpenThread are needed as well.

@abtink Do you maybe remember why OpenThread's DNS client discards TXT records with TTL 0 (see the code below), yet it returns SRV with TTL 0? I wonder how ResolveServiceAndHostAddress should behave in such a case - should it still query for AAAA?

    if (aServiceInfo.mTxtDataTtl == 0)
    {
        aServiceInfo.mTxtDataSize = 0;
    }

@fuxiaoming-lumi
Copy link
Contributor Author

@Damian-Nordic I am very grateful for this. I am not an expert in DNS-SD, so I am not familiar with the correct behavior as a DNS client for situations where TTL=0. I am also very curious whether it is necessary to inquire about the AAAA address, or when we receive TTL=0, we need to determine that this operational discovery has failed. https://github.com/project-chip/connectedhomeip/blob/master/src/platform/OpenThread/GenericThreadStackManagerImpl_OpenThread.hpp#L2670

@Damian-Nordic
Copy link
Contributor

@fuxiaoming-lumi PTAL #31227. I tested it with your OTBR modification, but feel free to verify it on your own :).

@abtink
Copy link

abtink commented Jan 3, 2024

@abtink Do you maybe remember why OpenThread's DNS client discards TXT records with TTL 0 (see the code below), yet it returns SRV with TTL 0? I wonder how ResolveServiceAndHostAddress should behave in such a case - should it still query for AAAA?

A TTL zero indicate TXT record is removed (no longer valid), so OT DNS client does sets mTxtDataSize to zero to indicate no TXT data info. This all seems correct to me.

@mergify mergify bot closed this as completed in #31227 Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage openthread
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants