Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoRaWan node ISR stack overflowed #14962

Open
MyCodeMyMoney opened this issue Sep 7, 2020 · 19 comments · Fixed by #15355
Open

LoRaWan node ISR stack overflowed #14962

MyCodeMyMoney opened this issue Sep 7, 2020 · 19 comments · Fixed by #15355
Labels
Area: LoRa Area: LoRa radio support Area: network Area: Networking Area: pkg Area: External package ports Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Comments

@MyCodeMyMoney
Copy link
Contributor

Description

LoRaWan node ISR stack overflowed, when receiving error

Steps to reproduce the issue

Set the LoRaWan node to sleep mode, wake up the node every 20s, join and send data after wake-up, and then sleep. If the node repeats this process for a long time, When LoRaWan receives error, it will cause ISR stack overflowed

Actual results

This is the context information after node failure

[semtech-loramac] get dr
[semtech-loramac] MCPS request: confirmed TX
[semtech-loramac] MCPS request: OK
[semtech-loramac] Transmission completed
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[semtech-loramac] RX started
[semtech-loramac] RX CRC error

Context before hardfault:
r0: 0x00000033
r1: 0x20012eab
r2: 0x200032c0
r3: 0x00000000
r12: 0x0493e000
lr: 0x08005347
pc: 0x0800538c
psr: 0x01000000

FSR/FAR:
CFSR: 0x00008200
HFSR: 0x40000000
DFSR: 0x00000000
AFSR: 0x00000000
BFAR: 0x20012ea9
Misc
EXC_

Cause Analysis

I found that in the following code, the length of the array LoRaMacRxPayload[LORAMAC_PHY_MAXPAYLOAD] is 255, but the parameter size may have an error value greater than 255

static uint8_t LoRaMacRxPayload[LORAMAC_PHY_MAXPAYLOAD];
...
static void OnRadioRxDone( uint8_t *payload, uint16_t size, int16_t rssi, int8_t snr )
{
    ...
        micRx |= ( uint32_t )LoRaMacRxPayload[size - LORAMAC_MFR_LEN];
        micRx |= ( ( uint32_t )LoRaMacRxPayload[size - LORAMAC_MFR_LEN + 1] << 8 );
        micRx |= ( ( uint32_t )LoRaMacRxPayload[size - LORAMAC_MFR_LEN + 2] << 16 );
        micRx |= ( ( uint32_t )LoRaMacRxPayload[size - LORAMAC_MFR_LEN + 3] << 24 );
}

So I added the following code, when the function is called.

if (size > 255)
{
	size = 255
}

But when I continued to test under the above conditions, I found that when running the test for a long time, my LoRaWan node sometimes fell into an infinite loop. The abnormal phenomenon is that the sleep mode cannot be entered again, and there is no debugging information output.

Versions

RIOT Branch

2020.01-branch

Operating System Environment

     Operating System: "Ubuntu" "20.04.1 LTS (Focal Fossa)"
               Kernel: Linux 5.4.0-45-generic x86_64 x86_64
         System shell: /usr/bin/dash (probably dash)
         make's shell: /usr/bin/dash (probably dash)

Installed compiler toolchains

           native gcc: gcc (Ubuntu 9.3.0-10ubuntu2) 9.3.0
    arm-none-eabi-gcc: arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 9-2019-q4-major) 9.2.1 20191025 (release) [ARM/arm-9-branch revision 277599]
              avr-gcc: missing
     mips-mti-elf-gcc: missing
           msp430-gcc: missing
 riscv-none-embed-gcc: missing
 xtensa-esp32-elf-gcc: missing

xtensa-esp8266-elf-gcc: missing
clang: missing

Installed compiler libs

 arm-none-eabi-newlib: "3.1.0"
  mips-mti-elf-newlib: missing

riscv-none-embed-newlib: missing
xtensa-esp32-elf-newlib: missing
xtensa-esp8266-elf-newlib: missing
avr-libc: missing (missing)

Installed development tools

               ccache: missing
                cmake: missing
             cppcheck: missing
              doxygen: missing
                  git: git version 2.25.1
                 make: GNU Make 4.2.1
              openocd: missing
               python: Python 3.8.2
              python2: Python 2.7.18rc1
              python3: Python 3.8.2
               flake8: 3.8.3 (mccabe: 0.6.1, pycodestyle: 2.6.0, pyflakes: 2.2.0) CPython 3.8.2 on
           coccinelle: missing
@aabadie
Copy link
Contributor

aabadie commented Sep 7, 2020

Cause Analysis

The code you are mentioning is from Loramac-node, right ?

So I added the following code, when the function is called.

Where do you add that code ? in $RIOTBASE/pkg/semtech-loramac/contrib/semtech-loramac.c ?
It seems to me that this is a bug in the loramac-node code. Note that RIOT is still using version 4.4.1 of this code and the package needs some refresh (the current version is 4.4.4 from may 2020).

@MyCodeMyMoney
Copy link
Contributor Author

Where do you add that code ? in $RIOTBASE/pkg/semtech-loramac/contrib/semtech-loramac.c ?

I did add the code in $RIOTBASE/pkg/semtech-loramac/contrib/semtech-loramac.c

It seems to me that this is a bug in the loramac-node code. Note that RIOT is still using version 4.4.1 of this code and the package needs some refresh (the current version is 4.4.4 from may 2020).

Can I use LoRaMac-node version 4.4.4 code in RIOT?I modified PKG_VERSION to 4.4.4, but the compilation failed.The branch I use is 2020.01-branch.

@aabadie
Copy link
Contributor

aabadie commented Sep 8, 2020

Can I use LoRaMac-node version 4.4.4 code in RIOT?I modified PKG_VERSION to 4.4.4, but the compilation failed.The branch I use is 2020.01-branch.

Adapting the package to loramac-node 4.4.4 is quite some work since this project doesn't use a standard numbering for their release (4.4.1 => 4.4.2 reorganizes a lot of files, refactor some large part of the code, etc). So they are not patch versions...

@MyCodeMyMoney
Copy link
Contributor Author

Adapting the package to loramac-node 4.4.4 is quite some work since this project doesn't use a standard numbering for their release (4.4.1 => 4.4.2 reorganizes a lot of files, refactor some large part of the code, etc). So they are not patch versions...

Since my node needs to enter the sleep mode during the working process, I cannot solve the above problems through watchdog. Do you have any good suggestions for the information I provided above?

@aabadie
Copy link
Contributor

aabadie commented Sep 8, 2020

What is the platform you are using ? STM32,SAM0 ?

@MyCodeMyMoney
Copy link
Contributor Author

What is the platform you are using ? STM32,SAM0 ?

I am using stm32l151cc platform

@aabadie
Copy link
Contributor

aabadie commented Sep 8, 2020

@fjmolinas has a good knowledge on L1 so maybe we has an intuition on what could be the cause ?

@fjmolinas
Copy link
Contributor

2020.01-branch

Can you test in a newer branch to see if the issue is still present?

But when I continued to test under the above conditions, I found that when running the test for a long time, my LoRaWan node sometimes fell into an infinite loop. The abnormal phenomenon is that the sleep mode cannot be entered again, and there is no debugging information output.

Can you elaborate?

I found that in the following code, the length of the array LoRaMacRxPayload[LORAMAC_PHY_MAXPAYLOAD] is 255, but the parameter size may have an error value greater than 255

That means the issue is upstream right? @aabadie you mention updating the pkg is not a short term option, can you confirm this bug? Should we provide a patch in the meantime?

@MyCodeMyMoney
Copy link
Contributor Author

Can you test in a newer branch to see if the issue is still present?

I tested 2020.01-branch and 2020.04-branch, and they both have the same problem.

Can you elaborate?

There are the following fragments in /bin/pkg/l151-lorawan/semtech-loramac/src/mac/LoRaMac.c.
This function calls LoRaMacRxPayload[size-LORAMAC_MFR_LEN], there is a hidden danger of crossing the boundary, because the length of the array definition is 255

#define LORAMAC_PHY_MAXPAYLOAD 255
...
static uint8_t LoRaMacRxPayload[LORAMAC_PHY_MAXPAYLOAD];
...
static void OnRadioRxDone( uint8_t *payload, uint16_t size, int16_t rssi, int8_t snr )
{
...
micRx |= ( uint32_t )LoRaMacRxPayload[size - LORAMAC_MFR_LEN];
micRx |= ( ( uint32_t )LoRaMacRxPayload[size - LORAMAC_MFR_LEN + 1] << 8 );
micRx |= ( ( uint32_t )LoRaMacRxPayload[size - LORAMAC_MFR_LEN + 2] << 16 );
micRx |= ( ( uint32_t )LoRaMacRxPayload[size - LORAMAC_MFR_LEN + 3] << 24 );
}

There are the following fragments in $RIOTBASE/pkg/semtech-loramac/contrib/semtech-loramac.c
The function OnRadioRxDone is called here, and the length passed to LoRaMacRxPayload [LORAMAC_PHY_MAXPAYLOAD] is still at risk of greater than the maximum length of 255.I found this problem in the actual application caused ISR stack overflowed error in my code.But when I limit the length of len to 255 here, when an exception occurs, my code seems to fall into an infinite loop, unable to enter sleep mode, and no debugging information is output.

static void _semtech_loramac_event_cb(netdev_t *dev, netdev_event_t event)
{
netdev_lora_rx_info_t packet_info;
msg_t msg;
msg.content.ptr = dev
switch (event) {
...
case NETDEV_EVENT_RX_COMPLETE:
{
size_t len;
uint8_t radio_payload[SX127X_RX_BUFFER_SIZE];
len = dev->driver->recv(dev, NULL, 0, 0);
dev->driver->recv(dev, radio_payload, len, &packet_info);
semtech_loramac_radio_events.RxDone(radio_payload,
len, packet_info.rssi,
packet_info.snr);
break;
}
...
}

@fjmolinas fjmolinas added Area: LoRa Area: LoRa radio support Area: network Area: Networking Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) labels Sep 9, 2020
@jia200x
Copy link
Member

jia200x commented Sep 9, 2020

@MyCodeMyMoney

Could you please try the following patch (untested)?

diff --git a/pkg/semtech-loramac/contrib/semtech_loramac.c b/pkg/semtech-loramac/contrib/semtech_loramac.c
index 0cd3a46b4f..368287f517 100644
--- a/pkg/semtech-loramac/contrib/semtech_loramac.c
+++ b/pkg/semtech-loramac/contrib/semtech_loramac.c
@@ -539,10 +539,16 @@ static void _semtech_loramac_event_cb(netdev_t *dev, netdev_event_t event)
 
         case NETDEV_EVENT_RX_COMPLETE:
         {
-            size_t len;
+            int len;
             uint8_t radio_payload[SX127X_RX_BUFFER_SIZE];
-            len = dev->driver->recv(dev, NULL, 0, 0);
-            dev->driver->recv(dev, radio_payload, len, &packet_info);
+            len = dev->driver->recv(dev, radio_payload, SX127X_RX_BUFFER_SIZE, &packet_info);
+            if (len < 0) {
+                /* the RxError call will be handled by the NETDEV_EVENT_CRC_ERROR called inside the recv function */
+                break;
+            }
             semtech_loramac_radio_events.RxDone(radio_payload,
                                                 len, packet_info.rssi,
                                                 packet_info.snr);

Let me know if that works.

@jia200x
Copy link
Member

jia200x commented Sep 9, 2020

I actually found the issue. We have 2 problems:

  1. The NETDEV_EVENT_CRC_ERROR is called inside the dev->driver->recv() function of netdev. As expected, the netdev recv function is called on NETDEV_EVENT_RX_COMPLETE (see here)

The semtech pkg expects the NETDEV_EVENT_RX_COMPLETE event to be "a frame was received and the CRC was OK", and the NETDEV_EVENT_CRC_ERROR to be "a frame was received but the CRC failed. So, the glue code should either call semtech_loramac_radio_events.RxDone or semtech_loramac_radio_events.RxError. Calling both might have undefined consequences.

  1. The recv function return -EBADMSG after a CRC error. This means, the size_t len variable will get very high value that will make the buffer overflow.

Overall, the snippet in #14962 (comment) should solve the issue. For mid term, we should probably move out the NETDEV_EVENT_CRC_ERROR from the recv function.

@jia200x
Copy link
Member

jia200x commented Sep 9, 2020

Still, the RxDone function of the current semtech pkg is unsafe as hell... It might be possible to send a malformed packet and make it crash if the first byte is a JoinAccept frame and the CRC still passes...
Maybe this was fixed in the upstream

@MyCodeMyMoney
Copy link
Contributor Author

@jia200x
I tested this patch file with two boards and they worked for about four days. One of them worked fine, but the other one was working abnormally. The exception log is as follows, LoRaMac repeatedly prints [semtech-loramac] MAC timer timeout forever.

[sx127x] Set sleep
[sx127x] Set op mode: SLEEP
[sx127x] Change state: IDLE
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[sx127x] Set channel: 486700000
[sx127x] set modem: 1
[sx127x] already using modem: 1
[sx127x] Set freq hop: 0
[sx127x] Set bandwidth: 0
[sx127x] Set coding rate: 1
[sx127x] Set spreading factor: 7
[sx127x] Set CRC: 1
[sx127x] Set freq hop: 0
[sx127x] Set Hop period: 0
[sx127x] Set fixed header length: 0
[sx127x] Set IQ invert: 0
[sx127x] Set payload len: 0
[sx127x] Set power: 17
[sx127x] Set preamble length: 8
[sx127x] Set RX single: 0
[sx127x] Set TX timeout: 3000000
[sx127x] Set max payload len: 22
[sx127x] Set payload len: 22
[sx127x] Set standby
[sx127x] Set op mode: STANDBY
[sx127x] Change state: IDLE
[sx127x] Change state: TX
[sx127x] Set op mode: TRANSMITTER
[sx127x] Change state: IDLE
[sx127x] Set sleep
[sx127x] Set op mode: SLEEP
[sx127x] Change state: IDLE
[sx127x] Set sleep
[sx127x] Set op mode: SLEEP
[sx127x] Change state: IDLE
[semtech-loramac] Transmission completed
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[sx127x] Set channel: 507100000
[sx127x] set modem: 1
[sx127x] already using modem: 1
[sx127x] Set bandwidth: 0
[sx127x] Set spreading factor: 7
[sx127x] Set coding rate: 1
[sx127x] Set preamble length: 8
[sx127x] Set fixed header length: 0
[sx127x] Set payload len: 0
[sx127x] Set CRC: 0
[sx127x] Set freq hop: 0
[sx127x] Set Hop period: 0
[sx127x] Set IQ invert: 1
[sx127x] Set symbol timeout: 114
[sx127x] Set RX single: 1
[sx127x] Set max payload len: 235
[sx127x] Set RX timeout: 3000000
[sx127x] Set RX
[sx127x] Change state: RX
[sx127x] Set op mode: RECEIVER SINGLE
[semtech-loramac] RX started
[sx127x] Change state: IDLE
[sx127x] Set sleep
[sx127x] Set op mode: SLEEP
[sx127x] Change state: IDLE
[semtech-loramac] MAC timer timeout
[semtech-loramac] MCPS indication event
[semtech-loramac] MCPS indication msg received
[semtech-loramac] MCPS indication Proprietary
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout

@jia200x
Copy link
Member

jia200x commented Sep 14, 2020

@jia200x
I tested this patch file with two boards and they worked for about four days. One of them worked fine, but the other one was working abnormally. The exception log is as follows, LoRaMac repeatedly prints [semtech-loramac] MAC timer timeout forever.

It seems the MAC got stuck when receiving a Proprietary frame (or something went wrong that made the stack believe that there was a Proprietary frame). I will dig deeper.

FYI, RIOT also has it's own implementation of LoRAWAN (GNRC LoRaWAN). @MichelRottleuthner has been running a low power application for several months with that one. Maybe you could try that one in the meantime?
Check examples/gnrc_lorawan.

Anyway, I will investigate this problem.

@MyCodeMyMoney
Copy link
Contributor Author

@jia200x
Okay, if you make any progress, I would be happy to assist you with the test.

@aabadie
Copy link
Contributor

aabadie commented Oct 7, 2020

@jia200x any progress on this one ?

@jia200x
Copy link
Member

jia200x commented Nov 2, 2020

We should still keep this one open

@jia200x
Copy link
Member

jia200x commented Nov 2, 2020

@jia200x any progress on this one ?

Not yet, I will try to go back to this ASAP

@aabadie aabadie added the Area: pkg Area: External package ports label May 20, 2021
@MrKevinWeiss MrKevinWeiss added this to the Release 2021.07 milestone Jun 21, 2021
@MrKevinWeiss MrKevinWeiss removed this from the Release 2021.07 milestone Jul 15, 2021
@maribu
Copy link
Member

maribu commented May 17, 2023

Ping?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: LoRa Area: LoRa radio support Area: network Area: Networking Area: pkg Area: External package ports Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
6 participants