Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using this code renders I2C unstable on ESP32/arduino? #4

Closed
marcmerlin opened this issue Mar 14, 2017 · 84 comments
Closed

Using this code renders I2C unstable on ESP32/arduino? #4

marcmerlin opened this issue Mar 14, 2017 · 84 comments

Comments

@marcmerlin
Copy link

Thanks for the code, it works great, except that as soon as I use it on my project, talking to I2C causes a crash:
0x400d65ee: i2cRead at /home/merlin/Arduino/hardware/espressif/esp32/cores/esp32/esp32-hal-i2c.c line 379
0x400d3e3e: TwoWire::requestFrom(unsigned char, unsigned int, bool) at /home/merlin/Arduino/hardware/espressif/esp32/libraries/Wire/src/Wire.cpp line 143
0x400d3e98: TwoWire::requestFrom(int, int) at /home/merlin/Arduino/hardware/espressif/esp32/libraries/Wire/src/Wire.cpp line 143
0x400d44e6: IoTuz::i2cexp_read() at /home/merlin/Arduino/libraries/IoTuz/IoTuz.cpp line 130

If I disable your code, then my I2C read which has been working for over a month, works again.

Are you using the same part of the chip, or is there some kind of incompatibility between using RMT and I2C?

@MartyMacGyver
Copy link
Owner

Do you have any sample code? What pins are you assigning for I2C and for the ws2812 output? I won't be able to test this til after this weekend, but more info will allow me to see if I can reproduce the problem. If I2C and RMT.dont play well together that'll be disappointing... I have enough trouble with I2C on the ESP32 as it is (in terms of timing and reliability).

@marcmerlin
Copy link
Author

Yeah, I've been trying to narrow it down to something simple that causes a crash on top of your demo1 code, but so far no luck. And yet, the more I move the LED code inside my code (that does a lot more), I get all kinds of Wire/I2C crashes.
I need to work on this more and will get back to you.
3 examples of different crashes below:

0x40083a21: xQueueGenericReceive at /Users/ficeto/Desktop/ESP32/ESP32/esp-idf-public/components/freertos/./queue.c line 1998
0x400d638a: i2cRead at /home/merlin/Arduino/hardware/espressif/esp32/cores/esp32/esp32-hal-i2c.c line 379
0x400d3bda: TwoWire::requestFrom(unsigned char, unsigned int, bool) at /home/merlin/Arduino/hardware/espressif/esp32/libraries/Wire/src/Wire.cpp line 143
0x400d3c34: TwoWire::requestFrom(int, int) at /home/merlin/Arduino/hardware/espressif/esp32/libraries/Wire/src/Wire.cpp line 143
0x400d4282: IoTuz::i2cexp_read() at /home/merlin/Arduino/libraries/IoTuz/IoTuz.cpp line 130
0x400d42a4: IoTuz::read_encoder_button() at /home/merlin/Arduino/libraries/IoTuz/IoTuz.cpp line 160
0x400d1856: setup() at /home/merlin/prg/arduino/ESP32/IoTuz/examples/fulldemo/fulldemo.ino line 448
0x400e2482: loopTask(void*) at /home/merlin/Arduino/hardware/espressif/esp32/cores/esp32/main.cpp line 11 (discriminator 1)

0x400d61c8: i2cWrite at /home/merlin/Arduino/hardware/espressif/esp32/cores/esp32/esp32-hal-i2c.c line 379
0x400d3c0c: TwoWire::endTransmission(unsigned char) at /home/merlin/Arduino/hardware/espressif/esp32/libraries/Wire/src/Wire.cpp line 143
0x400d3c40: TwoWire::endTransmission() at /home/merlin/Arduino/hardware/espressif/esp32/libraries/Wire/src/Wire.cpp line 143
0x400d3fa5: IoTuz::pcf8574_write_(unsigned char) at /home/merlin/Arduino/libraries/IoTuz/IoTuz.cpp line 104
0x400d3fc3: IoTuz::i2cexp_clear_bits(unsigned char) at /home/merlin/Arduino/libraries/IoTuz/IoTuz.cpp line 115
0x400d4006: IoTuz::get_touch() at /home/merlin/Arduino/libraries/IoTuz/IoTuz.cpp line 200
0x400d16ee: get_selection() at /home/merlin/prg/arduino/ESP32/IoTuz/examples/fulldemo/fulldemo.ino line 448
0x400d393b: loop() at /home/merlin/prg/arduino/ESP32/IoTuz/examples/fulldemo/fulldemo.ino line 634
0x400e247d: loopTask(void*) at /home/merlin/Arduino/hardware/espressif/esp32/cores/esp32/main.cpp line 11 (discriminator 1)

0x400d65b3: i2cSetFrequency at /home/merlin/Arduino/hardware/espressif/esp32/cores/esp32/esp32-hal-i2c.c line 379
0x400d3b51: TwoWire::begin(int, int, unsigned int) at /home/merlin/Arduino/hardware/espressif/esp32/libraries/Wire/src/Wire.cpp line 143
0x400d5a0c: Adafruit_ADXL345_Unified::begin() at /home/merlin/Arduino/libraries/Adafruit_ADXL345/Adafruit_ADXL345_U.cpp line 291
0x400d41ab: IoTuz::begin() at /home/merlin/Arduino/libraries/IoTuz/IoTuz.cpp line 325
0x400d17ab: setup() at /home/merlin/prg/arduino/ESP32/IoTuz/examples/fulldemo/fulldemo.ino line 448
0x400e249e: loopTask(void*) at /home/merlin/Arduino/hardware/espressif/esp32/cores/esp32/main.cpp line 11 (discriminator 1)

@marcmerlin
Copy link
Author

@MartyMacGyver so I think I'm hitting a compiler bug or something, adding your code causes crashes in I2C later reliably. Those crashes go away if I disable your code, or if I comment out 1 lines of simple code:
pixel_x = constrain(pixel_x + move_x, 2, 318);
#if 0
pixel_y = constrain(pixel_y + move_y, 2, 238);
#endif
that does not get run at all before the crash happens way way earlier in setup()

@marcmerlin
Copy link
Author

Will be discussing this on https://gitter.im/espressif/arduino-esp32 and will report back

@MartyMacGyver
Copy link
Owner

I'm not sure what constrain() is but subtle problems like this sometimes point to code bugs like uninitialized pointers (in your code, my code, or perhaps the base Arduino-ESP32 code...). A compiler bug is possible but IMHO less likely. This code seems pretty straightforward.

@marcmerlin
Copy link
Author

Thanks. Good point on the uninitialized pointer suggestion.
Is there a way to compile ESP32 code with the equivalent of eletric fence (http://elinux.org/Electric_Fence) or something more fancy like valgrind?
Having the code crash in the place where the bug actually is sure would make things a lot easier :)

@MartyMacGyver
Copy link
Owner

I've never tried that on this platform.

@marcmerlin
Copy link
Author

Ok, so I did more debugging and basically my code crashes soon after I run ws2812_setColors(NUMPIXELS, pixels);
(with NUMPIXELS=2)
If move that statement around, it crashes after the first call.
If I did the function to return early (before copyToRmtBlock_half() )
then the code never crashes
I just don't understand what that code does very well, so it's harder to debug from there.
If you have suggestion let me know, if not I'll look further and try to understand what the code does.

@MartyMacGyver
Copy link
Owner

I can't test this on hardware til next week, but on what schedule are you reading your I2C device? Polled or interrupt driven? The RMT scheme uses interrupts and I wonder if that is a factor here.

@marcmerlin
Copy link
Author

I'm talking to the ESP32 Wire library, stuff like this:
Wire.beginTransmission(I2C_EXPANDER);
Wire.write(dt);
error = Wire.endTransmission();
I'm not sure how it works behind the scenes, @me-no-dev might know, but I'm going to assume that it's interrupt driven.

@MartyMacGyver
Copy link
Owner

It doesn't sound like you're doing interrupt processing there (e.g. a sensor that signals data readiness by triggering an interrupt pin), just normal polling and writing. But it might be happening at a lower level in the HAL.

Let us know what gpio pins you have things wired to. It might matter.

@marcmerlin
Copy link
Author

marcmerlin commented Mar 17, 2017

Right, I'm not having ISRs on top to talk to I2C.
Mmmh, actually my only ISR reads simple IO pins for a rotary encoder:
https://github.com/marcmerlin/IoTuz/blob/master/IoTuz.cpp#L70
My pin mapping is here:
https://github.com/marcmerlin/IoTuz/blob/master/IoTuz.h#L50

@marcmerlin
Copy link
Author

Ok, did a bit more narrowing down and basically it got down to 3 lines of code in this function
If I comment out the RMT lines and the SemaphoreTake, my code continues to work.
If I uncomment the semaphore take (without RMT being enabled), code hangs there.
If I uncomment the RMT lines and the semaphore take, then the LEDs change color and the code crashes soon after that. Mmmh, now and I'm not sure why, it seems to hang after the semaphore take too.
As one last test, I tried removing all the semaphore lines, but the code still hangs soon after the RMT commands are run.
Hope this helps.

void ws2812_setColors(uint16_t length, rgbVal *array) {
(...)
ws2812_sem = xSemaphoreCreateBinary();
//RMT.conf_ch[RMTCHANNEL].conf1.mem_rd_rst = 1;
//RMT.conf_ch[RMTCHANNEL].conf1.tx_start = 1;
//xSemaphoreTake(ws2812_sem, portMAX_DELAY);
vSemaphoreDelete(ws2812_sem);
ws2812_sem = NULL;

@marcmerlin
Copy link
Author

marcmerlin commented Mar 17, 2017

I'm just confused as to why even the single xSemaphoreTake(ws2812_sem, portMAX_DELAY); line causes the code to hang when the RMT lines are also commented out.
And still, I can re-enable all that code, comment out a single line of code in a totally different place (code that does not get run) and then everything works:
//pixel_y = constrain(pixel_y + move_y, 2, 238);

Also, with the pixel line above commented out (and LEDs working) if I comment out the 2 RMT lines, then the code hangs in your setColors function on
xSemaphoreTake(ws2812_sem, portMAX_DELAY);
looks like this:
ws2812_sem = xSemaphoreCreateBinary();
//RMT.conf_ch[RMTCHANNEL].conf1.mem_rd_rst = 1;
//RMT.conf_ch[RMTCHANNEL].conf1.tx_start = 1;
xSemaphoreTake(ws2812_sem, portMAX_DELAY);
vSemaphoreDelete(ws2812_sem);
ws2812_sem = NULL;

cannot make sense of that either :(

@MartyMacGyver
Copy link
Owner

Sounds like a race condition...

@marcmerlin
Copy link
Author

between which components? (assuming you have a guess)

@marcmerlin
Copy link
Author

I did another round of tests, because this is driving me crazy :)
At this point I think you're right and there seems to be a problem with how the semaphore works. Sadly I don't quite understand the semaphore code.
Please look at those tests below, hopefully they will put you in the right direction

I've done 3 tests to figure out what's going on, and maybe with the results, you can venture a guess better than me.
My code does this:
Serial.println("Before colors");
ws2812_setColors(NUMPIXELS, pixels);
Serial.println("After colors");

Your code either works, hangs on the semaphore, or succeeds but causes I2C in my code to crash later, or even Serial.println to hang half way through a print. Here are the 3 combinations. Does that give you a better hint of what's going on?

---------------- Test #1
ws2812_sem = xSemaphoreCreateBinary();
RMT.conf_ch[RMTCHANNEL].conf1.mem_rd_rst = 1;
RMT.conf_ch[RMTCHANNEL].conf1.tx_start = 1;
// xSemaphoreTake(ws2812_sem, portMAX_DELAY);

Before colors
Af <- hang (note that it sets the LED, and then causes a hang in the middle of Serial.print later

---------------- Test #2
RMT.conf_ch[RMTCHANNEL].conf1.mem_rd_rst = 1;
RMT.conf_ch[RMTCHANNEL].conf1.tx_start = 1;
xSemaphoreTake(ws2812_sem, portMAX_DELAY);

Before colors
After colors
LEDs turned on, setting up Accelerometer next
/Users/ficeto/Desktop/ESP32/ESP32/esp-idf-public/components/freertos/./queue.c:1444 (xQueueGenericReceive)- assert failed!
0x40083a21: xQueueGenericReceive at /Users/ficeto/Desktop/ESP32/ESP32/esp-idf-public/components/freertos/./queue.c line 1998
0x400d68bb: i2cSetFrequency at /home/merlin/Arduino/hardware/espressif/esp32/cores/esp32/esp32-hal-i2c.c line 379
0x400d4969: TwoWire::begin(int, int, unsigned int) at /home/merlin/Arduino/hardware/espressif/esp32/libraries/Wire/src/Wire.cpp line 143

---------------- Test #3
// RMT.conf_ch[RMTCHANNEL].conf1.mem_rd_rst = 1;
// RMT.conf_ch[RMTCHANNEL].conf1.tx_start = 1;
xSemaphoreTake(ws2812_sem, portMAX_DELAY);

Before colors
<= hang likely on xSemaphoreTake, never returns

@projectgus
Copy link

projectgus commented Mar 22, 2017

This code may be the problem (or part of it):
https://github.com/MartyMacGyver/ESP32-Digital-RGB-LED-Drivers/blob/master/arduino-esp32/demo1/ws2812.cpp#L284

I'm guessing the intention is that ws2812_buffer is not freed until after the RMT interrupt has fully copied it into the RMT peripheral memory?

As implemented, it has at least one race condition causing a use-after-free. This may may lead to the kind of crashes you're seeing. The race is - if the RMT interrupt happens after vSemaphoreDelete, but before ws2812_sem is nulled out, then xQueueGiveFromISR in the ISR will access freed memory.

Rather than create and delete the semaphore each time, I'd suggest creating it once for the life of the driver, and use xSemaphoreTake(ws2812_sem, 0) (ignoring any failure) at the point where you currently re-create it.

Although you may also want to think about whether you can structure this code without the semaphore at all - using a queue of RMT items or something similar.

@MartyMacGyver
Copy link
Owner

MartyMacGyver commented Mar 22, 2017

Interesting - this is where I'm leaning on @FozzTexx's original code (https://github.com/FozzTexx/ws2812-demo) so it's a latent defect. I will look at how to fix this. At the very least getting rid of the race condition will be the first step.

When you say a queue of RMT items, do you have an example or reference for how that might work?

@me-no-dev
Copy link

This is an example where raw udp packet is pushed to a queue so it can be polled later. And this is where that happens. I hope it helps :)

@marcmerlin
Copy link
Author

@projectgus a huge thank you for the analysis and finding the issue
@me-no-dev thanks for the pointers on better ways to do this.
For what it's worth, after I upgraded to the latest esp-idf, my code that I had tweaked to randomly not trigger the bug, started failing instantly.
@MartyMacGyver looking forward to trying your new code and thanks again for writing that library to start with.

@MartyMacGyver
Copy link
Owner

Thank @FozzTexx... he wrote it, I just took it and improved upon it. (That said, whatever I do to make it work right will need to be fed back to his as well, for completeness.) First I need to be able to reproduce the issue on my own setup, so I can at least see if I can make the problem go away with said changes. Will work on it some tonight.

* And there is no small irony that your code is what led me to find this code which in turn you ended up using. I thought I recognized your name here... :-D

@marcmerlin
Copy link
Author

@MartyMacGyver yes I just realized 2 days ago that you were the same person on the adafruit issue saying that my patch wasn't so good (which indeed is true) :)
On reproducing, how about you send proposed code my way? I can reproduce at will here.

@MartyMacGyver
Copy link
Owner

@marcmerlin Yeah, I didn't mean to hate on your code - it's just that bit-banging becomes bang-head-on-wall-ing with the ESP32. If it wasn't for @FozzTexx's original code to use the RMT peripheral I might've given up.

(As it is, I'm dealing with fundamental I2C issues on this platform that are really vexing, but that's another matter - out of curiosity, have you had any issues with I2C devices dropping offline, glitching/sending corrupt data, or otherwise acting up versus other devices like the Teensy?)

@marcmerlin
Copy link
Author

marcmerlin commented Mar 23, 2017

@MartyMacGyver no worries, no offence taken, you and me-no-dev were both right that it was a dead end route, just the quickest one I could take for now, and still the one I'm using until I can use your code again.
For I2C, my board has an I/O expander I'm currently writing a lot to, to do PWM on my LCD's backlight (behind the I2C port expander) and it seems solid so far.

@MartyMacGyver
Copy link
Owner

@marcmerlin While I can't reproduce the errant behavior on my setup (probably because I'm not doing I2C in an ISR), I did make the changes suggested by @projectgus (but I'm still using semaphores for now). If you try to take the semaphore before the RMT.conf_ch[RMTCHANNEL].conf1.tx_start = 1; part it doesn't work at all, so I left it as it is except I'm only creating the semaphore once and never delete it (though at some point I intend to make this a class and put the sem delete and the buffer free in the destructor).

Give branch bugfix_i2c_crash a try and let me know if it works for you - if it does, I'll integrate that change as well as a small LED timing fix (better timing compliance to the spec for the WS2812B, though I suspect it won't have any visible effect whatsoever).

MartyMacGyver added a commit that referenced this issue Mar 23, 2017
@marcmerlin
Copy link
Author

marcmerlin commented Mar 23, 2017

Thanks for the attempt. Sadly I get the same result:.
Guru Meditation Error of type LoadProhibited occurred on core 1. Exception was unhandled.
Register dump:
PC : 0x400d88a6 PS : 0x00060230 A0 : 0x800d4f45 A1 : 0x3ffc7a50
A2 : 0x3ffc11d0 A3 : 0x00000190 A4 : 0x00000064 A5 : 0x000000c8
A6 : 0x00ff0000 A7 : 0xff000000 A8 : 0x800d889b A9 : 0xffffc000
A10 : 0x00000000 A11 : 0x00000000 A12 : 0x3ffc4134 A13 : 0x00000000
A14 : 0x00000000 A15 : 0xff000000 SAR : 0x00000001 EXCCAUSE: 0x0000001c
EXCVADDR: 0x00000000 LBEG : 0x400014fd LEND : 0x4000150d LCOUNT : 0xffffffff

Decoding 9 results
0x400d88a6: i2cSetFrequency at /home/merlin/Arduino/hardware/espressif/esp32/cores/esp32/esp32-hal-i2c.c line 379
0x400d88a6: i2cSetFrequency at /home/merlin/Arduino/hardware/espressif/esp32/cores/esp32/esp32-hal-i2c.c line 379
0x400d4f45: TwoWire::begin(int, int, unsigned int) at /home/merlin/Arduino/hardware/espressif/esp32/libraries/Wire/src/Wire.cpp line 136
0x400d633c: Adafruit_ADXL345_Unified::begin() at /home/merlin/Arduino/libraries/Adafruit_ADXL345/Adafruit_ADXL345_U.cpp line 291
0x400d4833: IoTuz::begin() at /home/merlin/Arduino/libraries/IoTuz/IoTuz.cpp line 350
0x400d1eb7: setup() at /home/merlin/prg/arduino/ESP32/IoTuz/examples/fulldemo/fulldemo.ino line 445
0x400e5b12: loopTask(void*) at /home/merlin/Arduino/hardware/espressif/esp32/cores/esp32/main.cpp line 17

Then I go back to the adafruit bit bang, and the I2C crash goes away

@MartyMacGyver
Copy link
Owner

MartyMacGyver commented Mar 25, 2017

IoTuz_wip.tar.gz

Attached is a greatly reduced example that demonstrates the problem. If Wire.begin() happens before ws2812_setColors(), a subsequent call to Wire.endTransmission() will cause a crash. However, if the little block of constrain()-using code in loop() (which comes after all that) is removed... it works fine.

The crash is slightly different now, though strategic placement of extra superfluous blocks of code can change the characteristics of the error slightly it's always connected to i2cWrite() in the hal.

ets Jun  8 2016 00:22:57

rst:0x1 (POWERON_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0x00
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:1
load:0x3fff0008,len:8
load:0x3fff0010,len:2016
load:0x40078000,len:7780
ho 0 tail 12 room 4
load:0x40080000,len:252
entry 0x40080034
Serial Begin
Talking to I2C device
  -- Wire.begin()
Turn on LEDs
Talking to I2C device (continued)
  -- Begin transmission
  -- Write data
  -- End transmission
/Users/ficeto/Desktop/ESP32/ESP32/esp-idf-public/components/freertos/./queue.c:1444 (xQueueGenericReceive)- assert failed!
abort() was called at PC 0x40083759
Guru Meditation Error: Core  1 panic'ed (abort)

Backtrace: 0x40008155:0x3ffc72a0 0x40007d16:0x3ffc72c0 0x400d10dc:0x3ffc7300 0x400d0e18:0x3ffc7350 0x400d0e4c:0x3ffc7370 0x400d07be:0x3ffc7390 0x400d8f9a:0x3ffc73b0

CPU halted.

Decoding 11 results
0x40080000: _WindowOverflow4 at ?? line ?
0x40080034: _WindowOverflow4 at ?? line ?
0x40083759: xQueueGenericReceive at /Users/ficeto/Desktop/ESP32/ESP32/esp-idf-public/components/freertos/./queue.c line 1998
0x400d10dc: i2cWrite at hardware\espressif\esp32\cores\esp32/esp32-hal-i2c.c line 379
0x400d0e18: TwoWire::endTransmission(unsigned char) at hardware\espressif\esp32\libraries\Wire\src/Wire.cpp line 143
0x400d0e4c: TwoWire::endTransmission() at hardware\espressif\esp32\libraries\Wire\src/Wire.cpp line 143
0x400d07be: setup() at IoTuz_wip\fulldemo/fulldemo.ino line 97
0x400d8f9a: loopTask(void*) at hardware\espressif\esp32\cores\esp32/main.cpp line 17 (discriminator 1)

At this point it's either a compiler bug, a code bug, or an interaction with system semaphores.

@marcmerlin If you get a moment, build this code as-is and see if it causes a crash for you too. It should. Then remove the constrain() code as noted code comment and let me know if it works again.

@marcmerlin
Copy link
Author

@MartyMacGyver good job taking the code way down still. Yes, I can confirm it crashes for me still.
Note that as said earlier, you don't need constrain. It crashes with this:
zzza = zzza+1;
zzzb = zzzb+1;
Then, you can make the code work by simply commenting out the single line:
zzzb = zzzb+1;

@MartyMacGyver
Copy link
Owner

I kept sprinkling in constrain() because it was quick and effective. I kind of doubt it's a problem itself.

I decided to create a crashing and non-crashing build and disassemble each with xtensa-esp32-elf-objdump.exe -d fulldemo.ino.elf. The results are attached:

good_and_bad.tar.gz

Here's what I changed, along with the relevant dump:

Version 1 (latest version from earlier upload, crashes)

Relevant code:

void loop() {
    Serial.println("Constrain something");

    // Remove this block though, and it all works again...
    {
      static float zzza = 5;
      static float zzzb = 5;
      zzza = constrain(zzza, 1, 9);
      zzzb = constrain(zzzb, 1, 9);
    }
    
    Serial.println("Constraint complete");
    Serial.println("Halting");
    while(1) {}

}

Disassembly of loop():

400d0690 <_Z4loopv>:
400d0690:	004136        	entry	a1, 32
400d0693:	fe6421        	l32r	a2, 400d0024 <_stext+0xc>
400d0696:	fe62b1        	l32r	a11, 400d0020 <_stext+0x8>
400d0699:	02ad      	mov.n	a10, a2
400d069b:	0127a5        	call8	400d1914 <_ZN5Print7printlnEPKc>
400d069e:	fe6281        	l32r	a8, 400d0028 <_stext+0x10>
400d06a1:	fe5d31        	l32r	a3, 400d0018 <_stext>
400d06a4:	000813        	lsi	f1, a8, 0
400d06a7:	fa0350        	wfr	f0, a3
400d06aa:	4b0100        	olt.s	b0, f1, f0
400d06ad:	111076        	bt	b0, 400d06c2 <_Z4loopv+0x32>
400d06b0:	fe5b31        	l32r	a3, 400d001c <_stext+0x4>
400d06b3:	fa2350        	wfr	f2, a3
400d06b6:	4b0210        	olt.s	b0, f2, f1
400d06b9:	0b1076        	bt	b0, 400d06c8 <_Z4loopv+0x38>
400d06bc:	0002c6        	j	400d06cb <_Z4loopv+0x3b>
400d06bf:	000000        	ill
400d06c2:	fa1350        	wfr	f1, a3
400d06c5:	000086        	j	400d06cb <_Z4loopv+0x3b>
400d06c8:	fa1350        	wfr	f1, a3
400d06cb:	004813        	ssi	f1, a8, 0
400d06ce:	fe5781        	l32r	a8, 400d002c <_stext+0x14>
400d06d1:	000813        	lsi	f1, a8, 0
400d06d4:	4b0100        	olt.s	b0, f1, f0
400d06d7:	0e1076        	bt	b0, 400d06e9 <_Z4loopv+0x59>
400d06da:	fe5031        	l32r	a3, 400d001c <_stext+0x4>
400d06dd:	fa0350        	wfr	f0, a3
400d06e0:	4b0010        	olt.s	b0, f0, f1
400d06e3:	021076        	bt	b0, 400d06e9 <_Z4loopv+0x59>
400d06e6:	000086        	j	400d06ec <_Z4loopv+0x5c>
400d06e9:	fa1000        	mov.s	f1, f0
400d06ec:	fe51b1        	l32r	a11, 400d0030 <_stext+0x18>
400d06ef:	20a220        	or	a10, a2, a2
400d06f2:	004813        	ssi	f1, a8, 0
400d06f5:	0121e5        	call8	400d1914 <_ZN5Print7printlnEPKc>
400d06f8:	fe4fb1        	l32r	a11, 400d0034 <_stext+0x1c>
400d06fb:	20a220        	or	a10, a2, a2
400d06fe:	012165        	call8	400d1914 <_ZN5Print7printlnEPKc>
400d0701:	ffff06        	j	400d0701 <_Z4loopv+0x71>

Version 2 (slight code change, doesn't crash)

Relevant Code:

void loop() {
    Serial.println("Constrain something");

    // Remove this block though, and it all works again...
    {
      static float zzza = 5;
      static float zzzb = 5;
      zzza = constrain(zzzb, 1, 9); // <<-- used the other var
      zzzb = constrain(zzza, 1, 9); // <<-- used the other var
    }
    
    Serial.println("Constraint complete");
    Serial.println("Halting");
    while(1) {}
}

Disassembly of loop():

400d068c <_Z4loopv>:
400d068c:	004136        	entry	a1, 32
400d068f:	fe6521        	l32r	a2, 400d0024 <_stext+0xc>
400d0692:	fe63b1        	l32r	a11, 400d0020 <_stext+0x8>
400d0695:	02ad      	mov.n	a10, a2
400d0697:	012665        	call8	400d18fc <_ZN5Print7printlnEPKc>
400d069a:	fe6381        	l32r	a8, 400d0028 <_stext+0x10>
400d069d:	fe5e31        	l32r	a3, 400d0018 <_stext>
400d06a0:	000803        	lsi	f0, a8, 0
400d06a3:	fa2350        	wfr	f2, a3
400d06a6:	4b0020        	olt.s	b0, f0, f2
400d06a9:	371076        	bt	b0, 400d06e4 <_Z4loopv+0x58>
400d06ac:	fe5c31        	l32r	a3, 400d001c <_stext+0x4>
400d06af:	fa1350        	wfr	f1, a3
400d06b2:	4b0100        	olt.s	b0, f1, f0
400d06b5:	111076        	bt	b0, 400d06ca <_Z4loopv+0x3e>
400d06b8:	4b0020        	olt.s	b0, f0, f2
400d06bb:	251076        	bt	b0, 400d06e4 <_Z4loopv+0x58>
400d06be:	4b0100        	olt.s	b0, f1, f0
400d06c1:	051076        	bt	b0, 400d06ca <_Z4loopv+0x3e>
400d06c4:	000146        	j	400d06cd <_Z4loopv+0x41>
400d06c7:	000000        	ill
400d06ca:	fa0100        	mov.s	f0, f1
400d06cd:	fe57b1        	l32r	a11, 400d002c <_stext+0x14>
400d06d0:	02ad      	mov.n	a10, a2
400d06d2:	004803        	ssi	f0, a8, 0
400d06d5:	012265        	call8	400d18fc <_ZN5Print7printlnEPKc>
400d06d8:	fe56b1        	l32r	a11, 400d0030 <_stext+0x18>
400d06db:	02ad      	mov.n	a10, a2
400d06dd:	0121e5        	call8	400d18fc <_ZN5Print7printlnEPKc>
400d06e0:	ffff06        	j	400d06e0 <_Z4loopv+0x54>
400d06e3:	020000        	andb	b0, b0, b0
400d06e6:	86fa      	add.n	a8, a6, a15
400d06e8:	fff8      	l32i.n	a15, a15, 60

That said, most of the disassembly is different. A lot is due to small shifts in the addresses of things, as would be expected. But as we see above, a small change has a surprisingly big effect, though that's the nature of compilers and optimizers... Regardless, somewhere in all of this something is going very wrong and I'd really like to know what it is.

@marcmerlin
Copy link
Author

Thanks, 2 thoughts:

  1. you said you got get it to crash with compiler optimizations turned off? If so, try that
  2. by the same vein, as per my previous comment, replace with a = a +1 which still crashes, and then comment only one of the 2 lines, which is enough to affect the crash
    the two above should give you s much smaller assembly diff hopefully

@marcmerlin
Copy link
Author

By the way, you should probably hang in https://gitter.im/espressif/arduino-esp32 so that we can @ you on relevant stuff :)

@MartyMacGyver
Copy link
Owner

I'll give the no-optimization thing a try again, just to see if it's more informative. I'd like to reduce this further, get rid of the redundant unused functions... but it doesn't want to break then.

I'm in the channel on gitter now, but responses there may be a little delayed.

@marcmerlin
Copy link
Author

Right, I think you can leave the other code around though if you're diffing the assembly output. If you can reduce the crash vs no crash to a simple y = y +1 being commented out, or not, that should give a much smaller diff hopefully.
I did get the code to crash by changing the constrain to var = var + 1 and then made the crash go away by commenting just that single line out.
This should be a smaller diff, right?

@MartyMacGyver
Copy link
Owner

MartyMacGyver commented Mar 25, 2017

Well, I'm also trying to tighten it up some. It's not clear to me exactly why dead code is having any effect, but obviously it is.

I'll post more info later, but (optimization unchanged for now) after making those constrain() changes and trimming things a bit I still get the same xQueueGenericReceive() crash stemming from i2c.

fulldemo.ino.txt

But... rgbLedFade() (renamed to asubfoo4() now - renaming had no effect, just cleaning up) certainly changes things if you remove it. It crashes... in the UART hal.

Turn on LEDs
Guru Meditation Error of type LoadProhibited occurred on core  1. Exception was unhandled.

Decoding 12 results
0x40080000: _WindowOverflow4 at ?? line ?
0x40080034: _WindowOverflow4 at ?? line ?
0x400d1be2: uartWriteBuf at hardware\espressif\esp32\cores\esp32/esp32-hal-uart.c line 279 (discriminator 1)
0x400d1be2: uartWriteBuf at hardware\espressif\esp32\cores\esp32/esp32-hal-uart.c line 279 (discriminator 1)
0x400d17a0: HardwareSerial::write(unsigned char const*, unsigned int) at hardware\espressif\esp32\cores\esp32/HardwareSerial.cpp line 93
0x400d18b9: Print::write(char const*) at hardware\espressif\esp32\cores\esp32/Print.cpp line 85
0x400d18e0: Print::print(char const*) at hardware\espressif\esp32\cores\esp32/Print.cpp line 85
:  (inlined by) Print::println(char const*) at hardware\espressif\esp32\cores\esp32/Print.cpp line 175
0x400d074d: setup() at IoTuz_wip\fulldemo/fulldemo.ino line 85
0x400d8f5a: loopTask(void*) at hardware\espressif\esp32\cores\esp32/main.cpp line 17 (discriminator 1)

That's new - and that suggests that this may not be about I2C after all. Investigating...

@MartyMacGyver
Copy link
Owner

So I used -O0 again instead of -Os. While it makes the objdump output more readable (along with the "-d -s -S" option set) it's still not clear why any of this should affect things.

One observation though: "asubfoo1()" gets converted to code. None of the others unused functions do that.... so that's weird. The .ino is attached (notations within assume -Os, the "remove one line and it crashes differently" thing doesn't seem to occur with -O0 - it crashes just the same as with that extra one line. Remove the block and it stops crashing.

fulldemo.ino.txt

Actually, one other observation - the UART and I2C code is later in the elf file than the setup and loop code. If something is stomping on memory or otherwise leading it to get corrupted, this could be important.

One last thing: all things being equal, if you remove the "Wire.h" import it doesn't crash (at least, I'm not yet seeing it crash in that config). But clearly I'm never calling anything from Wire.h.... so it's back to being either my code or some compiler/linking issue.

@MartyMacGyver
Copy link
Owner

I removed all reference to semaphores in my code and it's all still crashing on the UART hal (but only when the optimization option is -O0 NOT -Os....). If I remove the useless block in loop() it works just fine again. I'll probably put the semaphores back before trying other things.

@MartyMacGyver
Copy link
Owner

Reverted semaphore removal mostly. If I comment out esp_intr_alloc() (and xSemaphoreTake() since it will pause forever otherwise) this doesn't crash no matter -O0 or -Os.

Is there a problem with esp_intr_alloc(ETS_RMT_INTR_SOURCE, 0, ws2812_handleInterrupt, NULL, &rmt_intr_handle)?

@MartyMacGyver
Copy link
Owner

@projectgus , @me-no-dev Do you have any thoughts on this? Similarly random problem in #7 and I'm beginning to wonder if it's interrupt-related rather than semaphores, given the tests I've done so far, but I have no idea what might be going on.

Is there any possibility the RMT has a problem in itself, something like the "random data loss" bug in the errata that requires changing the base address of the registers?

@FozzTexx
Copy link

@projectgus The buffer won't be freed until after the ISR is done with it. ws2812_setColors will stop and wait at xSemaphoreTake until the semaphore is released which doesn't happen until after all of the data has been sent to the RMT.

@MartyMacGyver
Copy link
Owner

Thank you for the comment (and the original code) @FozzTexx! Even ripping out most of the semaphore stuff didn't change the problem, but it's good to know the code is solid (but if it ends up needing the changes @projectgus described or similar, I'll get on that. I doubt this is the cause of the problem though - I suspect it's all triggering some latent core defect.)

And as we discussed elsewhere, this problem appeared before for you, then disappeared. So it's probably not any of the code I added/changed either. Could be there's some weird bug in the code, but it may well be symptomatic of something in the SDK/chipset.

One new thought, especially in light of the I2C thing recently fixed on the Arduino side, is that maybe some critical part of Arduino-ESP32 is out of sync versus esp-idf, leading to this.

That all said, I still wonder if there's something wonky with interrupts on the ESP32 in multiplicity, or if it's a manifestation of a chipset bug known or unknown (like the base register thing).

I await Espressif's input here.

@projectgus
Copy link

projectgus commented Mar 27, 2017

@FozzTexx Thanks, this is good to know. I apologise for only glancing at the code before I first replied. The check for "&& ws2812_sem", and setting ws2812_sem to null after freeing it is what made me suspicious that this interrupt may happen multiple times. However I see now that the whole series of events is intended to happen within the context of a single ws2812_setColors() call, which makes this a non-issue and maybe the null-check & null are unnecessary.

It may still be worth trying an esp_intr_enable(rmt_intr_handle) / esp_intr_disable(rmt_intr_handle) inside the body of the ws2812_setColor though (enable interrupts after creating the semaphore, disable after taking the semaphore but before deleting it). You can also pass a flag to esp_intr_alloc() to disable the interrupt when it is first allocated. This all makes extra certain there isn't a spurious RMT interrupt happening somewhere when either ws2812_sem or ws2812_buffer are invalid. Although ws2812_buffer seems to be only read from the ISR, so that can't be directly responsible for the memory corruption here.

Another potential weirdness that just occured to me may be because Arduino ESP32 HAL functions don't use the esp_intr_alloc() interrupt allocation API yet. Which means that the same interrupt number that is assigned dynamically via esp_intr_alloc() may also be one which is assigned statically to a peripheral in the Arduino HAL. This may be worth checking on, although I'd expect the RMT functionality to never work at all if this was the case.

That all said, I still wonder if there's something wonky with interrupts on the ESP32 in multiplicity, or if it's a manifestation of a chipset bug known or unknown (like the base register thing).

Everything you're describing (LoadStoreErrors randomly coming and going depending on moving things around or on the compiler optimisation level) sounds to me like a memory corruption bug. Something is writing to memory in a place that it shouldn't, sometimes this matters and sometimes it doesn't depending on combinations of execution flow, timing and how things are laid out in memory by the compiler, linker, and runtime order.

Specifically, the LoadStoreErrors with the low EXCVADDR (0x38) that @marcmerlin describes above are an indication of a null pointer dereference - something is pointing to NULL which shouldn't, and the calling code is dereferencing the null pointer as it tries to read a field of a pointed-to-structure at offset 0x38.

I would love to look at this more closely but I haven't had time yet, sorry. If you have any strong indication that it's an IDF-specific bug then I can prioritise it in my work schedule, but at the moment that seems like an open question.

Angus

@MartyMacGyver
Copy link
Owner

@projectgus I appreciate the feedback - it would help if the compiler didn't include random bits of dead code as it's doing (further muddying the isue) - but yes, it does sort-of seem like a pointer problem somewhere. It'll be nice if it's in my code here, but if not then it could be anywhere.

However... if that was the case I'd expect the crashes to be far more random in nature: deterministic perhaps for a given build, but random in how they manifest. Here, the kind of crash is always the same even if it's in different subsystems. I2C crashes on write... UART crashes on write. It's fully repeatable but it's always something that's dealing with interrupts or mutexes, as if an interrupt is returning to the wrong place or interrupts are interfering with each other..

If interrupts are clobbering each other, that would explain one helluva lot. It was unclear to me how many interrupts - UART, I2C, RMT, timers, etc. - can actually co-exist per the specs... but it's probably time to bring Arduino-ESP32 into parity with esp-idf in this important regard.

@marcmerlin
Copy link
Author

@projectgus note that when I use the RMT code, it also seems to break analog read by shifting values by 2x as per #7 but then, sadly, if I try this in a short constrained example, analogread continues to work ok after I enable RMT.
So, they seem to interact, but only depending on what code is around (again). At least in this case I don't get a crash, and analogread is reliably giving me a value that's off by 2x, but it's consistent.

@marcmerlin
Copy link
Author

@MartyMacGyver were the replies from FozzTexx and projectgus helpful to you, or you're still stuck with my example code that crashes when I2C is called?

@MartyMacGyver
Copy link
Owner

MartyMacGyver commented Apr 2, 2017

I was busy with other things this week but hope to look at this and more tonight and tomorrow. If it's the interrupts clobbering each other then I'm not sure yet what the next step should be.

And of course it's not really an I2C issue at this point...

@MartyMacGyver
Copy link
Owner

MartyMacGyver commented Apr 3, 2017

@marcmerlin I see the debug functionality in my code wasn't working, so I've been fixing that tonight. It leads me to wonder if this is some level of Extern Hell wherein the externs necessary to make this thing work with C++ are causing issues. Probably not but who knows?

If I had the hardware and the ESP32 handling for it was reliable, JTAG debugging would be an option.

@projectgus As I search the thread for EXCVADDR with a value of 0x38 I'm not coming up with anything in the original dumps. It's not clear how much that would tell me, given that this bug is wandering anyway, but I'm curious where you saw that particular value.

Edit: I refactored away the constant malloc/free stuff, and made sure to only call xSemaphoreCreateBinary() once. Same error... but before or after refactoring, if I never call xSemaphoreGiveFromISR() and xSemaphoreTake() (one requires the other) it seems to work and doesn't crash (but I'd think that breaks regular functionality in some way). I'd even wager that all of this is directly related to those two calls... but how?

Edit2: Actually, even that didn't help... one thing that catches my eye is the use of RMTMEM (from extern rmt_mem_t RMTMEM in rmt_struct.h) to write to RMT memory directly. Could using this be causing problems?

@projectgus
Copy link

projectgus commented Apr 4, 2017

@projectgus As I search the thread for EXCVADDR with a value of 0x38 I'm not coming up with anything in the original dumps. It's not clear how much that would tell me, given that this bug is wandering anyway, but I'm curious where you saw that particular value.

Sorry, I miscopied - it was 0x18 not 38 in your post here. It's not particularly useful except that these very low EXCVADDR values for LoadStoreError almost always indicate a null pointer dereference - it's saying "load the field at PTR + 0x18" and PTR is unexpectedly zero.

@MartyMacGyver
Copy link
Owner

MartyMacGyver commented Apr 4, 2017

@projectgus Thanks for the clarification.

It's still not clear what the hell is going on or how this is walking on memory or the stack or whatever it's doing to cause this weird behavior. The current code is attached - I've enabled more debugging messages to help trace my way through the code. As it is, it crashes in uartWriteBuf() currently. Remove the unused asubfoo4() definition and it won't crash, but clearly it's fragile. I wonder if the writes to the RMT and RMTMEM structures are involved here - otherwise, I'm just not sure - even removing the semaphores (as noted above) didn't help.

current_code.tar.gz

ets Jun  8 2016 00:22:57

rst:0x1 (POWERON_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0x00
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:1
load:0x3fff0008,len:8
load:0x3fff0010,len:2016
load:0x40078000,len:7780
ho 0 tail 12 room 4
load:0x40080000,len:252
entry 0x40080034
Serial Begin
Turn on LEDs
DEBUG: (1) ws2812_init(); initRMTChannel(); intr_alloc; Driver initted 
init done
Guru Meditation Error of type LoadProhibited occurred on core  1. Exception was unhandled.
Register dump:
PC      : 0x400d13aa  PS      : 0x00060430  A0      : 0x800d0f10  A1      : 0x3ffc76d0  
A2      : 0x3ffc11d0  A3      : 0x3f400ef0  A4      : 0x00000008  A5      : 0x3ffc5960  
A6      : 0x00000003  A7      : 0x00060023  A8      : 0x800d1384  A9      : 0x00000008  
A10     : 0x00000001  A11     : 0x0000007e  A12     : 0x3ffc79d4  A13     : 0x00000000  
A14     : 0x00000000  A15     : 0x3ffc1110  SAR     : 0x00000018  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x0000001d  LBEG    : 0x400014fd  LEND    : 0x4000150d  LCOUNT  : 0xfffffffd  

Backtrace: 0x400d13aa:0x3ffc76d0 0x400d0f10:0x3ffc76f0 0x400d1029:0x3ffc7710 0x400d103c:0x3ffc7730 0x400d06ea:0x3ffc7750 0x400d077f:0x3ffc7770 0x400dbd5a:0x3ffc7790

CPU halted.

Decoding 13 results
0x40080000: _WindowOverflow4 at ?? line ?
0x40080034: _WindowOverflow4 at ?? line ?
0x400d13aa: uartWriteBuf at hardware\espressif\esp32\cores\esp32/esp32-hal-uart.c line 279 (discriminator 1)
0x400d13aa: uartWriteBuf at hardware\espressif\esp32\cores\esp32/esp32-hal-uart.c line 279 (discriminator 1)
0x400d0f10: HardwareSerial::write(unsigned char const*, unsigned int) at hardware\espressif\esp32\cores\esp32/HardwareSerial.cpp line 93
0x400d1029: Print::write(char const*) at hardware\espressif\esp32\cores\esp32/Print.cpp line 227
0x400d103c: Print::print(char const*) at hardware\espressif\esp32\cores\esp32/Print.cpp line 227
0x400d06ea: dumpDebugBuffer(int, char*) at IoTuz_wip\fulldemo/fulldemo.ino line 52
0x400d077f: setup() at IoTuz_wip\fulldemo/fulldemo.ino line 76
0x400dbd5a: loopTask(void*) at hardware\espressif\esp32\cores\esp32/main.cpp line 17 (discriminator 1)

@projectgus
Copy link

@MartyMacGyver I'm not able to reproduce this crash. With current arduino-esp32 and the code you attached above, I get this output: https://gist.githubusercontent.com/projectgus/fc3c595b79b3363a62e35a97fa6b57c5/raw/dbf215084a12fe2f602bf683d4a5def6c3430291/rmt_output.log

Can you confirm the code is the same? If so, can you please attach the ELF & BIN files from the Arduino build directory, that correspond to the crashing code?

@MartyMacGyver
Copy link
Owner

Attaching the code (the same as in the comment above) as well as the bin and elf. It crashes here.

current_code_with_bins.tar.gz

@projectgus
Copy link

I get the same crash you're reporting when building with arduino-esp32 commit 90322ae7, but not with current master (2b075f3204).

After poking around a bit more I don't think this is coincidence, as a fix was just merged in IDF where the CPU1 ISR stack space was located incorrectly. Arduino sketches run on CPU1, and I think this is what was clobbering memory. Will confirm with @me-no-dev that this fix is in the latest arduino-esp32.

@marcmerlin
Copy link
Author

So, I just upgraded to arduino-esp32 master, and the crash is gone for me too.
At the same time, the analogread being wrong by 2x has also been fixed ( espressif/arduino-esp32#290 )
Thank you!

@MartyMacGyver
Copy link
Owner

Yesterday there were just a few commits since my last update, none related. Today... lots more.

I'll update and give it all a try later tonight. I'll be a happy camper if this is working now. :-)

@MartyMacGyver
Copy link
Owner

Fingers crossed, but the IDF update (espressif/arduino-esp32@8032231) appears to have resolved this!

Thanks to @marcmerlin for reporting this, @projectgus and @me-no-dev for help debugging this, and @me-no-dev for updating Arduino-ESP32.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants