esp_timer_start_once - LoadProhibited on the second timer_armed call 1 in a million times

jcsbanks
Posts: 305
Joined: Tue Mar 28, 2017 8:03 pm

esp_timer_start_once - LoadProhibited on the second timer_armed call 1 in a million times

Postby jcsbanks » Tue Jul 30, 2019 6:35 pm

Code: Select all

Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.
Core 0 register dump:
PC      : 0x40089a1b  PS      : 0x00060130  A0      : 0x80081ee3  A1      : 0x3ffaf950
0x40089a1b: timer_armed at C:/msys32/home/jcsba/esp/esp-idf/components/esp32/esp_timer.c:443

A2      : 0x00000000  A3      : 0x000007e0  A4      : 0x3ffaf960  A5      : 0x00000008
A6      : 0x2616bdb6  A7      : 0x2616bdb6  A8      : 0x3ffafa80  A9      : 0x00000001
A10     : 0x00000000  A11     : 0x00000000  A12     : 0x00060120  A13     : 0x00000001
A14     : 0x00060123  A15     : 0x00000000  SAR     : 0x00000008  EXCCAUSE: 0x0000001c
EXCVADDR: 0x00000000  LBEG    : 0x4008e7b9  LEND    : 0x4008e7ed  LCOUNT  : 0xffffffff
0x4008e7b9: memcpy at /home/jeroen/esp8266/esp32/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/machine/xten
sa/../../../../.././newlib/libc/machine/xtensa/memcpy.S:291

0x4008e7ed: memcpy at /home/jeroen/esp8266/esp32/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/machine/xten
sa/../../../../.././newlib/libc/machine/xtensa/memcpy.S:324


ELF file SHA256: 3faba7506ae8f3a2274420df831d2d67cb00ec1d93b56dcdc10a065e478ad173

Backtrace: 0x40089a1b:0x3ffaf950 0x40081ee0:0x3ffaf970 0x400db723:0x3ffaf990 0x400d32a4:0x3ffaf9b0 0x400d330b:0x3ffaf9d0 0x40095285:0x
3ffaf9f0
0x40089a1b: timer_armed at C:/msys32/home/jcsba/esp/esp-idf/components/esp32/esp_timer.c:443

0x40081ee0: esp_timer_start_once at C:/msys32/home/jcsba/esp/esp-idf/components/esp32/esp_timer.c:443

0x400db723: continue_send at C:/Users/jcsba/Documents/GitHub/new1/main/isotp.c:281

0x400d32a4: timer_process_alarm at C:/msys32/home/jcsba/esp/esp-idf/components/esp32/esp_timer.c:443

0x400d330b: timer_task at C:/msys32/home/jcsba/esp/esp-idf/components/esp32/esp_timer.c:443

0x40095285: vPortTaskWrapper at C:/msys32/home/jcsba/esp/esp-idf/components/freertos/port.c:435


Rebooting...
In a timer callback, every 350us or so, I restart the timer. About 1 in a million times (not reproducible) I get a crash like above.

I take precautions to try to not have race conditions, but could not get on with critical sections around deleting and setting of timers or it would produce interrupt watchdog timeouts about 1 in 100000 times.

I'm looking at esp_timer.c to try to work out if it is something I could be doing wrong, but it is the second timer_armed call that does it. I do check the pointer passed to esp_timer_start once and cannot see anything else I'm doing wrong, but both my and the timer code look to have opportunities for tasks on different cores to fight each other where checks are done outside a criticial section.

Previously some similar timer delete bugs were resolved by changing to unicore, but the performance suffered and I am presently in dual core this is my last known remaining bug. I have critical sections and state variables protecting what I can but have to exit them to do timer activities or the WDT timeout (even when extended from 300ms).

esp-idf release 3.3 (beta 3) with WROVER 16MB/8MB.

I have not applied the PSRAM patch as could not work out the details on how it worked in Windows, but prob not related because only big mallocs will go in PSRAM by default, but I heavily use PSRAM with no evident crashes from that so far.

Any hints further where to look appreciated. It is a difficult one to solve.

I cannot produce a minimal example, the project is over 10000 lines of code with many tasks running. I can run things continuously for a long time with no signs of memory leaks, corrupt data, heap corruption, heap fragmentation etc. Then this randomly, it seems like a race condition between the cores.

I might be able to solve it by pinning certain tasks to the same core, but would like to solve it "properly" before doing kludges.

ESP_Angus
Posts: 2344
Joined: Sun May 08, 2016 4:11 am

Re: esp_timer_start_once - LoadProhibited on the second timer_armed call 1 in a million times

Postby ESP_Angus » Fri Aug 16, 2019 2:57 am

Hi jcsbanks,

Sorry noone got back to you earlier. A LoadProhibitedError with an EXCVADDR of 0x00000000 suggests that a null pointer was dereferenced. The code for timer_armed(), where the crash occurs, is very simple:

Code: Select all

static IRAM_ATTR bool timer_armed(esp_timer_handle_t timer)
{
    return timer->alarm > 0;
}
And as "alarm" is the first field in the esp_timer_handle_t structure, we can infer this function is called with timer == NULL.

The call stack implies that timer_process_alarm() is calling "continue_send" (implemented in your project isotp.c file) and then this function is calling esp_timer_start_once(NULL, ...) - ie the timer handle parameter is set to NULL.

We should add a NULL check in this function, but adding this check will just make the function return an error rather than necessarily make it work as you expect. You can add the NULL check in continue_send() if necessary, although this might not be a solution depending on the root cause.

Without seeing the implementation of continue_send() and the rest of the code it's hard to guess exactly what happens here, but if I had to guess then I would guess the root cause is one of two things:
  • There's a race condition where some other task is setting the timer handle to NULL at the same time continue_send() is being called.
  • There's memory corruption and the timer handle memory address is being incorrectly overwritten with NULL from somewhere else. Hopefully it's not this as this can be very hard to debug, as any other piece of code could be corrupting memory.

jcsbanks
Posts: 305
Joined: Tue Mar 28, 2017 8:03 pm

Re: esp_timer_start_once - LoadProhibited on the second timer_armed call 1 in a million times

Postby jcsbanks » Fri Aug 16, 2019 8:30 am

Thanks ESP_Angus. I had solved it by deleting and making a new timer instead of reusing it. Whilst this seems wasteful it meant that fewer critical sections and rechecks were needed in between timer alarm and start once and so far it is working well with heavy testing.

Your input was useful to understand what was going on and as always most appreciated. Thanks!

Who is online

Users browsing this forum: Google [Bot] and 144 guests