WDT Timeout on Core 0

berlinetta
Posts: 22
Joined: Tue May 21, 2019 8:33 pm

WDT Timeout on Core 0

Postby berlinetta » Fri Jul 17, 2020 2:09 pm

Hello,

I have a project which utilizes both the BLE and WiFi stacks, where the WiFi operations are restricted solely to the occasional use of AP mode for large data transfers to a mobile application. I am developing under PlatformIO using the latest release of the ESP IDF framework. I have configured the project to pin both stacks to core 0, leaving my application code to execute on core 1.

Our mobile app typically communicates with our device via BLE, but can request the activation of WiFi AP mode when a larger amount of data (~250KB) needs to be transferred via the esp http server. We are utilizing UART2 to receive the data from a second micro-controller in the system at a baud rate of 500Kbps. We transfer data to the mobile app (web-server client) in chunks of 3300 bytes. Although data is being received from the secondary micro at 500Kbps, the transfer is not continuous but at a rate closer to 12K bytes per second.

On most occasions, the operations work fine. However, there are instances where we encounter a WDT Timeout on Core 0. Since this core is reserved for the “black box” code, it would appear there is an issue with how the stacks (and/or the interrupt handler for the UART) are managing their time.

Given the spotty occurrence of the problem, is there any advice for how we may be able to further diagnose the root cause of this issue?

Are there any modifications I can make to the Core 0 configuration which may help mitigate the problem?

Thanks in advance!
Mark

ESP_Sprite
Posts: 3937
Joined: Thu Nov 26, 2015 4:08 am

Re: WDT Timeout on Core 0

Postby ESP_Sprite » Fri Jul 17, 2020 2:29 pm

Do you have the serial output for this? There are multiple WDTs and you're not specifying which one you're having issues with.

berlinetta
Posts: 22
Joined: Tue May 21, 2019 8:33 pm

Re: WDT Timeout on Core 0

Postby berlinetta » Mon Jul 20, 2020 3:39 pm

Hello ESP_Sprite,

Below is the diagnostic output from one of these events...

The application firmware was in the process of uploading the 250KB data file to the mobile application in 3300 byte chunks. In this case, the WDT timeout event occurred while the application firmware was attempting to send the 39th chunk (blocks 3801 - 3900). Each of my diagnostic print statements is prefaced with a local zero-based millisecond timestamp. The successful conclusion of a transfer is indicated with the block count and the status of an internal circular buffer (read pointer, write pointer, remaining buffer space).

62476: MeasureCb: Sending next chunk...
62476: 3500 blocks of 3500 sent (Rd: 0, Wr: 0, 23099 rem buffer)
62726: MeasureCb: Sending next chunk...
62726: 3600 blocks of 3600 sent (Rd: 3300, Wr: 3300, 23099 rem buffer)
62976: MeasureCb: Sending next chunk...
62976: 3700 blocks of 3700 sent (Rd: 6600, Wr: 6600, 23099 rem buffer)
63236: MeasureCb: Sending next chunk...
63236: 3800 blocks of 3800 sent (Rd: 9900, Wr: 9900, 23099 rem buffer)
63476: MeasureCb: Sending next chunk...
Guru Meditation Error: Core 0 panic'ed (Interrupt wdt timeout on CPU0)
Core 0 register dump:
PC : 0x400811c8 PS : 0x00060034 A0 : 0x400829c8 A1 : 0x3ffbe5c0
A2 : 0x3ffcfa70 A3 : 0x3ff6e000 A4 : 0x80121756 A5 : 0x40097180
A6 : 0x00000000 A7 : 0x00000001 A8 : 0x00000000 A9 : 0x00000000
A10 : 0x00000000 A11 : 0x00000001 A12 : 0x00000000 A13 : 0x00000000
A14 : 0x00000000 A15 : 0x3ffbcdb0 SAR : 0x00000019 EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000 LBEG : 0x00000000 LEND : 0x00000000 LCOUNT : 0x00000000
Core 0 was running in ISR context:
EPC1 : 0x401aa6e2 EPC2 : 0x00000000 EPC3 : 0x401aa6e2 EPC4 : 0x400811c8

ELF file SHA256: ac814cb30ca1b3fd

Backtrace: 0x400811c5:0x3ffbe5c0 0x400829c5:0x3ffbe5e0 0x401aa6df:0x3ffbcde0 0x40121dc2:0x3ffbce00 0x40098b85:0x3ffbce20 0x40096f55:0x3ffbce40

Core 1 register dump:
PC : 0x401aa6e2 PS : 0x00060034 A0 : 0x80121dc5 A1 : 0x3ffbd550
A2 : 0x00000000 A3 : 0x00000000 A4 : 0x00000001 A5 : 0x80000001
A6 : 0x00000003 A7 : 0x00060023 A8 : 0x80121756 A9 : 0x3ffbd520
A10 : 0x00000000 A11 : 0x00000000 A12 : 0x8009715a A13 : 0x3ffd9640
A14 : 0x00000003 A15 : 0x00060023 SAR : 0x00000000 EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000 LBEG : 0x00000000 LEND : 0x00000000 LCOUNT : 0x00000000

ELF file SHA256: ac814cb30ca1b3fd

Backtrace: 0x401aa6df:0x3ffbd550 0x40121dc2:0x3ffbd570 0x40098b85:0x3ffbd590 0x40096f55:0x3ffbd5b0

Rebooting...

ESP_Sprite
Posts: 3937
Joined: Thu Nov 26, 2015 4:08 am

Re: WDT Timeout on Core 0

Postby ESP_Sprite » Tue Jul 21, 2020 7:51 am

Can you decode that backtrace?

berlinetta
Posts: 22
Joined: Tue May 21, 2019 8:33 pm

Re: WDT Timeout on Core 0

Postby berlinetta » Tue Jul 21, 2020 2:25 pm

How can I do that for you?

berlinetta
Posts: 22
Joined: Tue May 21, 2019 8:33 pm

Re: WDT Timeout on Core 0

Postby berlinetta » Wed Jul 22, 2020 1:49 pm

OK, based on my mapping file, here is what I believe you were asking for... Please let me know if this is what you needed!

Backtrace: 0x400811c5:0x3ffbe5c0 0x400829c5:0x3ffbe5e0 0x401aa6df:0x3ffbcde0 0x40121dc2:0x3ffbce00 0x40098b85:0x3ffbce20 0x40096f55:0x3ffbce40

0x400811c5:
.iram1.6 0x40080f94 0x326 .pio\build\release\esp-idf\src\esp_uart.c.o
*fill* 0x400812ba 0x2


0x400829c5:
.iram1 0x40082640 0x4f8 .pio\build\release\esp-idf\freertos\libfreertos.a(xtensa_vectors.S.o)
0x40082640 _xt_panic
0x40082760 _xt_user_exit
0x40082a74 _xt_medint2_exit
0x40082b24 _xt_medint3_exit


0x401aa6df:
.text.esp_pm_impl_waiti
0x401aa6dc 0x8 .pio\build\release\esp-idf\esp32\libesp32.a(pm_esp32.c.o)
0x401aa6dc esp_pm_impl_waiti
*fill* 0x401aa6e4 0x0


0x40121dc2:
.text.esp_vApplicationIdleHook
0x40121d90 0x37 .pio\build\release\esp-idf\esp_common\libesp_common.a(freertos_hooks.c.o)
0x40121d90 esp_vApplicationIdleHook
*fill* 0x40121dc7 0x1


0x40098b85:
.text.prvIdleTask
0x40098b7c 0xf .pio\build\release\esp-idf\freertos\libfreertos.a(tasks.c.o)

0x12 (size before relaxing)
*fill* 0x40098b8b 0x1


0x40096f55:
.text.vPortTaskWrapper
0x40096f50 0x26 .pio\build\release\esp-idf\freertos\libfreertos.a(port.c.o)

0x32 (size before relaxing)
*fill* 0x40096f76 0x2



Backtrace: 0x401aa6df:0x3ffbd550 0x40121dc2:0x3ffbd570 0x40098b85:0x3ffbd590 0x40096f55:0x3ffbd5b0

0x401aa6df:
.text.esp_pm_impl_waiti
0x401aa6dc 0x8 .pio\build\release\esp-idf\esp32\libesp32.a(pm_esp32.c.o)
0x401aa6dc esp_pm_impl_waiti
*fill* 0x401aa6e4 0x0


0x40121dc2:
.text.esp_vApplicationIdleHook
0x40121d90 0x37 .pio\build\release\esp-idf\esp_common\libesp_common.a(freertos_hooks.c.o)
0x40121d90 esp_vApplicationIdleHook
*fill* 0x40121dc7 0x1

0x40098b85:
.text.prvIdleTask
0x40098b7c 0xf .pio\build\release\esp-idf\freertos\libfreertos.a(tasks.c.o)

0x12 (size before relaxing)
*fill* 0x40098b8b 0x1


0x40096f55:
.text.vPortTaskWrapper
0x40096f50 0x26 .pio\build\release\esp-idf\freertos\libfreertos.a(port.c.o)

0x32 (size before relaxing)
*fill* 0x40096f76 0x2

ESP_Sprite
Posts: 3937
Joined: Thu Nov 26, 2015 4:08 am

Re: WDT Timeout on Core 0

Postby ESP_Sprite » Thu Jul 23, 2020 8:16 am

That's the hard way to do it I guess... the easier way is to use addr2line or use 'idf.py monitor' as your serial terminal program, as it automatically decodes the backtrace; that will also give you the source line within the function.

From the backtrace, I can tell that something probably is looping in esp-idf\src\esp_uart.c, and as this is called in the ISR, the interrupt watchdog resets the system. Suggest you use addr2line to figure out what statement in that file exactly was running when this happens: my guess is that either there is a while() loop that never exits in some scenario, or an interrupt clear bit that never gets set.

berlinetta
Posts: 22
Joined: Tue May 21, 2019 8:33 pm

Re: WDT Timeout on Core 0

Postby berlinetta » Mon Jul 27, 2020 4:07 pm

OK... I've used the addr2line utility as you suggested to decode the actual line of code at the crash point.

BTW - I am utilizing the PlatformIO interface for my project, so I am not sure if the idf.py monitor is a legitimate option for me.

The reported line of code from the backtrace is within our UART interrupt handler, but it is not within a while loop. The code had just completed servicing the RX FIFO and was attempting to execute an IF statement to determine if the UART configuration included interrupt-driven management of the transmitter. In the case of this UART, we are NOT managing the transmit operations with interrupts, so the logic would have failed and we would have exited the interrupt handler - we were almost done. If I were to capture more failure events, I suspect the location reported by the backtrace would vary.

I am concerned that something else is happening on Core 0 which puts us very close to the WDT time limit before the UART interrupt hits. When I receive an RX FIFO interrupt, I extract the number of bytes from the FIFO as indicated by the rx_cnt and rxfifo_cnt registers. I read these register values at the entry of the ISR when the interrupt status register indicates an RX FIFO FULL or RX FIFO TIMEOUT condition. I then enter a while loop which extracts only the reported number of bytes available from the FIFO and places them into a local circular buffer where they can be processed later.

Please remember that this issue arises very infrequently, and only when a user triggers a measurement request via the web-server. In this scenario, Core 0 must manage the UART interrupts while it is also managing the BLE and WiFi stacks. The BLE is typically connected, but no activity is occurring during the measurement. The WiFi stack is being used during the measurement to send the 250KB collected measurement data back to the web-server client. Note that if I manually trigger a measurement on our device to start the flow of UART traffic, I never have an issue with the WDT timeout (web-server is not managing a client transfer in that scenario).

Is there a way to trigger the WDT reset when I enter the ISR to prevent a timeout during servicing?

Best Regards,
Mark

berlinetta
Posts: 22
Joined: Tue May 21, 2019 8:33 pm

Re: WDT Timeout on Core 0

Postby berlinetta » Mon Jul 27, 2020 6:56 pm

I just ran a series of tests to capture more failure scenarios... I was able to capture two more events. As I suspected, these two failures each occurred at different locations within the UartInterruptHandler function of ours. In one of those scenarios, the watchdog timeout occurred just as we were entering the ISR!

This problem appears to be a result of something on Core 0 failing to service the watchdog frequently enough. Perhaps we are encountering multiple interrupts between the UART and the WiFi stack which starve the WDT kick?

Any ideas? Can I adjust the WDT timeout on Core 0 to help mitigate the problem?

ESP_Sprite
Posts: 3937
Joined: Thu Nov 26, 2015 4:08 am

Re: WDT Timeout on Core 0

Postby ESP_Sprite » Tue Jul 28, 2020 1:35 pm

So, note that the *interrupt* watchdog is not called that for no reason: it kicks in when your program spends too long in an *interrupt*. (Well, technically it kicks in when interrupts are disabled for too long, but as ISRs run with interrupts disabled, it triggers on that as well). This means that WiFi/BLE probably have nothing to do with it, as the bulk of the processing there is done in a task (so with interrupts enabled). I think it is more likely that for some reason, an interrupt happens and you're not setting the interrupt clear flag for that interrupt somehow, causing the interrupt handler to be called again as soon as it exits as the interrupt line still is asserted. Maybe you can take a good look at that?

If you want to play around, in esp-idf you can change the interrupt watchdog timeout in menuconfig, I don't know how it would work in platformio.

Who is online

Users browsing this forum: Baidu [Spider], iotdeveloperfw, Majestic-12 [Bot] and 40 guests