WROVER-B Flash Config Corrupted In Field

vonnieda
Posts: 145
Joined: Tue Nov 07, 2017 3:42 pm

WROVER-B Flash Config Corrupted In Field

Postby vonnieda » Fri Mar 13, 2020 10:19 pm

Hi all,

I'm experiencing an issue on a small percentage of boards where the flash config of a board appears to become corrupt in the field which essentially bricks the device.

When this happens, the console starts looping with this message:

Code: Select all

rst:0x10 (RTCWDT_RTC_RESET),boot:0x3b (SPI_FAST_FLASH_BOOT)
flash read err, 1000
ets_main.c 371 
ets Jun  8 2016 00:22:57
Reflashing the board (as with `make flash`) fixes the problem in all cases.

I pulled a complete flash image off a device displaying this behavior using esptool and saved it. I can share this with Espressif if needed. It contains proprietary information, of course, so that will need to be done privately.

I have tried just flashing individual sections including the ota_data, the partition table, and the bootloader, and this does not fix the issue.

I then performed a `make flash` and that fixed the issue. Reflashing the original "bad" image with esptool.py puts the board back into the bricked state - so this does appear to be due to something stored in flash.

I suspect this has to do with the "flash config" structure, but I have not had any luck finding more information about that. It seems that it is referenced from ROM, perhaps?

I believe this situation appears when the board has been power cycled repeatedly, but I don't have definitive proof of that.

My firmware does not use any flash writing features except for OTA and NVS.

So, my questions are:

1. Any ideas of how this could happen in the field? Could power cycling the module repeatedly cause some kind of flash corruption?
2. Would providing my "broken" image to Espressif help me debug this issue?
3. Are there things I can look for in the broken image that might lead me to understand what went wrong?

Thanks,
Jason

WiFive
Posts: 3529
Joined: Tue Dec 01, 2015 7:35 am

Re: WROVER-B Flash Config Corrupted In Field

Postby WiFive » Sat Mar 14, 2020 8:34 am

boot:0x3b
For wrover-b it should be booting in 3.3v mode but the strapping pins are showing 1.8v mode, did you set efuse to fix this?

vonnieda
Posts: 145
Joined: Tue Nov 07, 2017 3:42 pm

Re: WROVER-B Flash Config Corrupted In Field

Postby vonnieda » Sat Mar 14, 2020 1:17 pm

WiFive wrote:
Sat Mar 14, 2020 8:34 am
boot:0x3b
For wrover-b it should be booting in 3.3v mode but the strapping pins are showing 1.8v mode, did you set efuse to fix this?
Thanks for the response - yes, 3.3v efuse is set during provisioning and verified in firmware at boot.

I should note also that in these cases the devices have been in service sometimes for months.

Thanks,
Jason

WiFive
Posts: 3529
Joined: Tue Dec 01, 2015 7:35 am

Re: WROVER-B Flash Config Corrupted In Field

Postby WiFive » Sun Mar 15, 2020 3:44 am

Suggest that you narrow it down to a partition by selectively reflashing the bad image or doing a binary diff to a working image. Also post the log from a cold boot if it is any different.

vonnieda
Posts: 145
Joined: Tue Nov 07, 2017 3:42 pm

Re: WROVER-B Flash Config Corrupted In Field

Postby vonnieda » Sun Mar 15, 2020 4:05 pm

WiFive wrote:
Sun Mar 15, 2020 3:44 am
Suggest that you narrow it down to a partition by selectively reflashing the bad image or doing a binary diff to a working image. Also post the log from a cold boot if it is any different.
I did try to selectively reflash partitions, but I didn't try every single partition, and I didn't try them isolated, one at a time - so I will try that next.

And good idea on the cold boot log - will try that in the morning.

Thanks,
Jason

vonnieda
Posts: 145
Joined: Tue Nov 07, 2017 3:42 pm

Re: WROVER-B Flash Config Corrupted In Field

Postby vonnieda » Mon Mar 16, 2020 8:49 pm

WiFive wrote:
Sun Mar 15, 2020 3:44 am
Suggest that you narrow it down to a partition by selectively reflashing the bad image or doing a binary diff to a working image. Also post the log from a cold boot if it is any different.
Okay, I've done a bunch of further testing and have some interesting results and more questions:

1. The cold boot log was the same, so nothing interesting there.
2. I reflashed the original image I pulled off the device and then flashed a new bootloader - same error.
3. I then realised I had logging in the bootloader turned off, so I changed that setting, built a new one and flashed it. That lead me to a new error: E (56) flash_parts: partition 0 invalid magic number 0x5000
4. So I then flashed the partition table. That lead me to more new errors:

Code: Select all

E (162) esp_image: image at 0x10000 has invalid magic byte
E (168) boot_comm: mismatch chip ID, expected 0, found 65535
E (174) boot_comm: can't run on lower chip revision, expected 1, found 255
W (182) esp_image: image at 0x10000 has invalid SPI mode 255
W (188) esp_image: image at 0x10000 has invalid SPI size 15
E (194) boot: OTA app partition slot 0 is not bootable
E (200) esp_image: image at 0x1f0000 has invalid magic byte
E (206) boot_comm: mismatch chip ID, expected 0, found 65535
5. Then flashed the application image at 0x10000 and it finally booted.

This lead me to take a closer look at the original image in a hex editor. What I found is that the data from the start (0x0) all the way into the second application partition is pure trash. There's no recognizable patterns or human readable strings. Bootloader, partition table, first application partition, and part of the second are all just trash.

So then I erased the flash and flashed the binaries that *should* have been on this device - I keep a copy of every version of every binary we push, including the bootloader and partition table - and then read the flash into a file. Looking at this file, as expected, the bootloader has plenty of human readable strings, the partition table aligns on 32 bytes boundaries and includes partition names, the first app looks like an app image, etc.

Of particular interest: In the bad image, the first 0x1000 bytes are trash. In the good image these are 0xff - they are erased. I don't know if the ESP stores anything in this region before the bootloader but it doesn't appear so - so why is there trash data there in the bad image?

Another point of interest: The bad image shows entopy of 6.12 bits per byte, while the good image shows 3.66 using `ent`.

So, to me it looks like more than half the flash somehow got overwritten with random data, in the field, with no functions accessing flash except NVS and OTA.


Anyone have any thoughts or suggestions? I'm stumped.

Thanks,
Jason

Who is online

Users browsing this forum: No registered users and 91 guests