NVMe unstable with Argon ONE V3 case

Hi,

I recently pruchased an RPi5 and an Argon ONE V3 M.2 NVMe PCIe case to be used as a home automation server. Installation and boot from the NVMe went smoothly. After 2 days of uptime, the server suddenly became unresponsive. After rebooting, I found several kernel messages saying “nvme cntroller is dow, will reset” (see example below)

Apr 09 01:00:07 iotserver kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Apr 09 01:00:07 iotserver kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Apr 09 01:00:07 iotserver kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Apr 09 01:00:07 iotserver kernel: nvme 0000:01:00.0: enabling device (0000 -> 0002)
Apr 09 01:00:07 iotserver kernel: nvme nvme0: 4/0/0 default/read/poll queues
Apr 09 01:04:19 iotserver kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Apr 09 01:04:19 iotserver kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Apr 09 01:04:19 iotserver kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Apr 09 01:04:19 iotserver kernel: nvme 0000:01:00.0: enabling device (0000 -> 0002)
Apr 09 01:04:19 iotserver kernel: nvme nvme0: 4/0/0 default/read/poll queues

Hooking the server up to a screen, I was able to capture the error messages printed on screen, but not captured in the logs on a later crash due to the file system being unmounted (hand-typed from a photo, may contain some “spelling errors”):

[118555.548795] EXT4-fs error (device nvme0n1p2) in ext4_reserve_inode_write:5764: Journal nas aborted
[118555.548807] EXT4-fs error (device nvme0n1p2) in ext4_reserve_inode_write:5764: Journal nas aborted
[118555.550145] EXT4-fs error (device nvme0n1p2): ext4_dirty_inode:5968: inode #120009: comm systemd-journal: mark_inode_dirty error
[118555.551476] EXT4-fs error (device nvme0n1p2): ext4_dirty_inode:5968: inode #545366: comm python3: mark_inode_dirty error
[118555.552829] EXT4-fs error (device nvme0n1p2) in ext4_dirty_inode:5969: Journal has aborted
[118555.552927] EXT4-fs error (device nvme0n1p2): ext4_journal_check_starting:84: comm systemd-journal: Detected aborted journal
[118555.617595] EXT4-fs (nvme0n1p2): Remounting filesystem read-only

I use the “official” RPi 5 power adapter. The NVMe card is a Kingston NV2 M.2 500GB. I did follow the instructions on setting up the eeprom config and installing the Argon scripts.

Sometimes the error happens after a few days, other times it will happen within 10 seconds of booting. The error does not seem related to the temperature of the device - it has happened within 30 seconds of booting after a 30 minute period of being turned off.

I have tried the recommended “nvme_core.default_ps_max_latency_us=0 pcie_aspm=off” kernel parameters without success.

Having spent a disordinate amount of time troubleshooting the device, I’m seriously regretting not paying €50 more and getting a NUC at this point :cry:

I hope someone here might be able to give me some advice before it all goes into the trash.

Thank you so much in advance!

Do you have tried with ‘dtparam=pciex1_gen=2’ in your config.txt to get it stable? Because “…gen=3” isn’t official supported by the Pi5.

Hi HarryH,

Thank you for your response.

I am currently running without any dtparam=pciex1_gen=xxx parameter in config.txt. My understanding is that it defaults to gen2? I can try to set it explicitly to see if it makes a difference.

I remember I saw some rather counter-intuitive posts somwhere on the internet that someone had managed to get an nvme card working when changing from gen2 to gen 3. Will try that as well.

Lately, I have not been able to boot the RPi at all. Seems the problem is getting worse considering I could run for 24-48 hours before it crashed. Had to pull out the NVMe card and change the boot order in order to boot and check the config.txt file.

In the mean time, any other ideas or tips that I could try out?

Again, thank you so much for your help and input!

Yes, dtparam=pciex1_gen=2 is default if you not specify this line. So it should make no different.
Additional it should be possible to slow down the PCIe bus with dtparam=pciex1_gen=1 for troubleshooting purposes.
dtparam=pciex1_gen=3 it’s like a kind of overclocking, but you can’t loose more than now. Normally this speed should be more critical, because of higher frequencies at the flexible pcb. But if the firmware of the used NVMe has trouble with lower speeds, perhaps it helps.

Because it’s some kind of fiddly, do you have new inserted the flexible pcb to ensure it fits right? The system was running 2 days, so another power supply could als be an option to try.
Do you know if a firmware update is available for your Kingston NVMe?

Hi again!

And thanks for the ultra-fast reply.

I managed to boot with gen3. Seeing that this is the first NVMe boot I’ve managed to complete without a crash before the login prompt in more than 20 attempts, it is a good start. But let’s see. It’s been runing for 10 minuts now … :slight_smile:

While counter-intuitive, it could verry well be that my Kingston NV2 actualy struggles with the lower speed PCIe speeds. I’ve seen some reviews indicating that the NV2 cards are really low-end, with cheap components and even different controllers and NAND flash from drive to drive. Would not be surprised if the card does properly support low-speed PCIe generations no longer used in mainstream setups.

However, if the firmware is the issue (rather than the hardware), it might be worth checking if the firmware can be upgraded. Will check that as well.

If gen3 turns out to be as unstable as gen2 and a firmware update does not help, I will try gen1 as well.

Crossing my fingers …

Again. Thank you!

Quick update: The RPi has been running without issues with gen3 for 72 hours now. Way too early to call it a win, but a good sign nevertheless. Still crossing my fingers … :slight_smile:

1 Like