I have bought laptop Huawei Matebook D16 RLEF-X. Tried to use Arch Linux and encountered with the problem:
- Turn on the suspend mode (
sudo systemctl suspend
) - Close the laptop lid
- Wait a few minutes
- Open the laptop back
- Use
mount
to see that the disk is read-only. Moreover, it determines like read-only, however, it is just broken. No zsh history, no executables, no way to poweroff without power button long-press.
INXI shrinked output:
System:
Host: archlinux Kernel: 6.4.2-arch1-1-linux arch: x86_64 bits: 64
Desktop: i3 v: 4.22 Distro: Arch Linux
Machine:
Type: Laptop System: HUAWEI product: RLEF-XX v: M1010
serial: <superuser required>
Mobo: HUAWEI model: RLEF-XX-PCB v: M1010 serial: <superuser required>
UEFI: HUAWEI v: 1.26 date: 01/30/2023
....
Drives:
Local Storage: total: 476.94 GiB used: 63.92 GiB (13.4%)
ID-1: /dev/nvme0n1 model: PCIe-8 SSD 512GB size: 476.94 GiB
My first step was to search something like huawei matebook disk read-only after suspend
.
To my surprise, I have found
this thread created at 2023
that just mention the case I have described above.
The recommendation in the thread: turn IOMMU into the soft mode
Unfortunately, turning the kernel option iommu=soft
in the GRUB did not change anything.
(I have not regenerated the grub.cfg, just launched edited cmdline in the GRUB menu).
I searched more and found thread with the similar issue on ASUS laptop and on IdeaPad. The first one I suddenly skipped (actually, I will reach same thoughts a bit later). The second seems to be working, however it is a bit.. expensive. My expectations did not include laptop disassembling and changing NVMe just after one usage day :)
Another idea from here consist in adding the kernel option acpiphp.disable=1
.
No matter how sad it is, the solution also brings no positive results.
Meanwhile, I found something about Wi-Fi and NVMe conflict or about turning off TPM. But playing with BIOS settings achieved no results.
My next step was to make more tests and capture a bit more information (than just disk is broken after suspend
) about the situation.
I have booted Arch Linux ISO from the USB-drive, therefore, the running OS does not depend on the NVMe.
Further, I mounted one of the NVMe's paritions (Linux root partition).
Then, activated suspend mode and replayed actions, described at the top.
After the returning from the "laptop anabiosis", USB-live OS worked fine, but the disk was read-only-broken.
I have checked dmesg
and found "the root" of the problem:
nvme 0000:01:00.0: can't change power state from D3cold to D0 (config space inaccessible)
Honorable mention: after "breaking" mounted NVMe in USB-live OS, laptop's BIOS lost the GRUB. This was solved by regenerating the grub config with grub-mkconfig
Well, my monkey-googling query can be specified. I was firmly convinced that this new detailed problem is well-known and surely already resolved. However, in fact I just went deeper into the Linux problems swamp.
The Google results I found can be divided into two categories:
- "The graveyard" of 2018-* threads with the same or extremely similar problem
- Email dump with conversations about the kernel changes (or something else... you know, those sites that are just plain text with incomprehensible context and obscure pieces of code on C)
Typical result or last-message in the graveyard-member thread looks like:
- Oh, I will switch to Windows
- Just changed the NVMe and now it works
- I have tried YYY and it did not help. Any more ideas? (*message created 4 years ago*)
- Yet another kernel parameter that does not work
After digging up the graves, I purely coincidental attempted to read the second-category-result that describes the kernel patch, that disables D3Cold for specified PCI device. For my luck, the patch was not complicated, so, I left the idea to do something like that for later.
My last resort (except the kernel patch) was to change /sys/bus/pci/devices/0000:01:00.0/d3cold_allowed
from 1 to 0.
As I thought, the attempt was failed (the cause will be described below).
No more resorts, no suggestions. The only way is to patch the kernel.
Briefly summarized links:
- 2023 Thread with the same issue
- IOMMU=soft
- Same problem with IdeaPad
- Similar issue on ASUS laptop
- acpiphp.disable=1
- The kernel patch for inspirations
- Find the PCI Vendor and PCI Class of the NVMe
lspci -vvvvvnn
...
01:00.0 Non-Volatile memory controller [0108]: Silicon Motion, Inc. Device [126f:1001] (rev 03) (prog-if 02 [NVM Express])
...
126f
is VendorID, 0108
is ClassID.
- Setup the kernel build system. I am using Arch Linux (btw) and the guide on Arch Linux Wiki has exhaustive information about the setup.
Except a little point, 2.1 Avoid creating the doc
.
The provided patch is not correct for the 6.4.2 kernel, so, I removed make _htmldocs
and "$pkgbase-docs"
manually.
Also, I modified _make
function and add -j$(nproc)
to make
command.
-
Optional part, that I used just to check build system. Build the kernel (without changes in the sources). Start the
makepkg -s
and leave the laptop for a while (30 minutes -- 1 hour, approximately) -
Insert the define, that disabled D3Cold on the specified device, somewhere near
DECLARE_PCI_FIXUP_CLASS_EARLY
for deprecated ATA devices.
DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_SILICON_POWER, PCI_ANY_ID, 0x0108, 8, quirk_no_ata_d3);
// PCI_VENDOR_ID_SILICON_POWER is my define that equals to 0x126f
Optionally, I added few debug prints.
My full patch diff looks like:
diff --color --unified --recursive '--exclude=.git' --text src/archlinux-linux/drivers/pci/pci.c src.new/archlinux-linux/drivers/pci/pci.c
--- src/archlinux-linux/drivers/pci/pci.c 2023-07-09 18:07:45.873293132 +0300
+++ src.new/archlinux-linux/drivers/pci/pci.c 2023-07-09 18:06:52.939961065 +0300
@@ -1445,6 +1445,7 @@
* This device is quirked not to be put into D3, so don't put it in
* D3
*/
+ pci_info(dev, "dev->dev_flags %llx\n", dev->dev_flags);
if (state >= PCI_D3hot && (dev->dev_flags & PCI_DEV_FLAGS_NO_D3))
return 0;
diff --color --unified --recursive '--exclude=.git' --text src/archlinux-linux/drivers/pci/quirks.c src.new/archlinux-linux/drivers/pci/quirks.c
--- src/archlinux-linux/drivers/pci/quirks.c 2023-07-09 18:07:45.873293132 +0300
+++ src.new/archlinux-linux/drivers/pci/quirks.c 2023-07-09 18:06:52.939961065 +0300
@@ -1340,6 +1340,7 @@
/* Some ATA devices break if put into D3 */
static void quirk_no_ata_d3(struct pci_dev *pdev)
{
+ pci_info(pdev, "quirk_no_ata_d3 called\n");
pdev->dev_flags |= PCI_DEV_FLAGS_NO_D3;
}
/* Quirk the legacy ATA devices only. The AHCI ones are ok */
@@ -1355,6 +1356,10 @@
DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_VIA, PCI_ANY_ID,
PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3);
+/* Do not suspend NVMe */
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_SILICON_POWER, PCI_ANY_ID,
+ 0x0108, 8, quirk_no_ata_d3);
+
/*
* This was originally an Alpha-specific thing, but it really fits here.
* The i82375 PCI/EISA bridge appears as non-classified. Fix that.
diff --color --unified --recursive '--exclude=.git' --text src/archlinux-linux/include/linux/pci_ids.h src.new/archlinux-linux/include/linux/pci_ids.h
--- src/archlinux-linux/include/linux/pci_ids.h 2023-07-09 18:07:45.883293132 +0300
+++ src.new/archlinux-linux/include/linux/pci_ids.h 2023-07-09 18:07:05.963294086 +0300
@@ -3120,4 +3120,6 @@
#define PCI_VENDOR_ID_NCUBE 0x10ff
+#define PCI_VENDOR_ID_SILICON_POWER 0x126f
+
#endif /* _LINUX_PCI_IDS_H */
- Compile the kernel (if you have completed p.3, the compilation will be done faster) and install it
- Regenerate grub.cfg, reboot your laptop and check the dmesg. You should see messages about the quirk.
sudo dmesg | grep quirk
[ 0.337952] pci 0000:01:00.0: quirk_no_ata_d3 called
[ 1.939159] nvme 0000:01:00.0: platform quirk: setting simple suspend
The first one is my debug message, the second one already exists in Linux.
- Check the suspend mode as described at the top.
Function quirk_no_ata_d3
sets pci->dev_flags |= PCI_DEV_FLAGS_NO_D3;
.
Sysfs d3cold_allowed
modifies dev->d3cold_allowed
field.
The d3cold_allowed
is being used in pci_dev_check_d3cold
function, that, in its turn, being used only in bridge update function pci_bridge_d3_update
.
However, the PCI_DEV_FLAGS_NO_D3
is being checked in pci_set_power_state
function.
That function does not have d3cold_allowed
checks (or at least, I cannot see it), hence, d3cold_allowed
change in sysfs is useless in the context of the described problem.
The solution seems to be the only way to fix the problem. The largest caveat of it is that every kernel update via pacman seem to be a recompilation headache.
I believe this post will help someone to finally fix the annoying issue with NVMe. Being encountered with such problems, I sincerely glad to realize that the percent of Linux-desktop laptops is still not below the zero.
I've just found here that the problem can be fixed using the kernel parameter nvme_core.default_ps_max_latency_us=0
Same issue with:
Disabled suspend and it works for now.
Have you submitted a pull-request to Linux kernel yet? I am using Debian 12 and don't want to compile kernel myself. Hope this could release in few weeks.