Motherboard: Asus Pro WS WRX80E-SAGE SE WIFI
Card: Asus HYPER M.2 X16 GEN 4 CARD
NVMe: 4x Samsung SSD 980 PRO 1TB
OS: Linux fedora 5.16.12-200.fc35.x86_64
AER, advanced error reporting logs excessively:
dmesg
nvme 0000:44:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
nvme 0000:44:00.0: [ 0] RxErr (First)
nvme 0000:44:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
nvme 0000:44:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
nvme 0000:44:00.0: [ 0] RxErr (First)
nvme 0000:44:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
{2085}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
{2085}[Hardware Error]: It has been corrected by h/w and requires no further action
{2085}[Hardware Error]: event severity: corrected
{2085}[Hardware Error]: Error 0, type: corrected
{2085}[Hardware Error]: section_type: PCIe error
{2085}[Hardware Error]: port_type: 0, PCIe end point
{2085}[Hardware Error]: version: 0.2
{2085}[Hardware Error]: command: 0x0406, status: 0x0010
{2085}[Hardware Error]: device_id: 0000:44:00.0
{2085}[Hardware Error]: slot: 0
{2085}[Hardware Error]: secondary_bus: 0x00
{2085}[Hardware Error]: vendor_id: 0x144d, device_id: 0xa80a
{2085}[Hardware Error]: class_code: 010802
{2085}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0000
Note device id
in logs. In this case it's 0000:44:00.0
. Also there are similar logs for all four NVMe disks on the same card with respective device ids 0000:43:00.0
, 0000:42:00.0
, 0000:41:00.0
. Then, for each device id (for example: 0000:44:00.0
) turn off corrected-severity bit (clear the first bit) if set. Get the current value for CAP_EXP register and XOR it with 0x1 to toggle.
setpci -v -s 0000:44:00.0 CAP_EXP+0x8.w
0000:44:00.0 (cap 10 @70) @78 = 2937
So, the bit is set... toggle: 0x2937 XOR 0x1 = 0x2936
setpci -v -s 0000:44:00.0 CAP_EXP+0x8.w=0x2936
0000:44:00.0 (cap 10 @70) @78 2936
Device id and CAP_EXP values might differ in other cases.
Thank you so much for this!
It needed to be run on every boot so I threw it into a .service based on this -
Pretty sureWantedBy
is wrong, but I'm not totally proficient in systemd services so I'll have to do some digging. Works for now.Update:
I was still getting errors and realized only the last ExecStart is run. Here is an updated service file -