Two changes were required for GRUB bootloader
Adding 15 second boot menu delay to grub config solves thunderbolt boot failure.
Delay is added to /etc/default/grub
$ cat /etc/default/grub
...
GRUB_TIMEOUT_STYLE=menu
GRUB_TIMEOUT=15
...
$ sudo update-grub
Not an ideal solution, but it seems to workaround some other issue.
The failure looks something like this (hand written from screenshot)
Workqueue: thunderbolt0 icm_handle_notification [thunderbolt]
RIP: 0010: switch_find_xdomain+0x11/0x160 [thunderbolt]
Code: aa bb cc dd ee
RSP: ... EFLAGS...
RAX ... RBX ... RCX ...
RDX ... RSI ... RDI ...
RBP ... R08 ... R09 ...
...
Call Trace:
<TASK>
tb_xdomain_find_by_link_depth
icm_fr_device_connected
icm_handle_notification
process_one_work
worker_thread
? process_one_work
kthread
? set_kthread_struct
reg_from_fork
</TASK>
At some point I added pci=realloc
to kernel cmdline configuration.
$ cat /etc/default/grub
...
#GRUB_CMDLINE_LINUX_DEFAULT="quiet splash acpi=off"
GRUB_CMDLINE_LINUX_DEFAULT="pci=realloc"
...
egpu-switcher - https://github.com/hertg/egpu-switcher
This script is a systemd service that checks for existence of eGPU before starting Xorg. If eGPU is found, /etc/X11/xorg.conf is linked to appropriate device configuration. This is used to only define a one display device for Xorg. This improves performance, otherwise it seems that xorg
wants to use 10-50% CPU while just idling with firefox.
egpu-switcher didn't work out of the box with my spectre machine. When inspected the /etc/X11/xorg.conf.internal
was empty. Modifying the file to following contents helped boot:
$ cat /etc/X11/xorg.conf.internal
Section "Device"
Identifier "Intel Graphics"
Driver "intel"
Option "TearFree" "true"
EndSection
Above content was originally at /etc/X11/xorg.conf.d/20-intel.conf
. Not sure where it got there. Now the file is deleted and the configuration only remains in /etc/X11/xorg.conf.internal
.
$ journalctl -u egpu.service
-- Reboot --
Jul 22 20:35:32 spectre systemd[1]: Starting EGPU Service...
Jul 22 20:35:32 spectre egpu-switcher[937]: [info] Automatically detecting if egpu is con>
Jul 22 20:35:34 spectre egpu-switcher[937]: [info] EGPU is disconnected.
Jul 22 20:35:34 spectre egpu-switcher[937]: [info] Create symlink /etc/X11/xorg.conf -> />
Jul 22 20:35:34 spectre systemd[1]: egpu.service: Succeeded.
Jul 22 20:35:34 spectre systemd[1]: Finished EGPU Service.
-- Reboot --
Jul 22 20:41:12 spectre systemd[1]: Starting EGPU Service...
Jul 22 20:41:12 spectre egpu-switcher[1100]: [info] Automatically detecting if egpu is co>
Jul 22 20:41:13 spectre egpu-switcher[1100]: [info] EGPU is connected.
Jul 22 20:41:13 spectre egpu-switcher[1100]: [info] Create symlink /etc/X11/xorg.conf -> >
Jul 22 20:41:13 spectre systemd[1]: egpu.service: Succeeded.
Jul 22 20:41:13 spectre systemd[1]: Finished EGPU Service.
After thunderbolt and PCI devices enumerate correctly but nvidia-smi
still fails, the issue could be that the module is blacklisted with prime-select intel
or prime-select on-demand
. The blacklisting configuration exists in automatically generated file /lib/modprobe.d/blacklist-nvidia.conf
.
See blacklisting from gpu-manager log
$ cat /var/log/gpu-manager.log
...
Is nvidia blacklisted? yes
...
The blacklisting can be removed with
prime-select nvidia
Not 100% sure, but it might be that the dkms modules get deleted and don't get recompiled with multiple sudo apt reinstall nvidia-driver-515
and sudo apt purge nvidia-*
.
nvidia dkms binaries can be recompiled again with dpkg-reconfgure
:
sudo dpkg-reconfigure nvidia-dkms-515
Possible issue in future Kernel. NVIDIA/open-gpu-kernel-modules#256 (comment)