Skip to content

Instantly share code, notes, and snippets.

@eduncan911
Last active June 22, 2024 20:37
Show Gist options
  • Save eduncan911/db9ea42207d96bf06120170e6dce6d65 to your computer and use it in GitHub Desktop.
Save eduncan911/db9ea42207d96bf06120170e6dce6d65 to your computer and use it in GitHub Desktop.
Fixing Thermal Throttling on Thinkpad P1 and X1 Extreme - Linux Edition

Fixing Thermal Throttling on Thinkpad P1 and X1 Extreme - Linux Edition

Lenovo messed up with the X1E and P1 Gen 1 versions (and maybe later generations) in that the system boots with a thermal limit (aka Tjunction or tjmax) set to 82C (some report 80C). What this means is that regardless of power draw or under-volting settings, when your CPU hits 82C, it will drop the frequency down to the "Configurable TDP-down" frequency, or even lower. It will also may limits the system power draw.

Thermal Paste and Stress Testing

First, note that I have already replaced the thermal paste on my P1's CPU and GPU with Noctua NT-H2 thermal compound (affiliate link). This immediately made a very noticable difference in idle temps and placing the laptop on my lap stayed cool. Also, the keyboard no longer got hot to the touch.

For stress testing under Linux, I used the s-tui application to dig into the details for all testing below.

How to Fix It

The fix is really two steps:

  • Set the Tjunction higher, say, -3 under your CPU's rated Tjunction value.
  • Undervolt the CPU, Cache, Uncore, and iGPU to maximize your performance.

Windows has a "driver" fix

Lenovo released a software update that effectively sets the Tjunction back up to 97C. However, this is only for Windows, and there are many posts of where Hyper-V negates the setting. I am not sure, but perhaps Lenovo has fixed this with newer drivers since others reported it back in Q1 2018.

Linux instructions

For Linux, we are left to fend for ourselves. Therefore, here's how to verify your system is affected, and how to fix it.

Verify your system is affected

Two different ways to do this.

Use msr-tools

You can install the msr-tools utility.

sudo apt install msr-tools
sudo modprobe msr 

Then, read the field and convert it to a digit:

$ sudo rdmsr --bitfield 23:16 -d 0x00001a2
18

This means your system is set to -18C under your Tjunction max, which for my Xeon E-2176M is 100C. So, that would be 100 - 18, which is 82C max.

Use the undervolt utility

Current install instructions are on the github:

https://github.com/georgewhewell/undervolt

But in short, install it via pip under root (I know, anti-Python, but this needs root to access the DMA).

sudo pip install undervolt

Now, you can read the Tjunction directly (called temperature target):

$ sudo undervolt --read
temperature target: -18 (82C)
core: 0.0 mV
gpu: 0.0 mV
cache: 0.0 mV
uncore: 0.0 mV
analogio: 0.0 mV
powerlimit: 78.0W (short: 0.00244140625s - enabled) / 45.0W (long: 96.0s - enabled)

As you can see, mine is set to 82C.

1. Set Tjunction to proper setting

Go lookup your CPU on Intel's Ark site and find its Tjunction value. My E-2176M has a max of 100C. You do NOT want to hit this 100C, ever! So we are going to set it to 97C instead, to leave a little headroom as sometime CPU temps spike 1C or 2C higher than your target temp while waiting on fans to ramp up. If you do hit your Tjunction max, your system will shut down out of safety.

Armed with target temp, mine being 97C, we can use the undervolt utility listed under the Verify section above.

sudo undervolt --temp 97

We can check it now:

$ sudo undervolt --read
temperature target: -3 (97C)
core: 0.0 mV
gpu: 0.0 mV
cache: 0.0 mV
uncore: 0.0 mV
analogio: 0.0 mV
powerlimit: 78.0W (short: 0.00244140625s - enabled) / 45.0W (long: 96.0s - enabled)

2. Undervolting

Now that my CPU ramps up to 97C, I went from 2700Mhz to 3400Mhz across all cores! However, this is still a far cry from its rated 4.4Ghz turbo setting. And, it only lasts about 10 seconds before it throttles pretty quickly down to 1500Mhz, and back up to 3400Mhz again. The reason is that our CPU is running at full voltage, which is hot. Intel processors run with more voltage than they need to account for unstable/inaccurate system voltage regulation.

To address this, I used undervolt to find a safe setting for undervolting. Here are my settings I found to be stable for the E-2176M:

sudo undervolt --temp 97 --core -150 --cache -150 --gpu -100 --uncore -100

And checking it's all set correctly:

$ sudo undervolt --read
temperature target: -3 (97C)
core: -150.39 mV
gpu: -99.61 mV
cache: -150.39 mV
uncore: -99.61 mV
analogio: 0.0 mV
powerlimit: 78.0W (short: 0.00244140625s - enabled) / 45.0W (long: 96.0s - enabled)

With these settings, I am connected to two Thunderbolt 3 docking stations, 3 1080p monitors, 5 USB external accessories, Brave browser open with about 29 tabs, and a couple of terminals on Pop_OS.

I ran s-tui stress test for about 3 hours straight, while using the Brave browser and watching youtube and various surfing. Zero issues.

All cores now hover around 3900Mhz to 4000Mhz, much closer to that Turbo of 4.4Gh and 35W of usage. It would still drop after a minute or two, but it only drops to 2200 or 2400Mhz now which is much better for the low before.

Your mileage may vary. Adjust the voltages 20mV at a time.

Persist it all across reboots

You'll want to read up on Undervolt's github site for how to persist it with systemd service. While I do use it, and my undervolting remains, my max temp isn't sticking yet across all reboots. It's a hit or miss, more likely a race condition with another service on startup. I'll setup the timer as described in the Undervolt instructions later.

Enjoy!

@jdchristensen
Copy link

@pglpm rdmsr also gives 100 on my X1E4, but undervolt --read gives

# undervolt --read
temperature target: -4 (96C)
...

which seems good.

@jdchristensen
Copy link

By the way, after upgrading to Ubuntu 22.04, I'm getting much better thermal behaviour. I'm running tlp, and I added the setting RUNTIME_PM_ON_AC=auto so that power management of devices (including the Nvidia GPU) happens on AC like when on battery. With this setting, all of the powertop tunables are in a good state. I also run thinkfan. The result is that my fan very rarely comes on at all when browsing the web, doing email, etc, and the machine stays very cool!

@pglpm
Copy link

pglpm commented Jun 22, 2022

Thank you @jdchristensen , very valuable information! I'm new to this X1E4 and also to a Linux OS, so I'm struggling a little to understand settings and how to change them. Unfortunately much information on the web seems either outdated or above my head. My problem has been the opposite of thermal throttling: the laptop gets extremely hot (especially around the lower-right side of the keyboard) when on AC. On the other hand it's cool and silent when on battery, and yet still powerful. I'll try the thermal repasting that you mentioned above (though a bit worried as I've never done it before). But in the meantime I'll try to lower the AC performance a little to see what happens. I didn't know that tlp could do that, thank you for the tip, that's great!

May I ask if you did any special tweaks with thinkfan or just used the default settings?

@jdchristensen
Copy link

I highly customized thinkfan, but I don't want to share my file as I think the best settings will depend on the thermal characteristics of your laptop. One thing I did is have thinkfan control the fans in just three steps: level 0 (off), level 1 (lowest) and level auto (let the BIOS do what it thinks is best). That way it's unlikely that using thinkfan will result in overheating or bad performance. (I also patched my kernel so that level 1 turns on only the right fan, but that's a hack that I don't want to share.)

@pglpm
Copy link

pglpm commented Jun 22, 2022

I set RUNTIME_PM_ON_AC=auto and also RUNTIME_PM_DRIVER_DENYLIST="mei_me" as was suggested in the tlp explanations of the configuration entries, and the heat problem has gone! It had bothered me for a month. Thank you jdchristensen and eduncan911 for the tips and for hosting this useful Readme!

@jdchristensen
Copy link

Glad to hear it! I'm curious why you needed to set RUNTIME_PM_DRIVER_DENYLIST. The default is mei_me nouveau radeon, so unless you are using nouveau or radeon, it shouldn't make a difference.

@pglpm
Copy link

pglpm commented Jun 23, 2022

Simple reason: my ignorance. Indeed that setting doesn't matter in my case. My ignorance is the reason why I try to gather as much information as possible and ask in places like this before making changes.

I want to confirm that RUNTIME_PM_ON_AC=auto has completely solved that excessive heat problem; I've been testing this for a day now.
Just wonderful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment