Thermal issues when compiling big things #9
Replies: 24 comments
-
@jglathe But thermal sensors don't work. Any progress with this? |
Beta Was this translation helpful? Give feedback.
-
I found that best cooling option is to have it upside down As heated air can leave case with ventilation holes from bottom side |
Beta Was this translation helpful? Give feedback.
-
Appears to be intermittent, no real handle on it yet. It seems to be a timing issue connected to the userspace services, and the coprocessors. Sometimes it wil cool diligently as soon as the load goes up starting silent, gradually increasing over time, sometimes it won't and does only the emergency blasts, which may not be enough. Documentation is still not really there. |
Beta Was this translation helpful? Give feedback.
-
@jglathe you wrote that temp sensors works fine in 6.5 kernel version And maybe we can also enable USB MP |
Beta Was this translation helpful? Give feedback.
-
Both these commits are in the 6.6 tree, too. Without the latter no USB-A ports on the wdk.
|
Beta Was this translation helpful? Give feedback.
-
I know that both commits are in tree but channel renaming isn't in sc8280xp-microsoft-dev-kit-2023.dts this one I did check 2. |
Beta Was this translation helpful? Give feedback.
-
So it looks like some bug in Makefile dep for CONFIG_QCOM_TSENS |
Beta Was this translation helpful? Give feedback.
-
And it is reallly stubborn in getting not set on 6.6
|
Beta Was this translation helpful? Give feedback.
-
Okay, I'll amend this. |
Beta Was this translation helpful? Give feedback.
-
@jglathe |
Beta Was this translation helpful? Give feedback.
-
Yeah I've seen that, and adding it gives just "unexpected data" but no change. Removing the dependency on NVMEM_QCOM_QFPROM from KConfig actually helps. And, this QFPROM option appears to be independent / have vanished from the menuconfig, too. |
Beta Was this translation helpful? Give feedback.
-
@xlazom00 thank you for the debug. Looks like its working now.
Will test a bit, then probably close it. We could try to get ath11k_hwmon to do something useful, too, but I guess soundwire over dp would be of greater benefit. Or the VM support with gunyah (if possible at all). |
Beta Was this translation helpful? Give feedback.
-
ath11k_hwmon ? Is it thermal monitor for what? |
Beta Was this translation helpful? Give feedback.
-
Docker works, yes. Thought I could get /dev/kvm somehow, but unlikely. Although if you need a VM for (whatever) it would be cool to have it. It's research, like this whole thing. |
Beta Was this translation helpful? Give feedback.
-
ath11k_hwmon is for the QCNFA765 wireless adapter. On Windows it loads a whole bunch of stuff to behave, among other things "thermal mitigation driver". I guess if you hammer it with the bandwidth you can get it pretty hot. |
Beta Was this translation helpful? Give feedback.
-
hmm done my test with sdbox2, it's running kernel 6.6.2 with Lunar (23.04) now, from a SATA SSD. So, state of the art Lunar setup, enough to test large compile load. And, it's scaling smoothly with the fan. Lots of reserves, no emergency blasts. Now, any ideas on how to go about debugging this on 23.10? I mean, we got our QCOM_TSENS devices back, that's something. But I'd like to understand this further. On 23.10, we also have qrtr-ns and pd-mapper running as services. Any help or ideas would be appreciated. |
Beta Was this translation helpful? Give feedback.
-
oh look, ath11k temp also works on 23.04. |
Beta Was this translation helpful? Give feedback.
-
maybe related to this |
Beta Was this translation helpful? Give feedback.
-
Nah. That‘s my debug message 😃 it says your board-2.bin doesn’t contain the calibration profile for this combination. I hacked up a board file that contains the X13s data as these… will be overwritten by linux-firmware when it updates |
Beta Was this translation helpful? Give feedback.
-
A strange tale of recovery images, internal SSDs and power management behaviour on LinuxWhat a weekend. I think I narrowed down the odd behaviour of my wdk regarding cooling. One (the first bought and now thoroughly used looking) has been opened to replace internal SSD with larger models (2TB Micron 2400 as of recently). The other one, sdbox2, was bought this summer via ebay and looked quite pristine despite being sold as used in a non-original packaging. The SSD statistics confirmed that it was barely used. Anyway, sdbox2 got a recovery image treatment and has never been opened to replace the SSD yet. The Linux is usually booted from an external USB SSD enclosure, with GRUB on the external SSD, which works nicely and can also boot the Windows of the internal SSD. On sdbox2, the fan behaviour is smooth, cooling is efficient and mostly silent, with enough headroom for peaks (I guess). No full blasts. FindingsOn the root cause I can only speculate, I guess it's some sort of signing / hardware signature thing. But there definitely is a connection between the SSD contents as created by the recovery image and the operation of the power management.
So, beware. First step after internal SSD change seems to be recovery treatment to ensure you get working power management on Linux. Fascinating. |
Beta Was this translation helpful? Give feedback.
-
You know what affects |
Beta Was this translation helpful? Give feedback.
-
Update re 23.10 and thermal management: The same applies as told in the cautionary tale of recovery images. 23.10 on USB boot, with core temps thanks to CONFIG_QCOM_TSENS being enabled, behaves well now. |
Beta Was this translation helpful? Give feedback.
-
Update to add: If you fail to mount the EFI partition in Linux for whatever reason, you're fscked. The hardware binding is broken at this moment. Still no idea what the reason is, but the WDK has an EC from Microsoft, and it might check a thing or two. To restore the binding you need to boot to Windows. It appears to repair or reset the hardware binding. |
Beta Was this translation helpful? Give feedback.
-
For a quick test how the temperatures are, I use this script: |
Beta Was this translation helpful? Give feedback.
-
Not the first time I see this, but it's a stability issue: wdk thermal profile appears to be not attuned to the wdk. Of course, it's taken from the x13s and not amended yet. Usually you won't notice, but with compiling bigger packages like the linux kernel or llvm (for gunyah docker image), you get crashes with read-only filesystem and other ones. Usually no hard crashes, but that's not far from it.
To avoid these issues my recommendation for now is to reduce the number of cores/threads used for build work to 4. This has the additional effect of using only the high performance cores of the SoC. Of ourse it increaeses build times, but the box stays stable that way.
Beta Was this translation helpful? Give feedback.
All reactions