-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Laptop freezes when starting X11 and discrete graphics are OFF #764
Comments
If you run:
What is the selected config? It should be /usr/lib/nvidia/bumblebee. Does the same problem happen if you choose /usr/lib/mesa-diverted instead? Finally, do you have another DE to try (Gnome would be best) to help narrow it down? |
I've been using |
/usr/lib/nvidia/bumblebee is the right one (default) when having bumblebee, I wanted to see if removing all traces of nvidia from the path helped. It is really strange that X is affected by bumblebee when not running through it. Can you get to another TTY when the screen is frozen? Don't bother with GDM for now if it's a hassle, was just trying to narrow it down. I'll install xfce on my sid partition and see what happens. |
I think this is an issue specific to my hardware setup (as descrete graphics cannot be forced on, optimus must be used). When I say 'the screen is frozen', the TTY I am in (I'm manually starting a display manager) stops responding (the cursor stops blinking). I can't switch to another TTY. Even the keyboard caps lock/numlock lights no longer change when I press them, and the SysReq keys no longer work either. The system has to be force powered off. |
I just double checked, but ssh sessions freeze too when this occurs. |
A kernel hard-lock then, that's a pain. Have you tried nouveau? |
maybe nouveau is already loaded and causes tha hang because something doesn't work and Xorg freezes due to messed up modesetting DDX? |
With the bumblebee-nvidia package nouveau is blacklisted, so it can't be loaded. |
and I hope nvidia is also blacklisted, but Xorg freezes and that usually happens for a bad reason. My guess is: X loads the nvidia DDX, which autoloads the nvidia kernel driver. |
Yes, all the kernel modules are blacklisted. And the nvidia libraries are out of the path (hence my question earlier about update-alternatives). |
I dealt with so many users where something was messed up, that I wouldn't rely on anything here. And that nvidia gets loaded also explains why turning the GPU off helps. In fact for that the nvidia libraries doesn'T need to be in the Path, because the nvidia ddx already is enough and for that different paths are used. Anyhow, without logs it will be painfull to debug this. |
I've tried w/ nouveau and I still see the same issue (but with the workaround (which worked under nouveau) I started to see some weird behavior like some CPU cores sticking at 100%). Also when running optirun I got some permission denied errors with nouveau. I'm not sure if this will help though. Just to clarify, simply turning the discrete video card ON with bbswitch before starting X11 fixes my issue (but it is a hassle to deal with every time). I'm not sure if there are any ways for me to get logs with this situation, but if there are let me know. When I run startx, the screen freezes before any errors come up, so I'm not sure if there is much I can do. bumblebee blacklists all the nvidia/nouveau modules by default, and I have nvidia set under the bumblebee.conf, so I think nouvau isn't conflicting? If there is any way to test this I would be happy to do so! |
well you don't use bumblee with nouveau, and that support should be removed in bumblebee |
@jgkamat what really would help would be the dmesg output. Maybe you can do "dmesg -w" through ssh while you start X and see if you get enough useful output this way. |
If dmesg can write it, so will journalctl. If you haven't, enable persistent journal (create /var/log/journal) and then after the freeze reboot and check the previous boot journal with journalctl -b -1 |
@bluca His machine crashes completly. And on a crash usually error logs can't be written anymore, because the kernel stoped doing anything. Dmesg -w could help us because it immediatly displays messages (even before they get written to disc), but if the network dies too fast, he wouldn't either get this and need to setup netconsole, allthough this also requires a working network. @jgkamat maybe you have something inside pstore (/sys/fs/pstore) check here for pstore information: https://lwn.net/Articles/434821/ |
I tried setting up a netconsole (and dmesg -w over ssh) and that dosen't seem to give me any logs either before the freeze. I don't have anything currently inside pstore as far as I can tell. I'm starting to think that this is some sort of race condition where bumblebee tries to turn on the nvidia driver before X starts, but X manages to start before the nvidia card comes online, leading to a lockout (or maybe my hardware can't deal with xorg starting without the nvidia card being on). (running |
@jgkamat could you add a xorg.conf file in /etc/X11 with this content and start X while the gpu is off? https://gist.github.com/karolherbst/1f1bdd1a3822df74097f and check if your nvidia card also has the 01:00.0 address in lspci. If this works, that means something is loaded which makes your kernel unhappy. |
Unfortunately, I'm still seeing the same issue with this config. Just to be sure, I created a new xorg.conf file (as the docs say that none should be present) with that config. My Nvidia card is on that bus. Here's the ouptut of lspci. if that helps:
Should that file have gone in |
I have a Clevo P650RA/P651RA (and also access to a Clevo P670RA/P671RA) which both have GTX 965M cards as well. This issue could be related to Bumblebee-Project/bbswitch#115 In my case an infinite loop would occur in ACPI. See Bumblebee-Project/bbswitch#115 (comment) for more details if you are interested. |
I'm not seeing any issues with suspend to the best of my knowlege (the video card is off before/after a sleep, according to bbswitch, and that works fine for me). These issues could be related though. I'm honestly pretty stoked at how well this performs (with this workaround in place). but I'm worried that a slight change could break it more. I'm happy to provide any more information if that would help! EDIT: My laptop is a CLEVO N155RF (sager just rebrands them?) |
I've been having the exact same issue with my MSI GE62. If i start X11 with the 960M turned off it will do a hard lock. But if i turn it on first then start X11 it works fine. I should also note that with Gnome GDM will start fine with the 960M turned off. But once I enter my password to log in to Gnome then it will do a hard lock. I presume this is because GDM is using Wayland? |
@jkehler : I'm having the exact same behavior with the same model, except I have a 970M |
Actually I had just realized I had never actually tried starting Gnome with Wayland instead of X11 to see if it hard freezes. I just tried it now and when using Wayland it worked fine with the 960M turned off. So it definitely appears to just be an issue with X11. |
I've had a couple random freezes too. Most of the time, they are triggered by some 'low level' operations, or things involving the graphics card (eg: starting steam, modprobes, even lspci once). This is usually accompanied by some audio garbling for some reason (before hard faulting). If I enable the descrete graphics card via bbswitch then I never have this issue, however. This is my xorg version, if that helps. I've never tried out wayland, and I don't have the time to test this right now, but If I ever do, I'll post an update here. Isn't wayland supposed to illiminate the need for bumblebee? I'm still fuzzy on that topic though...
|
I think X is not much of an issue, but a trigger. Can you switch to a TTY (Ctrl-Alt-F2), log in and try to power off/on the card manually using bbswitch? Repeat this twice to see if it makes a difference.
If that still does not hang, try this (exact output does not matter, only whether it hangs or not):
My guess is that trying to access some PCI configuration registers too fast results in failure. Why exactly this happens is something I have been trying for a week to figure out on a Clevo P651RA/GTX965M. Current key words: PCIe link training failure. |
Hello @Lekensteyn However, I've found that if I disable discrete card at boot with bbswitch, the system won't properly boot; on loading gnome, in freezes; visual artifacts in the console may appear at freeze instant, and nothing but power button answers. All this while being on the integrated intel card. Warp |
@Lekensteyn I finally got around to trying what you had suggested above. Switching to a TTY and repeatedly turning the GPU on and off did not result in any sort of hard lock for me. But when I ran your second set of commands the first one outputted the following.
The 2nd command didn't output anything. But then I ran the first command a 2nd time and it resulted in a hard-lock for me. |
How to do it. I'm just new to this business. Even with the correct spelling everything hangs p.s. я ес чо русский. вообще не понимаю почему фризит на окне входа. хотелось бы сделать |
Pretty obvious. |
Same issue here - Lenovo ThinkPad P53 with NVIDIA Quadro RTX 3000. I tried both |
anybody up for trying this patch? https://lists.freedesktop.org/archives/dri-devel/2019-October/240521.html In case it doesn't help, I'd like to see your "lspci -nn" and lspci -tv" output |
Hi @karolherbst , I'll check if your patch already made it to archlinux, meantime this is the output of commands you asked for: # lspci -nn
00:00.0 Host bridge [0600]: Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers [8086:3ec4] (rev 0d)
00:01.0 PCI bridge [0604]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) [8086:1901] (rev 0d)
00:02.0 VGA compatible controller [0300]: Intel Corporation UHD Graphics 630 (Mobile) [8086:3e9b] (rev 02)
00:04.0 Signal processing controller [1180]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem [8086:1903] (rev 0d)
00:08.0 System peripheral [0880]: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th Gen Core Processor Gaussian Mixture Model [8086:1911]
00:12.0 Signal processing controller [1180]: Intel Corporation Cannon Lake PCH Thermal Controller [8086:a379] (rev 10)
00:14.0 USB controller [0c03]: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller [8086:a36d] (rev 10)
00:14.2 RAM memory [0500]: Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f] (rev 10)
00:15.0 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 [8086:a368] (rev 10)
00:15.1 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1 [8086:a369] (rev 10)
00:16.0 Communication controller [0780]: Intel Corporation Cannon Lake PCH HECI Controller [8086:a360] (rev 10)
00:1b.0 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #17 [8086:a340] (rev f0)
00:1c.0 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #1 [8086:a338] (rev f0)
00:1c.5 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #6 [8086:a33d] (rev f0)
00:1c.7 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #8 [8086:a33f] (rev f0)
00:1d.0 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #9 [8086:a330] (rev f0)
00:1e.0 Communication controller [0780]: Intel Corporation Device [8086:a328] (rev 10)
00:1f.0 ISA bridge [0601]: Intel Corporation Device [8086:a30e] (rev 10)
00:1f.3 Audio device [0403]: Intel Corporation Cannon Lake PCH cAVS [8086:a348] (rev 10)
00:1f.4 SMBus [0c05]: Intel Corporation Cannon Lake PCH SMBus Controller [8086:a323] (rev 10)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH SPI Controller [8086:a324] (rev 10)
00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (7) I219-LM [8086:15bb] (rev 10)
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU106GLM [Quadro RTX 3000 Mobile / Max-Q] [10de:1f36] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation TU106 High Definition Audio Controller [10de:10f9] (rev a1)
01:00.2 USB controller [0c03]: NVIDIA Corporation TU106 USB 3.1 Host Controller [10de:1ada] (rev a1)
01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU106 USB Type-C Port Policy Controller [10de:1adb] (rev a1)
02:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
04:00.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] [8086:15ea] (rev 06)
05:00.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] [8086:15ea] (rev 06)
05:01.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] [8086:15ea] (rev 06)
05:02.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] [8086:15ea] (rev 06)
05:04.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] [8086:15ea] (rev 06)
06:00.0 System peripheral [0880]: Intel Corporation JHL7540 Thunderbolt 3 NHI [Titan Ridge 4C 2018] [8086:15eb] (rev 06)
2c:00.0 USB controller [0c03]: Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge 4C 2018] [8086:15ec] (rev 06)
2d:00.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge DD 2018] [8086:15ef] (rev 06)
2e:02.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge DD 2018] [8086:15ef] (rev 06)
2e:04.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge DD 2018] [8086:15ef] (rev 06)
2f:00.0 USB controller [0c03]: Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge DD 2018] [8086:15f0] (rev 06)
52:00.0 Network controller [0280]: Intel Corporation Wi-Fi 6 AX200 [8086:2723] (rev 1a)
54:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader [10ec:525a] (rev 01)
55:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808] # lspci -tv
-[0000:00]-+-00.0 Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers
+-01.0-[01]--+-00.0 NVIDIA Corporation TU106GLM [Quadro RTX 3000 Mobile / Max-Q]
| +-00.1 NVIDIA Corporation TU106 High Definition Audio Controller
| +-00.2 NVIDIA Corporation TU106 USB 3.1 Host Controller
| \-00.3 NVIDIA Corporation TU106 USB Type-C Port Policy Controller
+-02.0 Intel Corporation UHD Graphics 630 (Mobile)
+-04.0 Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem
+-08.0 Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th Gen Core Processor Gaussian Mixture Model
+-12.0 Intel Corporation Cannon Lake PCH Thermal Controller
+-14.0 Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller
+-14.2 Intel Corporation Cannon Lake PCH Shared SRAM
+-15.0 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0
+-15.1 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1
+-16.0 Intel Corporation Cannon Lake PCH HECI Controller
+-1b.0-[02]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
+-1c.0-[04-51]----00.0-[05-51]--+-00.0-[06]----00.0 Intel Corporation JHL7540 Thunderbolt 3 NHI [Titan Ridge 4C 2018]
| +-01.0-[07-2b]--
| +-02.0-[2c]----00.0 Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge 4C 2018]
| \-04.0-[2d-51]----00.0-[2e-51]--+-02.0-[2f]----00.0 Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge DD 2018]
| \-04.0-[30-51]--
+-1c.5-[52]----00.0 Intel Corporation Wi-Fi 6 AX200
+-1c.7-[54]----00.0 Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader
+-1d.0-[55]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
+-1e.0 Intel Corporation Device a328
+-1f.0 Intel Corporation Device a30e
+-1f.3 Intel Corporation Cannon Lake PCH cAVS
+-1f.4 Intel Corporation Cannon Lake PCH SMBus Controller
+-1f.5 Intel Corporation Cannon Lake PCH SPI Controller
\-1f.6 Intel Corporation Ethernet Connection (7) I219-LM |
yeah.. that sounds like a system which is affected according to my current theory. What laptop is that? |
I'm running on Lenovo ThinkPad P53, I slowly start regretting this choice... |
@karolherbst I have a laptop with Kaby Lake + Pascal so I'd be up to test that patch. But what bug does it fix exactly ? Do you have reproduction steps ? |
it fixes D3cold with the Intel 0x1901 pcie bridge controller, so that the GPU can be powered on again. |
Seems to be the same issue here: laptop freezes on powering down for suspend and waking up from it. Sometimes it unfreezes after 3-4 minutes, but most of the time it's permanent. MSI gl65 9sdk |
Welcome to a reality which has been a nightmare for many since a while sadly, Intel + NVidia is really not fun sometimes when using GNU/Linux. |
@karolherbst I've read your patch and it sound like my system is affected by the D3 error. |
I wouldn't include this fix unless someone can be sure it doesn't break anything, there are some upstream discussions going on. And maybe we get something merged for 5.5. In the end it's up to you to make a discussion on that though. |
@karolherbst So it's part of the kernel, wich mean we have to make a custom one to try the patch? |
@karolherbst However, I don't have another laptop to test if this works on others. |
@Leo1003 Do you think your fix can be easily reproduce on other laptop? |
My XPS 15 9570 freezes completely after starting X.Org (even in a LiveCD environment). Tried all the different distros, Ubuntu, Fedora, Arch, even GParted LiveCD, all exhibit the same behaviour. Would this patch still be relevant in my case since my laptop is not Skylake-based? (it's 8th gen) |
I don't have time to make a detail tutorial recently.
1st duplicate path
2nd duplicate path
P.s. Is this some OEM patches??? The two same path ACPI method are located in two different SSDT tables. Try to merge them into one ACPI method. The AML codes are different on each laptop, make sure you use the codes extracted from your own laptop. Also, it may changed after BIOS update, be sure to check it! Update [11/28 03:11 GMT]: Correct the path of the duplicate methods [off-topic]: The pm of the card is invalid since a Linux kernel 5.x bug to keep the NVIDIA audio controller powered on. I am very frustrated... |
Will try to inject this SSDT in OpenCore and boot to Linux |
I have had problems with getting Bumblebee to work on Asus GL753VE related to system fan becoming out of control and requiring a full system poweroff after bbswitch powers down the GPU ( same problem as @cdbrendel ). Now after many failures and research I have a solution with which I am able to get bumblebee & primusrun to work as usual/as expected on this hardware using Debian Unstable amd64, so I will document it here in case anyone needs it. Solving the issue consisted of understanding and overcoming 4 sub-problems:
So to get bumblebee to work:
Additionally to prevent conflicts the nouveau xorg driver was uninstalled and the nouveau kernel module blacklisted. My full working configs: /etc/X11/xorg.conf.d/20-intel-gpu.conf
/etc/bumblebee/bumblebee.conf (the kernel driver name may be debian-specific,
update-alternatives --config glx
relevant packages:
Result:
No kernel boot options required, no problems with touchpad/freezing/etc. And most importantly no uncontrollable fan problems. I am going to stick with this setup until xorg with patches to make nvidia's official offloading work is released ( as described in https://download.nvidia.com/XFree86/Linux-x86_64/435.17/README/primerenderoffload.html ). UPDATE 2020-05-14 |
@x-qq Thank you for sharing your experience and solution (it doesn't affect me as I have a different setup, but it should help others). Just wanted to add that the nvidia's xorg patches are actually in the already released xorg 1.20.6 version. This is the current version for Fedora 31 (before that, Fedora actually added these patches in the default build, so it was working anyway). I have no clue about Debian though. |
@karolherbst I've applied your patch on a custom lts kernel according to the official documentation and booted on it.
Many things could have gone wrong: I can give you any other output if needed. |
Hi All, Looks like we can add: cd /sys/class/dmi/id && grep . bios_* I have tried every combination in #764 (comment) and it looks like any kind of PCI scan like lspci or even logging out of KDE or an SDDM shutdown (after first login) results in a hard freeze when the card is off. I'm able to boot Debian Buster, load and unload all kernel modules, turn off/on Nvidia GPU and start and stop all processes sucessfully with this work around: But the card must be ON to shutdown or run lspci (more than once). Interestingly the default bumblee-nvidia install in Debian Buster pulls in nvidia-persistenced so bumblebee is not able to unload drivers or turn card off, regardless off any bumblebee config. It turns out that this is not such a bad setup since nvidia-persistenced drops the card to it's lowest power mode (after a short time) and at a tty the laptop is averaging around 7.7 W compared to 6.5 W with everything unloaded and card off. Plus this setup is safer since the card is on and anything triggering a pci scan will not result in a freeze for this laptop. I'll write more on this else where when I've done some more testing. |
Adding |
Adding |
Adding It has created another problem though. I can no longer turn off the card. There are no errors in |
After installing optimus-manager the ASUS x560-ud laptop with Nvidia GTX 1050 freezed on every shutdown and after getting past the login screen.
None of the solutions posted above helped. P.S. In windows, a while ago after buying this device I had noticed that battery report utility issues warnings about the graphics card not supporting so-called "Link state power management". Update 02 Aug 2020: |
I have a ThinkPad P52 with the NVIDIA Quadro P2000 GPU. I tried Power control of the dGPU works fine. I can use nvidia-xrun. But when using xfce's display settings it locks up on any change (rearranging displays, disabling displays, mirror, etc.). Otherwise it works fine. This happens with hybrid mode, or dedicated mode in the BIOS. EDIT: After further investigation, this doesn't look like it's a bumblebee issue. More like Nvidia driver. |
On Lenovo ThinkPad P1 I had no problems switching to nVidia until recently. I solved it with: acpi_osi="!Windows 2015" |
[edit by @Lekensteyn]
This issue affects newer laptops (from about 2015-2016) with Skylake and GTX 9xxM/10xx cards/
A workaround exists for some laptops, see #764 (comment)
[/edit]
I'm having a weird issue, and I'm not sure what kind of debug information is neccesary, but let me know what to give and I'll supply anything you need.
When I start my graphics (lxdm), I get a freeze (keyboard stops working, no response on monitor at all, even log files stop working), but I can work around this by enabling the graphics card before starting graphics.
System (installed with
bumblebee-nvidia
in debian testing repos):Optirun --version:
My laptop seems to not work without optimus, the intel drivers work fine, but trying to run w/o the intel drivers (nvidia only) seems to result in a frozen screen. Using the workaround works perfectly for me, however.
Steps to Reproduce:
systemctl start bumblebeed
systemctl start lxdm
Workaround:
systemctl start bumblebeed
echo "ON" >/proc/acpi/bbswitch
systemctl start lxdm
Unfortunately, any X11 log files don't seem to survive after my system freezes (they show everything completed successfully, probably from the previous successfull boot). If you know any way of retreiving them I'd be happy to supply them though! (When the system freezes, even my shell history file gets corrupted).
I did have to make some changes to my config files to get things to work in my situation though, I'll post anything I remember changing below. Let me know if you need any more information, I am happy to supply it! Without bumblebee, my laptop would be unusuable 👍
bumblebee.conf
xorg.conf.nvidia
The text was updated successfully, but these errors were encountered: