View Issue Details

IDProjectCategoryView StatusLast Update
0001070channel: elrepo/el8kmod-nvidiapublic2021-01-30 14:30
Reporterspekbukkem Assigned Topperry  
PrioritynormalSeveritymajorReproducibilityrandom
Status assignedResolutionopen 
Summary0001070: Black screen during GDM and manual startx in mutlti-user mode
DescriptionStarting with Centos 8, I'm facing a very strange issue with the nvidia drivers in combination with one of my pc's. I've already mentioed this issue in this reports: 1064, but at that time I thought it was solved by removing the package dkms. But at the end, this does not seem to be the issue.

I will try to better describe my issue:
1. Almost always, when the pc is booted (literately cold) GDM does not start (only a black screen is displayed). After a few minutes, I'm allowed to switch to a pseudo tty. Most of the time, I'm also not able to start gnome manually (using startx). It crashes with the error:
(EE) Screen(s) found, but none have a usable configuration
(EE) no screens found(EE)
(EE)
2. However, I'm eventually able to start by executing startx in a infinite while true... loop. Time varies before gnome shell pops up, but it could take 1 to 10 minutes. Restarting GDM with systemctl manually, also often works after this step. Often (but not always), a reboot of the PC makes GDM start automatically without my previous workaround.

When the GDM/gnome-shell does not start, I also noticed the following errors in the kernel log:
NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1239)
NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

Normally, I would expect a hardware issue with the video kart (looks like it should "warm up" before it is able to work). But I do not have this issue when using the nouveau driver. I also got the same issue with an older nvidia video card (that uses the 390x driver). I also reinstalled the pc a couple of times. It works perfectly, until I install the nvidia driver. I suspect issues with the motherboard, but again: why does it work with the nouveau driver....(Unless the nvidia closed source driver requires a call to the hardware that is not required for the open source driver)

I tried a few kernel options, that could be related to the issue. Some that I still remember I've tried:
mem_encrypt=off (know incompatibility in combination with AMD hardware; but I also never had to use this option wit Centos 7)
apci=off (because I also noticed some ACPI errors, but this should not be related from what I've read).
None of these solved the issue.

It looks like there some delay before the hardware works. But not sure if this is hardware or software related. Does anyone have any idea what kernel options or driver options I could try to debug this issue?

Thank you in advance!
TagsNo tags attached.

Activities

pperry

2021-01-29 06:40

administrator   ~0007413

What happens when you uninstall our nvidia driver packages and install the nvidia driver .run package directly from nvidia? Does it work then, or do you experience the same issues (please test the same driver version for direct comparison).

spekbukkem

2021-01-29 06:58

reporter   ~0007414

Hi Pperry,

Thank you for your update: I will check. I will let you know the outcome.

spekbukkem

2021-01-29 13:58

reporter   ~0007415

I've tried the manual installation of the same version supplied by Nvidia. But the issue is exactly the same. I've additionally also tried both the latest long en short term supported drivers, also both with the same result. I've also tried Centos 7.9 again, on a spare SSD, and ensured that it was fully up-to-date and installed the latest elrepo kmod-nvidia. This still worked like expected. Next to that, I also a different distribution together with the latest nvidia driver available for that distro, and this also worked correctly (to ensure it was not related to a newer kernel). I could also not reproduce the issue. So at least it does not seems to be a hardware defect.

 I'm currently looking in to the differences between the kernel options between Centos 7 and 8 (and any documentation about removed supported). Hopefully I could find out something that could be the cause of this specific for my hardware. Any suggestions are welcome of course :)

pperry

2021-01-29 18:19

administrator   ~0007416

I am not able to offer a solution but do have some observations. I have come across one other such report (privately) so I don't think you're alone here, but equally we have plenty of other reports that the drivers work as expected so it's clearly not a widespread issue, but rather something fairly unique to your situation. I agree it's unlikely it's hardware related given the number of other scenarios in which the hardware works fine. Definitely something specific to el8 and the nvidia drivers.

My advice would be to file a bug report with NVIDIA and include the output from nvidia-bug-report.sh. Maybe base your report from the installation from their .run package to avoid complications. If you ever find out the cause, I'm eager to hear.

spekbukkem

2021-01-30 05:02

reporter   ~0007417

Thank you for your reply. I will first try to compile the latest centos 8 kernel using the config file of centos 7 to see if this makes any difference. I will also report this issue to NVidia based on the additional knowledge after this attempt. I will certainly keep you updated. Thank you for your insights and feedback. This is certainly appreciated!

pperry

2021-01-30 07:08

administrator   ~0007418

Just to confirm, you were using the stock CentOS 8 kernel (not Stream kernel)?

spekbukkem

2021-01-30 09:36

reporter   ~0007419

Yes, I was indeed using the stock CentOS 8 kernel. In fact, I just installed the kernel-lts from elrepo together with a manual installation of the driver. And this also works perfectly. I have the idea it is something related to the AMD APU that I'm using. Debugging this issue myself is very time consuming unfortunately. So I will send the debug log to NVidia. In my case, it is just my daughters pc, so It is not business critical to us the stock kernel. Otherwise I hope I possible (or someone reporting similar issues) will find the real cause. I will let you know if NVidia was able to find a possible cause for the issue....

spekbukkem

2021-01-30 14:30

reporter   ~0007420

One step closer: older stock kernel (4.18.0-80.1.2) also works perfectly. So something changed in one of the releases....

Issue History

Date Modified Username Field Change
2021-01-29 05:59 spekbukkem New Issue
2021-01-29 05:59 spekbukkem Status new => assigned
2021-01-29 05:59 spekbukkem Assigned To => pperry
2021-01-29 06:40 pperry Note Added: 0007413
2021-01-29 06:58 spekbukkem Note Added: 0007414
2021-01-29 13:58 spekbukkem Note Added: 0007415
2021-01-29 18:19 pperry Note Added: 0007416
2021-01-30 05:02 spekbukkem Note Added: 0007417
2021-01-30 07:08 pperry Note Added: 0007418
2021-01-30 09:36 spekbukkem Note Added: 0007419
2021-01-30 14:30 spekbukkem Note Added: 0007420