View Issue Details

IDProjectCategoryView StatusLast Update
0000842channel: elrepo/el7kmod-nvidiapublic2018-05-04 15:08
Reporterdafrye2 Assigned Topperry  
PrioritynormalSeveritycrashReproducibilityalways
Status resolvedResolutionno change required 
Summary0000842: kmod-nvidia 390.48 results in seg fault in RHEL 7.5
DescriptionInstalling kmod-nvidia (390.48) causes X to get a seg fault in RHEL 7.5. It looks like a nvidia issue.

There is a beta released (396.18) that seems to fix this issue and could be tested.

Alternatively, there is a patch to 390.48 that fixes the issue.
https://devtalk.nvidia.com/default/topic/1030082/linux/kernel-4-16-rc1-breaks-latest-drivers-unknown-symbol-swiotlb_map_sg_attrs-/post/5243252/#5243252
TagsNo tags attached.
Reported upstream

Activities

toracat

2018-04-25 09:47

administrator   ~0005781

Is that kmod-nvidia-390.48-2.el7_5.elrepo that you tested? If not, please try the .el7_5 one from the testing repo.

dafrye2

2018-04-25 10:33

reporter   ~0005782

I tried that driver as well. Kernel panic as soon as I reboot, same as with el7_4.

pperry

2018-04-25 13:10

administrator   ~0005783

I'm unable to replicate the problem and the latest package in testing (kmod-nvidia-390.48-2.el7_5.elrepo.x86_64) is working fine for me.

Further, I can't see where the above referenced symbol is required by kmod-nvidia:

rpm -q --requires kmod-nvidia | grep -i swiotlb

Furthermore, the symbol in question is exported by the kernel:

rpm -q --provides kernel-3.10.0-862.el7.x86_64 | grep swiotlb_map_sg_attrs
kernel(swiotlb_map_sg_attrs) = 0xd27d1dbf
kernel(xen_swiotlb_map_sg_attrs) = 0x85291cd7

so I think it unlikely the thread you linked above is related to your issue.

dafrye2

2018-04-25 13:38

reporter   ~0005785

I am able to replicate it on multiple computers. Install default RHEL 7.5 from the ISO. Update it fully. No other changes are made to the computer.

Install kmod-nvidia 390.48 and it immediately kernel panics on reboot and will not complete the boot process.

This happened on existing machines running kmod-nvidia 390.48 as soon as they were upgraded to 7.5 as well.

The only other thing it might be is the nvidia-x11-drv that also installs. Maybe there is an issue there? I am running GNOME3. Both versions (elrepo and elrepo testing) use nvidia-x11-drv from elrepo.

It may not be the symbols issue, but that seemed to align with my issue fairly well; figured it might provide a jumping off point.

pperry

2018-04-25 19:19

administrator   ~0005786

Last edited: 2018-04-26 00:24

Have you tried the nvidia installer to establish if it's a problem with the drivers or a problem with our packaging of the drivers?

Please show:

rpm -qa kernel kmod*
find /lib/modules -name nvidia*
nvidia-detect -v

Thanks

dafrye2

2018-04-26 07:46

reporter   ~0005791

I am fairly certain it is an nvidia issue. I brought it up here for visibility to potentially get the beta driver repackaged in elrepo-testing. You all do good work and I prefer to use elrepo versus direct from nvidia.

As far as your request, what set up would you like for that information? Meaning, what version, if any, of kmod-nvidia do you want installed? Or do you want that information from a vanilla machine with the nvidia direct drivers installed (and if so, 390.48 or beta 396.18)?

pperry

2018-04-26 15:07

administrator   ~0005792

OK, thanks.

We don't normally build beta drivers for RHEL. If you want to test the beta driver to see if it fixes your issue, please feel free to try the nvidia installer.

Regarding the information above, I'm just looking for any clues as to why it's failing, so from a RHEL7.5 box with the latest elrepo drivers installed from the testing repository (kmod-nvidia-390.48-2.el7_5.elrepo.x86_64, nvidia-x11-drv-390.48-1.el7_4.elrepo.x86_64)

I've updated from 7.4 to 7.5 and things are working as expected. Other users have reported similar results, so it's definitely something specific to your systems. I don't have hardware available to test a fresh install of 7.5. Just looking for clues at this point as I'm unable to replicate and have no idea what your issue may be.

dafrye2

2018-04-27 10:11

reporter   ~0005795

RHEL 7.5 out of the box, fully updated from RHN.

Switchable Graphics enabled (disabling switchable graphics causes kernel panic when the computer locks or restarts)

Here is the information you requested:

kernel-3.10.0.862.el7.x86_64
kmod-20-21.el7.x86_64
kmod-nvidia-390.48-2.el7-5.elrepo.x86_64
kmod-kvdo-6.1.0.153-15.el7.x86_64
kmod-libs-20-21.el7.x86_64

/lib/modules/3.10.0.862.el7.x86_64/extras/nvidia
/lib/modules/3.10.0.862.el7.x86_64/extras/nvidia/nvidia-drm.ko
/lib/modules/3.10.0.862.el7.x86_64/extras/nvidia/nvidia-modeset.ko
/lib/modules/3.10.0.862.el7.x86_64/extras/nvidia/nvidia-uvm.ko
/lib/modules/3.10.0.862.el7.x86_64/extras/nvidia/nvidia.ko

Probing for supported NVIDIA devices...
[10de:1bb7] NVIDIA Corporation GP104GLM [Quadro P400 Mobile]
This device requires the current 390.48 NVIDIA driver kmod-nvidia
[8086:591b] Intel Corporation device 591b
An Intel display controller was also detected


Here is some additional information:

X crashes once I install the nvidia drivers via elrepo. If I grep "error" in /var/log/messages, I get the following results:

org.a11y.atspi.Registry: X10: fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
journal: abrt: Fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
gdm: GdmLocalDisplayFactory: maximum number of X display failures reached. check X server log for errors.

/var/log/Xorg.0.log ends with "server terminated with error (1). closing log file"

After removing kmod-nvidia, I installed 390.48 directly from nvidia (letting nvidia manage X settings).

X still does not load with 390.48 directly from nvidia, so it is definitely an issue with the nvidia drivers themselves, not elrepo builds.

As a further test, I installed the beta drivers after completely purging 390.48

X still crashes.


The only thing I see the same between the 2 NVIDIA drivers is in /var/log/messages, I see "kernel: ACPI: Marking method CLPS as Serialized because of AE_ALREADY_EXISTS error".



If I disable Switchable Graphics, then both NVIDIA drivers (390.48 and 396.18) boot to X successfully.

Kmod-nvidia 390.48-2.el7-5 produces a kernel panic as soon as it is installed and I reboot the system. Switchable Graphics is still turned off, so that isn't the issue. If I hard reset, X will load, and things seem to work fine.

Switchable Graphics must be the cause. With it enabled, X doesn't load using any driver. I don't want to install bumblebee as the users need to use the nvidia card exclusively...

I hope that helps some...if there is even an issue. Literally nothing changed with the machines except for updating to RHEL 7.5 X started to crash.

dafrye2

2018-04-27 11:28

reporter   ~0005797

Also, to add to everything. As soon as I uninstall Kmod-nvidia 390.48-2.el7-5, I get a kernel panic every time the screen goes idle. It isn't just X crashing, the whole machine panics out.

If I re-enable Switchable Graphics, I don't get the kernel panic issue.

I also noticed, with the 862 kernel, if I try to install Kmod-nvidia 390.48-2.el7-4, it now says the 693 kernel is a dependency. I hadn't noticed that before.

With Switchable Graphics enabled, I reinstalled Kmod-nvidia 390.48-2.el7-5 and X does not load at all.

pperry

2018-04-27 16:53

administrator   ~0005798

I agree this is most likely an issue relating to the fact you have switchable graphics (Optimus) hardware. To me this is a support issue, how best to configure your system. Unfortunately I have zero experience with Optimus hardware so I would recommend you post to the mailing list (either elrepo and/or CentOS mailing lists) where hopefully you will receive more coverage and are more likely to get some useful assistance - you probably have a useful target audience of zero here given I'm unable to offer any meaningful help.

dafrye2

2018-05-03 10:18

reporter   ~0005823

It is definitely an Optimus/Switchable graphics issue, like we have determined.

I have a computer where the BIOS does not allow turning off switchable graphics and it will not boot RHEL 7.5 with the 390.48 drivers (they require CUDA, so I need bumblebee/nvidia drivers, etc).

I did notice that the Current official release: 396.24 is available. How long does it usually take to get the current official releases at least into elrepo testing? I'd be game to test it and report if it works with switchable graphics (basically the same test I did above).

Again, thank you for all that you all do.

toracat

2018-05-03 11:13

administrator   ~0005824

Last edited: 2018-05-03 11:26

Regarding the Optimus graphics, you may want to use bumblebee:

http://elrepo.org/tiki/bumblebee

Please note that it has not been tested for 7.5.

dafrye2

2018-05-03 11:31

reporter   ~0005825

I do have bumblebee installed on this particular system. But with 7.5 and Kmod-nvidia 390.48-2.el7-5, I get no gui in X.

It goes straight to text based login, which won't work in my particular situation.

pperry

2018-05-03 12:43

administrator   ~0005827

With regard to your question about 396.24, our policy is that we only release the long lived branch releases:

http://www.nvidia.com/object/unix.html

I update them as soon as I can after release, normally within a day or so. I have no idea when NVIDIA next plans to update the long lived branch.

I would only consider packaging the short lived branch if you have evidence it fixes a specific issue. I don't see anything in the changelog to suggest your issue has been fixed in the 396.24 release. Please feel free to try the drivers from NVIDIA, and if you are able to confirm they do fix your issue then I would consider packaging them, but only for release in the testing repository

toracat

2018-05-03 12:45

administrator   ~0005828

Could you check to see if you have kmod-bbswitch-0.8-5.el7_5.elrepo.x86_64.rpm installed? And if the module can be loaded without an error?

dafrye2

2018-05-03 13:45

reporter   ~0005829

yes, kmod-bbswitch is installed:

Installed Packages
Name : kmod-bbswitch
Arch : x86_64
Version : 0.8
Release : 5.el7_5.elrepo
Size : 47 k
Repo : installed
From repo : elrepo
Summary : bbswitch kernel module(s)
URL : https://github.com/Bumblebee-Project/bbswitch
License : GPLv2
Description : This package provides the bbswitch kernel module(s) built
            : for the Linux kernel using the x86_64 family of processors.

Is this what you are looking for when you say 'module'?

bumblebeed
modprobe: FATAL: Module bbswitch not found.
[ 1472.110079] [ERROR]Module bbswitch could not be loaded (timeout?)
[ 1472.110105] [WARN]No switching method available. The dedicated card will always be on.
[ 1472.113417] [ERROR]Cannot open or write pidfile /var/run/bumblebeed.pid.

toracat

2018-05-03 15:56

administrator   ~0005830

Can you try:

# modprobe bbswitch

and check the output from dmesg ?

dafrye2

2018-05-04 07:03

reporter   ~0005831

modprobe bbswitch returns the following:

modprobe: FATAL: Module bbswitch not found.

dmesg doesn't mention anything that I can glean.

toracat

2018-05-04 07:21

administrator   ~0005832

"bbswitch not found" is very strange if kmod-bbswitch has been correctly installed. Please show us the output from :

uname -mr

and

ls -l `find /lib/modules -name bbswitch.ko`

dafrye2

2018-05-04 07:27

reporter   ~0005833

uname -mr

3.10.0-862.el7.x86_64 x86_64

ls -l `find /lib/modules -name bbswitch.ko`

-rw-r--r--. 1 root root 2236 Apr 11 19:17 /lib/modules/3.10.0-862.el7.x86_64/extra/bbswitch/bbswitch.ko


I do understand that bbswitch hasn't been tested with 7.5, but I am hoping we can figure out what changed to cause it to not work.

toracat

2018-05-04 07:42

administrator   ~0005834

I don't have proper hardware (Optimus) to test this, but the modprobe command at least tries to install the bbswitch module and fails. So, your "not found" error is something I do not understand. My output is as follows:

$ sudo modprobe bbswitch
modprobe: ERROR: could not insert 'bbswitch': No such device

dmesg has the following:

[1274233.475345] bbswitch: loading out-of-tree module taints kernel.
[1274233.475460] bbswitch: module verification failed: signature and/or required key missing - tainting kernel
[1274233.475784] bbswitch: version 0.8
[1274233.475790] bbswitch: Found discrete VGA device 0000:00:02.0: \_SB_.PCI0.S2__
[1274233.475803] bbswitch: failed to evaluate ... (snip)

What if you specify the module name like so:

insmod /lib/modules/3.10.0-862.el7.x86_64/extra/bbswitch/bbswitch.ko

dafrye2

2018-05-04 08:06

reporter   ~0005835

could not insert module. File exists.

I do see the following in dmesg:

bbswitch: loading out-of-tree module taints kernel.
bbswitch: module verification failed: signature and/or required key missing - tainting kernel
bbswitch: version 0.8
bbswitch: Found integrated VGA device 0000:00:02.0: \_SB_.PCI0.GFX0
bbswitch: Found discrete VGA device 0000:01:00.0: \_SB_.PCI0.PEG0.PEGP
ACPI Warning: \_SB_.PCI0.PEG0._DSM: Argument #4 type mismatch - Found [buffer], ACPI requires [Package] (20130517/nsarguments-95)
bbswitch: detected an Optimus _DSM function
pci 0000:01:00.0: enabled device (006 -> 007)
bbswitch: disabling discrete graphics
ACPI Warning: \_SB_.PCI0.PEG0._DSM: Argument #4 type mismatch - Found [buffer], ACPI requires [Package] (20130517/nsarguments-95)
bbswitch: Successfully loaded. Discrete card 0000:01:00.0 is off

toracat

2018-05-04 08:15

administrator   ~0005836

"bbswitch: Successfully loaded." is a good sign.

Things still do not work after that?

dafrye2

2018-05-04 08:27

reporter   ~0005837

Nope.

When I boot, it comes up to the CLI. If I log in and run 'startx', I get errors. It can't find a display driver to use with xorg. It has an intel VGA card as well.

At this point, to you, does it look like bbswitch is functioning correctly, but RHEL 7.5 and the intel VGA are not?

dafrye2

2018-05-04 09:18

reporter   ~0005838

figured it out.

at some point in the upgrade to 7.5, gdm had gone missing.

toracat

2018-05-04 15:08

administrator   ~0005839

OK. Looks like there is no ELRepo bug here. Closing.

Issue History

Date Modified Username Field Change
2018-04-25 08:21 dafrye2 New Issue
2018-04-25 08:21 dafrye2 Status new => assigned
2018-04-25 08:21 dafrye2 Assigned To => pperry
2018-04-25 09:47 toracat Note Added: 0005781
2018-04-25 10:33 dafrye2 Note Added: 0005782
2018-04-25 13:10 pperry Note Added: 0005783
2018-04-25 13:38 dafrye2 Note Added: 0005785
2018-04-25 19:19 pperry Note Added: 0005786
2018-04-26 00:24 pperry Note Edited: 0005786
2018-04-26 07:46 dafrye2 Note Added: 0005791
2018-04-26 15:07 pperry Note Added: 0005792
2018-04-27 10:11 dafrye2 Note Added: 0005795
2018-04-27 11:28 dafrye2 Note Added: 0005797
2018-04-27 16:53 pperry Note Added: 0005798
2018-05-03 10:18 dafrye2 Note Added: 0005823
2018-05-03 11:13 toracat Note Added: 0005824
2018-05-03 11:26 burakkucat Note Edited: 0005824
2018-05-03 11:31 dafrye2 Note Added: 0005825
2018-05-03 12:43 pperry Note Added: 0005827
2018-05-03 12:45 toracat Note Added: 0005828
2018-05-03 13:45 dafrye2 Note Added: 0005829
2018-05-03 15:56 toracat Note Added: 0005830
2018-05-04 07:03 dafrye2 Note Added: 0005831
2018-05-04 07:21 toracat Note Added: 0005832
2018-05-04 07:27 dafrye2 Note Added: 0005833
2018-05-04 07:42 toracat Note Added: 0005834
2018-05-04 08:06 dafrye2 Note Added: 0005835
2018-05-04 08:15 toracat Note Added: 0005836
2018-05-04 08:27 dafrye2 Note Added: 0005837
2018-05-04 09:18 dafrye2 Note Added: 0005838
2018-05-04 15:08 toracat Note Added: 0005839
2018-05-04 15:08 toracat Status assigned => resolved
2018-05-04 15:08 toracat Resolution open => no change required