View Issue Details

IDProjectCategoryView StatusLast Update
0001131channel: elrepo/el8kmod-nvidiapublic2021-09-21 06:05
Reporterohw0571 Assigned Topperry  
PriorityhighSeveritymajorReproducibilityalways
Status assignedResolutionopen 
Summary0001131: kmod-nvidia fails to load with updated RHCK in Oracle Linux
DescriptionSymptoms are as in 0001125; output of "dnf upgrade":

dracut-install: Failed to find module 'nvidia'
dracut: FAILED: /usr/lib/dracut/dracut-install -D /var/tmp/dracut.Fx46k0/initramfs -N nouveau --kerneldir /lib/modules/4.18.0-305.17.1.el8_4.x86_64/ -m nvidia

The kernel is RHCK and should thus correspond to the kernel in other EL8 clones.
Steps To ReproduceUpdate kernel from 4.18.0-305.12.1.el8_4.x86_64 to 4.18.0-305.17.1.el8_4.x86_64;
Observe messages; new kernel does not boot into GUI.
Additional InformationCompletely removing and re-installing kmod-nvidia (and its dependencies) appears to solve the issue,
but this should not be necessary.
TagsNo tags attached.

Activities

pperry

2021-09-09 13:19

administrator   ~0007814

You will need to provide more information to help troubleshoot.
If it happens again, please provide full copy and paste from the 'dnf upgrade' process as well as a full copy of the output to /var/log/messages for the relevant time period.

pperry

2021-09-09 13:20

administrator   ~0007815

I don't have an el8 install on which I can test, so I'd appreciate feedback from anyone who can confirm the issue or offer a "works for me".

pperry

2021-09-16 03:19

administrator   ~0007829

As part of your testing, are you able to test on the equivalent RHEL or CentOS kernel to confirm it's not an issue specific to the Oracle kernel

The CentOS kernel is freely available on public mirror sites, and the RHEL kernel can be downloaded from Red Hat if you have a RHEL subscription.

Thanks

ohw0571

2021-09-16 05:32

reporter   ~0007830

As outlined above, the kmod from elrepo ist not per se incompatible with the 4.18.0-305.17.1 kernel from Oracle, because the module is successfully integrated upon re-installing the driver.
The failure to "find" the module while updating the kernel seems to indicate that the files are not automatically copied into the modules directory ... What is the actual trigger here?

pperry

2021-09-16 07:35

administrator   ~0007832

Thank for the feedback

Please can you provide output from:

1. rpm -qa kernel\* | sort
2. find /lib/modules -name nvidia\*

The nvidia module is installed to:

/lib/modules/{kernel_version}/extra/nvidia/nvidia.ko

where {kernel_version} is the kernel version the module is built against (in this case 4.18.0-305.el8.x86_64)

nvidia modules should also show up in /lib/modules/{kernel_version}/weak-updates/nvidia/ for compatible kernels where the module 'weak-links'.

The weak linking is determined by weak-modules, called from the %post scriptlet

The error in the original post:

dracut-install: Failed to find module 'nvidia'
dracut: FAILED: /usr/lib/dracut/dracut-install -D /var/tmp/dracut.Fx46k0/initramfs -N nouveau --kerneldir /lib/modules/4.18.0-305.17.1.el8_4.x86_64/ -m nvidia

is as it suggests - dracut couldn't find /lib/modules/4.18.0-305.17.1.el8_4.x86_64/nvidia.ko, most likely because it isn't there. The output from 'find' above should confirm that for us.

Why wouldn't it be there? Either because that kernel is not kABI compatible hence the module does not weak link, or because weak-modules has not successfully been run or there is a bug in weak-modules.

pperry

2021-09-16 08:58

administrator   ~0007834

Great, thank you. So the nvidia module has not weak linked against the 4.18.0-305.17.1.el8_4 kernel. We need to try to establish why not.

Am I correct in understanding that if you reinstall kmod-nvidia, it works and now shows up on the 'find'?

ohw0571

2021-09-16 09:31

reporter   ~0007835

In my case (on OL8) it has worked exactly like this.

ohw0571

2021-09-21 05:09

reporter   ~0007864

As far as I can see, the problem is NOT solved with the 4.18.0-305.19.1 kernel:
analagous messages as before; the /lib/modules/4.18.0-305.19.1.el8_4.x86_64/weak-updates directory is empty.

ohw0571

2021-09-21 05:21

reporter   ~0007865

Btw, the same error occurs for the Oracle kernel (5.4.17-2102.205.7.2) which is installed in parallel to the Redhat kernel;
in this case the message is probably expected since the newer kernel is not supported by the elrepo module.
Is there a way to prevent the system from even trying to integrate the module into an incompatible kernel?

pperry

2021-09-21 05:55

administrator   ~0007867

I am unable to reproduce this bug on a genuine RHEL8 system.

$ rpm -qa kernel
kernel-4.18.0-305.19.1.el8_4.x86_64
kernel-4.18.0-240.el8.x86_64
kernel-4.18.0-305.17.1.el8_4.x86_64
kernel-4.18.0-305.el8.x86_64

$ rpm -qa | grep nvidia
kmod-nvidia-470.63.01-1.el8_4.elrepo.x86_64
nvidia-x11-drv-libs-470.63.01-1.el8_4.elrepo.x86_64
nvidia-x11-drv-470.63.01-1.el8_4.elrepo.x86_64

$ find /lib/modules -name nvidia\*
/lib/modules/4.18.0-305.17.1.el8_4.x86_64/weak-updates/nvidia
/lib/modules/4.18.0-305.17.1.el8_4.x86_64/weak-updates/nvidia/nvidia-uvm.ko
/lib/modules/4.18.0-305.17.1.el8_4.x86_64/weak-updates/nvidia/nvidia-modeset.ko
/lib/modules/4.18.0-305.17.1.el8_4.x86_64/weak-updates/nvidia/nvidia-drm.ko
/lib/modules/4.18.0-305.17.1.el8_4.x86_64/weak-updates/nvidia/nvidia.ko
/lib/modules/4.18.0-305.17.1.el8_4.x86_64/weak-updates/nvidia/nvidia-peermem.ko
/lib/modules/4.18.0-305.19.1.el8_4.x86_64/weak-updates/nvidia
/lib/modules/4.18.0-305.19.1.el8_4.x86_64/weak-updates/nvidia/nvidia-uvm.ko
/lib/modules/4.18.0-305.19.1.el8_4.x86_64/weak-updates/nvidia/nvidia-modeset.ko
/lib/modules/4.18.0-305.19.1.el8_4.x86_64/weak-updates/nvidia/nvidia-drm.ko
/lib/modules/4.18.0-305.19.1.el8_4.x86_64/weak-updates/nvidia/nvidia.ko
/lib/modules/4.18.0-305.19.1.el8_4.x86_64/weak-updates/nvidia/nvidia-peermem.ko
/lib/modules/4.18.0-305.el8.x86_64/extra/nvidia
/lib/modules/4.18.0-305.el8.x86_64/extra/nvidia/nvidia-uvm.ko
/lib/modules/4.18.0-305.el8.x86_64/extra/nvidia/nvidia-modeset.ko
/lib/modules/4.18.0-305.el8.x86_64/extra/nvidia/nvidia-drm.ko
/lib/modules/4.18.0-305.el8.x86_64/extra/nvidia/nvidia.ko
/lib/modules/4.18.0-305.el8.x86_64/extra/nvidia/nvidia-peermem.ko

If I uninstall a kernel and reinstall it to simulate updating a kernel, I see no issues.

Please can you test to see if you can reproduce the issue on RHEL8. If you can give me a reliable reproducer on RHEL8, we can investigate further. Otherwise it looks like it may be something specific to Oracle Linux.

pperry

2021-09-21 06:05

administrator   ~0007868

@kademlia I hope you don't mind - I'm going to delete your posts from this bug as your issue is unrelated and just adds confusion to the case here. If you still have an issue, please feel free to open a separate new bug and we will be more than happy to look at it.

Issue History

Date Modified Username Field Change
2021-09-09 07:08 ohw0571 New Issue
2021-09-09 07:08 ohw0571 Status new => assigned
2021-09-09 07:08 ohw0571 Assigned To => pperry
2021-09-09 09:50 burakkucat Reproducibility have not tried => always
2021-09-09 13:19 pperry Note Added: 0007814
2021-09-09 13:20 pperry Note Added: 0007815
2021-09-16 03:19 pperry Note Added: 0007829
2021-09-16 05:32 ohw0571 Note Added: 0007830
2021-09-16 07:35 pperry Note Added: 0007832
2021-09-16 08:58 pperry Note Added: 0007834
2021-09-16 09:31 ohw0571 Note Added: 0007835
2021-09-21 05:09 ohw0571 Note Added: 0007864
2021-09-21 05:21 ohw0571 Note Added: 0007865
2021-09-21 05:55 pperry Note Added: 0007867
2021-09-21 06:05 pperry Note Added: 0007868