View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0001131 | channel: elrepo/el8 | kmod-nvidia | public | 2021-09-09 07:08 | 2021-09-21 06:05 |
Reporter | ohw0571 | Assigned To | pperry | ||
Priority | high | Severity | major | Reproducibility | always |
Status | assigned | Resolution | open | ||
Summary | 0001131: kmod-nvidia fails to load with updated RHCK in Oracle Linux | ||||
Description | Symptoms are as in 0001125; output of "dnf upgrade": dracut-install: Failed to find module 'nvidia' dracut: FAILED: /usr/lib/dracut/dracut-install -D /var/tmp/dracut.Fx46k0/initramfs -N nouveau --kerneldir /lib/modules/4.18.0-305.17.1.el8_4.x86_64/ -m nvidia The kernel is RHCK and should thus correspond to the kernel in other EL8 clones. | ||||
Steps To Reproduce | Update kernel from 4.18.0-305.12.1.el8_4.x86_64 to 4.18.0-305.17.1.el8_4.x86_64; Observe messages; new kernel does not boot into GUI. | ||||
Additional Information | Completely removing and re-installing kmod-nvidia (and its dependencies) appears to solve the issue, but this should not be necessary. | ||||
Tags | No tags attached. | ||||
|
You will need to provide more information to help troubleshoot. If it happens again, please provide full copy and paste from the 'dnf upgrade' process as well as a full copy of the output to /var/log/messages for the relevant time period. |
|
I don't have an el8 install on which I can test, so I'd appreciate feedback from anyone who can confirm the issue or offer a "works for me". |
|
As part of your testing, are you able to test on the equivalent RHEL or CentOS kernel to confirm it's not an issue specific to the Oracle kernel The CentOS kernel is freely available on public mirror sites, and the RHEL kernel can be downloaded from Red Hat if you have a RHEL subscription. Thanks |
|
As outlined above, the kmod from elrepo ist not per se incompatible with the 4.18.0-305.17.1 kernel from Oracle, because the module is successfully integrated upon re-installing the driver. The failure to "find" the module while updating the kernel seems to indicate that the files are not automatically copied into the modules directory ... What is the actual trigger here? |
|
Thank for the feedback Please can you provide output from: 1. rpm -qa kernel\* | sort 2. find /lib/modules -name nvidia\* The nvidia module is installed to: /lib/modules/{kernel_version}/extra/nvidia/nvidia.ko where {kernel_version} is the kernel version the module is built against (in this case 4.18.0-305.el8.x86_64) nvidia modules should also show up in /lib/modules/{kernel_version}/weak-updates/nvidia/ for compatible kernels where the module 'weak-links'. The weak linking is determined by weak-modules, called from the %post scriptlet The error in the original post: dracut-install: Failed to find module 'nvidia' dracut: FAILED: /usr/lib/dracut/dracut-install -D /var/tmp/dracut.Fx46k0/initramfs -N nouveau --kerneldir /lib/modules/4.18.0-305.17.1.el8_4.x86_64/ -m nvidia is as it suggests - dracut couldn't find /lib/modules/4.18.0-305.17.1.el8_4.x86_64/nvidia.ko, most likely because it isn't there. The output from 'find' above should confirm that for us. Why wouldn't it be there? Either because that kernel is not kABI compatible hence the module does not weak link, or because weak-modules has not successfully been run or there is a bug in weak-modules. |
|
Great, thank you. So the nvidia module has not weak linked against the 4.18.0-305.17.1.el8_4 kernel. We need to try to establish why not. Am I correct in understanding that if you reinstall kmod-nvidia, it works and now shows up on the 'find'? |
|
In my case (on OL8) it has worked exactly like this. |
|
As far as I can see, the problem is NOT solved with the 4.18.0-305.19.1 kernel: analagous messages as before; the /lib/modules/4.18.0-305.19.1.el8_4.x86_64/weak-updates directory is empty. |
|
Btw, the same error occurs for the Oracle kernel (5.4.17-2102.205.7.2) which is installed in parallel to the Redhat kernel; in this case the message is probably expected since the newer kernel is not supported by the elrepo module. Is there a way to prevent the system from even trying to integrate the module into an incompatible kernel? |
|
I am unable to reproduce this bug on a genuine RHEL8 system. $ rpm -qa kernel kernel-4.18.0-305.19.1.el8_4.x86_64 kernel-4.18.0-240.el8.x86_64 kernel-4.18.0-305.17.1.el8_4.x86_64 kernel-4.18.0-305.el8.x86_64 $ rpm -qa | grep nvidia kmod-nvidia-470.63.01-1.el8_4.elrepo.x86_64 nvidia-x11-drv-libs-470.63.01-1.el8_4.elrepo.x86_64 nvidia-x11-drv-470.63.01-1.el8_4.elrepo.x86_64 $ find /lib/modules -name nvidia\* /lib/modules/4.18.0-305.17.1.el8_4.x86_64/weak-updates/nvidia /lib/modules/4.18.0-305.17.1.el8_4.x86_64/weak-updates/nvidia/nvidia-uvm.ko /lib/modules/4.18.0-305.17.1.el8_4.x86_64/weak-updates/nvidia/nvidia-modeset.ko /lib/modules/4.18.0-305.17.1.el8_4.x86_64/weak-updates/nvidia/nvidia-drm.ko /lib/modules/4.18.0-305.17.1.el8_4.x86_64/weak-updates/nvidia/nvidia.ko /lib/modules/4.18.0-305.17.1.el8_4.x86_64/weak-updates/nvidia/nvidia-peermem.ko /lib/modules/4.18.0-305.19.1.el8_4.x86_64/weak-updates/nvidia /lib/modules/4.18.0-305.19.1.el8_4.x86_64/weak-updates/nvidia/nvidia-uvm.ko /lib/modules/4.18.0-305.19.1.el8_4.x86_64/weak-updates/nvidia/nvidia-modeset.ko /lib/modules/4.18.0-305.19.1.el8_4.x86_64/weak-updates/nvidia/nvidia-drm.ko /lib/modules/4.18.0-305.19.1.el8_4.x86_64/weak-updates/nvidia/nvidia.ko /lib/modules/4.18.0-305.19.1.el8_4.x86_64/weak-updates/nvidia/nvidia-peermem.ko /lib/modules/4.18.0-305.el8.x86_64/extra/nvidia /lib/modules/4.18.0-305.el8.x86_64/extra/nvidia/nvidia-uvm.ko /lib/modules/4.18.0-305.el8.x86_64/extra/nvidia/nvidia-modeset.ko /lib/modules/4.18.0-305.el8.x86_64/extra/nvidia/nvidia-drm.ko /lib/modules/4.18.0-305.el8.x86_64/extra/nvidia/nvidia.ko /lib/modules/4.18.0-305.el8.x86_64/extra/nvidia/nvidia-peermem.ko If I uninstall a kernel and reinstall it to simulate updating a kernel, I see no issues. Please can you test to see if you can reproduce the issue on RHEL8. If you can give me a reliable reproducer on RHEL8, we can investigate further. Otherwise it looks like it may be something specific to Oracle Linux. |
|
@kademlia I hope you don't mind - I'm going to delete your posts from this bug as your issue is unrelated and just adds confusion to the case here. If you still have an issue, please feel free to open a separate new bug and we will be more than happy to look at it. |
Date Modified | Username | Field | Change |
---|---|---|---|
2021-09-09 07:08 | ohw0571 | New Issue | |
2021-09-09 07:08 | ohw0571 | Status | new => assigned |
2021-09-09 07:08 | ohw0571 | Assigned To | => pperry |
2021-09-09 09:50 | burakkucat | Reproducibility | have not tried => always |
2021-09-09 13:19 | pperry | Note Added: 0007814 | |
2021-09-09 13:20 | pperry | Note Added: 0007815 | |
2021-09-16 03:19 | pperry | Note Added: 0007829 | |
2021-09-16 05:32 | ohw0571 | Note Added: 0007830 | |
2021-09-16 07:35 | pperry | Note Added: 0007832 | |
2021-09-16 08:58 | pperry | Note Added: 0007834 | |
2021-09-16 09:31 | ohw0571 | Note Added: 0007835 | |
2021-09-21 05:09 | ohw0571 | Note Added: 0007864 | |
2021-09-21 05:21 | ohw0571 | Note Added: 0007865 | |
2021-09-21 05:55 | pperry | Note Added: 0007867 | |
2021-09-21 06:05 | pperry | Note Added: 0007868 |