View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0001046 | channel: elrepo/el8 | kmod-nvidia | public | 2020-10-21 12:53 | 2020-10-21 18:10 |
Reporter | mroche | Assigned To | pperry | ||
Priority | normal | Severity | minor | Reproducibility | N/A |
Status | resolved | Resolution | fixed | ||
Summary | 0001046: Weak-modules/Dracut issue with 4.18.0-193.28.1 | ||||
Description | I'm not sure if this is a problem that needs to be rectified on the ELRepo side of things or if Red Hat introduced a breaking change with their tooling somehow. I upgraded to the latest 8.2 kernel last night (4.18.0-193.28.1) and the weak-modules system used by the kmod-nvidia package group is not having a good time with it: ==== kernel-core rpm posttrans script ==== # /usr/sbin/weak-modules --add-kernel 4.18.0-193.28.1.el8_2.x86_64 dracut-install: Failed to find module 'nvidia' dracut: FAILED: /usr/lib/dracut/dracut-install -D /var/tmp/dracut.JBh4Vv/initramfs -N nouveau --kerneldir /lib/modules/4.18.0-193.28.1.el8_2.x86_64/ -m nvidia ==== I tried manually symlinking the components into the weak-updates directory and running dracut, but it didn't take either. I'll try again, and maybe make some modifications to the dracut.conf.d file, adding in the other NVIDIA subcomponents just to see. I'm thinking of opening a bug report on bugzilla, but I wanted to give a heads up hear on the issue first. Booting off the 4.18.0-193.19.1 kernel still works just fine. | ||||
Steps To Reproduce | Install 4.18.0-193.28.1.el8_2.x86_64 kernel Reboot | ||||
Additional Information | Kernel: 4.18.0-193.28.1.el8_2.x86_64 Driver: 450.80.02 (also tested 450.66) | ||||
Tags | No tags attached. | ||||
|
Acknowledged. It sounds like something has happened to break kABI compatibility for the latest el8 kernel and the weak links are not being created for the new kernel due to incompatibility. At which point we will need to rebuild against the new kernel. Let me see if I can fire up an el8 system and take a look. To help me out, please can you show the output from the following... rpm -qa kernel\* | sort find /lib/modules -name \*nvidia\* preferably after you have removed any symlinks you manually created. |
|
OK, this is what I see: $ rpm -q kmod-nvidia kmod-nvidia-450.80.02-1.el8_2.elrepo.x86_64 with these kernels installed: $ rpm -qa kernel | sort kernel-4.18.0-193.14.3.el8_2.x86_64 kernel-4.18.0-193.19.1.el8_2.x86_64 kernel-4.18.0-193.28.1.el8_2.x86_64 kernel-4.18.0-193.el8.x86_64 but our nvidia modules do not weak link against the latest 4.18.0-193.28.1.el8_2.x86_64 kernel update: $ find /lib/modules -name \*nvidia\* /lib/modules/4.18.0-193.14.3.el8_2.x86_64/weak-updates/nvidia /lib/modules/4.18.0-193.14.3.el8_2.x86_64/weak-updates/nvidia/nvidia-uvm.ko /lib/modules/4.18.0-193.14.3.el8_2.x86_64/weak-updates/nvidia/nvidia-modeset.ko /lib/modules/4.18.0-193.14.3.el8_2.x86_64/weak-updates/nvidia/nvidia-drm.ko /lib/modules/4.18.0-193.14.3.el8_2.x86_64/weak-updates/nvidia/nvidia.ko /lib/modules/4.18.0-193.el8.x86_64/extra/nvidia /lib/modules/4.18.0-193.el8.x86_64/extra/nvidia/nvidia-uvm.ko /lib/modules/4.18.0-193.el8.x86_64/extra/nvidia/nvidia-modeset.ko /lib/modules/4.18.0-193.el8.x86_64/extra/nvidia/nvidia-drm.ko /lib/modules/4.18.0-193.el8.x86_64/extra/nvidia/nvidia.ko /lib/modules/4.18.0-193.19.1.el8_2.x86_64/weak-updates/nvidia /lib/modules/4.18.0-193.19.1.el8_2.x86_64/weak-updates/nvidia/nvidia-uvm.ko /lib/modules/4.18.0-193.19.1.el8_2.x86_64/weak-updates/nvidia/nvidia-modeset.ko /lib/modules/4.18.0-193.19.1.el8_2.x86_64/weak-updates/nvidia/nvidia-drm.ko /lib/modules/4.18.0-193.19.1.el8_2.x86_64/weak-updates/nvidia/nvidia.ko I will rebuild against the latest kernel update and upload for you to test. |
|
The following package has been uploaded to the elrepo-testing repository: kmod-nvidia-450.80.02-2.el8_2.elrepo.x86_64.rpm Please test with: dnf --enablerepo=elrepo\* update kmod-nvidia The new package should work with the new el8.2 kernel, but may not be backward compatible with older kernels (I've not tested backward compatibility). Please can you report back how you get on. Thanks, Phil |
|
Pre-package upgrade info: # rpm -q kmod-nvidia kmod-nvidia-450.80.02-1.el8_2.elrepo.x86_64 # rpm -qa kernel | sort kernel-4.18.0-193.14.3.el8_2.x86_64 kernel-4.18.0-193.19.1.el8_2.x86_64 kernel-4.18.0-193.28.1.el8_2.x86_64 # find /lib/modules -name \*nvidia\* /lib/modules/4.18.0-193.14.3.el8_2.x86_64/weak-updates/nvidia /lib/modules/4.18.0-193.14.3.el8_2.x86_64/weak-updates/nvidia/nvidia-uvm.ko /lib/modules/4.18.0-193.14.3.el8_2.x86_64/weak-updates/nvidia/nvidia-drm.ko /lib/modules/4.18.0-193.14.3.el8_2.x86_64/weak-updates/nvidia/nvidia-modeset.ko /lib/modules/4.18.0-193.14.3.el8_2.x86_64/weak-updates/nvidia/nvidia.ko /lib/modules/4.18.0-193.el8.x86_64/extra/nvidia /lib/modules/4.18.0-193.el8.x86_64/extra/nvidia/nvidia-uvm.ko /lib/modules/4.18.0-193.el8.x86_64/extra/nvidia/nvidia-drm.ko /lib/modules/4.18.0-193.el8.x86_64/extra/nvidia/nvidia-modeset.ko /lib/modules/4.18.0-193.el8.x86_64/extra/nvidia/nvidia.ko /lib/modules/4.18.0-193.19.1.el8_2.x86_64/weak-updates/nvidia /lib/modules/4.18.0-193.19.1.el8_2.x86_64/weak-updates/nvidia/nvidia-uvm.ko /lib/modules/4.18.0-193.19.1.el8_2.x86_64/weak-updates/nvidia/nvidia-drm.ko /lib/modules/4.18.0-193.19.1.el8_2.x86_64/weak-updates/nvidia/nvidia-modeset.ko /lib/modules/4.18.0-193.19.1.el8_2.x86_64/weak-updates/nvidia/nvidia.ko I'll give the new package a spin and see what happens! |
|
Updated package worked a treat, thanks for that Phil! I agree, Red Hat did something to break the kABI that was in use by the NVIDIA module. As for what, I have no idea, it's beyond my realm of comprehension ;) Think it's worth filing a bug with Red Hat to report this break? There's a non-trivial amount of changes between 19 and 28: https://access.redhat.com/downloads/content/rhel---8/x86_64/7416/kernel/4.18.0-193.28.1.el8_2/x86_64/fd431d51/package-changelog Mike |
|
Thanks for the feedback - I'll move the new package to the main repo now. No point filing a bug. Red Hat only guarantees to maintain the kABI of symbols on their whitelist. If a driver uses symbols that are not on the whitelist (and almost all 3rd party drivers will), AND one of those symbols happens to have a change that breaks the ABI, then breakage happens. Generally we only observe this at point releases and it is extremely uncommon to see it between point releases, and is almost certainly due to a required security fix. It's easy enough for us to rebuild against a newer kernel and fix, just not something we are used to doing regularly, and of course more importantly an inconvenience for yourself when the driver does not work with a new kernel. Again, thanks for reporting and for the prompt feedback. |
Date Modified | Username | Field | Change |
---|---|---|---|
2020-10-21 12:53 | mroche | New Issue | |
2020-10-21 12:53 | mroche | Status | new => assigned |
2020-10-21 12:53 | mroche | Assigned To | => pperry |
2020-10-21 14:04 | pperry | Note Added: 0007253 | |
2020-10-21 14:11 | pperry | Note Added: 0007254 | |
2020-10-21 14:35 | pperry | Note Added: 0007256 | |
2020-10-21 14:40 | mroche | Note Added: 0007257 | |
2020-10-21 15:52 | mroche | Note Added: 0007258 | |
2020-10-21 17:50 | pperry | Note Added: 0007259 | |
2020-10-21 18:10 | pperry | Status | assigned => resolved |
2020-10-21 18:10 | pperry | Resolution | open => fixed |