View Issue Details

IDProjectCategoryView StatusLast Update
0001046channel: elrepo/el8kmod-nvidiapublic2020-10-21 18:10
Reportermroche Assigned Topperry  
PrioritynormalSeverityminorReproducibilityN/A
Status resolvedResolutionfixed 
Summary0001046: Weak-modules/Dracut issue with 4.18.0-193.28.1
DescriptionI'm not sure if this is a problem that needs to be rectified on the ELRepo side of things or if Red Hat introduced a breaking change with their tooling somehow. I upgraded to the latest 8.2 kernel last night (4.18.0-193.28.1) and the weak-modules system used by the kmod-nvidia package group is not having a good time with it:

==== kernel-core rpm posttrans script ====
# /usr/sbin/weak-modules --add-kernel 4.18.0-193.28.1.el8_2.x86_64
dracut-install: Failed to find module 'nvidia'
dracut: FAILED: /usr/lib/dracut/dracut-install -D /var/tmp/dracut.JBh4Vv/initramfs -N nouveau --kerneldir /lib/modules/4.18.0-193.28.1.el8_2.x86_64/ -m nvidia
====

I tried manually symlinking the components into the weak-updates directory and running dracut, but it didn't take either. I'll try again, and maybe make some modifications to the dracut.conf.d file, adding in the other NVIDIA subcomponents just to see. I'm thinking of opening a bug report on bugzilla, but I wanted to give a heads up hear on the issue first. Booting off the 4.18.0-193.19.1 kernel still works just fine.
Steps To ReproduceInstall 4.18.0-193.28.1.el8_2.x86_64 kernel
Reboot
Additional InformationKernel: 4.18.0-193.28.1.el8_2.x86_64
Driver: 450.80.02 (also tested 450.66)
TagsNo tags attached.

Activities

pperry

2020-10-21 14:04

administrator   ~0007253

Acknowledged.

It sounds like something has happened to break kABI compatibility for the latest el8 kernel and the weak links are not being created for the new kernel due to incompatibility. At which point we will need to rebuild against the new kernel.

Let me see if I can fire up an el8 system and take a look. To help me out, please can you show the output from the following...

rpm -qa kernel\* | sort
find /lib/modules -name \*nvidia\*

preferably after you have removed any symlinks you manually created.

pperry

2020-10-21 14:11

administrator   ~0007254

OK, this is what I see:

$ rpm -q kmod-nvidia
kmod-nvidia-450.80.02-1.el8_2.elrepo.x86_64

with these kernels installed:
$ rpm -qa kernel | sort
kernel-4.18.0-193.14.3.el8_2.x86_64
kernel-4.18.0-193.19.1.el8_2.x86_64
kernel-4.18.0-193.28.1.el8_2.x86_64
kernel-4.18.0-193.el8.x86_64

but our nvidia modules do not weak link against the latest 4.18.0-193.28.1.el8_2.x86_64 kernel update:
$ find /lib/modules -name \*nvidia\*
/lib/modules/4.18.0-193.14.3.el8_2.x86_64/weak-updates/nvidia
/lib/modules/4.18.0-193.14.3.el8_2.x86_64/weak-updates/nvidia/nvidia-uvm.ko
/lib/modules/4.18.0-193.14.3.el8_2.x86_64/weak-updates/nvidia/nvidia-modeset.ko
/lib/modules/4.18.0-193.14.3.el8_2.x86_64/weak-updates/nvidia/nvidia-drm.ko
/lib/modules/4.18.0-193.14.3.el8_2.x86_64/weak-updates/nvidia/nvidia.ko
/lib/modules/4.18.0-193.el8.x86_64/extra/nvidia
/lib/modules/4.18.0-193.el8.x86_64/extra/nvidia/nvidia-uvm.ko
/lib/modules/4.18.0-193.el8.x86_64/extra/nvidia/nvidia-modeset.ko
/lib/modules/4.18.0-193.el8.x86_64/extra/nvidia/nvidia-drm.ko
/lib/modules/4.18.0-193.el8.x86_64/extra/nvidia/nvidia.ko
/lib/modules/4.18.0-193.19.1.el8_2.x86_64/weak-updates/nvidia
/lib/modules/4.18.0-193.19.1.el8_2.x86_64/weak-updates/nvidia/nvidia-uvm.ko
/lib/modules/4.18.0-193.19.1.el8_2.x86_64/weak-updates/nvidia/nvidia-modeset.ko
/lib/modules/4.18.0-193.19.1.el8_2.x86_64/weak-updates/nvidia/nvidia-drm.ko
/lib/modules/4.18.0-193.19.1.el8_2.x86_64/weak-updates/nvidia/nvidia.ko

I will rebuild against the latest kernel update and upload for you to test.

pperry

2020-10-21 14:35

administrator   ~0007256

The following package has been uploaded to the elrepo-testing repository:

kmod-nvidia-450.80.02-2.el8_2.elrepo.x86_64.rpm

Please test with:

dnf --enablerepo=elrepo\* update kmod-nvidia

The new package should work with the new el8.2 kernel, but may not be backward compatible with older kernels (I've not tested backward compatibility).

Please can you report back how you get on.

Thanks,

Phil

mroche

2020-10-21 14:40

reporter   ~0007257

Pre-package upgrade info:

# rpm -q kmod-nvidia
kmod-nvidia-450.80.02-1.el8_2.elrepo.x86_64

# rpm -qa kernel | sort
kernel-4.18.0-193.14.3.el8_2.x86_64
kernel-4.18.0-193.19.1.el8_2.x86_64
kernel-4.18.0-193.28.1.el8_2.x86_64

# find /lib/modules -name \*nvidia\*
/lib/modules/4.18.0-193.14.3.el8_2.x86_64/weak-updates/nvidia
/lib/modules/4.18.0-193.14.3.el8_2.x86_64/weak-updates/nvidia/nvidia-uvm.ko
/lib/modules/4.18.0-193.14.3.el8_2.x86_64/weak-updates/nvidia/nvidia-drm.ko
/lib/modules/4.18.0-193.14.3.el8_2.x86_64/weak-updates/nvidia/nvidia-modeset.ko
/lib/modules/4.18.0-193.14.3.el8_2.x86_64/weak-updates/nvidia/nvidia.ko
/lib/modules/4.18.0-193.el8.x86_64/extra/nvidia
/lib/modules/4.18.0-193.el8.x86_64/extra/nvidia/nvidia-uvm.ko
/lib/modules/4.18.0-193.el8.x86_64/extra/nvidia/nvidia-drm.ko
/lib/modules/4.18.0-193.el8.x86_64/extra/nvidia/nvidia-modeset.ko
/lib/modules/4.18.0-193.el8.x86_64/extra/nvidia/nvidia.ko
/lib/modules/4.18.0-193.19.1.el8_2.x86_64/weak-updates/nvidia
/lib/modules/4.18.0-193.19.1.el8_2.x86_64/weak-updates/nvidia/nvidia-uvm.ko
/lib/modules/4.18.0-193.19.1.el8_2.x86_64/weak-updates/nvidia/nvidia-drm.ko
/lib/modules/4.18.0-193.19.1.el8_2.x86_64/weak-updates/nvidia/nvidia-modeset.ko
/lib/modules/4.18.0-193.19.1.el8_2.x86_64/weak-updates/nvidia/nvidia.ko

I'll give the new package a spin and see what happens!

mroche

2020-10-21 15:52

reporter   ~0007258

Updated package worked a treat, thanks for that Phil!

I agree, Red Hat did something to break the kABI that was in use by the NVIDIA module. As for what, I have no idea, it's beyond my realm of comprehension ;) Think it's worth filing a bug with Red Hat to report this break?

There's a non-trivial amount of changes between 19 and 28: https://access.redhat.com/downloads/content/rhel---8/x86_64/7416/kernel/4.18.0-193.28.1.el8_2/x86_64/fd431d51/package-changelog

Mike

pperry

2020-10-21 17:50

administrator   ~0007259

Thanks for the feedback - I'll move the new package to the main repo now.

No point filing a bug. Red Hat only guarantees to maintain the kABI of symbols on their whitelist. If a driver uses symbols that are not on the whitelist (and almost all 3rd party drivers will), AND one of those symbols happens to have a change that breaks the ABI, then breakage happens. Generally we only observe this at point releases and it is extremely uncommon to see it between point releases, and is almost certainly due to a required security fix. It's easy enough for us to rebuild against a newer kernel and fix, just not something we are used to doing regularly, and of course more importantly an inconvenience for yourself when the driver does not work with a new kernel.

Again, thanks for reporting and for the prompt feedback.

Issue History

Date Modified Username Field Change
2020-10-21 12:53 mroche New Issue
2020-10-21 12:53 mroche Status new => assigned
2020-10-21 12:53 mroche Assigned To => pperry
2020-10-21 14:04 pperry Note Added: 0007253
2020-10-21 14:11 pperry Note Added: 0007254
2020-10-21 14:35 pperry Note Added: 0007256
2020-10-21 14:40 mroche Note Added: 0007257
2020-10-21 15:52 mroche Note Added: 0007258
2020-10-21 17:50 pperry Note Added: 0007259
2020-10-21 18:10 pperry Status assigned => resolved
2020-10-21 18:10 pperry Resolution open => fixed