View Issue Details

IDProjectCategoryView StatusLast Update
0001494channel: elrepo/el9kmod-ib_qibpublic2024-11-28 15:52
Reporternscfreny Assigned Totoracat  
PrioritynormalSeveritycrashReproducibilityalways
Status resolvedResolutionfixed 
Platform5.14.0-503.14.1.el9_5.x86_64OSRocky LinuxOS Version9.5
Summary0001494: Kernel panic loading ib_qib with EL 9.5 kernel
DescriptionKernel panic when loading ib_qib from kmod-ib_qib-1.11-12.el9_5.elrepo.x86_64.rpm with Rocky Linux 9.5 kernel.

See console log here: https://rpa.st/UW5Q

(Failed to attach as text file here? Or perhaps timeout?)
Steps To ReproduceInstall kmod-ib_qib + modprobe ib_qib:

dnf install https://elrepo.org/linux/elrepo/el9/x86_64/RPMS/kmod-ib_qib-1.11-12.el9_5.elrepo.x86_64.rpm
modprobe ib_qib
Additional InformationInformation about HCA:

lspci -vnn | grep -A1 "InfiniBand \["
02:00.0 InfiniBand [0c06]: QLogic Corp. IBA7322 QDR InfiniBand HCA [1077:7322] (rev 02)
        Subsystem: QLogic Corp. IBA7322 QDR InfiniBand HCA [1077:7322]
TagsNo tags attached.

Activities

nscfreny

2024-11-26 09:46

reporter   ~0010207

Works with Rocky 9.5 by downgrading kernel + kmod-ib_qib to latest 9.4 versions.

toracat

2024-11-26 10:46

administrator   ~0010208

Acknowledged.

toracat

2024-11-26 12:28

administrator   ~0010210

We cannot reproduce the issue most likely because we do not have proper hardware to test on.

Can you try installing the current kernel-ml package? By this, you'll be testing the latest code available from kernel.org.

$ sudo dnf --enablerepo=elrepo-kernel install kernel-ml

nscfreny

2024-11-26 14:31

reporter   ~0010211

We don't boot from disk but I did manage to kexec kernel-ml, and ib_qib seems to work.

# Staring point is a stripped compute node image with kernel-5.14.0-503.14.1.el9_5.x86_64 without kmod-ib_qib.
dnf install elrepo-release
dnf --enablerepo=elrepo-kernel install kernel-ml
kexec -l /boot/vmlinuz-6.12.1-1.el9.elrepo.x86_64 --append='ro root=/dev/sda1 console=ttyS1,57600' --initrd=/boot/initramfs-6.12.1-1.el9.elrepo.x86_64.img
kexec -e
# Wait...

uname -r
6.12.1-1.el9.elrepo.x86_64

ibstat
CA 'qib0'
        CA type: InfiniPath_QLE7340
        Number of ports: 1
        Firmware version:
        Hardware version: 2
        Node GUID: 0x00117500006f85e8
        System image GUID: 0x00117500006f85e8
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 10
                LMC: 0
                SM lid: 1
                Capability mask: 0x07610868
                Port GUID: 0x00117500006f85e8
                Link layer: InfiniBand

toracat

2024-11-26 21:54

administrator   ~0010212

Thank you for running the test. It's good to know that the in-kernel module of ib_qib in kernel-ml works fine.

So,
EL 9.4 : works
EL 9.5 : does not work
kernel-ml 6.12.1: works

It looks as if the source code for the ib_qib driver in EL 9.5 was bumped with newer code that may not work under the current environment. I have rebuilt the driver by using the source from kernel 5.14.21 and released it to the elrepo-testing repository.

kmod-ib_qib-1.11-13.el9_5.elrepo.x86_64.rpm

You can install it by running:

$ sudo dnf --enablerepo=elrepo-testing install kmod-ib_qib

Could you give it a try?

nscfreny

2024-11-27 01:28

reporter   ~0010213

Thanks, still crashing though.

# Just to make sure it's the right one before running modprobe:
uname -r
5.14.0-503.14.1.el9_5.x86_64
find /lib/modules -name "*qib*"
/lib/modules/5.14.0-503.14.1.el9_5.x86_64/weak-updates/ib_qib
/lib/modules/5.14.0-503.14.1.el9_5.x86_64/weak-updates/ib_qib/ib_qib.ko
/lib/modules/5.14.0-503.11.1.el9_5.x86_64/extra/ib_qib
/lib/modules/5.14.0-503.11.1.el9_5.x86_64/extra/ib_qib/ib_qib.ko
rpm -qf /lib/modules/5.14.0-503.11.1.el9_5.x86_64/extra/ib_qib/ib_qib.ko
kmod-ib_qib-1.11-13.el9_5.elrepo.x86_64
dnf list installed kmod-ib_qib
Installed Packages
kmod-ib_qib.x86_64 1.11-13.el9_5.elrepo @elrepo-testing

# Ran "modprobe ib_qib" on serial console
https://rpa.st/YOMA

nscfreny

2024-11-27 02:20

reporter   ~0010214

Looks like "WARNING: CPU: 0 PID: 21 at kernel/dma/mapping.c:551 dma_alloc_attrs+0x40/0x60" is from:

EL9 kernel changelog:
* Mon Mar 11 2024 Lucas Zampieri <lzampier@redhat.com> [5.14.0-429.el9]
- Reapply "dma-mapping: reject __GFP_COMP in dma_alloc_attrs" (Chris Leech) [RHEL-26081]

CentOS stream commit:
https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/commit/3e2dc0427aa4d4130ac67e50b489437c056a81bf

nscfreny

2024-11-27 02:48

reporter   ~0010215

Upstream commit "2fce26a15f17 RDMA/qib: don't pass bogus GFP_ flags to dma_alloc_coherent"?
https://patchwork.kernel.org/project/netdevbpf/patch/20221113163535.884299-4-hch@lst.de/#25091339

nscfreny

2024-11-27 05:16

reporter   ~0010216

Rebuilt kmod-ib_qib-1.11-12.el9_5.elrepo.src.rpm with upstream commit 2fce26a15f17 and it seems to work.

[root@n6 ~]# uname -r
5.14.0-503.14.1.el9_5.x86_64
[root@n6 ~]# ibstat
CA 'qib0'
        CA type: InfiniPath_QLE7340
        Number of ports: 1
        Firmware version:
        Hardware version: 2
        Node GUID: 0x00117500006f85e8
        System image GUID: 0x00117500006f85e8
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 10
                LMC: 0
                SM lid: 1
                Capability mask: 0x07610868
                Port GUID: 0x00117500006f85e8
                Link layer: InfiniBand
[root@n6 ~]# qperf n3 conf
conf:
    loc_node = n6
    loc_cpu = 32 Cores: Mixed CPUs
    loc_os = Linux 5.14.0-503.14.1.el9_5.x86_64
    loc_qperf = 0.4.9
    rem_node = n3
    rem_cpu = 32 Cores: Mixed CPUs
    rem_os = Linux 5.14.0-427.42.1.el9_4.x86_64
    rem_qperf = 0.4.9
[root@n6 ~]# qperf -v -i qib0 n3 rc_bi_bw
rc_bi_bw:
    bw = 5.99 GB/sec
    msg_rate = 91.4 K/sec
    id = qib0
    loc_cpus_used = 406 % cpus
    rem_cpus_used = 144 % cpus

toracat

2024-11-27 12:15

administrator   ~0010217

Thanks for the excellent debugging work. We will re-package our kmod with the patch.

nscfreny

2024-11-27 17:26

reporter   ~0010218

Thanks.

I checked the 1.11-13 srpm and it looks like there may have been a mistake and the patch replacement was only in changelog?

ls SOURCES/
elrepo-ib_qib_9_1.patch ib_qib-1.11.tar.gz
GPL-v2.0.txt ib_qib-elrepo-bug1390.patch

## From SPECS/kmod-ib_qib.spec:

# Source code patches
Patch0: elrepo-ib_qib_9_1.patch
Patch1: ib_qib-elrepo-bug1390.patch

%changelog
* Wed Nov 27 2024 Akemi Yagi <toracat@elrepo.org> - 1.11-13
- Add ib_qib-elrepo-bug1494.patch
- Remove ib_qib-elrepo-bug1390.patch
  [https://elrepo.org/bugs/view.php?id=1494]

Yes, it crashed ;)

Regards / Fredrik Nyström

nscfreny

2024-11-27 17:51

reporter   ~0010219

Looks like ib_qib-elrepo-bug1390.patch is still needed (removed according to changelog).

toracat

2024-11-27 17:53

administrator   ~0010220

OH NO! $^@%##%^$@(xx

nscfreny

2024-11-27 18:05

reporter   ~0010221

## Diff for test build I did 15 minutes ago:
--- SPECS/kmod-ib_qib.spec.orig 2024-11-27 18:37:24.000000000 +0100
+++ SPECS/kmod-ib_qib.spec 2024-11-27 23:49:00.142054218 +0100
@@ -2,13 +2,13 @@
 %define kmod_name ib_qib
 
 # If kmod_kernel_version isn't defined on the rpmbuild line, define it here.
-%{!?kmod_kernel_version: %define kmod_kernel_version 5.14.0-503.11.1.el9_5}
+%{!?kmod_kernel_version: %define kmod_kernel_version 5.14.0-503.15.1.el9_5}
 
 %{!?dist: %define dist .el9}
 
 Name: kmod-%{kmod_name}
 Version: 1.11
-Release: 13%{?dist}
+Release: 13.nsc2%{?dist}
 Summary: %{kmod_name} kernel module(s)
 Group: System Environment/Kernel
 License: GPLv2
@@ -34,6 +34,7 @@
 # Source code patches
 Patch0: elrepo-ib_qib_9_1.patch
 Patch1: ib_qib-elrepo-bug1390.patch
+Patch2: ib_qib-elrepo-bug1494.patch
 
 %define findpat %( echo "%""P" )
 %define __find_requires /usr/lib/rpm/redhat/find-requires.ksyms
@@ -85,6 +86,7 @@
 # Apply patch(es)
 %patch0 -p1
 %patch1 -p1
+%patch2 -p5
 
 %build
 %{__make} -C %{kernel_source} %{?_smp_mflags} V=1 modules M=$PWD CONFIG_INFINIBAND_QIB=m


## And:
wget -O SOURCES/ib_qib-elrepo-bug1494.patch https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/patch/?id=2fce26a15f1709090ca70f4c7da017424b3b78b3

toracat

2024-11-27 18:54

administrator   ~0010222

Thanks for that. But before I saw your post, I had pushed kmod-ib_qib-1.11-14.el9_5.elrepo and removed kmod-ib_qib-1.11-13.el9_5.elrepo.

nscfreny

2024-11-27 20:49

reporter   ~0010223

Thanks, kmod-ib_qib-1.11-14.el9_5.elrepo looks good. �️

toracat

2024-11-27 21:03

administrator   ~0010224

\o/
Finally.

toracat

2024-11-27 21:14

administrator   ~0010225

All -12 files removed.
A DUD image for the -14 pushed to mirrors.

Issue History

Date Modified Username Field Change
2024-11-26 09:36 nscfreny New Issue
2024-11-26 09:36 nscfreny Status new => assigned
2024-11-26 09:36 nscfreny Assigned To => tqhoang
2024-11-26 09:46 nscfreny Note Added: 0010207
2024-11-26 10:46 toracat Status assigned => acknowledged
2024-11-26 10:46 toracat Note Added: 0010208
2024-11-26 12:28 toracat Note Added: 0010210
2024-11-26 14:31 nscfreny Note Added: 0010211
2024-11-26 21:54 toracat Note Added: 0010212
2024-11-27 01:28 nscfreny Note Added: 0010213
2024-11-27 02:20 nscfreny Note Added: 0010214
2024-11-27 02:48 nscfreny Note Added: 0010215
2024-11-27 05:16 nscfreny Note Added: 0010216
2024-11-27 12:15 toracat Assigned To tqhoang => toracat
2024-11-27 12:15 toracat Status acknowledged => assigned
2024-11-27 12:15 toracat Note Added: 0010217
2024-11-27 17:26 nscfreny Note Added: 0010218
2024-11-27 17:51 nscfreny Note Added: 0010219
2024-11-27 17:53 toracat Note Added: 0010220
2024-11-27 18:05 nscfreny Note Added: 0010221
2024-11-27 18:54 toracat Note Added: 0010222
2024-11-27 20:49 nscfreny Note Added: 0010223
2024-11-27 21:03 toracat Note Added: 0010224
2024-11-27 21:14 toracat Note Added: 0010225
2024-11-28 15:52 toracat Status assigned => resolved
2024-11-28 15:52 toracat Resolution open => fixed