View Issue Details

IDProjectCategoryView StatusLast Update
0001275channel: elrepo/el9kmod-mlx4public2022-11-15 05:52
Reportertorkil Assigned Topperry  
PrioritynormalSeverityminorReproducibilityalways
Status resolvedResolutionfixed 
Summary0001275: Kmod-mlx4 for RHEL 9 / Mellanox Technologies MT25408A0-FCC-QI ConnectX
DescriptionHi

I got the following from Red Hat:

"
Unfortunately the "Mellanox Technologies MT25408A0-FCC-QI ConnectX" card is no longer supported as of RHEL8:

# egrep Mellanox lspci
08:00.0 Network controller [0280]: Mellanox Technologies MT25408A0-FCC-QI ConnectX, Dual Port 40Gb/s InfiniBand / 10GigE Adapter IC with PCIe 2.0 x8 5.0GT/s In... (rev b0)
        Subsystem: Mellanox Technologies HP InfiniBand 4X QDR CX-2 PCI-e G2 Dual Port HCA [15b3:0021]

# egrep :08: sos_commands/kernel/dmesg | head -n1
[ 0.351192] pci 0000:08:00.0: [15b3:673c] type 00 class 0x028000
It's supported on RHEL7:

[root@rhel7 ~]# modinfo mlx4_core | grep -i 15b3d | grep -i 673c
alias: pci:v000015B3d0000673Csv*sd*bc*sc*i*

[root@rhel7 ~]# uname -a
Linux rhel7 3.10.0-1160.76.1.el7.x86_64 #1 SMP Tue Jul 26 14:15:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
But starting from RHEL8 the adapter has been removed from the mlx4_core driver:

[root@rhel8 ~]# modinfo mlx4_core | grep -i 15b3 | grep -i 673c
[root@rhel8 ~]# uname -a
Linux rhel8 4.18.0-372.26.1.el8_6.x86_64 #1 SMP Sat Aug 27 02:44:20 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux

[root@rhel9 ~]# modinfo mlx4_core | grep -i 15b3 | grep -i 673c
[root@rhel9 ~]# uname -a
Linux rhel9 5.14.0-70.26.1.el9_0.x86_64 #1 SMP PREEMPT Fri Sep 2 16:07:40 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
"

I've used kmod-mlx4 on RHEL 8 for the cards, can that be built for RHEL 9 also?

Thanks
TagsNo tags attached.

Activities

toracat

2022-10-05 14:43

administrator   ~0008687

Acknowledged.

In the meantime, you may want to (test-)install kernel-ml for el9 which has the mlx4 driver enabled.

torkil

2022-10-05 16:03

reporter   ~0008688

Thanks, works like a charm with kernel-ml

toracat

2022-10-05 18:09

administrator   ~0008689

That's great news. We will get to the kmod package as soon as we are able.

pperry

2022-10-06 08:47

administrator   ~0008692

Last edited: 2022-10-06 08:48

The following package has been built for rhel9 and uploaded to the main elrepo repository:

kmod-mlx4-4.0-1.el9_0.elrepo.x86_64.rpm

It should be available on our mirror sites to test shortly.

Please note - our kmod packages are only compatible with the RHEL distro kernel. They do not work with our own kernel-ml packages.

To test, please install the kmod package and reboot to a RHEL9 distro kernel (not our kernel-ml) and test. The device should now work as expected with the distro kernel(s).

Thanks

torkil

2022-10-06 14:50

reporter   ~0008693

Hi

Wow, that was fast =)

It doesn't quite work though. I have this on dmesg:

"
[ 10.854188] mlx4_core 0000:08:00.0: 32.000 Gb/s available PCIe bandwidth (5.0 GT/s PCIe x8 link)
[ 11.167136] <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-0
[ 11.168241] <mlx4_ib> mlx4_ib_add: counter index 0 for port 1 allocated 0
[ 11.168244] <mlx4_ib> mlx4_ib_add: counter index 1 for port 2 allocated 0
[ 11.202932] infiniband mlx4_0: Couldn't register device with driver model
[ 11.228829] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0
"

[root@g79 ~]# uname -a
Linux g79.drcmr 5.14.0-70.26.1.el9_0.x86_64 #1 SMP PREEMPT Fri Sep 2 16:07:40 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux

Host is freshly rebuilt with distro kernel.

pperry

2022-10-06 18:14

administrator   ~0008695

Thanks for the feedback. We are going to have to do a bit more work to fix this, I think.

Originally, on RHEL8, Red Hat simply didn't enable support for older hardware, so all we had to do was rebuild the RHEL drivers/net/ethernet/mellanox/mlx4 source code with -DCONFIG_MLX4_CORE_GEN2 to switch support back on for Gen2 cards, and that is what the first version above did.

This bug looks like it may be the issue you have reported:
https://bugzilla.redhat.com/show_bug.cgi?id=2014094

and the patch is here:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/infiniband/hw/mlx4?h=v5.15.72&id=0bccc44a54e8d68be5ed02b7985a869cc2df6444

The driver is actually split into two parts - an infiniband part and ethernet part, and the issue we have here is that the bug and patch applies to the infiniband module: drivers/infiniband/hw/mlx4/main.c

I think we will need to backport the patch to the /drivers/infiniband/hw/mlx4/mlx4_ib.ko module, and also build and ship this module in our kmod package (I didn't originally realise this driver was split into two disparate modules). It's late now, but I can take a look at that tomorrow for you, and hopefully get a v2 package out for you to test. I will update here as soon as I have something available for you.

torkil

2022-10-07 04:38

reporter   ~0008696

Sounds good, thanks.

Mvh.

Torkil

pperry

2022-10-07 12:10

administrator   ~0008697

An updated package has been released to the main repository:

kmod-mlx4-4.0-2.el9_0.elrepo.x86_64.rpm

As discussed above, I have also built the mlx4_ib infiniband module, and have backported the upstream patch: mlx4: Do not fail the registration on port stats

Hoping that will now have fixed the issues for you. If you could please test and provide feedback, that would be great.

Many thanks.

torkil

2022-10-07 15:49

reporter   ~0008700

Seems to work, if a little noisy:

"
[Fri Oct 7 21:44:22 2022] mlx4_core: Mellanox ConnectX core driver v4.0-0
[Fri Oct 7 21:44:22 2022] ------------[ cut here ]------------
[Fri Oct 7 21:44:22 2022] WARNING: CPU: 0 PID: 289 at net/core/devlink.c:10134 devlink_param_register+0x1b3/0x1d0
q[Fri Oct 7 21:44:22 2022] Modules linked in: mlx4_core(OE+) tls rfkill ib_uverbs ib_core sunrpc intel_rapl_msr intel_rapl_common iTCO_wdt iTCO_vendor_support ipmi_ssif mgag200 sb_edac drm_kms_helper x86_pkg_temp_thermal intel_powerclamp syscopyarea sysfillrect coretemp sysimgblt fb_sys_fops rapl intel_cstate cec intel_uncore drm pcspkr acpi_ipmi ipmi_si ipmi_devintf lpc_ich acpi_power_meter ipmi_msghandler hpilo ioatdma fuse xfs libcrc32c sd_mod t10_pi sg crct10dif_pclmul ahci crc32_pclmul crc32c_intel libahci mpt3sas ghash_clmulni_intel libata igb serio_raw i2c_algo_bit hpwdt dca raid_class scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: mlx4_core]
[Fri Oct 7 21:44:22 2022] CPU: 0 PID: 289 Comm: kworker/0:8 Kdump: loaded Tainted: G W OE --------- --- 5.14.0-70.26.1.el9_0.x86_64 #1
[Fri Oct 7 21:44:22 2022] Hardware name: HP ProLiant SL230s Gen8 /, BIOS P75 05/24/2019
[Fri Oct 7 21:44:22 2022] Workqueue: events work_for_cpu_fn
[Fri Oct 7 21:44:22 2022] RIP: 0010:devlink_param_register+0x1b3/0x1d0
[Fri Oct 7 21:44:22 2022] Code: ff ff ff 0f 0b 49 8b 6c 24 08 e9 05 ff ff ff 0f 0b e9 54 ff ff ff 0f 0b e9 2b ff ff ff 49 83 7c 24 28 00 75 a4 e9 40 ff ff ff <0f> 0b 49 8b 6c 24 08 e9 de fe ff ff 0f 0b e9 68 fe ff ff b8 f4 ff
[Fri Oct 7 21:44:22 2022] RSP: 0018:ffffa34847cf7d98 EFLAGS: 00010246
[Fri Oct 7 21:44:22 2022] RAX: 000000000000000e RBX: ffffffffc08f3968 RCX: 0000000000000001
[Fri Oct 7 21:44:22 2022] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8b7c86b91c00
[Fri Oct 7 21:44:22 2022] RBP: ffffffffc090ec91 R08: 0000000000000230 R09: 0000000004000000
[Fri Oct 7 21:44:22 2022] R10: 0000000000000000 R11: 0000000000000010 R12: ffffffffc08f3968
[Fri Oct 7 21:44:22 2022] R13: ffff8b7c86730000 R14: 0000000000000005 R15: ffff8b8b7f62e90d
[Fri Oct 7 21:44:22 2022] FS: 0000000000000000(0000) GS:ffff8b8b7f600000(0000) knlGS:0000000000000000
[Fri Oct 7 21:44:22 2022] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri Oct 7 21:44:22 2022] CR2: 000055a3ff2f84a0 CR3: 000000118166c001 CR4: 00000000000606f0
[Fri Oct 7 21:44:22 2022] Call Trace:
[Fri Oct 7 21:44:22 2022] devlink_params_register+0x50/0xb0
[Fri Oct 7 21:44:22 2022] mlx4_init_one+0x111/0x2a0 [mlx4_core]
[Fri Oct 7 21:44:22 2022] local_pci_probe+0x45/0x80
[Fri Oct 7 21:44:22 2022] work_for_cpu_fn+0x16/0x20
[Fri Oct 7 21:44:22 2022] process_one_work+0x1e8/0x3c0
[Fri Oct 7 21:44:22 2022] worker_thread+0x1da/0x3b0
[Fri Oct 7 21:44:22 2022] ? rescuer_thread+0x370/0x370
[Fri Oct 7 21:44:22 2022] kthread+0x149/0x170
[Fri Oct 7 21:44:22 2022] ? set_kthread_struct+0x40/0x40
[Fri Oct 7 21:44:22 2022] ret_from_fork+0x22/0x30
[Fri Oct 7 21:44:22 2022] ---[ end trace 10dcc546735bafc5 ]---
[Fri Oct 7 21:44:22 2022] mlx4_core: Initializing 0000:08:00.0
[Fri Oct 7 21:44:25 2022] mlx4_core 0000:08:00.0: 32.000 Gb/s available PCIe bandwidth (5.0 GT/s PCIe x8 link)
[Fri Oct 7 21:44:25 2022] <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-0
[Fri Oct 7 21:44:25 2022] <mlx4_ib> mlx4_ib_add: counter index 0 for port 1 allocated 0
[Fri Oct 7 21:44:25 2022] <mlx4_ib> mlx4_ib_add: counter index 1 for port 2 allocated 0
[Fri Oct 7 21:44:25 2022] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0
[Fri Oct 7 21:44:25 2022] mlx4_core 0000:08:00.0 ib1: "NetworkManager" wants to know my dev_id. Should it look at dev_port instead? See Documentation/ABI/testing/sysfs-class-net for more info.
[Fri Oct 7 21:44:25 2022] mlx4_core 0000:08:00.0 ibs2d1: renamed from ib1
[Fri Oct 7 21:44:25 2022] Loading iSCSI transport class v2.0-870.
[Fri Oct 7 21:44:25 2022] mlx4_core 0000:08:00.0 ibs2: renamed from ib0
[Fri Oct 7 21:44:25 2022] iscsi: registered transport (iser)
[Fri Oct 7 21:44:25 2022] Rounding down aligned max_sectors from 4294967295 to 4294967288
[Fri Oct 7 21:44:25 2022] db_root: cannot open: /etc/target
[Fri Oct 7 21:44:25 2022] RPC: Registered rdma transport module.
[Fri Oct 7 21:44:25 2022] RPC: Registered rdma backchannel transport module.
[Fri Oct 7 21:44:44 2022] IPv6: ADDRCONF(NETDEV_CHANGE): ibs2: link becomes ready
"

Thanks a lot =)

Mvh.

Torkil

pperry

2022-10-07 17:01

administrator   ~0008701

Thanks for the feedback.

I'll mark as resolved for now - if you get any issues or anything actionable we can improve, please do not hesitate to let us know.

pperry

2022-11-15 05:52

administrator   ~0008738

The patch in #8695 has now been applied in RHEL 9.1.

Issue History

Date Modified Username Field Change
2022-10-05 14:36 torkil New Issue
2022-10-05 14:36 torkil Status new => assigned
2022-10-05 14:36 torkil Assigned To => toracat
2022-10-05 14:43 toracat Note Added: 0008687
2022-10-05 14:44 toracat Assigned To toracat => pperry
2022-10-05 14:48 burakkucat Project channel: kernel/el9 => channel: elrepo/el9
2022-10-05 14:48 burakkucat Category --kernel--request-for-enhancement-- => General
2022-10-05 14:50 burakkucat Category General => --elrepo--request-for-enhancement--
2022-10-05 16:03 torkil Note Added: 0008688
2022-10-05 18:09 toracat Note Added: 0008689
2022-10-05 18:48 burakkucat Category --elrepo--request-for-enhancement-- => kmod-mlx4
2022-10-06 08:47 pperry Note Added: 0008692
2022-10-06 08:47 pperry Status assigned => feedback
2022-10-06 08:48 pperry Note Edited: 0008692
2022-10-06 14:50 torkil Note Added: 0008693
2022-10-06 14:50 torkil Status feedback => assigned
2022-10-06 18:14 pperry Note Added: 0008695
2022-10-07 04:38 torkil Note Added: 0008696
2022-10-07 12:10 pperry Note Added: 0008697
2022-10-07 12:10 pperry Status assigned => feedback
2022-10-07 15:49 torkil Note Added: 0008700
2022-10-07 15:49 torkil Status feedback => assigned
2022-10-07 17:01 pperry Note Added: 0008701
2022-10-07 17:01 pperry Status assigned => resolved
2022-10-07 17:01 pperry Resolution open => fixed
2022-11-15 05:52 pperry Note Added: 0008738