View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0001480 | channel: elrepo/el8 | --elrepo--request-for-enhancement-- | public | 2024-09-09 07:58 | 2024-09-16 13:31 |
Reporter | smcgrat | Assigned To | tqhoang | ||
Priority | normal | Severity | minor | Reproducibility | always |
Status | closed | Resolution | fixed | ||
Summary | 0001480: mthca driver update request | ||||
Description | Sorry if this is the wrong place for this. We have hardware with the following InfiniBand HCA that is not fully operational in Rocky Linux 8: 0a:00.0 InfiniBand [0c06]: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] [15b3:6274] (rev a0) I don't see that identifier listed on the device id's page though: https://elrepo.org/wiki/doku.php?id=deviceids With stock Rocky Linux 8.10 installed the infiniband hardware is not detected. If the mainline kernel is used the software is visible but not fully operational. In Scientific Linux 7 the Mellanox MT25204 HCA uses the 'mthca' driver, as per https://access.redhat.com/discussions/5628911 and as per https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/8/html/considerations_in_adopting_rhel_8/hardware-enablement_considerations-in-adopting-rhel-8#removed-device-drivers_hardware-enablement that driver has been removed from RHEL. Is it possible to get this driver updated please? If anymore information is required please let me know. | ||||
Steps To Reproduce | Case 1 - stock RL8.10 Install Rocky Linux 8.10 and run: yum group install "Infiniband Support" yum install rdma-core-devel reboot ibstat ibv_devices device node GUID ------ ---------------- ibv_devinfo No IB devices found Case 2 - Mainline kernel Install Rocky Linux 8.10 and run: rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org yum install https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm -y yum --enablerepo=elrepo-kernel install kernel-ml - reboot ibv_devinfo No IB devices found ibv_devices device node GUID ------ ---------------- ibstat mthca0 CA 'mthca0' CA type: MT25204 Number of ports: 1 Firmware version: 1.2.0 Hardware version: a0 Node GUID: 0x0002c9020027b208 System image GUID: 0x0002c9020027b20b Port 1: State: Active Physical state: LinkUp Rate: 10 Base lid: 1 LMC: 0 SM lid: 1 Capability mask: 0x02590a6a Port GUID: 0x0002c9020027b209 Link layer: InfiniBand | ||||
Tags | No tags attached. | ||||
|
We have pushed the following packages to our mirrors. Please let us know if everything is working ok. For RHEL 8.10 installation: dd-ib_mthca-1.0.20080404-1.el8_10.elrepo.iso dd-ib_mthca-1.0.20080404-1.el8_10.elrepo.SHA256SUM.asc For RHEL 8.10 GA kernel: kmod-ib_mthca-1.0.20080404-1.el8_10.elrepo.x86_64.rpm kmod-ib_mthca-1.0.20080404-1.el8_10.elrepo.src.rpm For RHEL 8.10 errata kernel: kmod-ib_mthca-1.0.20080404-2.el8_10.elrepo.x86_64.rpm kmod-ib_mthca-1.0.20080404-2.el8_10.elrepo.src.rpm For RHEL 8.10 libibverbs-utils plugin (ibv_devices, ibv_devinfo, etc.): ib_mthca-ibverbs-48.0-1.el8_10.elrepo.x86_64.rpm ib_mthca-ibverbs-48.0-1.el8_10.elrepo.src.rpm -- On a side note, we made the same packages available for RHEL 9.4. |
|
Thanks. Unfortunately though, after installing the package and rebooting, `yum localinstall kmod-ib_mthca-1.0.20080404-2.el8_10.elrepo.x86_64.rpm -y && reboot`, two test systems have kernel panic'd as per the attached image. Don't have a kdump, sorry, disabled on these systems. If you want it let me know and I'll try to generate one. |
|
Thanks for the feedback. A kdump would be helpful to debug. In the mean time, I will take a look at the function triggering the page fault. Could you also check if the "ib_mthca-ibverbs" package allows the ibv_devinfo & ibv_devices apps to work with the kernel-ml? |
|
Thanks for your help. Won't be able to get a kdump until the week after next because I'm on leave most of next week. Sorry. Yes, the "ib_mthca-ibverbs" package allows the ibv_devinfo & ibv_devices apps to work with the kernel-ml as follows. [root@crusher-n001:~]# uname -r 6.10.9-1.el8.elrepo.x86_64 [root@crusher-n001:~]# ibv_devinfo No IB devices found [root@crusher-n001:~]# ibv_devices device node GUID ------ ---------------- [root@crusher-n001:~]# yum install -y ib_mthca-ibverbs [...] [root@crusher-n001:~]# echo $? 0 [root@crusher-n001:~]# ibv_devinfo hca_id: mthca0 transport: InfiniBand (0) fw_ver: 1.2.0 node_guid: 0002:c902:0027:b208 sys_image_guid: 0002:c902:0027:b20b vendor_id: 0x02c9 vendor_part_id: 25204 hw_ver: 0xA0 board_id: MT_03B0140001 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 1 port_lmc: 0x00 link_layer: InfiniBand hca_id: mthca1 transport: InfiniBand (0) fw_ver: 1.2.0 node_guid: 0002:c902:0027:b234 sys_image_guid: 0002:c902:0027:b237 vendor_id: 0x02c9 vendor_part_id: 25204 hw_ver: 0xA0 board_id: MT_03B0140001 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: InfiniBand [root@crusher-n001:~]# ibv_devices device node GUID ------ ---------------- mthca0 0002c9020027b208 mthca1 0002c9020027b234 |
|
I found a patch from upstream that appears to fix the crash. https://github.com/torvalds/linux/commit/dc52aadbc1849cbe3fcf6bc54d35f6baa396e0a1 Here's some updated packages for you to test. Note these are unsigned while we're debugging this issue. https://elrepo.org/people/tqhoang/bug-1480/ The "-1.1" are for the RHEL 8.10 GA kernel. The "-2.1" are for the RHEL 8.10 errata kernels. |
|
Thank you. Package install's OK $ uname -a Linux crusher-n002 4.18.0-553.16.1.el8_10.x86_64 #1 SMP Thu Aug 8 17:47:08 UTC 2024 x86_64 x86_64 x86_64 GNU/Linu $ yum localinstall kmod-ib_mthca-1.0.20080404-2.1.el8_10.elrepo.x86_64.rpm -y And the server comes back OK after reboot but the infiniband device is not visible: [root@crusher-n002:~]# ibv_devices device node GUID ------ ---------------- [root@crusher-n002:~]# ibv_devinfo No IB devices found [root@crusher-n002:~]# ibstat CA 'mthca0' CA type: MT25204 Number of ports: 1 Firmware version: 1.2.0 Hardware version: a0 Node GUID: 0x0002c9020027b0bc System image GUID: 0x0002c9020027b0bf Port 1: State: Initializing Physical state: LinkUp Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02590a68 Port GUID: 0x0002c9020027b0bd Link layer: InfiniBand CA 'mthca1' CA type: MT25204 Number of ports: 1 Firmware version: 1.2.0 Hardware version: a0 Node GUID: 0x0002c9020027b210 System image GUID: 0x0002c9020027b213 Port 1: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02590a68 Port GUID: 0x0002c9020027b211 Link layer: InfiniBand Do you still want a kdump for the other driver version that caused a kernel panic? Again, many thanks for all your help here. |
|
That's great to hear. No need for the kdump. Regarding the ibv_devices/ibv_devinfo, did you install our "ib_mthca-ibverbs" package on this box? I'm curious if it's not working with our kmod still or you just didn't install it on this one. |
|
Thank you, ibv_devices and ibv_devinfo do work after installing the ib_mthca-ibverbs-48.0-1.el8_10.elrepo.x86_64.rpm and kmod-ib_mthca-1.0.20080404-2.1.el8_10.elrepo.x86_64.rpm packages. Apologies, I'd missed that requirement. [root@crusher-n002:~]# ibv_devices device node GUID ------ ---------------- mthca0 0002c9020027b0bc mthca1 0002c9020027b210 [root@crusher-n002:~]# ibv_devinfo hca_id: mthca0 transport: InfiniBand (0) fw_ver: 1.2.0 node_guid: 0002:c902:0027:b0bc sys_image_guid: 0002:c902:0027:b0bf vendor_id: 0x02c9 vendor_part_id: 25204 hw_ver: 0xA0 board_id: MT_03B0140001 phys_port_cnt: 1 port: 1 state: PORT_INIT (2) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: InfiniBand hca_id: mthca1 transport: InfiniBand (0) fw_ver: 1.2.0 node_guid: 0002:c902:0027:b210 sys_image_guid: 0002:c902:0027:b213 vendor_id: 0x02c9 vendor_part_id: 25204 hw_ver: 0xA0 board_id: MT_03B0140001 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: InfiniBand |
|
Thanks for confirming that. FWIW, the "ib_mthca-ibverbs" package is the mthca plugin for the libibverbs package. We have it listed as recommended by the kmod-ib_mthca package. When it's installed from our mirrors via dnf, it should automatically get pulled in as a weak dependency. I'll get the updated kmod packages rebuilt & signed and drop a note here when they're ready. |
|
Great, many thanks again. |
|
The following updated packages are syncing to our mirrors. For RHEL 8.10 GA kernel (e.g. installation): dd-ib_mthca-1.0.20080404-1.1.el8_10.elrepo.SHA256SUM.asc dd-ib_mthca-1.0.20080404-1.1.el8_10.elrepo.iso kmod-ib_mthca-1.0.20080404-1.1.el8_10.elrepo.x86_64.rpm kmod-ib_mthca-1.0.20080404-1.1.el8_10.elrepo.src.rpm For RHEL 8.10 errata kernel: kmod-ib_mthca-1.0.20080404-2.1.el8_10.elrepo.x86_64.rpm kmod-ib_mthca-1.0.20080404-2.1.el8_10.elrepo.src.rpm Reminder that this other package is required for libibverbs support (ibv_devinfo, ibv_devices, etc): ib_mthca-ibverbs-48.0-1.el8_10.elrepo.x86_64.rpm I'm going to close this issue as resolved now. If you have any further issues, please open a new ticket. |
Date Modified | Username | Field | Change |
---|---|---|---|
2024-09-09 07:58 | smcgrat | New Issue | |
2024-09-09 07:58 | smcgrat | Status | new => assigned |
2024-09-09 07:58 | smcgrat | Assigned To | => toracat |
2024-09-09 09:27 | tqhoang | Assigned To | toracat => tqhoang |
2024-09-10 08:57 | tqhoang | Status | assigned => feedback |
2024-09-10 08:57 | tqhoang | Note Added: 0010098 | |
2024-09-11 13:25 | tqhoang | Note Edited: 0010098 | |
2024-09-12 04:27 | smcgrat | Note Added: 0010103 | |
2024-09-12 04:27 | smcgrat | File Added: IMG_20240912_083653_HDR.jpg | |
2024-09-12 04:27 | smcgrat | Status | feedback => assigned |
2024-09-12 10:54 | tqhoang | Note Added: 0010104 | |
2024-09-13 05:02 | smcgrat | Note Added: 0010106 | |
2024-09-15 13:36 | tqhoang | Note Added: 0010107 | |
2024-09-15 13:36 | tqhoang | Status | assigned => feedback |
2024-09-15 13:45 | tqhoang | Note Edited: 0010107 | |
2024-09-16 04:58 | smcgrat | Note Added: 0010108 | |
2024-09-16 04:58 | smcgrat | Status | feedback => assigned |
2024-09-16 07:24 | tqhoang | Note Added: 0010109 | |
2024-09-16 07:38 | smcgrat | Note Added: 0010110 | |
2024-09-16 08:38 | tqhoang | Note Added: 0010111 | |
2024-09-16 09:54 | smcgrat | Note Added: 0010112 | |
2024-09-16 13:31 | tqhoang | Status | assigned => closed |
2024-09-16 13:31 | tqhoang | Resolution | open => fixed |
2024-09-16 13:31 | tqhoang | Note Added: 0010113 |