View Issue Details

IDProjectCategoryView StatusLast Update
0001480channel: elrepo/el8--elrepo--request-for-enhancement--public2024-09-16 13:31
Reportersmcgrat Assigned Totqhoang  
PrioritynormalSeverityminorReproducibilityalways
Status closedResolutionfixed 
Summary0001480: mthca driver update request
DescriptionSorry if this is the wrong place for this.

We have hardware with the following InfiniBand HCA that is not fully operational in Rocky Linux 8:

0a:00.0 InfiniBand [0c06]: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] [15b3:6274] (rev a0)

I don't see that identifier listed on the device id's page though: https://elrepo.org/wiki/doku.php?id=deviceids

With stock Rocky Linux 8.10 installed the infiniband hardware is not detected. If the mainline kernel is used the software is visible but not fully operational.

In Scientific Linux 7 the Mellanox MT25204 HCA uses the 'mthca' driver, as per https://access.redhat.com/discussions/5628911 and as per https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/8/html/considerations_in_adopting_rhel_8/hardware-enablement_considerations-in-adopting-rhel-8#removed-device-drivers_hardware-enablement that driver has been removed from RHEL.

Is it possible to get this driver updated please?

If anymore information is required please let me know.

Steps To ReproduceCase 1 - stock RL8.10
Install Rocky Linux 8.10 and run:
yum group install "Infiniband Support"
yum install rdma-core-devel
reboot
ibstat
ibv_devices
    device node GUID
    ------ ----------------
ibv_devinfo
No IB devices found

Case 2 - Mainline kernel
Install Rocky Linux 8.10 and run:
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
yum install https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm -y
yum --enablerepo=elrepo-kernel install kernel-ml -
reboot
ibv_devinfo
No IB devices found
ibv_devices
    device node GUID
    ------ ----------------
ibstat mthca0
CA 'mthca0'
        CA type: MT25204
        Number of ports: 1
        Firmware version: 1.2.0
        Hardware version: a0
        Node GUID: 0x0002c9020027b208
        System image GUID: 0x0002c9020027b20b
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 10
                Base lid: 1
                LMC: 0
                SM lid: 1
                Capability mask: 0x02590a6a
                Port GUID: 0x0002c9020027b209
                Link layer: InfiniBand
TagsNo tags attached.

Activities

tqhoang

2024-09-10 08:57

manager   ~0010098

Last edited: 2024-09-11 13:25

We have pushed the following packages to our mirrors. Please let us know if everything is working ok.

For RHEL 8.10 installation:
dd-ib_mthca-1.0.20080404-1.el8_10.elrepo.iso
dd-ib_mthca-1.0.20080404-1.el8_10.elrepo.SHA256SUM.asc

For RHEL 8.10 GA kernel:
kmod-ib_mthca-1.0.20080404-1.el8_10.elrepo.x86_64.rpm
kmod-ib_mthca-1.0.20080404-1.el8_10.elrepo.src.rpm

For RHEL 8.10 errata kernel:
kmod-ib_mthca-1.0.20080404-2.el8_10.elrepo.x86_64.rpm
kmod-ib_mthca-1.0.20080404-2.el8_10.elrepo.src.rpm

For RHEL 8.10 libibverbs-utils plugin (ibv_devices, ibv_devinfo, etc.):
ib_mthca-ibverbs-48.0-1.el8_10.elrepo.x86_64.rpm
ib_mthca-ibverbs-48.0-1.el8_10.elrepo.src.rpm

--

On a side note, we made the same packages available for RHEL 9.4.

smcgrat

2024-09-12 04:27

reporter   ~0010103

Thanks. Unfortunately though, after installing the package and rebooting, `yum localinstall kmod-ib_mthca-1.0.20080404-2.el8_10.elrepo.x86_64.rpm -y && reboot`, two test systems have kernel panic'd as per the attached image. Don't have a kdump, sorry, disabled on these systems. If you want it let me know and I'll try to generate one.
IMG_20240912_083653_HDR.jpg (3,494,643 bytes)

tqhoang

2024-09-12 10:54

manager   ~0010104

Thanks for the feedback. A kdump would be helpful to debug. In the mean time, I will take a look at the function triggering the page fault.

Could you also check if the "ib_mthca-ibverbs" package allows the ibv_devinfo & ibv_devices apps to work with the kernel-ml?

smcgrat

2024-09-13 05:02

reporter   ~0010106

Thanks for your help.

Won't be able to get a kdump until the week after next because I'm on leave most of next week. Sorry.

Yes, the "ib_mthca-ibverbs" package allows the ibv_devinfo & ibv_devices apps to work with the kernel-ml as follows.

[root@crusher-n001:~]# uname -r
6.10.9-1.el8.elrepo.x86_64
[root@crusher-n001:~]# ibv_devinfo
No IB devices found
[root@crusher-n001:~]# ibv_devices
    device node GUID
    ------ ----------------
[root@crusher-n001:~]# yum install -y ib_mthca-ibverbs
[...]
[root@crusher-n001:~]# echo $?
0
[root@crusher-n001:~]# ibv_devinfo
hca_id: mthca0
        transport: InfiniBand (0)
        fw_ver: 1.2.0
        node_guid: 0002:c902:0027:b208
        sys_image_guid: 0002:c902:0027:b20b
        vendor_id: 0x02c9
        vendor_part_id: 25204
        hw_ver: 0xA0
        board_id: MT_03B0140001
        phys_port_cnt: 1
                port: 1
                        state: PORT_ACTIVE (4)
                        max_mtu: 2048 (4)
                        active_mtu: 2048 (4)
                        sm_lid: 1
                        port_lid: 1
                        port_lmc: 0x00
                        link_layer: InfiniBand

hca_id: mthca1
        transport: InfiniBand (0)
        fw_ver: 1.2.0
        node_guid: 0002:c902:0027:b234
        sys_image_guid: 0002:c902:0027:b237
        vendor_id: 0x02c9
        vendor_part_id: 25204
        hw_ver: 0xA0
        board_id: MT_03B0140001
        phys_port_cnt: 1
                port: 1
                        state: PORT_DOWN (1)
                        max_mtu: 2048 (4)
                        active_mtu: 512 (2)
                        sm_lid: 0
                        port_lid: 0
                        port_lmc: 0x00
                        link_layer: InfiniBand
[root@crusher-n001:~]# ibv_devices
    device node GUID
    ------ ----------------
    mthca0 0002c9020027b208
    mthca1 0002c9020027b234

tqhoang

2024-09-15 13:36

manager   ~0010107

Last edited: 2024-09-15 13:45

I found a patch from upstream that appears to fix the crash.
https://github.com/torvalds/linux/commit/dc52aadbc1849cbe3fcf6bc54d35f6baa396e0a1

Here's some updated packages for you to test. Note these are unsigned while we're debugging this issue.
https://elrepo.org/people/tqhoang/bug-1480/

The "-1.1" are for the RHEL 8.10 GA kernel.
The "-2.1" are for the RHEL 8.10 errata kernels.

smcgrat

2024-09-16 04:58

reporter   ~0010108

Thank you.

Package install's OK
$ uname -a
Linux crusher-n002 4.18.0-553.16.1.el8_10.x86_64 #1 SMP Thu Aug 8 17:47:08 UTC 2024 x86_64 x86_64 x86_64 GNU/Linu
$ yum localinstall kmod-ib_mthca-1.0.20080404-2.1.el8_10.elrepo.x86_64.rpm -y

And the server comes back OK after reboot but the infiniband device is not visible:

[root@crusher-n002:~]# ibv_devices
    device node GUID
    ------ ----------------
[root@crusher-n002:~]# ibv_devinfo
No IB devices found
[root@crusher-n002:~]# ibstat
CA 'mthca0'
        CA type: MT25204
        Number of ports: 1
        Firmware version: 1.2.0
        Hardware version: a0
        Node GUID: 0x0002c9020027b0bc
        System image GUID: 0x0002c9020027b0bf
        Port 1:
                State: Initializing
                Physical state: LinkUp
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x02590a68
                Port GUID: 0x0002c9020027b0bd
                Link layer: InfiniBand
CA 'mthca1'
        CA type: MT25204
        Number of ports: 1
        Firmware version: 1.2.0
        Hardware version: a0
        Node GUID: 0x0002c9020027b210
        System image GUID: 0x0002c9020027b213
        Port 1:
                State: Down
                Physical state: Polling
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x02590a68
                Port GUID: 0x0002c9020027b211
                Link layer: InfiniBand

Do you still want a kdump for the other driver version that caused a kernel panic?

Again, many thanks for all your help here.

tqhoang

2024-09-16 07:24

manager   ~0010109

That's great to hear. No need for the kdump.

Regarding the ibv_devices/ibv_devinfo, did you install our "ib_mthca-ibverbs" package on this box? I'm curious if it's not working with our kmod still or you just didn't install it on this one.

smcgrat

2024-09-16 07:38

reporter   ~0010110

Thank you, ibv_devices and ibv_devinfo do work after installing the ib_mthca-ibverbs-48.0-1.el8_10.elrepo.x86_64.rpm and kmod-ib_mthca-1.0.20080404-2.1.el8_10.elrepo.x86_64.rpm packages. Apologies, I'd missed that requirement.

[root@crusher-n002:~]# ibv_devices
    device node GUID
    ------ ----------------
    mthca0 0002c9020027b0bc
    mthca1 0002c9020027b210
[root@crusher-n002:~]# ibv_devinfo
hca_id: mthca0
        transport: InfiniBand (0)
        fw_ver: 1.2.0
        node_guid: 0002:c902:0027:b0bc
        sys_image_guid: 0002:c902:0027:b0bf
        vendor_id: 0x02c9
        vendor_part_id: 25204
        hw_ver: 0xA0
        board_id: MT_03B0140001
        phys_port_cnt: 1
                port: 1
                        state: PORT_INIT (2)
                        max_mtu: 2048 (4)
                        active_mtu: 512 (2)
                        sm_lid: 0
                        port_lid: 0
                        port_lmc: 0x00
                        link_layer: InfiniBand

hca_id: mthca1
        transport: InfiniBand (0)
        fw_ver: 1.2.0
        node_guid: 0002:c902:0027:b210
        sys_image_guid: 0002:c902:0027:b213
        vendor_id: 0x02c9
        vendor_part_id: 25204
        hw_ver: 0xA0
        board_id: MT_03B0140001
        phys_port_cnt: 1
                port: 1
                        state: PORT_DOWN (1)
                        max_mtu: 2048 (4)
                        active_mtu: 512 (2)
                        sm_lid: 0
                        port_lid: 0
                        port_lmc: 0x00
                        link_layer: InfiniBand

tqhoang

2024-09-16 08:38

manager   ~0010111

Thanks for confirming that. FWIW, the "ib_mthca-ibverbs" package is the mthca plugin for the libibverbs package. We have it listed as recommended by the kmod-ib_mthca package. When it's installed from our mirrors via dnf, it should automatically get pulled in as a weak dependency.

I'll get the updated kmod packages rebuilt & signed and drop a note here when they're ready.

smcgrat

2024-09-16 09:54

reporter   ~0010112

Great, many thanks again.

tqhoang

2024-09-16 13:31

manager   ~0010113

The following updated packages are syncing to our mirrors.

For RHEL 8.10 GA kernel (e.g. installation):
dd-ib_mthca-1.0.20080404-1.1.el8_10.elrepo.SHA256SUM.asc
dd-ib_mthca-1.0.20080404-1.1.el8_10.elrepo.iso
kmod-ib_mthca-1.0.20080404-1.1.el8_10.elrepo.x86_64.rpm
kmod-ib_mthca-1.0.20080404-1.1.el8_10.elrepo.src.rpm

For RHEL 8.10 errata kernel:
kmod-ib_mthca-1.0.20080404-2.1.el8_10.elrepo.x86_64.rpm
kmod-ib_mthca-1.0.20080404-2.1.el8_10.elrepo.src.rpm

Reminder that this other package is required for libibverbs support (ibv_devinfo, ibv_devices, etc):
ib_mthca-ibverbs-48.0-1.el8_10.elrepo.x86_64.rpm

I'm going to close this issue as resolved now. If you have any further issues, please open a new ticket.

Issue History

Date Modified Username Field Change
2024-09-09 07:58 smcgrat New Issue
2024-09-09 07:58 smcgrat Status new => assigned
2024-09-09 07:58 smcgrat Assigned To => toracat
2024-09-09 09:27 tqhoang Assigned To toracat => tqhoang
2024-09-10 08:57 tqhoang Status assigned => feedback
2024-09-10 08:57 tqhoang Note Added: 0010098
2024-09-11 13:25 tqhoang Note Edited: 0010098
2024-09-12 04:27 smcgrat Note Added: 0010103
2024-09-12 04:27 smcgrat File Added: IMG_20240912_083653_HDR.jpg
2024-09-12 04:27 smcgrat Status feedback => assigned
2024-09-12 10:54 tqhoang Note Added: 0010104
2024-09-13 05:02 smcgrat Note Added: 0010106
2024-09-15 13:36 tqhoang Note Added: 0010107
2024-09-15 13:36 tqhoang Status assigned => feedback
2024-09-15 13:45 tqhoang Note Edited: 0010107
2024-09-16 04:58 smcgrat Note Added: 0010108
2024-09-16 04:58 smcgrat Status feedback => assigned
2024-09-16 07:24 tqhoang Note Added: 0010109
2024-09-16 07:38 smcgrat Note Added: 0010110
2024-09-16 08:38 tqhoang Note Added: 0010111
2024-09-16 09:54 smcgrat Note Added: 0010112
2024-09-16 13:31 tqhoang Status assigned => closed
2024-09-16 13:31 tqhoang Resolution open => fixed
2024-09-16 13:31 tqhoang Note Added: 0010113