View Issue Details

IDProjectCategoryView StatusLast Update
0001485channel: elrepo/el8kmod-ib_mthcapublic2024-11-22 10:22
Reportersmcgrat Assigned Totqhoang  
PrioritynormalSeveritycrashReproducibilityalways
Status assignedResolutionopen 
Platformx86_64OSRocky LinuxOS Version8.10
Summary0001485: RDMA connection causes kernel panic
DescriptionApologies if this is the wrong place for this. The mthca drivers where kindly updated in request 0001480. However if I try to connect over RDMA to a machine with the drivers installed the machine kernel panics.

Steps To ReproduceInstall Rocky Linux 8.10 on two servers with the following HCA's installed:

0a:00.0 InfiniBand [0c06]: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] [15b3:6274] (rev a0)

Install the ib_mthca-ibverbs and kmod-ib_mthca packages.

Start a qperf server on one of the servers.

On the other try to connect over RDMA with qperf to the server running qperf. E.g. command: qperf -v -i mthca0 172.16.36.3 rc_bi_bw

This causes the machine that is running the qperf server to kernel panic.
Additional InformationHave generated a kdump, it is quite large though, 233MB. Let me know if you want a copy of it and how to share it with you.

Many thanks in advance.
TagsNo tags attached.

Activities

tqhoang

2024-09-26 10:58

manager   ~0010135

Can you get a screen shot of the kernel panic? Make sure it shows the entire stack trace.

smcgrat

2024-09-26 11:03

reporter   ~0010136

Is the attached any use?
IMG_20240926_144227_HDR.jpg (3,455,956 bytes)

tqhoang

2024-09-27 18:08

manager   ~0010139

Thanks, I'll see if I can make do with the screen capture.

tqhoang

2024-09-29 19:11

manager   ~0010140

It looks like I'll need the kdump file.

Also please let me know the exact kernel version you were running too.

smcgrat

2024-09-30 04:15

reporter   ~0010141

OK. Does the forum have dm? If so, can you direct message me your email address please and I'll send you a link for it through our file sending service.

smcgrat

2024-09-30 10:08

reporter   ~0010143

Link sent, please let me know if it doesn't arrive.

tqhoang

2024-09-30 11:38

manager   ~0010144

I haven't received it yet. There isn't anything in my spam folder either.

tqhoang

2024-09-30 13:59

manager   ~0010146

Ok, I received it now.

tqhoang

2024-10-24 17:21

manager   ~0010157

Following up to say that I have not made any progress on ib_mthca crash. I looked at the kdump file and comparing our source code to upstream, nothing seems suspicious.

I'll keep investigating, but it's kind of hard without any hardware in-hand to test against.

smcgrat

2024-10-25 04:08

reporter   ~0010159

Thanks for the update and the efforts. They are much appreciated. I can ask if you can be given access to some of our systems so you have access to the hardware if that works for you?

tqhoang

2024-10-28 08:21

manager   ~0010162

Thanks, but I'm not sure if this would help. I would kind of need physical access to the cards because I'd anticipate the box crashing a lot as I try a lot of build & test cycles.

tqhoang

2024-10-28 08:57

manager   ~0010163

If you're in the USA, any chance you'd be willing to send 2 cards and cables? I'd like get this resolved but I can't justify buying them myself (even a used set on eBay).

smcgrat

2024-10-29 06:56

reporter   ~0010175

We're in Ireland unfortunately. It would need to be approved but I could see if you could be given root permissions to some of the servers and ipmi access to powercycle when they break. Would that meet your needs?

tqhoang

2024-11-03 17:49

manager   ~0010179

Last edited: 2024-11-04 10:19

Thanks for the offer, but I'd prefer not to do something internationally for liability reasons.

But I do have something new to try. I decided to port the abandoned driver in the RHEL 8.10 kernel instead.
https://elrepo.org/people/tqhoang/bug-1485/

For RHEL 8.10 GA kernel:
kmod-ib_mthca-1.0.20080404-3.el8_10.elrepo.x86_64.rpm
kmod-ib_mthca-1.0.20080404-3.el8_10.elrepo.src.rpm

For the latest REHL 8.10 errata kernel:
kmod-ib_mthca-1.0.20080404-3.1.el8_10.elrepo.x86_64.rpm
kmod-ib_mthca-1.0.20080404-3.1.el8_10.elrepo.src.rpm

UPDATE: Hold off on trying this. I want to add some upstream patches from the LTS branch Linux 4.19.y

tqhoang

2024-11-04 10:20

manager   ~0010180

@smcgrat - I uploaded the refresh with the upstream patches from the LTS branch Linux 4.19.y. Please let me know how it goes.

tqhoang

2024-11-13 19:10

manager   ~0010181

@smcgrat - Will you have a chance to test the new kmods for ib_mthca?

smcgrat

2024-11-14 06:51

reporter   ~0010182

Sorry for the delay.

Had the 4.18.0-553.27.1.el8_10.x86_64 kernel installed and installed the package with:

yum localinstall kmod-ib_mthca-1.0.20080404-3.el8_10.elrepo.x86_64.rpm

That installed kernel-core-4.18.0-553.8.1.el8_10.x86_64 and after the reboot it is booted from 4.18.0-553.8.1.el8_10.x86_64 and there are no network devices showing, including no ehternet devices. Only the loopback device is visible from ip a.

When I run ibv_devices or ibv_devinfo I get this error:
Failed to get IB devices list: Function not implemented

Have I missed something here? Sorry if so.

tqhoang

2024-11-14 09:10

manager   ~0010183

The "-3" kmod is meant for the stock RHEL 8.10 GA kernel (kernel-4.18.0-553.el8_10). It's only meant for fresh installs with either the full or minimal DVD ISO.

The "-3.1" kmod is the one you want for RHEL 8.10 errata kernels. At the time, kernel-4.18.0-553.22.1.el8_10 was the latest when I built the RPM. But it should weak-link against the current kernel-4.18.0-553.27.1.el8_10 just fine.

I suspect that when you did the local install with the "-3" kmod, that dnf didn't install the kernel-modules-4.18.0-553.8.1.el8_10...which might explain why you have no other networking devices.

So long story short, update to the "-3.1" kmod and use the latest kernel 4.18.0-553.27.1.el8_10. Please let me know how it goes.

smcgrat

2024-11-14 11:48

reporter   ~0010184

Thanks, sorry for missing that.
I installed that package:

yum localinstall kmod-ib_mthca-1.0.20080404-3.1.el8_10.elrepo.x86_64.rpm

Now the node kernel panics unfortunately.

Do you want a kdump?

Sorry for the mess.
IMG_20241114_164032_HDR.jpg (3,308,650 bytes)

tqhoang

2024-11-16 16:33

manager   ~0010185

Yes, if you have the kdump that might help.

tqhoang

2024-11-22 10:22

manager   ~0010199

Thanks, I downloaded the kdump. I'll try to check it out this weekend.

Issue History

Date Modified Username Field Change
2024-09-26 10:51 smcgrat New Issue
2024-09-26 10:51 smcgrat Status new => assigned
2024-09-26 10:51 smcgrat Assigned To => toracat
2024-09-26 10:57 tqhoang Assigned To toracat => tqhoang
2024-09-26 10:58 tqhoang Note Added: 0010135
2024-09-26 11:03 smcgrat Note Added: 0010136
2024-09-26 11:03 smcgrat File Added: IMG_20240926_144227_HDR.jpg
2024-09-27 18:08 tqhoang Note Added: 0010139
2024-09-29 19:11 tqhoang Note Added: 0010140
2024-09-30 04:15 smcgrat Note Added: 0010141
2024-09-30 10:08 smcgrat Note Added: 0010143
2024-09-30 11:38 tqhoang Note Added: 0010144
2024-09-30 13:59 tqhoang Note Added: 0010146
2024-10-24 17:21 tqhoang Note Added: 0010157
2024-10-25 04:08 smcgrat Note Added: 0010159
2024-10-28 08:21 tqhoang Note Added: 0010162
2024-10-28 08:57 tqhoang Note Added: 0010163
2024-10-28 16:48 tqhoang Category --elrepo--OTHER-- => kmod-ib_mthca
2024-10-29 06:56 smcgrat Note Added: 0010175
2024-11-03 17:49 tqhoang Status assigned => feedback
2024-11-03 17:49 tqhoang Note Added: 0010179
2024-11-04 09:02 tqhoang Note Edited: 0010179
2024-11-04 10:19 tqhoang Note Edited: 0010179
2024-11-04 10:20 tqhoang Note Added: 0010180
2024-11-13 19:10 tqhoang Note Added: 0010181
2024-11-14 06:51 smcgrat Note Added: 0010182
2024-11-14 06:51 smcgrat Status feedback => assigned
2024-11-14 09:10 tqhoang Note Added: 0010183
2024-11-14 11:48 smcgrat Note Added: 0010184
2024-11-14 11:48 smcgrat File Added: IMG_20241114_164032_HDR.jpg
2024-11-16 16:33 tqhoang Note Added: 0010185
2024-11-22 10:22 tqhoang Note Added: 0010199