View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0001485 | channel: elrepo/el8 | kmod-ib_mthca | public | 2024-09-26 10:51 | 2024-11-16 16:33 |
Reporter | smcgrat | Assigned To | tqhoang | ||
Priority | normal | Severity | crash | Reproducibility | always |
Status | assigned | Resolution | open | ||
Platform | x86_64 | OS | Rocky Linux | OS Version | 8.10 |
Summary | 0001485: RDMA connection causes kernel panic | ||||
Description | Apologies if this is the wrong place for this. The mthca drivers where kindly updated in request 0001480. However if I try to connect over RDMA to a machine with the drivers installed the machine kernel panics. | ||||
Steps To Reproduce | Install Rocky Linux 8.10 on two servers with the following HCA's installed: 0a:00.0 InfiniBand [0c06]: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] [15b3:6274] (rev a0) Install the ib_mthca-ibverbs and kmod-ib_mthca packages. Start a qperf server on one of the servers. On the other try to connect over RDMA with qperf to the server running qperf. E.g. command: qperf -v -i mthca0 172.16.36.3 rc_bi_bw This causes the machine that is running the qperf server to kernel panic. | ||||
Additional Information | Have generated a kdump, it is quite large though, 233MB. Let me know if you want a copy of it and how to share it with you. Many thanks in advance. | ||||
Tags | No tags attached. | ||||
|
Can you get a screen shot of the kernel panic? Make sure it shows the entire stack trace. |
|
Is the attached any use? |
|
Thanks, I'll see if I can make do with the screen capture. |
|
It looks like I'll need the kdump file. Also please let me know the exact kernel version you were running too. |
|
OK. Does the forum have dm? If so, can you direct message me your email address please and I'll send you a link for it through our file sending service. |
|
Link sent, please let me know if it doesn't arrive. |
|
I haven't received it yet. There isn't anything in my spam folder either. |
|
Ok, I received it now. |
|
Following up to say that I have not made any progress on ib_mthca crash. I looked at the kdump file and comparing our source code to upstream, nothing seems suspicious. I'll keep investigating, but it's kind of hard without any hardware in-hand to test against. |
|
Thanks for the update and the efforts. They are much appreciated. I can ask if you can be given access to some of our systems so you have access to the hardware if that works for you? |
|
Thanks, but I'm not sure if this would help. I would kind of need physical access to the cards because I'd anticipate the box crashing a lot as I try a lot of build & test cycles. |
|
If you're in the USA, any chance you'd be willing to send 2 cards and cables? I'd like get this resolved but I can't justify buying them myself (even a used set on eBay). |
|
We're in Ireland unfortunately. It would need to be approved but I could see if you could be given root permissions to some of the servers and ipmi access to powercycle when they break. Would that meet your needs? |
|
Thanks for the offer, but I'd prefer not to do something internationally for liability reasons. But I do have something new to try. I decided to port the abandoned driver in the RHEL 8.10 kernel instead. https://elrepo.org/people/tqhoang/bug-1485/ For RHEL 8.10 GA kernel: kmod-ib_mthca-1.0.20080404-3.el8_10.elrepo.x86_64.rpm kmod-ib_mthca-1.0.20080404-3.el8_10.elrepo.src.rpm For the latest REHL 8.10 errata kernel: kmod-ib_mthca-1.0.20080404-3.1.el8_10.elrepo.x86_64.rpm kmod-ib_mthca-1.0.20080404-3.1.el8_10.elrepo.src.rpm UPDATE: Hold off on trying this. I want to add some upstream patches from the LTS branch Linux 4.19.y |
|
@smcgrat - I uploaded the refresh with the upstream patches from the LTS branch Linux 4.19.y. Please let me know how it goes. |
|
@smcgrat - Will you have a chance to test the new kmods for ib_mthca? |
|
Sorry for the delay. Had the 4.18.0-553.27.1.el8_10.x86_64 kernel installed and installed the package with: yum localinstall kmod-ib_mthca-1.0.20080404-3.el8_10.elrepo.x86_64.rpm That installed kernel-core-4.18.0-553.8.1.el8_10.x86_64 and after the reboot it is booted from 4.18.0-553.8.1.el8_10.x86_64 and there are no network devices showing, including no ehternet devices. Only the loopback device is visible from ip a. When I run ibv_devices or ibv_devinfo I get this error: Failed to get IB devices list: Function not implemented Have I missed something here? Sorry if so. |
|
The "-3" kmod is meant for the stock RHEL 8.10 GA kernel (kernel-4.18.0-553.el8_10). It's only meant for fresh installs with either the full or minimal DVD ISO. The "-3.1" kmod is the one you want for RHEL 8.10 errata kernels. At the time, kernel-4.18.0-553.22.1.el8_10 was the latest when I built the RPM. But it should weak-link against the current kernel-4.18.0-553.27.1.el8_10 just fine. I suspect that when you did the local install with the "-3" kmod, that dnf didn't install the kernel-modules-4.18.0-553.8.1.el8_10...which might explain why you have no other networking devices. So long story short, update to the "-3.1" kmod and use the latest kernel 4.18.0-553.27.1.el8_10. Please let me know how it goes. |
|
Thanks, sorry for missing that. I installed that package: yum localinstall kmod-ib_mthca-1.0.20080404-3.1.el8_10.elrepo.x86_64.rpm Now the node kernel panics unfortunately. Do you want a kdump? Sorry for the mess. |
|
Yes, if you have the kdump that might help. |
Date Modified | Username | Field | Change |
---|---|---|---|
2024-09-26 10:51 | smcgrat | New Issue | |
2024-09-26 10:51 | smcgrat | Status | new => assigned |
2024-09-26 10:51 | smcgrat | Assigned To | => toracat |
2024-09-26 10:57 | tqhoang | Assigned To | toracat => tqhoang |
2024-09-26 10:58 | tqhoang | Note Added: 0010135 | |
2024-09-26 11:03 | smcgrat | Note Added: 0010136 | |
2024-09-26 11:03 | smcgrat | File Added: IMG_20240926_144227_HDR.jpg | |
2024-09-27 18:08 | tqhoang | Note Added: 0010139 | |
2024-09-29 19:11 | tqhoang | Note Added: 0010140 | |
2024-09-30 04:15 | smcgrat | Note Added: 0010141 | |
2024-09-30 10:08 | smcgrat | Note Added: 0010143 | |
2024-09-30 11:38 | tqhoang | Note Added: 0010144 | |
2024-09-30 13:59 | tqhoang | Note Added: 0010146 | |
2024-10-24 17:21 | tqhoang | Note Added: 0010157 | |
2024-10-25 04:08 | smcgrat | Note Added: 0010159 | |
2024-10-28 08:21 | tqhoang | Note Added: 0010162 | |
2024-10-28 08:57 | tqhoang | Note Added: 0010163 | |
2024-10-28 16:48 | tqhoang | Category | --elrepo--OTHER-- => kmod-ib_mthca |
2024-10-29 06:56 | smcgrat | Note Added: 0010175 | |
2024-11-03 17:49 | tqhoang | Status | assigned => feedback |
2024-11-03 17:49 | tqhoang | Note Added: 0010179 | |
2024-11-04 09:02 | tqhoang | Note Edited: 0010179 | |
2024-11-04 10:19 | tqhoang | Note Edited: 0010179 | |
2024-11-04 10:20 | tqhoang | Note Added: 0010180 | |
2024-11-13 19:10 | tqhoang | Note Added: 0010181 | |
2024-11-14 06:51 | smcgrat | Note Added: 0010182 | |
2024-11-14 06:51 | smcgrat | Status | feedback => assigned |
2024-11-14 09:10 | tqhoang | Note Added: 0010183 | |
2024-11-14 11:48 | smcgrat | Note Added: 0010184 | |
2024-11-14 11:48 | smcgrat | File Added: IMG_20241114_164032_HDR.jpg | |
2024-11-16 16:33 | tqhoang | Note Added: 0010185 |