View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0001514 | channel: elrepo/el9 | kmod-aacraid | public | 2025-02-24 02:34 | 2025-02-25 22:23 |
Reporter | ogumemura | Assigned To | pperry | ||
Priority | high | Severity | crash | Reproducibility | random |
Status | assigned | Resolution | open | ||
Platform | Linux Server | OS | Alma Linux 9 | OS Version | 9.5 |
Summary | 0001514: kmod-aacraid Issue Resurfaces with ASR71605 on AlmaLinux 9.5 | ||||
Description | The issue described in "0001444: kmod-aacraid problem when accessing the physical/logical device" is believed to be ongoing. --- I have been struggling with the same issue in the AlmaLinux9 and ASR-71605 environment even after applying the following procedure: https://elrepo.org/bugs/view.php?id=1444 Revert "scsi: aacraid: Reply queue mapping to CPUs based on IRQ affinity" *It is possible that this was only applied to the ELrepo RPM for the 8 series. Since then, there has been no significant progress, and I thought it was a specific issue occurring only in my environment. However, there have been additional reports and responses in the following upstream kernel thread, which has caught my attention: https://bugzilla.kernel.org/show_bug.cgi?id=217599 In the exchanges after January 2, 2025, in this thread, I wanted to test the patch provided by Sagar. I attempted to port it to the AlmaLinux9 Kernel 5.14 environment based on the SRPM of kmod-aacraid. However, Kernel 5.14 does not support blk_mq_map_hw_queues, and although I considered alternative methods, I couldn't resolve it with my skills and knowledge (GPT suggested a software-based solution, but I couldn't decide whether to adopt it). I also considered testing with Kernel 6.x, but due to my situation, it ultimately needs to work effectively with the standard Kernel 5.14 package, so I am at a standstill. Additionally, although I haven't been able to confirm it due to lack of environment, I believe the same issue might be persisting in the EL8 series as well. | ||||
Steps To Reproduce | 1. Install a Alma linux 9.5 server 2. download the source from https://mirror.rackspace.com/elrepo/elrepo/el8/x86_64/RPMS/kmod-aacraid-1.2.1-9.el8_9.elrepo.x86_64.rpm 3. Install by rpm command 4. When accessing the logical device under high I/O load. (???) | ||||
Additional Information | I examined the SRPM of kmod-aacraid-1.2.1-9.el9_5 and found no evidence of patches being applied. It appears to be identical to the source included in 9.5. The sources for kmod-aacraid-1.2.1-11.1.el8_10 and el9_5 seem to have many inherent differences. #define AAC_DRIVER_BUILD 50877 <= el8 #define AAC_DRIVER_BUILD 50983 <= el9 --- My system specs is: --- OS - Alma Linux 9.5 Kernel - 5.14.0-503.23.2.el9_5.x86_64 ( believed to be ongoing in all versions after 5.14.0-362.8.1.el9_3 ) Harddisk - HDD*8 + SSD*4 install with Hardware Raid Adaptec Series 7 - ASR-71605 --- # lspci -knn 12:00.0 RAID bus controller [0104]: Adaptec Series 7 6G SAS/PCIe 3 [9005:028c] (rev 01) Subsystem: Adaptec Series 7 - ASR-71605 - 16 internal 6G SAS Port/PCIe 3.0 [9005:0501] Kernel driver in use: aacraid Kernel modules: aacraid --- The dmesg output during the failure varies slightly each time, but it generally looks like this: kernel: aacraid: Host adapter abort request. aacraid: Outstanding commands on (0,0,0,0): kernel: aacraid: Host bus reset request. SCSI hang? kernel: aacraid 0000:12:00.0: outstanding cmd: midlevel-0 kernel: aacraid 0000:12:00.0: outstanding cmd: lowlevel-0 kernel: aacraid 0000:12:00.0: outstanding cmd: error handler-0 kernel: aacraid 0000:12:00.0: outstanding cmd: firmware-1 kernel: aacraid 0000:12:00.0: outstanding cmd: kernel-0 kernel: aacraid 0000:12:00.0: Controller reset type is 3 kernel: aacraid 0000:12:00.0: Issuing IOP reset kernel: aacraid 0000:12:00.0: IOP reset succeeded kernel: aacraid: Comm Interface type2 enabled kernel: aacraid 0000:12:00.0: Scheduling bus rescan systemd[1]: Started Session 30895 of User root. systemd[1]: session-30895.scope: Deactivated successfully. kernel: sd 0:0:1:0: [sdc] Very big device. Trying to use READ CAPACITY(16). kernel: sd 0:0:3:0: [sdd] Very big device. Trying to use READ CAPACITY(16). systemd[1]: Started Session 30896 of User root. systemd[1]: session-30896.scope: Deactivated successfully. kernel: aac_read: aac_fib_send failed with status: -12. kernel: aac_read: aac_fib_send failed with status: -12. | ||||
Tags | No tags attached. | ||||
|
I apologize for the mistake. I registered it as channel: elrepo/el8 instead of channel: elrepo/el9. Based on the original report, it seems that the same issue persists with el8. If it's inappropriate, I will re-register it as channel: elrepo/el9, but if it's acceptable, please let it remain as is. |
|
No problem. Moved to elrepo/el9. |
|
Hi, toracat. Thank you very much. One more thing, I would like to correct the initial description as follows (kernel version notation error. I apologize for the repeated mistakes) ---- In the exchanges after January 2, 2025, in this thread, I wanted to test the patch provided by Sagar. I attempted to port it to the AlmaLinux9 Kernel 5.14 environment based on the SRPM of kmod-aacraid. However, Kernel 5.14 does not support blk_mq_map_hw_queues, and although I considered alternative methods, I couldn't resolve it with my skills and knowledge (GPT suggested a software-based solution, but I couldn't decide whether to adopt it). |
|
Correction done. |
|
@ogumemura I have reviewed the latest kmod-aacraid package for el9 (kmod-aacraid-1.2.1-9.el9_5.elrepo.x86_64.rpm), and it does not contain the code which was reverted/removed in bug 0001444 for the el8 variant, so there is no need to apply this patch for the el9 driver (nothing to patch). This would suggest that the issues you are experiencing are not wholly caused by that revert (in el8) patch. If you can provide or point me towards a patch you believe may address the issues you are experiencing, I will be happy to see if I can backport / apply it to our code base. |
|
@pperry Thank you for your response. (Thank you for the comment correction > toracat) First, we did not verify how applying—and later removing—the patch on EL8 actually affected the production environment. Furthermore, after reviewing the EL8-specific process provided by toracat and checking the EL9 elrepo package as well as the standard AlmaLinux9 packages, we confirmed that, as you mentioned, the EL9 packages do not show the kinds of modifications seen in the EL8 package. There were also no differences compared to the standard AlmaLinux9 drivers. Since this investigation occurred after the patch removal timing for Debian/Upstream Kernel (around late 2023 to early 2024?) and after the adjustments made in ELrepo8, we were unable to determine whether the absence of changes was due to there never having been any modifications or because countermeasures had already been implemented upstream in RHEL9 or AlmaLinux9. This uncertainty is why the initial post was phrased in that manner. Reference: [Bugzilla Comment #55]( https://bugzilla.kernel.org/show_bug.cgi?id=217599#c55 ) I believed that the withdrawn patch was related to the issue I reported—or one very similar to it—namely, the handling for "scsi: aacraid: Reply queue mapping to CPUs based on IRQ affinity." When I observed that this patch was eventually withdrawn because it triggered another problem, some doubts remained about how the original issue was being handled. However, since no further incident reports were received from others for a while, I thought that perhaps it was a different issue occurring only in my environment. That assumption was later questioned when I noticed new reports and developments, which prompted me to reach out for further assistance. For details on the new report and the follow-up actions, please refer to the conversation from comment 64 onward. Reference: [Bugzilla Comment #64 and later]( https://bugzilla.kernel.org/show_bug.cgi?id=217599#c64 ) At this point, I attempted to port the following patch to the EL9 environment: https://marc.info/?l=linux-scsi&m=173825819000502&w=2 However, as noted above—and considering that I have no prior experience in handling kernel driver code—my knowledge and skillset were insufficient to complete the adaptation; specifically, I encountered difficulty in handling the code in the static void aac_map_queues function. So far, no feedback has been received regarding the effectiveness of the patch provided by Sagar Biradar. If it is deemed inappropriate to proceed before the patch is officially merged, we may need to wait a bit longer. (I do have a spare 71605 available, so theoretically, a test environment could be set up if a reliable method to reproduce the issue were identified. However, because any test using Kernel 6.4 and this patch would differ from the conditions of the current problem environment, my objectives would not be fully met. Additionally, the time it takes for the issue to manifest is a hurdle—often occurring 10 to 14 days after boot in my EL9 real-world environment, for reasons still unknown. This is why I have not conducted tests with Kernel 6.4 and the provided patch) |
|
Since I lack the necessary knowledge, I created this patch using trial and error with GPT. Therefore, there might be some inaccuracies in the changes, and I'm not sure if it will be helpful. However, just in case, I'm attaching the patch I made: This patch is a modified https://marc.info/?l=linux-scsi&m=173825819000502&w=2, but it does not resolve the issue with aac_map_queues(), which remains unchanged. aacraid-ml-20250130-elrepo.el9.patch (5,173 bytes)
diff -u aacraid-1.2.1/aachba.c aacraid-1.2.1-patched/aachba.c --- aacraid-1.2.1/aachba.c 2024-09-30 22:55:22.000000000 +0900 +++ aacraid-1.2.1-patched/aachba.c 2025-02-23 22:57:22.879818201 +0900 @@ -328,6 +328,12 @@ "\t1 - Array Meta Data Signature (default)\n" "\t2 - Adapter Serial Number"); +int aac_cpu_offline_feature; +module_param_named(aac_cpu_offline_feature, aac_cpu_offline_feature, int, 0644); +MODULE_PARM_DESC(aac_cpu_offline_feature, + "This enables CPU offline feature and may result in IO performance drop in some cases:\n" + "\t0 - Disable (default)\n" + "\t1 - Enable"); static inline int aac_valid_context(struct scsi_cmnd *scsicmd, struct fib *fibptr) { diff -u aacraid-1.2.1/aacraid.h aacraid-1.2.1-patched/aacraid.h --- aacraid-1.2.1/aacraid.h 2024-09-30 22:55:22.000000000 +0900 +++ aacraid-1.2.1-patched/aacraid.h 2025-02-23 22:57:22.879818201 +0900 @@ -1677,6 +1677,7 @@ u32 handle_pci_error; bool init_reset; u8 soft_reset_support; + u8 use_map_queue; }; #define aac_adapter_interrupt(dev) \ @@ -2769,4 +2770,5 @@ extern int check_interval; extern int aac_check_reset; extern int aac_fib_dump; +extern int aac_cpu_offline_feature; #endif diff -u aacraid-1.2.1/commsup.c aacraid-1.2.1-patched/commsup.c --- aacraid-1.2.1/commsup.c 2024-09-30 22:55:22.000000000 +0900 +++ aacraid-1.2.1-patched/commsup.c 2025-02-23 22:58:57.000000000 +0900 @@ -223,8 +223,14 @@ struct fib *aac_fib_alloc_tag(struct aac_dev *dev, struct scsi_cmnd *scmd) { struct fib *fibptr; - - fibptr = &dev->fibs[scmd->request->tag]; + u32 blk_tag; + int i; + if (aac_cpu_offline_feature == 1) { + blk_tag = blk_mq_unique_tag(scmd->request); + i = blk_mq_unique_tag_to_tag(blk_tag); + fibptr = &dev->fibs[i]; + } else + fibptr = &dev->fibs[scmd->request->tag]; /* * Null out fields that depend on being zero at the start of * each I/O diff -u aacraid-1.2.1/linit.c aacraid-1.2.1-patched/linit.c --- aacraid-1.2.1/linit.c 2024-09-30 22:55:22.000000000 +0900 +++ aacraid-1.2.1-patched/linit.c 2025-02-23 23:01:04.000000000 +0900 @@ -507,6 +507,23 @@ } /** + * aac_map_queues - Map hardware queues for the SCSI host + * @shost: SCSI host structure + * + * Maps the default hardware queue for the given SCSI host to the + * corresponding PCI device and enables mapped queue usage. + */ + +static void aac_map_queues(struct Scsi_Host *shost) +{ + struct aac_dev *aac = (struct aac_dev *)shost->hostdata; + + blk_mq_map_hw_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT], + &aac->pdev->dev, 0); + aac->use_map_queue = true; +} + +/** * aac_change_queue_depth - alter queue depths * @sdev: SCSI device we are considering * @depth: desired queue depth @@ -1485,6 +1502,7 @@ .bios_param = aac_biosparm, .shost_attrs = aac_attrs, .slave_configure = aac_slave_configure, + .map_queues = aac_map_queues, .change_queue_depth = aac_change_queue_depth, .sdev_attrs = aac_dev_attrs, .eh_abort_handler = aac_eh_abort, @@ -1771,6 +1789,11 @@ shost->max_lun = AAC_MAX_LUN; pci_set_drvdata(pdev, shost); + if (aac_cpu_offline_feature == 1) { + shost->nr_hw_queues = aac->max_msix; + shost->can_queue = aac->vector_cap; + shost->host_tagset = 1; + } error = scsi_add_host(shost, &pdev->dev); if (error) @@ -1902,6 +1925,7 @@ struct aac_dev *aac = (struct aac_dev *)shost->hostdata; aac_cancel_rescan_worker(aac); + aac->use_map_queue = false; scsi_remove_host(shost); __aac_shutdown(aac); diff -u aacraid-1.2.1/src.c aacraid-1.2.1-patched/src.c --- aacraid-1.2.1/src.c 2024-09-30 22:55:22.000000000 +0900 +++ aacraid-1.2.1-patched/src.c 2025-02-23 22:57:22.880818200 +0900 @@ -493,6 +493,10 @@ #endif u16 vector_no; + struct scsi_cmnd *scmd; + u32 blk_tag; + struct Scsi_Host *shost = dev->scsi_host_ptr; + struct blk_mq_queue_map *qmap; atomic_inc(&q->numpending); @@ -505,8 +509,28 @@ if ((dev->comm_interface == AAC_COMM_MESSAGE_TYPE3) && dev->sa_firmware) vector_no = aac_get_vector(dev); - else - vector_no = fib->vector_no; + else { + if (aac_cpu_offline_feature == 1) { + if (!fib->vector_no || !fib->callback_data) { + if (shost && dev->use_map_queue) { + qmap = &shost->tag_set.map[HCTX_TYPE_DEFAULT]; + vector_no = qmap->mq_map[raw_smp_processor_id()]; + } + /* + * We hardcode the vector_no for + * reserved commands as a valid shost is + * absent during the init + */ + else + vector_no = 0; + } else { + scmd = (struct scsi_cmnd *)fib->callback_data; + blk_tag = blk_mq_unique_tag(scmd->request); + vector_no = blk_mq_unique_tag_to_hwq(blk_tag); + } + } else + vector_no = fib->vector_no; + } if (native_hba) { if (fib->flags & FIB_CONTEXT_FLAG_NATIVE_HBA_TMF) { diff -u aacraid-1.2.1/src.c aacraid-1.2.1-patched/src.c --- aacraid-1.2.1/src.c 2025-02-26 12:10:33.820406083 +0900 +++ aacraid-1.2.1-patched/src.c 2025-02-26 12:14:39.000000000 +0900 @@ -28,6 +28,7 @@ #include <linux/time.h> #include <linux/interrupt.h> #include <scsi/scsi_host.h> +#include <scsi/scsi_cmnd.h> #include "aacraid.h" |
Date Modified | Username | Field | Change |
---|---|---|---|
2025-02-24 02:34 | ogumemura | New Issue | |
2025-02-24 02:34 | ogumemura | Status | new => assigned |
2025-02-24 02:34 | ogumemura | Assigned To | => pperry |
2025-02-24 09:22 | ogumemura | Note Added: 0010322 | |
2025-02-24 13:09 | toracat | Project | channel: elrepo/el8 => channel: elrepo/el9 |
2025-02-24 13:10 | toracat | Note Added: 0010326 | |
2025-02-25 07:10 | ogumemura | Note Added: 0010334 | |
2025-02-25 11:40 | toracat | Description Updated | |
2025-02-25 11:41 | toracat | Note Added: 0010335 | |
2025-02-25 13:14 | pperry | Note Added: 0010336 | |
2025-02-25 13:15 | pperry | Status | assigned => feedback |
2025-02-25 21:31 | ogumemura | Note Added: 0010337 | |
2025-02-25 21:31 | ogumemura | Status | feedback => assigned |
2025-02-25 22:23 | ogumemura | Note Added: 0010338 | |
2025-02-25 22:23 | ogumemura | File Added: aacraid-ml-20250130-elrepo.el9.patch |