View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0001488 | channel: elrepo/el9 | drbd9x-utils | public | 2024-10-28 15:06 | 2024-10-31 10:28 |
Reporter | anenni | Assigned To | toracat | ||
Priority | normal | Severity | major | Reproducibility | always |
Status | acknowledged | Resolution | open | ||
Platform | Linux | OS | rhel | OS Version | 9.4 |
Summary | 0001488: pacemaker-schedulerd warning: Unexpected result (error: Resource agent did not complete within 1m40s) | ||||
Description | when restarting drbd standby node, pacemaker stop procedure always fails making node unusable for a lot of time, with this message multiple in logs: pacemaker-schedulerd warning: Unexpected result (error: Resource agent did not complete within 1m40s) after a lot of minutes it finally reboots, unless obviously you configure fencing that kills the node, but this is not acceptable as a routine measure. | ||||
Steps To Reproduce | just reboot the drbd unpromoted (slave) server. Instead, stopping pacemaker before and then rebooting the node goes flawlessly. | ||||
Additional Information | using drbd9x-utils-9.28.0-1.el9.elrepo.x86_64 tried kmod-drbd9x-9.1.22 and .21 and .20 selinux is permissive resource config is as per linbit latest rhel9/drbd9 docs. Clone: users_drbd-clone Meta Attributes: users_drbd-clone-meta_attributes clone-max=2 clone-node-max=1 notify=true promotable=true promoted-max=1 promoted-node-max=1 Resource: users_drbd (class=ocf provider=linbit type=drbd) Attributes: users_drbd-instance_attributes drbd_resource=users Operations: demote: users_drbd-demote-interval-0s interval=0s timeout=90 monitor: users_drbd-monitor-interval-29s interval=29s timeout=20s role=Promoted monitor: users_drbd-monitor-interval-31s interval=31s timeout=20s role=Unpromoted notify: users_drbd-notify-interval-0s interval=0s timeout=90 promote: users_drbd-promote-interval-0s interval=0s timeout=90 reload: users_drbd-reload-interval-0s interval=0s timeout=30 start: users_drbd-start-interval-0s interval=0s timeout=240 stop: users_drbd-stop-interval-0s interval=0s timeout=100 | ||||
Tags | No tags attached. | ||||
|
this is logs from system console while trying to reboot |
|
we have a very similar setup on a rhel8.9 system with slightly lower versions that has no problem at all. drbd90-utils-9.27.0-1.el8.elrepo.x86_64 kmod-drbd90-9.1.19-1.el8_9.elrepo.x86_64 |
|
Acknowledged. |
|
infact after downgrading to this almost identycal combination all started to work as expected: kmod-drbd9x-9.1.19-2.el9_4.elrepo.x86_64 drbd9x-utils-9.27.0-1.el9.elrepo.x86_64 |
|
a lot has changed in latest utils: https://github.com/LINBIT/drbd-utils/blob/master/ChangeLog 9.28.0 ----------- * events2: set may_promote:no promotion_score:0 while force-io-failure:yes * drbdsetup,v9: show TLS in connection status * drbdsetup,v9: add udev command * 8.3: remove * crm-fence-peer.9.sh: fixes for pacemaker 2.1.7 * events2: improved out of order message handling |
|
I finished picking a kmod-drbd9x version older then drbd9x-utils-9.28.0-1.el9.elrepo.x86_64, just to be on the safe side with a somewhat common combination. drbd9x-utils-9.27.0-1.el9.elrepo.x86_64.rpm 2023-12-23 12:41 1.0M drbd9x-utils-9.28.0-1.el9.elrepo.x86_64.rpm 2024-05-11 20:36 886K kmod-drbd9x-9.1.19-2.el9_4.elrepo.x86_64.rpm 2024-05-01 17:03 400K kmod-drbd9x-9.1.20-1.el9_4.elrepo.x86_64.rpm 2024-05-13 18:59 402K kmod-drbd9x-9.1.21-1.el9_4.elrepo.x86_64.rpm 2024-06-08 18:33 402K kmod-drbd9x-9.1.22-1.el9_4.elrepo.x86_64.rpm 2024-08-12 18:48 403K |
|
It seems a lot has changed in 9.28.0 .... https://lists.linbit.com/pipermail/drbd-announce/2024-May/000728.html "In contrast to the recent releases this one contains a bit more exciting news:" |
|
Thank you. I was about to suggest trying drbd9x-utils-9.27.0. By the way 9.29.0 is on its way. The changelog is: 9.29.0-rc.1 ----------- * drbdmeta: fix initialization for external md * build: allow disabling keyutils * tests: export sanitized environment * drbdmon: various improvements * build: add cyclonedx * drbsetup,v9: fix multiple paths drbdsetup show --json strictly speaking breaking change, but maily used internally * events2: expose if device is open * drbdadm: fix undefined behavior that triggered on amd64 * shared: fix out-of-bounds access in parsing * drbsetup,v9: event consistency with peer devices * drbdadm: fix parsing of v8.4 configs for compatibility * drbdmeta: fix segfault for check-resize on intentionally diskless * drbd-promote@.service: check if ExecCondition is available |
|
Thank you, but I have to go production so if it proves stable, I'll stop here for a while. And it could also be something rhel9 specific. Regards. |
|
Understood. Best to stay with what is proven to work. |
|
i'll leave tests for the next cluster in the meantime maybe it's better to retire this version |
|
Also, rhel 9.5 should be near now, so some testing will be necessary |
|
drbd9x-utils-9.29.0-1.el9.elrepo.x86_64.rpm is out. |
Date Modified | Username | Field | Change |
---|---|---|---|
2024-10-28 15:06 | anenni | New Issue | |
2024-10-28 15:06 | anenni | Status | new => assigned |
2024-10-28 15:06 | anenni | Assigned To | => toracat |
2024-10-28 15:10 | anenni | Note Added: 0010165 | |
2024-10-28 15:10 | anenni | File Added: drbdreboot.jpg | |
2024-10-28 15:13 | anenni | Note Added: 0010166 | |
2024-10-28 15:19 | toracat | Status | assigned => acknowledged |
2024-10-28 15:19 | toracat | Note Added: 0010167 | |
2024-10-28 15:34 | anenni | Note Added: 0010168 | |
2024-10-28 15:39 | anenni | Note Added: 0010169 | |
2024-10-28 15:56 | anenni | Note Added: 0010170 | |
2024-10-28 16:09 | anenni | Note Added: 0010171 | |
2024-10-28 16:24 | toracat | Note Added: 0010172 | |
2024-10-28 16:32 | anenni | Note Added: 0010173 | |
2024-10-28 17:56 | toracat | Note Added: 0010174 | |
2024-10-29 13:50 | anenni | Note Added: 0010176 | |
2024-10-29 14:37 | anenni | Note Added: 0010177 | |
2024-10-31 10:28 | toracat | Note Added: 0010178 |