View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0001488 | channel: elrepo/el9 | drbd9x-utils | public | 2024-10-28 15:06 | 2025-01-27 14:23 |
Reporter | anenni | Assigned To | toracat | ||
Priority | normal | Severity | major | Reproducibility | always |
Status | feedback | Resolution | open | ||
Platform | Linux | OS | rhel | OS Version | 9.4 |
Summary | 0001488: pacemaker-schedulerd warning: Unexpected result (error: Resource agent did not complete within 1m40s) | ||||
Description | when restarting drbd standby node, pacemaker stop procedure always fails making node unusable for a lot of time, with this message multiple in logs: pacemaker-schedulerd warning: Unexpected result (error: Resource agent did not complete within 1m40s) after a lot of minutes it finally reboots, unless obviously you configure fencing that kills the node, but this is not acceptable as a routine measure. | ||||
Steps To Reproduce | just reboot the drbd unpromoted (slave) server. Instead, stopping pacemaker before and then rebooting the node goes flawlessly. | ||||
Additional Information | using drbd9x-utils-9.28.0-1.el9.elrepo.x86_64 tried kmod-drbd9x-9.1.22 and .21 and .20 selinux is permissive resource config is as per linbit latest rhel9/drbd9 docs. Clone: users_drbd-clone Meta Attributes: users_drbd-clone-meta_attributes clone-max=2 clone-node-max=1 notify=true promotable=true promoted-max=1 promoted-node-max=1 Resource: users_drbd (class=ocf provider=linbit type=drbd) Attributes: users_drbd-instance_attributes drbd_resource=users Operations: demote: users_drbd-demote-interval-0s interval=0s timeout=90 monitor: users_drbd-monitor-interval-29s interval=29s timeout=20s role=Promoted monitor: users_drbd-monitor-interval-31s interval=31s timeout=20s role=Unpromoted notify: users_drbd-notify-interval-0s interval=0s timeout=90 promote: users_drbd-promote-interval-0s interval=0s timeout=90 reload: users_drbd-reload-interval-0s interval=0s timeout=30 start: users_drbd-start-interval-0s interval=0s timeout=240 stop: users_drbd-stop-interval-0s interval=0s timeout=100 | ||||
Tags | No tags attached. | ||||
|
this is logs from system console while trying to reboot |
|
we have a very similar setup on a rhel8.9 system with slightly lower versions that has no problem at all. drbd90-utils-9.27.0-1.el8.elrepo.x86_64 kmod-drbd90-9.1.19-1.el8_9.elrepo.x86_64 |
|
Acknowledged. |
|
infact after downgrading to this almost identycal combination all started to work as expected: kmod-drbd9x-9.1.19-2.el9_4.elrepo.x86_64 drbd9x-utils-9.27.0-1.el9.elrepo.x86_64 |
|
a lot has changed in latest utils: https://github.com/LINBIT/drbd-utils/blob/master/ChangeLog 9.28.0 ----------- * events2: set may_promote:no promotion_score:0 while force-io-failure:yes * drbdsetup,v9: show TLS in connection status * drbdsetup,v9: add udev command * 8.3: remove * crm-fence-peer.9.sh: fixes for pacemaker 2.1.7 * events2: improved out of order message handling |
|
I finished picking a kmod-drbd9x version older then drbd9x-utils-9.28.0-1.el9.elrepo.x86_64, just to be on the safe side with a somewhat common combination. drbd9x-utils-9.27.0-1.el9.elrepo.x86_64.rpm 2023-12-23 12:41 1.0M drbd9x-utils-9.28.0-1.el9.elrepo.x86_64.rpm 2024-05-11 20:36 886K kmod-drbd9x-9.1.19-2.el9_4.elrepo.x86_64.rpm 2024-05-01 17:03 400K kmod-drbd9x-9.1.20-1.el9_4.elrepo.x86_64.rpm 2024-05-13 18:59 402K kmod-drbd9x-9.1.21-1.el9_4.elrepo.x86_64.rpm 2024-06-08 18:33 402K kmod-drbd9x-9.1.22-1.el9_4.elrepo.x86_64.rpm 2024-08-12 18:48 403K |
|
It seems a lot has changed in 9.28.0 .... https://lists.linbit.com/pipermail/drbd-announce/2024-May/000728.html "In contrast to the recent releases this one contains a bit more exciting news:" |
|
Thank you. I was about to suggest trying drbd9x-utils-9.27.0. By the way 9.29.0 is on its way. The changelog is: 9.29.0-rc.1 ----------- * drbdmeta: fix initialization for external md * build: allow disabling keyutils * tests: export sanitized environment * drbdmon: various improvements * build: add cyclonedx * drbsetup,v9: fix multiple paths drbdsetup show --json strictly speaking breaking change, but maily used internally * events2: expose if device is open * drbdadm: fix undefined behavior that triggered on amd64 * shared: fix out-of-bounds access in parsing * drbsetup,v9: event consistency with peer devices * drbdadm: fix parsing of v8.4 configs for compatibility * drbdmeta: fix segfault for check-resize on intentionally diskless * drbd-promote@.service: check if ExecCondition is available |
|
Thank you, but I have to go production so if it proves stable, I'll stop here for a while. And it could also be something rhel9 specific. Regards. |
|
Understood. Best to stay with what is proven to work. |
|
i'll leave tests for the next cluster in the meantime maybe it's better to retire this version |
|
Also, rhel 9.5 should be near now, so some testing will be necessary |
|
drbd9x-utils-9.29.0-1.el9.elrepo.x86_64.rpm is out. |
|
hello. well, I tried drbd9x-utils-9.29.0-1.el9.elrepo.x86_64, updating also to rhel 9.5 and kmod-drbd9x-9.1.23-1.el9_5.elrepo.x86_64 but I still observe system not being able to cleanly shutdown: Node List: * Node plutone2.cnmca.meteoam.it: UNCLEAN (online) * Online: [ plutone1.cnmca.meteoam.it ] Active Resources: * Clone Set: users_drbd-clone [users_drbd] (promotable): * users_drbd (ocf:linbit:drbd): FAILED plutone2.cnmca.meteoam.it * Promoted: [ plutone1.cnmca.meteoam.it ] * Resource Group: users_group (3 members inactive): * users_fs (ocf:heartbeat:Filesystem): Started plutone1.cnmca.meteoam.it Failed Resource Actions: * users_drbd stop on plutone2.cnmca.meteoam.it could not be executed (Timed Out: Resource agent did not complete within 1m40s) at Wed Jan 22 19:17:49 2025 after 1m40.002s |
|
stop of drbd resource seems to complete correctly, but pacemaker seems to think that it didn't complete: Jan 22 19:17:49 plutone2 kernel: drbd users: Preparing cluster-wide state change 3527246773: 1->0 conn( Disconnecting ) Jan 22 19:17:49 plutone2 kernel: drbd users: State change 3527246773: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFE Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Cluster is now split Jan 22 19:17:49 plutone2 kernel: drbd users: Committing cluster-wide state change 3527246773 (1ms) Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: conn( Connected -> Disconnecting ) peer( Primary -> Unknown ) [down] Jan 22 19:17:49 plutone2 kernel: drbd users/0 drbd1: disk( UpToDate -> Outdated ) [down] Jan 22 19:17:49 plutone2 kernel: drbd users/0 drbd1 plutone1: pdsk( UpToDate -> DUnknown ) repl( Established -> Off ) [down] Jan 22 19:17:49 plutone2 kernel: drbd users/0 drbd1: Enabling local AL-updates Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: ack_receiver terminated Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Terminating ack_recv thread Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Terminating sender thread Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Starting sender thread (peer-node-id 0) Jan 22 19:17:49 plutone2 pacemaker-controld[1873]: notice: Requesting local execution of notify operation for users_drbd on plutone2.cnmca.meteoam.it Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Connection closed Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: helper command: /sbin/drbdadm disconnected Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: helper command: /sbin/drbdadm disconnected exit code 0 Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: conn( Disconnecting -> StandAlone ) [disconnected] Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Terminating receiver thread Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Terminating sender thread Jan 22 19:17:49 plutone2 pacemaker-controld[1873]: notice: Result of notify operation for users_drbd on plutone2.cnmca.meteoam.it: ok Jan 22 19:17:49 plutone2 pacemaker-controld[1873]: notice: Requesting local execution of stop operation for users_drbd on plutone2.cnmca.meteoam.it Jan 22 19:17:49 plutone2 kernel: drbd users/0 drbd1: disk( Outdated -> Detaching ) [down] Jan 22 19:17:49 plutone2 kernel: drbd users/0 drbd1: disk( Detaching -> Diskless ) [go-diskless] Jan 22 19:17:49 plutone2 kernel: drbd users/0 drbd1: drbd_bm_resize called with capacity == 0 Jan 22 19:17:49 plutone2 kernel: drbd /unregistered/users: Terminating worker thread Jan 22 19:17:49 plutone2 systemd[1]: drbd-graceful-shutdown.service: Deactivated successfully. Jan 22 19:19:29 plutone2 pacemaker-controld[1873]: error: Result of stop operation for users_drbd on plutone2.cnmca.meteoam.it: Timed Out after 1m40s (Resource agent did not complete within 1m40s) Jan 22 19:19:29 plutone2 pacemaker-attrd[1871]: notice: Setting last-failure-users_drbd#stop_0[plutone2.cnmca.meteoam.it] in instance_attributes: (unset) -> 1737573569 Jan 22 19:19:29 plutone2 pacemaker-attrd[1871]: notice: Setting fail-count-users_drbd#stop_0[plutone2.cnmca.meteoam.it] in instance_attributes: (unset) -> INFINITY |
|
drbd-utils v9.30.0-rc.1 was announced a few days ago. As soon as the GA version is out, we will publish it. Changelog: 9.30.0-rc.1 ----------- * drbdmon: various improvements * drbdadm: preparations for drbd-proxy v4 * ocf: explicitly timeout crm_master IPC early * ocf: account for "recent" name change to "Promoted" * windrbd: various improvements * doc: switch to drbd9 manpages by default * init: reverse dependency for drbd-graceful-shutdown.service drbd@.service now has an After dependency for it. |
|
it seems quite strange that this primary issue got uncaught for so many versions, and I don't see it explicitly reported |
|
is there any way to report this to linbit? I'll try to report the issue to redhat also, even if it's not a supported product. regards |
|
I strongly suspect that this is a red hat only problem. Linbit only mentions AlmaLinux in 9,x docs, and cluster start and stop procedures are quite different. alma uses corosync and pacemaker services in autostart in systemd, while redhat uses pcsd. that could be why stopping pacemaker by hand before rebooting works. |
|
issue seems to be that with a standard shutdown systemd kicks in before pacemaker Jan 22 19:17:49 plutone2 systemd[1]: drbd-graceful-shutdown.service: Deactivated successfully. Jan 22 19:19:29 plutone2 pacemaker-controld[1873]: error: Result of stop operation for users_drbd on plutone2.cnmca.meteoam.it: Timed Out after 1m40s (Resource agent did not complete within 1m40s) but drbd is not under systemd in any way [root@NEWplutone2 ~]# systemctl status drbd.service ○ drbd.service - DRBD -- please disable. Unless you are NOT using a cluster manager. Loaded: loaded (/usr/lib/systemd/system/drbd.service; disabled; preset: disabled) Active: inactive (dead) |
|
Aha... drbd-graceful-shutdown.service https://linbit.com/drbd-user-guide/drbd-guide-9_0-en/#s-drbd-graceful-shutdown 5.7. Shutting Down Gracefully By Using a systemd Service Included with drbd-utils versions since 9.26.0, there is a “graceful shutdown” service, drbd-graceful-shutdown.service. This service ensures that the DRBD “last man standing” behavior applies to your cluster when shutting down nodes. |
|
Someone shoudl tell Linbit that this new silent gift doesn't work with pacemaker and makes quite some trouble. there's no mention at all in admin docs. |
|
and it's not even possible to disable it cleanly [root@NEWplutone2 ~]# systemctl disable drbd-graceful-shutdown.service The unit files have no installation config (WantedBy=, RequiredBy=, Also=, Alias= settings in the [Install] section, and DefaultInstance= for template units). This means they are not meant to be enabled or disabled using systemctl. Possible reasons for having this kind of units are: • A unit may be statically enabled by being symlinked from another unit's .wants/ or .requires/ directory. • A unit's purpose may be to act as a helper for some other unit which has a requirement dependency on it. • A unit may be started when needed via activation (socket, path, timer, D-Bus, udev, scripted systemctl call, ...). • In case of template units, the unit is meant to be enabled with some instance name specified. |
|
drbd9x-utils-9.30.0-1.el9.elrepo.x86_64.rpm has just been released. |
|
bug is still in drbd9x-utils-9.30.0-1.el9.elrepo doesn't resolve this see: https://lists.linbit.com/pipermail/drbd-user/2025-January/026693.html |
|
you can verysimply patch drbd-graceful-shutdown.service following their indications |
|
I saw this post on the drbd mailing list: https://lists.linbit.com/pipermail/drbd-user/2025-January/026692.html Looks like the fix is comming? |
|
yes, it's the original post by redhat following my bug report. but it will probably take months, since in the reply linbit says that it's now queued togethere with all other systemd thingies |
|
and linbit suggestion is only a line to add: So apparently we forgot to add a "Before=pacemaker.service" or even better "Before=resource-agents-deps.target" to the graceful shutdown service. |
|
OK, we will build and release a patched version to the testing repo soon as an interim solution. |
|
The following package has just been released to the elrepo-testing repository and is currently syncing to mirrors: drbd9x-utils-9.30.0-2.el9.elrepo.x86_64.rpm This is the patch applied: $ cat elrepo-drbd-graceful-shutdown.patch --- a/scripts/drbd-graceful-shutdown.service.in 2025-01-23 12:26:45.605658196 -0800 +++ b/scripts/drbd-graceful-shutdown.service.in 2025-01-27 10:30:29.268233167 -0800 @@ -8,6 +8,7 @@ Description=ensure all DRBD resources sh Wants=network-online.target After=network-online.target Before=drbd.service +Before=resource-agents-deps.target [Service] Type=oneshot |
Date Modified | Username | Field | Change |
---|---|---|---|
2024-10-28 15:06 | anenni | New Issue | |
2024-10-28 15:06 | anenni | Status | new => assigned |
2024-10-28 15:06 | anenni | Assigned To | => toracat |
2024-10-28 15:10 | anenni | Note Added: 0010165 | |
2024-10-28 15:10 | anenni | File Added: drbdreboot.jpg | |
2024-10-28 15:13 | anenni | Note Added: 0010166 | |
2024-10-28 15:19 | toracat | Status | assigned => acknowledged |
2024-10-28 15:19 | toracat | Note Added: 0010167 | |
2024-10-28 15:34 | anenni | Note Added: 0010168 | |
2024-10-28 15:39 | anenni | Note Added: 0010169 | |
2024-10-28 15:56 | anenni | Note Added: 0010170 | |
2024-10-28 16:09 | anenni | Note Added: 0010171 | |
2024-10-28 16:24 | toracat | Note Added: 0010172 | |
2024-10-28 16:32 | anenni | Note Added: 0010173 | |
2024-10-28 17:56 | toracat | Note Added: 0010174 | |
2024-10-29 13:50 | anenni | Note Added: 0010176 | |
2024-10-29 14:37 | anenni | Note Added: 0010177 | |
2024-10-31 10:28 | toracat | Note Added: 0010178 | |
2025-01-22 14:31 | anenni | Note Added: 0010265 | |
2025-01-22 14:35 | anenni | Note Added: 0010266 | |
2025-01-22 14:50 | toracat | Note Added: 0010267 | |
2025-01-22 14:59 | anenni | Note Added: 0010268 | |
2025-01-22 15:46 | anenni | Note Added: 0010269 | |
2025-01-23 12:24 | anenni | Note Added: 0010270 | |
2025-01-23 15:18 | anenni | Note Added: 0010271 | |
2025-01-23 15:26 | anenni | Note Added: 0010272 | |
2025-01-23 15:38 | anenni | Note Added: 0010273 | |
2025-01-23 15:46 | anenni | Note Added: 0010274 | |
2025-01-23 19:43 | toracat | Note Added: 0010275 | |
2025-01-27 08:54 | anenni | Note Added: 0010283 | |
2025-01-27 09:59 | anenni | Note Added: 0010284 | |
2025-01-27 12:33 | toracat | Note Added: 0010285 | |
2025-01-27 12:38 | anenni | Note Added: 0010286 | |
2025-01-27 12:40 | anenni | Note Added: 0010287 | |
2025-01-27 13:44 | toracat | Note Added: 0010288 | |
2025-01-27 14:23 | toracat | Note Added: 0010289 | |
2025-01-27 14:23 | toracat | Status | acknowledged => feedback |