0001488: pacemaker-schedulerd warning: Unexpected result (error: Resource agent did not complete within 1m40s)

ID	Project	Category	View Status	Date Submitted	Last Update

0001488	channel: elrepo/el9	drbd9x-utils	public	2024-10-28 15:06	2025-01-27 14:23

Reporter	anenni	Assigned To	toracat
Priority	normal	Severity	major	Reproducibility	always
Status	feedback	Resolution	open
Platform	Linux	OS	rhel	OS Version	9.4

Summary	0001488: pacemaker-schedulerd warning: Unexpected result (error: Resource agent did not complete within 1m40s)
Description	when restarting drbd standby node, pacemaker stop procedure always fails making node unusable for a lot of time, with this message multiple in logs: pacemaker-schedulerd warning: Unexpected result (error: Resource agent did not complete within 1m40s) after a lot of minutes it finally reboots, unless obviously you configure fencing that kills the node, but this is not acceptable as a routine measure.
Steps To Reproduce	just reboot the drbd unpromoted (slave) server. Instead, stopping pacemaker before and then rebooting the node goes flawlessly.
Additional Information	using drbd9x-utils-9.28.0-1.el9.elrepo.x86_64 tried kmod-drbd9x-9.1.22 and .21 and .20 selinux is permissive resource config is as per linbit latest rhel9/drbd9 docs. Clone: users_drbd-clone Meta Attributes: users_drbd-clone-meta_attributes clone-max=2 clone-node-max=1 notify=true promotable=true promoted-max=1 promoted-node-max=1 Resource: users_drbd (class=ocf provider=linbit type=drbd) Attributes: users_drbd-instance_attributes drbd_resource=users Operations: demote: users_drbd-demote-interval-0s interval=0s timeout=90 monitor: users_drbd-monitor-interval-29s interval=29s timeout=20s role=Promoted monitor: users_drbd-monitor-interval-31s interval=31s timeout=20s role=Unpromoted notify: users_drbd-notify-interval-0s interval=0s timeout=90 promote: users_drbd-promote-interval-0s interval=0s timeout=90 reload: users_drbd-reload-interval-0s interval=0s timeout=30 start: users_drbd-start-interval-0s interval=0s timeout=240 stop: users_drbd-stop-interval-0s interval=0s timeout=100
Tags	No tags attached.

anenni 2024-10-28 15:10 reporter ~0010165	this is logs from system console while trying to reboot drbdreboot.jpg (222,900 bytes) drbdreboot.jpg (222,900 bytes)

anenni 2024-10-28 15:13 reporter ~0010166	we have a very similar setup on a rhel8.9 system with slightly lower versions that has no problem at all. drbd90-utils-9.27.0-1.el8.elrepo.x86_64 kmod-drbd90-9.1.19-1.el8_9.elrepo.x86_64

toracat 2024-10-28 15:19 administrator ~0010167	Acknowledged.

anenni 2024-10-28 15:34 reporter ~0010168	infact after downgrading to this almost identycal combination all started to work as expected: kmod-drbd9x-9.1.19-2.el9_4.elrepo.x86_64 drbd9x-utils-9.27.0-1.el9.elrepo.x86_64

anenni 2024-10-28 15:39 reporter ~0010169	a lot has changed in latest utils: https://github.com/LINBIT/drbd-utils/blob/master/ChangeLog 9.28.0 ----------- * events2: set may_promote:no promotion_score:0 while force-io-failure:yes * drbdsetup,v9: show TLS in connection status * drbdsetup,v9: add udev command * 8.3: remove * crm-fence-peer.9.sh: fixes for pacemaker 2.1.7 * events2: improved out of order message handling

anenni 2024-10-28 15:56 reporter ~0010170	I finished picking a kmod-drbd9x version older then drbd9x-utils-9.28.0-1.el9.elrepo.x86_64, just to be on the safe side with a somewhat common combination. drbd9x-utils-9.27.0-1.el9.elrepo.x86_64.rpm 2023-12-23 12:41 1.0M drbd9x-utils-9.28.0-1.el9.elrepo.x86_64.rpm 2024-05-11 20:36 886K kmod-drbd9x-9.1.19-2.el9_4.elrepo.x86_64.rpm 2024-05-01 17:03 400K kmod-drbd9x-9.1.20-1.el9_4.elrepo.x86_64.rpm 2024-05-13 18:59 402K kmod-drbd9x-9.1.21-1.el9_4.elrepo.x86_64.rpm 2024-06-08 18:33 402K kmod-drbd9x-9.1.22-1.el9_4.elrepo.x86_64.rpm 2024-08-12 18:48 403K

anenni 2024-10-28 16:09 reporter ~0010171	It seems a lot has changed in 9.28.0 .... https://lists.linbit.com/pipermail/drbd-announce/2024-May/000728.html "In contrast to the recent releases this one contains a bit more exciting news:"

toracat 2024-10-28 16:24 administrator ~0010172	Thank you. I was about to suggest trying drbd9x-utils-9.27.0. By the way 9.29.0 is on its way. The changelog is: 9.29.0-rc.1 ----------- * drbdmeta: fix initialization for external md * build: allow disabling keyutils * tests: export sanitized environment * drbdmon: various improvements * build: add cyclonedx * drbsetup,v9: fix multiple paths drbdsetup show --json strictly speaking breaking change, but maily used internally * events2: expose if device is open * drbdadm: fix undefined behavior that triggered on amd64 * shared: fix out-of-bounds access in parsing * drbsetup,v9: event consistency with peer devices * drbdadm: fix parsing of v8.4 configs for compatibility * drbdmeta: fix segfault for check-resize on intentionally diskless * drbd-promote@.service: check if ExecCondition is available

anenni 2024-10-28 16:32 reporter ~0010173	Thank you, but I have to go production so if it proves stable, I'll stop here for a while. And it could also be something rhel9 specific. Regards.

toracat 2024-10-28 17:56 administrator ~0010174	Understood. Best to stay with what is proven to work.

anenni 2024-10-29 13:50 reporter ~0010176	i'll leave tests for the next cluster in the meantime maybe it's better to retire this version

anenni 2024-10-29 14:37 reporter ~0010177	Also, rhel 9.5 should be near now, so some testing will be necessary

toracat 2024-10-31 10:28 administrator ~0010178	drbd9x-utils-9.29.0-1.el9.elrepo.x86_64.rpm is out.

anenni 2025-01-22 14:31 reporter ~0010265	hello. well, I tried drbd9x-utils-9.29.0-1.el9.elrepo.x86_64, updating also to rhel 9.5 and kmod-drbd9x-9.1.23-1.el9_5.elrepo.x86_64 but I still observe system not being able to cleanly shutdown: Node List: * Node plutone2.cnmca.meteoam.it: UNCLEAN (online) * Online: [ plutone1.cnmca.meteoam.it ] Active Resources: * Clone Set: users_drbd-clone [users_drbd] (promotable): * users_drbd (ocf:linbit:drbd): FAILED plutone2.cnmca.meteoam.it * Promoted: [ plutone1.cnmca.meteoam.it ] * Resource Group: users_group (3 members inactive): * users_fs (ocf:heartbeat:Filesystem): Started plutone1.cnmca.meteoam.it Failed Resource Actions: * users_drbd stop on plutone2.cnmca.meteoam.it could not be executed (Timed Out: Resource agent did not complete within 1m40s) at Wed Jan 22 19:17:49 2025 after 1m40.002s

anenni 2025-01-22 14:35 reporter ~0010266	stop of drbd resource seems to complete correctly, but pacemaker seems to think that it didn't complete: Jan 22 19:17:49 plutone2 kernel: drbd users: Preparing cluster-wide state change 3527246773: 1->0 conn( Disconnecting ) Jan 22 19:17:49 plutone2 kernel: drbd users: State change 3527246773: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFE Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Cluster is now split Jan 22 19:17:49 plutone2 kernel: drbd users: Committing cluster-wide state change 3527246773 (1ms) Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: conn( Connected -> Disconnecting ) peer( Primary -> Unknown ) [down] Jan 22 19:17:49 plutone2 kernel: drbd users/0 drbd1: disk( UpToDate -> Outdated ) [down] Jan 22 19:17:49 plutone2 kernel: drbd users/0 drbd1 plutone1: pdsk( UpToDate -> DUnknown ) repl( Established -> Off ) [down] Jan 22 19:17:49 plutone2 kernel: drbd users/0 drbd1: Enabling local AL-updates Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: ack_receiver terminated Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Terminating ack_recv thread Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Terminating sender thread Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Starting sender thread (peer-node-id 0) Jan 22 19:17:49 plutone2 pacemaker-controld[1873]: notice: Requesting local execution of notify operation for users_drbd on plutone2.cnmca.meteoam.it Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Connection closed Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: helper command: /sbin/drbdadm disconnected Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: helper command: /sbin/drbdadm disconnected exit code 0 Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: conn( Disconnecting -> StandAlone ) [disconnected] Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Terminating receiver thread Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Terminating sender thread Jan 22 19:17:49 plutone2 pacemaker-controld[1873]: notice: Result of notify operation for users_drbd on plutone2.cnmca.meteoam.it: ok Jan 22 19:17:49 plutone2 pacemaker-controld[1873]: notice: Requesting local execution of stop operation for users_drbd on plutone2.cnmca.meteoam.it Jan 22 19:17:49 plutone2 kernel: drbd users/0 drbd1: disk( Outdated -> Detaching ) [down] Jan 22 19:17:49 plutone2 kernel: drbd users/0 drbd1: disk( Detaching -> Diskless ) [go-diskless] Jan 22 19:17:49 plutone2 kernel: drbd users/0 drbd1: drbd_bm_resize called with capacity == 0 Jan 22 19:17:49 plutone2 kernel: drbd /unregistered/users: Terminating worker thread Jan 22 19:17:49 plutone2 systemd[1]: drbd-graceful-shutdown.service: Deactivated successfully. Jan 22 19:19:29 plutone2 pacemaker-controld[1873]: error: Result of stop operation for users_drbd on plutone2.cnmca.meteoam.it: Timed Out after 1m40s (Resource agent did not complete within 1m40s) Jan 22 19:19:29 plutone2 pacemaker-attrd[1871]: notice: Setting last-failure-users_drbd#stop_0[plutone2.cnmca.meteoam.it] in instance_attributes: (unset) -> 1737573569 Jan 22 19:19:29 plutone2 pacemaker-attrd[1871]: notice: Setting fail-count-users_drbd#stop_0[plutone2.cnmca.meteoam.it] in instance_attributes: (unset) -> INFINITY

toracat 2025-01-22 14:50 administrator ~0010267	drbd-utils v9.30.0-rc.1 was announced a few days ago. As soon as the GA version is out, we will publish it. Changelog: 9.30.0-rc.1 ----------- * drbdmon: various improvements * drbdadm: preparations for drbd-proxy v4 * ocf: explicitly timeout crm_master IPC early * ocf: account for "recent" name change to "Promoted" * windrbd: various improvements * doc: switch to drbd9 manpages by default * init: reverse dependency for drbd-graceful-shutdown.service drbd@.service now has an After dependency for it.

anenni 2025-01-22 14:59 reporter ~0010268	it seems quite strange that this primary issue got uncaught for so many versions, and I don't see it explicitly reported

anenni 2025-01-22 15:46 reporter ~0010269	is there any way to report this to linbit? I'll try to report the issue to redhat also, even if it's not a supported product. regards

anenni 2025-01-23 12:24 reporter ~0010270	I strongly suspect that this is a red hat only problem. Linbit only mentions AlmaLinux in 9,x docs, and cluster start and stop procedures are quite different. alma uses corosync and pacemaker services in autostart in systemd, while redhat uses pcsd. that could be why stopping pacemaker by hand before rebooting works.

anenni 2025-01-23 15:18 reporter ~0010271	issue seems to be that with a standard shutdown systemd kicks in before pacemaker Jan 22 19:17:49 plutone2 systemd[1]: drbd-graceful-shutdown.service: Deactivated successfully. Jan 22 19:19:29 plutone2 pacemaker-controld[1873]: error: Result of stop operation for users_drbd on plutone2.cnmca.meteoam.it: Timed Out after 1m40s (Resource agent did not complete within 1m40s) but drbd is not under systemd in any way [root@NEWplutone2 ~]# systemctl status drbd.service ○ drbd.service - DRBD -- please disable. Unless you are NOT using a cluster manager. Loaded: loaded (/usr/lib/systemd/system/drbd.service; disabled; preset: disabled) Active: inactive (dead)

anenni 2025-01-23 15:26 reporter ~0010272	Aha... drbd-graceful-shutdown.service https://linbit.com/drbd-user-guide/drbd-guide-9_0-en/#s-drbd-graceful-shutdown 5.7. Shutting Down Gracefully By Using a systemd Service Included with drbd-utils versions since 9.26.0, there is a “graceful shutdown” service, drbd-graceful-shutdown.service. This service ensures that the DRBD “last man standing” behavior applies to your cluster when shutting down nodes.

anenni 2025-01-23 15:38 reporter ~0010273	Someone shoudl tell Linbit that this new silent gift doesn't work with pacemaker and makes quite some trouble. there's no mention at all in admin docs.

anenni 2025-01-23 15:46 reporter ~0010274	and it's not even possible to disable it cleanly [root@NEWplutone2 ~]# systemctl disable drbd-graceful-shutdown.service The unit files have no installation config (WantedBy=, RequiredBy=, Also=, Alias= settings in the [Install] section, and DefaultInstance= for template units). This means they are not meant to be enabled or disabled using systemctl. Possible reasons for having this kind of units are: • A unit may be statically enabled by being symlinked from another unit's .wants/ or .requires/ directory. • A unit's purpose may be to act as a helper for some other unit which has a requirement dependency on it. • A unit may be started when needed via activation (socket, path, timer, D-Bus, udev, scripted systemctl call, ...). • In case of template units, the unit is meant to be enabled with some instance name specified.

toracat 2025-01-23 19:43 administrator ~0010275	drbd9x-utils-9.30.0-1.el9.elrepo.x86_64.rpm has just been released.

anenni 2025-01-27 08:54 reporter ~0010283	bug is still in drbd9x-utils-9.30.0-1.el9.elrepo doesn't resolve this see: https://lists.linbit.com/pipermail/drbd-user/2025-January/026693.html

anenni 2025-01-27 09:59 reporter ~0010284	you can verysimply patch drbd-graceful-shutdown.service following their indications

toracat 2025-01-27 12:33 administrator ~0010285	I saw this post on the drbd mailing list: https://lists.linbit.com/pipermail/drbd-user/2025-January/026692.html Looks like the fix is comming?

anenni 2025-01-27 12:38 reporter ~0010286	yes, it's the original post by redhat following my bug report. but it will probably take months, since in the reply linbit says that it's now queued togethere with all other systemd thingies

anenni 2025-01-27 12:40 reporter ~0010287	and linbit suggestion is only a line to add: So apparently we forgot to add a "Before=pacemaker.service" or even better "Before=resource-agents-deps.target" to the graceful shutdown service.

toracat 2025-01-27 13:44 administrator ~0010288	OK, we will build and release a patched version to the testing repo soon as an interim solution.

toracat 2025-01-27 14:23 administrator ~0010289	The following package has just been released to the elrepo-testing repository and is currently syncing to mirrors: drbd9x-utils-9.30.0-2.el9.elrepo.x86_64.rpm This is the patch applied: $ cat elrepo-drbd-graceful-shutdown.patch --- a/scripts/drbd-graceful-shutdown.service.in 2025-01-23 12:26:45.605658196 -0800 +++ b/scripts/drbd-graceful-shutdown.service.in 2025-01-27 10:30:29.268233167 -0800 @@ -8,6 +8,7 @@ Description=ensure all DRBD resources sh Wants=network-online.target After=network-online.target Before=drbd.service +Before=resource-agents-deps.target [Service] Type=oneshot

Date Modified	Username	Field	Change
2024-10-28 15:06	anenni	New Issue
2024-10-28 15:06	anenni	Status	new => assigned
2024-10-28 15:06	anenni	Assigned To	=> toracat
2024-10-28 15:10	anenni	Note Added: 0010165
2024-10-28 15:10	anenni	File Added: drbdreboot.jpg
2024-10-28 15:13	anenni	Note Added: 0010166
2024-10-28 15:19	toracat	Status	assigned => acknowledged
2024-10-28 15:19	toracat	Note Added: 0010167
2024-10-28 15:34	anenni	Note Added: 0010168
2024-10-28 15:39	anenni	Note Added: 0010169
2024-10-28 15:56	anenni	Note Added: 0010170
2024-10-28 16:09	anenni	Note Added: 0010171
2024-10-28 16:24	toracat	Note Added: 0010172
2024-10-28 16:32	anenni	Note Added: 0010173
2024-10-28 17:56	toracat	Note Added: 0010174
2024-10-29 13:50	anenni	Note Added: 0010176
2024-10-29 14:37	anenni	Note Added: 0010177
2024-10-31 10:28	toracat	Note Added: 0010178
2025-01-22 14:31	anenni	Note Added: 0010265
2025-01-22 14:35	anenni	Note Added: 0010266
2025-01-22 14:50	toracat	Note Added: 0010267
2025-01-22 14:59	anenni	Note Added: 0010268
2025-01-22 15:46	anenni	Note Added: 0010269
2025-01-23 12:24	anenni	Note Added: 0010270
2025-01-23 15:18	anenni	Note Added: 0010271
2025-01-23 15:26	anenni	Note Added: 0010272
2025-01-23 15:38	anenni	Note Added: 0010273
2025-01-23 15:46	anenni	Note Added: 0010274
2025-01-23 19:43	toracat	Note Added: 0010275
2025-01-27 08:54	anenni	Note Added: 0010283
2025-01-27 09:59	anenni	Note Added: 0010284
2025-01-27 12:33	toracat	Note Added: 0010285
2025-01-27 12:38	anenni	Note Added: 0010286
2025-01-27 12:40	anenni	Note Added: 0010287
2025-01-27 13:44	toracat	Note Added: 0010288
2025-01-27 14:23	toracat	Note Added: 0010289
2025-01-27 14:23	toracat	Status	acknowledged => feedback

View Issue Details

Activities

Issue History