View Issue Details

IDProjectCategoryView StatusLast Update
0001488channel: elrepo/el9drbd9x-utilspublic2025-01-27 14:23
Reporteranenni Assigned Totoracat  
PrioritynormalSeveritymajorReproducibilityalways
Status feedbackResolutionopen 
PlatformLinuxOSrhelOS Version9.4
Summary0001488: pacemaker-schedulerd warning: Unexpected result (error: Resource agent did not complete within 1m40s)
Descriptionwhen restarting drbd standby node, pacemaker stop procedure always fails making node unusable for a lot of time, with this message multiple in logs:

pacemaker-schedulerd warning: Unexpected result (error: Resource agent did not complete within 1m40s)

after a lot of minutes it finally reboots, unless obviously you configure fencing that kills the node, but this is not acceptable as a routine measure.

Steps To Reproducejust reboot the drbd unpromoted (slave) server.

Instead, stopping pacemaker before and then rebooting the node goes flawlessly.
Additional Informationusing drbd9x-utils-9.28.0-1.el9.elrepo.x86_64
tried kmod-drbd9x-9.1.22 and .21 and .20

selinux is permissive

resource config is as per linbit latest rhel9/drbd9 docs.

Clone: users_drbd-clone
  Meta Attributes: users_drbd-clone-meta_attributes
    clone-max=2
    clone-node-max=1
    notify=true
    promotable=true
    promoted-max=1
    promoted-node-max=1
  Resource: users_drbd (class=ocf provider=linbit type=drbd)
    Attributes: users_drbd-instance_attributes
      drbd_resource=users
    Operations:
      demote: users_drbd-demote-interval-0s
        interval=0s timeout=90
      monitor: users_drbd-monitor-interval-29s
        interval=29s timeout=20s role=Promoted
      monitor: users_drbd-monitor-interval-31s
        interval=31s timeout=20s role=Unpromoted
      notify: users_drbd-notify-interval-0s
        interval=0s timeout=90
      promote: users_drbd-promote-interval-0s
        interval=0s timeout=90
      reload: users_drbd-reload-interval-0s
        interval=0s timeout=30
      start: users_drbd-start-interval-0s
        interval=0s timeout=240
      stop: users_drbd-stop-interval-0s
        interval=0s timeout=100
TagsNo tags attached.

Activities

anenni

2024-10-28 15:10

reporter   ~0010165

this is logs from system console while trying to reboot
drbdreboot.jpg (222,900 bytes)   
drbdreboot.jpg (222,900 bytes)   

anenni

2024-10-28 15:13

reporter   ~0010166

we have a very similar setup on a rhel8.9 system with slightly lower versions that has no problem at all.

drbd90-utils-9.27.0-1.el8.elrepo.x86_64
kmod-drbd90-9.1.19-1.el8_9.elrepo.x86_64

toracat

2024-10-28 15:19

administrator   ~0010167

Acknowledged.

anenni

2024-10-28 15:34

reporter   ~0010168

infact after downgrading to this almost identycal combination all started to work as expected:

kmod-drbd9x-9.1.19-2.el9_4.elrepo.x86_64
drbd9x-utils-9.27.0-1.el9.elrepo.x86_64

anenni

2024-10-28 15:39

reporter   ~0010169

a lot has changed in latest utils:

https://github.com/LINBIT/drbd-utils/blob/master/ChangeLog

9.28.0
-----------
 * events2: set may_promote:no promotion_score:0 while
   force-io-failure:yes
 * drbdsetup,v9: show TLS in connection status
 * drbdsetup,v9: add udev command
 * 8.3: remove
 * crm-fence-peer.9.sh: fixes for pacemaker 2.1.7
 * events2: improved out of order message handling

anenni

2024-10-28 15:56

reporter   ~0010170

I finished picking a kmod-drbd9x version older then drbd9x-utils-9.28.0-1.el9.elrepo.x86_64, just to be on the safe side with a somewhat common combination.

    drbd9x-utils-9.27.0-1.el9.elrepo.x86_64.rpm 2023-12-23 12:41 1.0M
     drbd9x-utils-9.28.0-1.el9.elrepo.x86_64.rpm 2024-05-11 20:36 886K

     kmod-drbd9x-9.1.19-2.el9_4.elrepo.x86_64.rpm 2024-05-01 17:03 400K
     kmod-drbd9x-9.1.20-1.el9_4.elrepo.x86_64.rpm 2024-05-13 18:59 402K
     kmod-drbd9x-9.1.21-1.el9_4.elrepo.x86_64.rpm 2024-06-08 18:33 402K
     kmod-drbd9x-9.1.22-1.el9_4.elrepo.x86_64.rpm 2024-08-12 18:48 403K

anenni

2024-10-28 16:09

reporter   ~0010171

It seems a lot has changed in 9.28.0 ....

https://lists.linbit.com/pipermail/drbd-announce/2024-May/000728.html

"In contrast to the recent releases this one contains a bit more exciting
news:"

toracat

2024-10-28 16:24

administrator   ~0010172

Thank you. I was about to suggest trying drbd9x-utils-9.27.0.

By the way 9.29.0 is on its way. The changelog is:

9.29.0-rc.1
-----------
 * drbdmeta: fix initialization for external md
 * build: allow disabling keyutils
 * tests: export sanitized environment
 * drbdmon: various improvements
 * build: add cyclonedx
 * drbsetup,v9: fix multiple paths drbdsetup show --json
   strictly speaking breaking change, but maily used internally
 * events2: expose if device is open
 * drbdadm: fix undefined behavior that triggered on amd64
 * shared: fix out-of-bounds access in parsing
 * drbsetup,v9: event consistency with peer devices
 * drbdadm: fix parsing of v8.4 configs for compatibility
 * drbdmeta: fix segfault for check-resize on intentionally diskless
 * drbd-promote@.service: check if ExecCondition is available

anenni

2024-10-28 16:32

reporter   ~0010173

Thank you, but I have to go production so if it proves stable, I'll stop here for a while.

And it could also be something rhel9 specific.

Regards.

toracat

2024-10-28 17:56

administrator   ~0010174

Understood. Best to stay with what is proven to work.

anenni

2024-10-29 13:50

reporter   ~0010176

i'll leave tests for the next cluster
in the meantime maybe it's better to retire this version

anenni

2024-10-29 14:37

reporter   ~0010177

Also, rhel 9.5 should be near now, so some testing will be necessary

toracat

2024-10-31 10:28

administrator   ~0010178

drbd9x-utils-9.29.0-1.el9.elrepo.x86_64.rpm is out.

anenni

2025-01-22 14:31

reporter   ~0010265

hello.

well, I tried drbd9x-utils-9.29.0-1.el9.elrepo.x86_64, updating also to rhel 9.5 and kmod-drbd9x-9.1.23-1.el9_5.elrepo.x86_64

but I still observe system not being able to cleanly shutdown:

Node List:
  * Node plutone2.cnmca.meteoam.it: UNCLEAN (online)
  * Online: [ plutone1.cnmca.meteoam.it ]

Active Resources:
  * Clone Set: users_drbd-clone [users_drbd] (promotable):
    * users_drbd (ocf:linbit:drbd): FAILED plutone2.cnmca.meteoam.it
    * Promoted: [ plutone1.cnmca.meteoam.it ]
  * Resource Group: users_group (3 members inactive):
    * users_fs (ocf:heartbeat:Filesystem): Started plutone1.cnmca.meteoam.it

Failed Resource Actions:
  * users_drbd stop on plutone2.cnmca.meteoam.it could not be executed (Timed Out: Resource agent did not complete within 1m40s) at Wed Jan 22 19:17:49 2025 after 1m40.002s

anenni

2025-01-22 14:35

reporter   ~0010266

stop of drbd resource seems to complete correctly, but pacemaker seems to think that it didn't complete:

Jan 22 19:17:49 plutone2 kernel: drbd users: Preparing cluster-wide state change 3527246773: 1->0 conn( Disconnecting )
Jan 22 19:17:49 plutone2 kernel: drbd users: State change 3527246773: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFE
Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Cluster is now split
Jan 22 19:17:49 plutone2 kernel: drbd users: Committing cluster-wide state change 3527246773 (1ms)
Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: conn( Connected -> Disconnecting ) peer( Primary -> Unknown ) [down]
Jan 22 19:17:49 plutone2 kernel: drbd users/0 drbd1: disk( UpToDate -> Outdated ) [down]
Jan 22 19:17:49 plutone2 kernel: drbd users/0 drbd1 plutone1: pdsk( UpToDate -> DUnknown ) repl( Established -> Off ) [down]
Jan 22 19:17:49 plutone2 kernel: drbd users/0 drbd1: Enabling local AL-updates
Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: ack_receiver terminated
Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Terminating ack_recv thread
Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Terminating sender thread
Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Starting sender thread (peer-node-id 0)
Jan 22 19:17:49 plutone2 pacemaker-controld[1873]: notice: Requesting local execution of notify operation for users_drbd on plutone2.cnmca.meteoam.it
Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Connection closed
Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: helper command: /sbin/drbdadm disconnected
Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: helper command: /sbin/drbdadm disconnected exit code 0
Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: conn( Disconnecting -> StandAlone ) [disconnected]
Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Terminating receiver thread
Jan 22 19:17:49 plutone2 kernel: drbd users plutone1: Terminating sender thread
Jan 22 19:17:49 plutone2 pacemaker-controld[1873]: notice: Result of notify operation for users_drbd on plutone2.cnmca.meteoam.it: ok
Jan 22 19:17:49 plutone2 pacemaker-controld[1873]: notice: Requesting local execution of stop operation for users_drbd on plutone2.cnmca.meteoam.it
Jan 22 19:17:49 plutone2 kernel: drbd users/0 drbd1: disk( Outdated -> Detaching ) [down]
Jan 22 19:17:49 plutone2 kernel: drbd users/0 drbd1: disk( Detaching -> Diskless ) [go-diskless]
Jan 22 19:17:49 plutone2 kernel: drbd users/0 drbd1: drbd_bm_resize called with capacity == 0
Jan 22 19:17:49 plutone2 kernel: drbd /unregistered/users: Terminating worker thread
Jan 22 19:17:49 plutone2 systemd[1]: drbd-graceful-shutdown.service: Deactivated successfully.
Jan 22 19:19:29 plutone2 pacemaker-controld[1873]: error: Result of stop operation for users_drbd on plutone2.cnmca.meteoam.it: Timed Out after 1m40s (Resource agent did not complete within 1m40s)
Jan 22 19:19:29 plutone2 pacemaker-attrd[1871]: notice: Setting last-failure-users_drbd#stop_0[plutone2.cnmca.meteoam.it] in instance_attributes: (unset) -> 1737573569
Jan 22 19:19:29 plutone2 pacemaker-attrd[1871]: notice: Setting fail-count-users_drbd#stop_0[plutone2.cnmca.meteoam.it] in instance_attributes: (unset) -> INFINITY

toracat

2025-01-22 14:50

administrator   ~0010267

drbd-utils v9.30.0-rc.1 was announced a few days ago. As soon as the GA version is out, we will publish it.

Changelog:

9.30.0-rc.1
-----------
 * drbdmon: various improvements
 * drbdadm: preparations for drbd-proxy v4
 * ocf: explicitly timeout crm_master IPC early
 * ocf: account for "recent" name change to "Promoted"
 * windrbd: various improvements
 * doc: switch to drbd9 manpages by default
 * init: reverse dependency for drbd-graceful-shutdown.service
   drbd@.service now has an After dependency for it.

anenni

2025-01-22 14:59

reporter   ~0010268

it seems quite strange that this primary issue got uncaught for so many versions, and I don't see it explicitly reported

anenni

2025-01-22 15:46

reporter   ~0010269

is there any way to report this to linbit?

I'll try to report the issue to redhat also, even if it's not a supported product.

regards

anenni

2025-01-23 12:24

reporter   ~0010270

I strongly suspect that this is a red hat only problem.
Linbit only mentions AlmaLinux in 9,x docs, and cluster start and stop procedures are quite different.
alma uses corosync and pacemaker services in autostart in systemd, while redhat uses pcsd.

that could be why stopping pacemaker by hand before rebooting works.

anenni

2025-01-23 15:18

reporter   ~0010271

issue seems to be that with a standard shutdown systemd kicks in before pacemaker

Jan 22 19:17:49 plutone2 systemd[1]: drbd-graceful-shutdown.service: Deactivated successfully.
Jan 22 19:19:29 plutone2 pacemaker-controld[1873]: error: Result of stop operation for users_drbd on plutone2.cnmca.meteoam.it: Timed Out after 1m40s (Resource agent did not complete within 1m40s)

but drbd is not under systemd in any way

[root@NEWplutone2 ~]# systemctl status drbd.service
○ drbd.service - DRBD -- please disable. Unless you are NOT using a cluster manager.
     Loaded: loaded (/usr/lib/systemd/system/drbd.service; disabled; preset: disabled)
     Active: inactive (dead)

anenni

2025-01-23 15:26

reporter   ~0010272

Aha... drbd-graceful-shutdown.service

https://linbit.com/drbd-user-guide/drbd-guide-9_0-en/#s-drbd-graceful-shutdown

5.7. Shutting Down Gracefully By Using a systemd Service

Included with drbd-utils versions since 9.26.0, there is a “graceful shutdown” service, drbd-graceful-shutdown.service. This service ensures that the DRBD “last man standing” behavior applies to your cluster when shutting down nodes.

anenni

2025-01-23 15:38

reporter   ~0010273

Someone shoudl tell Linbit that this new silent gift doesn't work with pacemaker and makes quite some trouble.

there's no mention at all in admin docs.

anenni

2025-01-23 15:46

reporter   ~0010274

and it's not even possible to disable it cleanly

[root@NEWplutone2 ~]# systemctl disable drbd-graceful-shutdown.service
The unit files have no installation config (WantedBy=, RequiredBy=, Also=,
Alias= settings in the [Install] section, and DefaultInstance= for template
units). This means they are not meant to be enabled or disabled using systemctl.

Possible reasons for having this kind of units are:
• A unit may be statically enabled by being symlinked from another unit's
  .wants/ or .requires/ directory.
• A unit's purpose may be to act as a helper for some other unit which has
  a requirement dependency on it.
• A unit may be started when needed via activation (socket, path, timer,
  D-Bus, udev, scripted systemctl call, ...).
• In case of template units, the unit is meant to be enabled with some
  instance name specified.

toracat

2025-01-23 19:43

administrator   ~0010275

drbd9x-utils-9.30.0-1.el9.elrepo.x86_64.rpm has just been released.

anenni

2025-01-27 08:54

reporter   ~0010283

bug is still in drbd9x-utils-9.30.0-1.el9.elrepo doesn't resolve this
see:
https://lists.linbit.com/pipermail/drbd-user/2025-January/026693.html

anenni

2025-01-27 09:59

reporter   ~0010284

you can verysimply patch drbd-graceful-shutdown.service following their indications

toracat

2025-01-27 12:33

administrator   ~0010285

I saw this post on the drbd mailing list:

https://lists.linbit.com/pipermail/drbd-user/2025-January/026692.html

Looks like the fix is comming?

anenni

2025-01-27 12:38

reporter   ~0010286

yes, it's the original post by redhat following my bug report.

but it will probably take months, since in the reply linbit says that it's now queued togethere with all other systemd thingies

anenni

2025-01-27 12:40

reporter   ~0010287

and linbit suggestion is only a line to add:

So apparently we forgot to add a "Before=pacemaker.service"
or even better "Before=resource-agents-deps.target"
to the graceful shutdown service.

toracat

2025-01-27 13:44

administrator   ~0010288

OK, we will build and release a patched version to the testing repo soon as an interim solution.

toracat

2025-01-27 14:23

administrator   ~0010289

The following package has just been released to the elrepo-testing repository and is currently syncing to mirrors:

drbd9x-utils-9.30.0-2.el9.elrepo.x86_64.rpm

This is the patch applied:

$ cat elrepo-drbd-graceful-shutdown.patch
--- a/scripts/drbd-graceful-shutdown.service.in 2025-01-23 12:26:45.605658196 -0800
+++ b/scripts/drbd-graceful-shutdown.service.in 2025-01-27 10:30:29.268233167 -0800
@@ -8,6 +8,7 @@ Description=ensure all DRBD resources sh
 Wants=network-online.target
 After=network-online.target
 Before=drbd.service
+Before=resource-agents-deps.target
 
 [Service]
 Type=oneshot

Issue History

Date Modified Username Field Change
2024-10-28 15:06 anenni New Issue
2024-10-28 15:06 anenni Status new => assigned
2024-10-28 15:06 anenni Assigned To => toracat
2024-10-28 15:10 anenni Note Added: 0010165
2024-10-28 15:10 anenni File Added: drbdreboot.jpg
2024-10-28 15:13 anenni Note Added: 0010166
2024-10-28 15:19 toracat Status assigned => acknowledged
2024-10-28 15:19 toracat Note Added: 0010167
2024-10-28 15:34 anenni Note Added: 0010168
2024-10-28 15:39 anenni Note Added: 0010169
2024-10-28 15:56 anenni Note Added: 0010170
2024-10-28 16:09 anenni Note Added: 0010171
2024-10-28 16:24 toracat Note Added: 0010172
2024-10-28 16:32 anenni Note Added: 0010173
2024-10-28 17:56 toracat Note Added: 0010174
2024-10-29 13:50 anenni Note Added: 0010176
2024-10-29 14:37 anenni Note Added: 0010177
2024-10-31 10:28 toracat Note Added: 0010178
2025-01-22 14:31 anenni Note Added: 0010265
2025-01-22 14:35 anenni Note Added: 0010266
2025-01-22 14:50 toracat Note Added: 0010267
2025-01-22 14:59 anenni Note Added: 0010268
2025-01-22 15:46 anenni Note Added: 0010269
2025-01-23 12:24 anenni Note Added: 0010270
2025-01-23 15:18 anenni Note Added: 0010271
2025-01-23 15:26 anenni Note Added: 0010272
2025-01-23 15:38 anenni Note Added: 0010273
2025-01-23 15:46 anenni Note Added: 0010274
2025-01-23 19:43 toracat Note Added: 0010275
2025-01-27 08:54 anenni Note Added: 0010283
2025-01-27 09:59 anenni Note Added: 0010284
2025-01-27 12:33 toracat Note Added: 0010285
2025-01-27 12:38 anenni Note Added: 0010286
2025-01-27 12:40 anenni Note Added: 0010287
2025-01-27 13:44 toracat Note Added: 0010288
2025-01-27 14:23 toracat Note Added: 0010289
2025-01-27 14:23 toracat Status acknowledged => feedback