0000788: Deadlocks on all filesystems using kmod-drbd84 8.4.10 with kernel 3.10.0-693-el7 - ELRepo Bugs

ID	Project	Category	View Status	Date Submitted	Last Update

0000788	channel: elrepo/el7	kmod-drbd84	public	2017-10-01 08:03	2017-10-26 12:42

Reporter	slyce	Assigned To	pperry
Priority	normal	Severity	block	Reproducibility	always
Status	assigned	Resolution	open

Summary	0000788: Deadlocks on all filesystems using kmod-drbd84 8.4.10 with kernel 3.10.0-693-el7
Description	Hi, I'm maintaining two NFS-HA clusters based on pacemaker, drbd84 and xfs filesystems. Report http://elrepo.org/bugs/view.php?id=781 says that drbd 8.4.10 is available for that kernel, but even if the ticket is "fixed", there are a LOT of issues with this module. We upgraded kernel and so drbd module (for security reasons) on the first cluster (lvm over drbd + 300 logical volumes, xfs and ext4, pacemaker with hundred of resources), and very quickly we found a blocking issue : each time we were doing some I/O on the filesystem, we found that it caused a huge number of deadlocks over all the system : FS I/O are all stuck, blocked, until the primary writer/reader finished its task. The biggest consequence for us was the loss of the exportfs (nfs exports) resources on our pacemaker cluster, as the monitoring of nfs exports was falling into timeout (300 shares, monitoring every 30s, timeout of 10s). Note that our servers are 192 GB, 32 cores and 6 TB RAID5 storage... I put in copy a picture of the monitoring (centreon). The 100+ load average was the consequence of a tar czf command, for less than 10 minutes... During this time, we lost nine nfs exports (pacemaker timeouts), and the remaining ones were blocked on client side , with some bad consequences. Each one or two hours, we've some cronjobs that are doing backups. Each time we had loadavg peaks with deadlocks, and we lost between one and five nfs exports... Pacemaker samples : * exportfs_nfsv4_pv-nfsha-prod-00279_monitor_30000 on ipe 'unknown error' (1): call=10620, status=Timed Out, exitreason='none', last-rc-change='Tue Sep 26 18:01:21 2017', queued=0ms, exec=0ms * exportfs_nfsv4_pv-nfsha-prod-00216_monitor_30000 on ipe 'unknown error' (1): call=11203, status=Timed Out, exitreason='none', last-rc-change='Tue Sep 26 18:03:25 2017', queued=0ms, exec=0ms ps -efl \| grep ' D' samples : 1 D root 449061 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-221] 1 D root 449132 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-222] 1 D root 449631 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-223] 1 D root 449827 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-224] 1 D root 449834 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-198] 1 D root 450384 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-225] 1 D root 450717 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-227] 1 D root 451633 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-230] 1 D root 452237 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-232] 1 D root 452954 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-200] 1 D root 453222 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-201] 1 D root 453325 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-202] ... We were able to reproduce the issue on another environment (two virtualbox VMs) using the testing protocol with only one filesystem : * -693 + drbd 8.4.10 module from elrepo => deadlocks * -514 + drbd 8.4.9 module from elrepo => ok * - 693 without drbd => ok As it was clear that the root cause was -693 + 8.4.10, we rollbacked to -514 + 8.4.9, issue is now away, load is never overd 1 or 2 on heavy I/O... I would recommend you to remove the kmod-drbd84-8.4.10 package until this behaviour fixed.
Tags	No tags attached.
Attached Files	Centreon.png (24,607 bytes) Centreon.png (24,607 bytes)

Reported upstream

pperry 2017-10-01 08:13 administrator ~0005525	Acknowledged. Are you aware of any similar issues with drbd-8.4.10 reported upstream?

slyce 2017-10-01 08:29 reporter ~0005526	I checked and I found almost nothing. They are more involved on 9.0, and as 8.4.10 is recent and was patched for -693 (rhel/centos 7.4), I don't think there's a lot of experience returns yet.

pperry 2017-10-01 09:21 administrator ~0005527	Acknowledged. I just had a quick google and didn't find anything that looked related. I'll consult with my colleague, and we will most likely demote the package back to the testing repo. I'm guessing you should probably report the issue upstream to Linbit.

slyce 2017-10-01 10:33 reporter ~0005528	FYI, this single command line was used to run my tests : dd if=/dev/urandom of=sample.txt bs=64M count=100

toracat 2017-10-01 10:40 administrator ~0005529	We are waiting to hear from other drbd84 + el7.4 user. In the meantime, please do report this issue to the drbd developers.

slyce 2017-10-26 12:42 reporter ~0005563	We finally downgraded all our clusters to kernel -514 and drbd 8.4.9 module, all our issues has gone away... No more deadlocks.

Date Modified	Username	Field	Change
2017-10-01 08:03	slyce	New Issue
2017-10-01 08:03	slyce	Status	new => assigned
2017-10-01 08:03	slyce	Assigned To	=> pperry
2017-10-01 08:04	slyce	File Added: Centreon.png
2017-10-01 08:13	pperry	Note Added: 0005525
2017-10-01 08:29	slyce	Note Added: 0005526
2017-10-01 09:21	pperry	Note Added: 0005527
2017-10-01 10:33	slyce	Note Added: 0005528
2017-10-01 10:40	toracat	Note Added: 0005529
2017-10-26 12:42	slyce	Note Added: 0005563