View Issue Details

IDProjectCategoryView StatusLast Update
0000788channel: elrepo/el7kmod-drbd84public2017-10-26 12:42
Reporterslyce Assigned Topperry  
PrioritynormalSeverityblockReproducibilityalways
Status assignedResolutionopen 
Summary0000788: Deadlocks on all filesystems using kmod-drbd84 8.4.10 with kernel 3.10.0-693-el7
DescriptionHi,

I'm maintaining two NFS-HA clusters based on pacemaker, drbd84 and xfs filesystems.
Report http://elrepo.org/bugs/view.php?id=781 says that drbd 8.4.10 is available for that kernel, but even if the ticket is "fixed", there are a LOT of issues with this module.


We upgraded kernel and so drbd module (for security reasons) on the first cluster (lvm over drbd + 300 logical volumes, xfs and ext4, pacemaker with hundred of resources), and very quickly we found a blocking issue : each time we were doing some I/O on the filesystem, we found that it caused a huge number of deadlocks over all the system : FS I/O are all stuck, blocked, until the primary writer/reader finished its task. The biggest consequence for us was the loss of the exportfs (nfs exports) resources on our pacemaker cluster, as the monitoring of nfs exports was falling into timeout (300 shares, monitoring every 30s, timeout of 10s).

Note that our servers are 192 GB, 32 cores and 6 TB RAID5 storage... I put in copy a picture of the monitoring (centreon). The 100+ load average was the consequence of a tar czf command, for less than 10 minutes... During this time, we lost nine nfs exports (pacemaker timeouts), and the remaining ones were blocked on client side , with some bad consequences. Each one or two hours, we've some cronjobs that are doing backups. Each time we had loadavg peaks with deadlocks, and we lost between one and five nfs exports...

Pacemaker samples :
* exportfs_nfsv4_pv-nfsha-prod-00279_monitor_30000 on ipe 'unknown error' (1): call=10620, status=Timed Out, exitreason='none',
    last-rc-change='Tue Sep 26 18:01:21 2017', queued=0ms, exec=0ms
* exportfs_nfsv4_pv-nfsha-prod-00216_monitor_30000 on ipe 'unknown error' (1): call=11203, status=Timed Out, exitreason='none',
    last-rc-change='Tue Sep 26 18:03:25 2017', queued=0ms, exec=0ms

ps -efl | grep ' D' samples :
1 D root 449061 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-221]
1 D root 449132 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-222]
1 D root 449631 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-223]
1 D root 449827 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-224]
1 D root 449834 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-198]
1 D root 450384 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-225]
1 D root 450717 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-227]
1 D root 451633 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-230]
1 D root 452237 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-232]
1 D root 452954 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-200]
1 D root 453222 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-201]
1 D root 453325 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-202]
...

We were able to reproduce the issue on another environment (two virtualbox VMs) using the testing protocol with only one filesystem :
* -693 + drbd 8.4.10 module from elrepo => deadlocks
* -514 + drbd 8.4.9 module from elrepo => ok
* - 693 without drbd => ok

As it was clear that the root cause was -693 + 8.4.10, we rollbacked to -514 + 8.4.9, issue is now away, load is never overd 1 or 2 on heavy I/O...

I would recommend you to remove the kmod-drbd84-8.4.10 package until this behaviour fixed.
TagsNo tags attached.
Attached Files
Centreon.png (24,607 bytes)   
Centreon.png (24,607 bytes)   
Reported upstream

Activities

pperry

2017-10-01 08:13

administrator   ~0005525

Acknowledged.

Are you aware of any similar issues with drbd-8.4.10 reported upstream?

slyce

2017-10-01 08:29

reporter   ~0005526

I checked and I found almost nothing. They are more involved on 9.0, and as 8.4.10 is recent and was patched for -693 (rhel/centos 7.4), I don't think there's a lot of experience returns yet.

pperry

2017-10-01 09:21

administrator   ~0005527

Acknowledged. I just had a quick google and didn't find anything that looked related.

I'll consult with my colleague, and we will most likely demote the package back to the testing repo. I'm guessing you should probably report the issue upstream to Linbit.

slyce

2017-10-01 10:33

reporter   ~0005528

FYI, this single command line was used to run my tests :

dd if=/dev/urandom of=sample.txt bs=64M count=100

toracat

2017-10-01 10:40

administrator   ~0005529

We are waiting to hear from other drbd84 + el7.4 user. In the meantime, please do report this issue to the drbd developers.

slyce

2017-10-26 12:42

reporter   ~0005563

We finally downgraded all our clusters to kernel -514 and drbd 8.4.9 module, all our issues has gone away... No more deadlocks.

Issue History

Date Modified Username Field Change
2017-10-01 08:03 slyce New Issue
2017-10-01 08:03 slyce Status new => assigned
2017-10-01 08:03 slyce Assigned To => pperry
2017-10-01 08:04 slyce File Added: Centreon.png
2017-10-01 08:13 pperry Note Added: 0005525
2017-10-01 08:29 slyce Note Added: 0005526
2017-10-01 09:21 pperry Note Added: 0005527
2017-10-01 10:33 slyce Note Added: 0005528
2017-10-01 10:40 toracat Note Added: 0005529
2017-10-26 12:42 slyce Note Added: 0005563