View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0000788 | channel: elrepo/el7 | kmod-drbd84 | public | 2017-10-01 08:03 | 2017-10-26 12:42 |
Reporter | slyce | Assigned To | pperry | ||
Priority | normal | Severity | block | Reproducibility | always |
Status | assigned | Resolution | open | ||
Summary | 0000788: Deadlocks on all filesystems using kmod-drbd84 8.4.10 with kernel 3.10.0-693-el7 | ||||
Description | Hi, I'm maintaining two NFS-HA clusters based on pacemaker, drbd84 and xfs filesystems. Report http://elrepo.org/bugs/view.php?id=781 says that drbd 8.4.10 is available for that kernel, but even if the ticket is "fixed", there are a LOT of issues with this module. We upgraded kernel and so drbd module (for security reasons) on the first cluster (lvm over drbd + 300 logical volumes, xfs and ext4, pacemaker with hundred of resources), and very quickly we found a blocking issue : each time we were doing some I/O on the filesystem, we found that it caused a huge number of deadlocks over all the system : FS I/O are all stuck, blocked, until the primary writer/reader finished its task. The biggest consequence for us was the loss of the exportfs (nfs exports) resources on our pacemaker cluster, as the monitoring of nfs exports was falling into timeout (300 shares, monitoring every 30s, timeout of 10s). Note that our servers are 192 GB, 32 cores and 6 TB RAID5 storage... I put in copy a picture of the monitoring (centreon). The 100+ load average was the consequence of a tar czf command, for less than 10 minutes... During this time, we lost nine nfs exports (pacemaker timeouts), and the remaining ones were blocked on client side , with some bad consequences. Each one or two hours, we've some cronjobs that are doing backups. Each time we had loadavg peaks with deadlocks, and we lost between one and five nfs exports... Pacemaker samples : * exportfs_nfsv4_pv-nfsha-prod-00279_monitor_30000 on ipe 'unknown error' (1): call=10620, status=Timed Out, exitreason='none', last-rc-change='Tue Sep 26 18:01:21 2017', queued=0ms, exec=0ms * exportfs_nfsv4_pv-nfsha-prod-00216_monitor_30000 on ipe 'unknown error' (1): call=11203, status=Timed Out, exitreason='none', last-rc-change='Tue Sep 26 18:03:25 2017', queued=0ms, exec=0ms ps -efl | grep ' D' samples : 1 D root 449061 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-221] 1 D root 449132 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-222] 1 D root 449631 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-223] 1 D root 449827 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-224] 1 D root 449834 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-198] 1 D root 450384 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-225] 1 D root 450717 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-227] 1 D root 451633 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-230] 1 D root 452237 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-232] 1 D root 452954 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-200] 1 D root 453222 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-201] 1 D root 453325 2 0 80 0 - 0 xfs_lo 00:02 ? 00:00:11 [xfsaild/dm-202] ... We were able to reproduce the issue on another environment (two virtualbox VMs) using the testing protocol with only one filesystem : * -693 + drbd 8.4.10 module from elrepo => deadlocks * -514 + drbd 8.4.9 module from elrepo => ok * - 693 without drbd => ok As it was clear that the root cause was -693 + 8.4.10, we rollbacked to -514 + 8.4.9, issue is now away, load is never overd 1 or 2 on heavy I/O... I would recommend you to remove the kmod-drbd84-8.4.10 package until this behaviour fixed. | ||||
Tags | No tags attached. | ||||
Attached Files | |||||
Reported upstream | |||||
|
Acknowledged. Are you aware of any similar issues with drbd-8.4.10 reported upstream? |
|
I checked and I found almost nothing. They are more involved on 9.0, and as 8.4.10 is recent and was patched for -693 (rhel/centos 7.4), I don't think there's a lot of experience returns yet. |
|
Acknowledged. I just had a quick google and didn't find anything that looked related. I'll consult with my colleague, and we will most likely demote the package back to the testing repo. I'm guessing you should probably report the issue upstream to Linbit. |
|
FYI, this single command line was used to run my tests : dd if=/dev/urandom of=sample.txt bs=64M count=100 |
|
We are waiting to hear from other drbd84 + el7.4 user. In the meantime, please do report this issue to the drbd developers. |
|
We finally downgraded all our clusters to kernel -514 and drbd 8.4.9 module, all our issues has gone away... No more deadlocks. |
Date Modified | Username | Field | Change |
---|---|---|---|
2017-10-01 08:03 | slyce | New Issue | |
2017-10-01 08:03 | slyce | Status | new => assigned |
2017-10-01 08:03 | slyce | Assigned To | => pperry |
2017-10-01 08:04 | slyce | File Added: Centreon.png | |
2017-10-01 08:13 | pperry | Note Added: 0005525 | |
2017-10-01 08:29 | slyce | Note Added: 0005526 | |
2017-10-01 09:21 | pperry | Note Added: 0005527 | |
2017-10-01 10:33 | slyce | Note Added: 0005528 | |
2017-10-01 10:40 | toracat | Note Added: 0005529 | |
2017-10-26 12:42 | slyce | Note Added: 0005563 |