hey dchinner... I'm following up on https://www.spinics.net/lists/linux-xfs/msg13544.html We were finally able to run xfs_repair on one of the bad volumes and got this .... http://paste.ubuntu.com/26315462/ I'll follow up in the thread as well. Is there an xfs off by one bug that was fixed in the mainline/stable kernels since 4.10.10 that resolves freeblk count 3 != flcount 4 in ag 1012 ? What typically causes those errors. fyi... I've looked but haven't seen anything obvious.. I was hoping you guys had an idea. v5 fs? possibly created before 4.5? if so, then possibly possible ... can you elaborate? <-- yocum (~yocum@70.59.184.10) has quit (Quit: Remote host closed the connection) it was running 4.10.10 when we hit this. chiluk: where did the filesystem come from? how do I check fs version? what xfsprogs version are you using? dchinner I'm not sure what you are asking... when the machine was deployed it was deployed on a 2.5 gb lvm volume, and then grown.. whatever comes with cent7 yeah, there's your problem dchinner is there something you can point me to that outlines "my problem", commit id ...etc ? if you are running centos, you need to use centos kernels and xfsprogs if you are running a mainline kernel, you need to use mainline xfsprogs this is the problematic issue: commit 96f859d52bcb ("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct") that commit went into the 4.5 kernels (fubar'd structure padding means the agfl size changed in 4.5 on .... 64bit kernels?) alright ... so that explains all the xfs_repair complaints but that doesn't explain the XFS (dm-4): Internal error XFS_WANT_CORRUPTED_GOTO at line 3505 of file fs/xfs/libxfs/xfs_btree.c. Caller xfs_free_ag_extent+0x35d/0x7a0 [xfs] it hasn't been backported to RHEL/centos kernels for compatibility reasons that could be anything so the thought is that the filesystem gets created with default 3.10 cent7 kernel + cent7 mkfs... then upgraded to 4.10... and we start hitting this issue? Am I understanding this? that's the vector that can cause it you need to run xfs_repair from xfsprogs >= 4.7.0 to fix it up dchinner we are hitting this in our production cluster once a week or so. well... mounted with the default 3.10 kernel, then later remounted on 4.5+ and, trickily, only if the active part of the agfl goes near the end ok let me check to see if we are formatting using 3.10 first... we might be deploying with 4.10 out of the gate. (just in case you're making xfs images with a system having a 3.10 kernel and then deploying them to machines that boot 4.10) we are not... deploys are done using puppet + chef + kickstart. yeah, it has nothing to do with the deployed kernel - it's about where the filesystem image being deployed was made in the first place -*- dchinner needs to resurrect the old patches he had that automatically detected this condition and fixed it. djwong: I suspect this is a good case for agfl scrub + repair at mount time :P yeah I'd second that. dchinner, urk sorry missed this, now need to run an errand, I'll try to catch you tonight. mostly just wanted to coordinate on changes to your small mkfs series (i.e. after the second phase of journal recovery, before EFIs and intents are processed) sandeen: no worries Alright so the recommendation would be to upgrade xfs_progs, and run xfs_repair after the kernel upgrade before the mount under the new kernel. --> yocum (~yocum@2607:fb90:4b14:fba0:10b:5576:706e:1fa4) has joined #xfs you don't need to upgrade the kernel to run the newer xfs_repair just don't mount it on an older kernel after running the newer repair. --> navidr (uid112413@gateway/web/irccloud.com/x-wxhyyksnbqkzqeia) has joined #xfs djwong: I really need to go back and update and test those AGFL patches again :/ yeah I'm not even sure the approach I took in that patch set is the right way to do it anymore, either.... at this point i suspect it might be easier to stuff it in scrub/agheader.c as one of the repair functions tbh as i read the other functions i started wondering if i ought to just shove it in the online repair patch set for 4.17 then the only problem is, do we read/fix every agfl on every mount? (i guess at this point we dig through every AG's refcountbt on mount...)