View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0001450 | channel: kernel/el8 | --kernel--OTHER-- | public | 2024-05-12 21:26 | 2024-08-07 13:16 |
Reporter | T.Kabu | Assigned To | toracat | ||
Priority | normal | Severity | minor | Reproducibility | sometimes |
Status | resolved | Resolution | no change required | ||
Summary | 0001450: I'm having trouble booting a Linux VM on top of a Linux KVM. | ||||
Description | Hello, Everyone. And, Thank you for always providing us with the latest kernel. We recently updated the kernel for HOST and VMs all the way to 6.8.x. Then, when there are more than 5 VMs, the VMs started to deadlock during or after startup. Up to four VMs will start and continue to run without any problems. By the way, if you want to boot the fifth and later kernels with 6.7 or 6.6, there is no problem. Has something changed in the kernel config between up to 6.7 and from 6.8? Or is there a bug in kernel 6.8? HOST 6.8.9-1.el8.elrepo.x86_64 VM001 6.8.9-1.el8.elrepo.x86_64 VM002 6.8.9-1.el9.elrepo.x86_64 VM003 6.8.9-1.el7.elrepo.x86_64 VM004 6.8.9-1.el7.elrepo.x86_64 VM005 6.8.9-1.el8.elrepo.x86_64 DeadLock!! VM006 6.8.9-1.el8.elrepo.x86_64 DeadLock!! Since we have no choice, we are currently... VM005 6.7.9-1.el8.elrepo.x86_64 VM006 6.6.1-1.el8.elrepo.x86_64 Regards. T.Kabu | ||||
Tags | centos, crash, kernel-ml | ||||
|
You can see the config changes going from 6.7 to 6.8 in our git. el7: https://github.com/elrepo/kernel/commit/15291c5242d5885fa788de42300e0acb2d571a65 el8: https://github.com/elrepo/kernel/commit/6e68d511d96f80086d64b57d7fcb5fd21785afc8 el9: https://github.com/elrepo/kernel/commit/ead38c65089d27e637f981c52d4e11f127849fad |
|
Thanks for the info. https://github.com/elrepo/kernel/commit/6e68d511d96f80086d64b57d7fcb5fd21785afc8 I will take a look at the information here and make a kernel with modified VM-related and KVM-related parameters. Hopefully I can figure something out and report back. T.Kabu |
|
I built kernel version 6.8.9 with kernel version 6.7.9's config. And, I install this kernel in a VM005 and boot it, it does not deadlock!! :-) I am still trying to determine the cause, but clearly someone of the kernel options seems to be the cause of the deadlock. T.Kabu |
|
Can you by any chance test the latest kernel 6.9.0? It is currently in the elrepo-testing repository. |
|
Ok, I'll give it a try. |
|
I said that rebuilding kernel version 6.8.9 with a 6.7.9 config would not deadlock, but that was a mistake. It always deadlocks once every few times. I found this out when I had to DESTROY and START dozens of times. I still have not found the cause of this... T.Kabu |
|
Hmmm... > error: Failed build dependencies: > gcc-toolset-9 is needed by kernel-ml-6.9.0-1.el8.x86_64 > gcc-toolset-9-binutils is needed by kernel-ml-6.9.0-1.el8.x86_64 > gcc-toolset-9-runtime is needed by kernel-ml-6.9.0-1.el8.x86_64 dnf -y install gcc-toolset-9 gcc-toolset-9-binutils gcc-toolset-9-runtime rpmbuild --rebuild kernel-ml-6.9.0-1.el8.elrepo.nosrc.rpm > Warning: Kernel ABI header differences: > diff -u tools/include/uapi/linux/vhost.h include/uapi/linux/vhost.h > diff -u tools/include/linux/bits.h include/linux/bits.h > diff -u tools/arch/x86/include/asm/cpufeatures.h arch/x86/include/asm/cpufeatures.h > diff -u tools/arch/x86/include/asm/msr-index.h arch/x86/include/asm/msr-index.h > Makefile.config:448: *** No gnu/libc-version.h found, please install glibc-dev[el]. Stop. > make[1]: *** [Makefile.perf:264: sub-make] Error 2 > make: *** [Makefile:70: all] Error 2 > error: Bad exit status from /var/tmp/rpm-tmp.q6oH2D (%build) |
|
You can ignore the warning. And do as the error message tells you to do. Try installing the glibc-devel package and see how it goes. |
|
Thanks for your follow-up. But ... # dnf -y install glibc-devel Last metadata expiration check: 2:08:42 ago on Fri 17 May 2024 07:30:34 AM JST. Package glibc-devel-2.28-251.el8.x86_64 is already installed. Dependencies resolved. Nothing to do. Complete! Is it Old!? |
|
You seem to be running CentOS Stream. I suggest you use mock to build the kernel. |
|
I have been able to identify the cause of the problems I am having thanks to various advice and follow up from different people. I am mainly running a number of VMs on a Dell R420 or R610 with CentOS Stream+KVM installed. For some time now, I have been encountering problems with my VMs updating to the latest kernel (kernel-ml) and then deadlocking while booting. However, I was not too concerned about it because after a few attempts of booting my VM, it would boot without any problems. However, for the past month, my VM has been deadlocking once every two times with the latest kernel (6.8.x) and I was having trouble. So I asked the question here, and after trying to diff and rebuild the kernel, I noticed an event. I found the following message, every once in a while, while booting the VM. > [ 7.962698] ------------[ cut here ]------------ > [ 7.962704] workqueue: WQ_MEM_RECLAIM ttm:ttm_bo_delayed_delete [ttm] is flushing !WQ_MEM_RECLAIM events:qxl_gc_work [qxl] > [ 7.962822] WARNING: CPU: 0 PID: 471 at kernel/workqueue.c:2970 check_flush_dependency+0x10c/0x130 > [ 7.962850] Modules linked in: rfkill(E) ip6t_REJECT(E) nf_reject_ipv6(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nft_compat(E) nf_tables(E) libcrc32c(E) nfnetlink(E) kvm_intel(E) kvm(E) iTCO_wdt(E) irqbypass(E) intel_pmc_bxt(E) crct10dif_pclmul(E) crc32_pclmul(E) iTCO_vendor_support(E) polyval_generic(E) ghash_clmulni_intel(E) sha512_ssse3(E) joydev(E) i2c_i801(E) pcspkr(E) i2c_smbus(E) virtio_balloon(E) lpc_ich(E) ext4(E) mbcache(E) jbd2(E) qxl(E) drm_ttm_helper(E) ttm(E) drm_kms_helper(E) ahci(E) libahci(E) libata(E) drm(E) crc32c_intel(E) serio_raw(E) virtio_blk(E) virtio_net(E) net_failover(E) failover(E) virtio_console(E) fuse(E) > [ 7.962898] CPU: 0 PID: 471 Comm: kworker/u8:2 Tainted: G E 6.8.9-1.el8.x86_64 #1 > [ 7.962901] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-4.module_el8+603+e0ca2c01 04/01/2014 > [ 7.962904] Workqueue: ttm ttm_bo_delayed_delete [ttm] > [ 7.962917] RIP: 0010:check_flush_dependency+0x10c/0x130 > [ 7.962920] Code: ff ff 49 8b 55 18 48 8d 8b b0 00 00 00 49 89 e8 48 81 c6 b0 00 00 00 48 c7 c7 58 44 c8 90 c6 05 e9 2f c0 01 01 e8 64 ab fd ff <0f> 0b e9 0c ff ff ff 80 3d d7 2f c0 01 00 75 95 e9 46 ff ff ff 66 > [ 7.962922] RSP: 0000:ffffad9cc057bcf0 EFLAGS: 00010082 > [ 7.962924] RAX: 0000000000000000 RBX: ffff94708004ee00 RCX: 0000000000000027 > [ 7.962926] RDX: 0000000000000027 RSI: 0000000000000002 RDI: ffff9470fbc216c8 > [ 7.962927] RBP: ffffffffc04e45d0 R08: ffffffff91061ca0 R09: 0000000000000000 > [ 7.962928] R10: ffffffff91e5952f R11: 000000000000016c R12: ffff947082d76100 > [ 7.962929] R13: ffff947085d23b40 R14: ffff9470fbc33080 R15: ffff947085d23b01 > [ 7.962930] FS: 0000000000000000(0000) GS:ffff9470fbc00000(0000) knlGS:0000000000000000 > [ 7.962932] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 7.962933] CR2: 00007fc791f4bcb0 CR3: 0000000109fa4006 CR4: 0000000000020ef0 > [ 7.962937] Call Trace: > [ 7.962949] <TASK> > [ 7.962952] ? __warn+0x81/0x140 > [ 7.962957] ? check_flush_dependency+0x10c/0x130 > [ 7.962960] ? report_bug+0xf8/0x1e0 > [ 7.962964] ? _raw_spin_unlock_irqrestore+0xa/0x40 > [ 7.962967] ? handle_bug+0x44/0x70 > [ 7.962970] ? exc_invalid_op+0x13/0x60 > [ 7.962973] ? asm_exc_invalid_op+0x16/0x20 > [ 7.962977] ? __pfx_qxl_gc_work+0x10/0x10 [qxl] > [ 7.962985] ? check_flush_dependency+0x10c/0x130 > [ 7.962987] __flush_work+0x9a/0x250 > [ 7.962990] ? _raw_spin_unlock+0xa/0x30 > [ 7.962992] ? __queue_work+0x100/0x3f0 > [ 7.962995] qxl_queue_garbage_collect+0x54/0x60 [qxl] > [ 7.963009] qxl_fence_wait+0x9a/0x170 [qxl] > [ 7.963020] dma_fence_wait_timeout+0x4a/0x110 > [ 7.963031] dma_resv_wait_timeout+0x67/0xd0 > [ 7.963037] ttm_bo_delayed_delete+0x2e/0x90 [ttm] > [ 7.963056] process_scheduled_works+0x22e/0x360 > [ 7.963059] worker_thread+0x143/0x2a0 > [ 7.963061] ? __pfx_worker_thread+0x10/0x10 > [ 7.963064] kthread+0xe4/0x110 > [ 7.963067] ? __pfx_kthread+0x10/0x10 > [ 7.963068] ret_from_fork+0x30/0x40 > [ 7.963075] ? __pfx_kthread+0x10/0x10 > [ 7.963077] ret_from_fork_asm+0x1b/0x30 > [ 7.963083] </TASK> > [ 7.963084] ---[ end trace 0000000000000000 ]--- > [ 6.077835] ------------[ cut here ]------------ > [ 6.078356] workqueue: WQ_MEM_RECLAIM ttm:ttm_bo_delayed_delete [ttm] is flushing !WQ_MEM_RECLAIM events:qxl_gc_work [qxl] > [ 6.078411] WARNING: CPU: 0 PID: 468 at kernel/workqueue.c:2970 check_flush_dependency+0x10c/0x130 > [ 6.079386] Modules linked in: ipt_REJECT(E) ip6t_REJECT(E) nf_reject_ipv6(E) nf_reject_ipv4(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nft_compat(E) nf_tables(E) libcrc32c(E) nfnetlink(E) kvm_intel(E) kvm(E) irqbypass(E) iTCO_wdt(E) crct10dif_pclmul(E) crc32_pclmul(E) intel_pmc_bxt(E) iTCO_vendor_support(E) polyval_generic(E) ghash_clmulni_intel(E) sha512_ssse3(E) joydev(E) pcspkr(E) lpc_ich(E) i2c_i801(E) virtio_balloon(E) i2c_smbus(E) ext4(E) mbcache(E) jbd2(E) qxl(E) drm_ttm_helper(E) ttm(E) drm_kms_helper(E) ahci(E) libahci(E) drm(E) libata(E) crc32c_intel(E) virtio_net(E) serio_raw(E) net_failover(E) virtio_console(E) virtio_blk(E) failover(E) fuse(E) > [ 6.082615] CPU: 0 PID: 468 Comm: kworker/u8:1 Tainted: G E 6.8.9-1.el8.x86_64 #1 > [ 6.083212] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-4.module_el8+603+e0ca2c01 04/01/2014 > [ 6.083778] Workqueue: ttm ttm_bo_delayed_delete [ttm] > [ 6.084393] RIP: 0010:check_flush_dependency+0x10c/0x130 > [ 6.084398] Code: ff ff 49 8b 55 18 48 8d 8b b0 00 00 00 49 89 e8 48 81 c6 b0 00 00 00 48 c7 c7 58 44 68 a9 c6 05 e9 2f c0 01 01 e8 64 ab fd ff <0f> 0b e9 0c ff ff ff 80 3d d7 2f c0 01 00 75 95 e9 46 ff ff ff 66 > [ 6.084399] RSP: 0000:ffffc0c3806f7cf0 EFLAGS: 00010082 > [ 6.084401] RAX: 0000000000000000 RBX: ffff9ffdc004ee00 RCX: 0000000000000027 > [ 6.084403] RDX: 0000000000000027 RSI: 00000000ffff7fff RDI: ffff9ffe3bc216c8 > [ 6.084404] RBP: ffffffffc051d5d0 R08: 0000000000000000 R09: c0000000ffff7fff > [ 6.084405] R10: 0000000000000001 R11: ffffc0c3806f7b08 R12: ffff9ffdc8ce9840 > [ 6.084406] R13: ffff9ffe27390d80 R14: ffff9ffe3bc33080 R15: ffff9ffe27390d01 > [ 6.084407] FS: 0000000000000000(0000) GS:ffff9ffe3bc00000(0000) knlGS:0000000000000000 > [ 6.084409] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 6.084410] CR2: 00005591a8aaafa0 CR3: 000000010472a005 CR4: 0000000000020ef0 > [ 6.084414] Call Trace: > [ 6.084423] <TASK> > [ 6.084426] ? __warn+0x81/0x140 > [ 6.084431] ? check_flush_dependency+0x10c/0x130 > [ 6.084433] ? report_bug+0xf8/0x1e0 > [ 6.084439] ? console_unlock+0x51/0xf0 > [ 6.095799] ? handle_bug+0x44/0x70 > [ 6.095804] ? exc_invalid_op+0x13/0x60 > [ 6.095807] ? asm_exc_invalid_op+0x16/0x20 > [ 6.095812] ? __pfx_qxl_gc_work+0x10/0x10 [qxl] > [ 6.095819] ? check_flush_dependency+0x10c/0x130 > [ 6.095822] __flush_work+0x9a/0x250 > [ 6.095825] ? _raw_spin_unlock+0xa/0x30 > [ 6.095827] ? __queue_work+0x100/0x3f0 > [ 6.095829] qxl_queue_garbage_collect+0x54/0x60 [qxl] > [ 6.095838] qxl_fence_wait+0x9a/0x170 [qxl] > [ 6.095845] dma_fence_wait_timeout+0x4a/0x110 > [ 6.095851] dma_resv_wait_timeout+0x67/0xd0 > [ 6.095853] ttm_bo_delayed_delete+0x2e/0x90 [ttm] > [ 6.095864] process_scheduled_works+0x22e/0x360 > [ 6.095866] worker_thread+0x143/0x2a0 > [ 6.095868] ? __pfx_worker_thread+0x10/0x10 > [ 6.095870] kthread+0xe4/0x110 > [ 6.095873] ? __pfx_kthread+0x10/0x10 > [ 6.095875] ret_from_fork+0x30/0x40 > [ 6.095878] ? __pfx_kthread+0x10/0x10 > [ 6.095879] ret_from_fork_asm+0x1b/0x30 > [ 6.095883] </TASK> > [ 6.095884] ---[ end trace 0000000000000000 ]--- Searching for “DRM”, “TTM”, and “QXL” in this message, I found that there were others who were having trouble with the (nosy) DRM feature. So I rebuilt it with the kernel option “CONFIG_DRM=n” and installed it in my VM. To my surprise, it did not deadlock after dozens of starts.(But, my VM's VNC screen became narrower :-) I have blacklisted the kernel module from loading because I don't need a wide screen with framebuffer in the future for my VM's use (WEB, DB and DNS). Thanks everyone. I hope DRM and its related sources will be better source code. (BTW, VMs with “type=‘vga’” in KVM do not have the problem, but 'qxl' or 'cirrus' have a deadlock, sometime) T.Kabu P.S. I'll try MOCK later!! |
|
Closing as 'resolved/no change required'. |
Date Modified | Username | Field | Change |
---|---|---|---|
2024-05-12 21:26 | T.Kabu | New Issue | |
2024-05-12 21:26 | T.Kabu | Status | new => assigned |
2024-05-12 21:26 | T.Kabu | Assigned To | => toracat |
2024-05-12 21:26 | T.Kabu | Tag Attached: centos | |
2024-05-12 21:26 | T.Kabu | Tag Attached: crash | |
2024-05-12 21:26 | T.Kabu | Tag Attached: kernel-ml | |
2024-05-13 11:49 | toracat | Note Added: 0009727 | |
2024-05-13 21:47 | T.Kabu | Note Added: 0009735 | |
2024-05-14 22:15 | T.Kabu | Note Added: 0009741 | |
2024-05-15 14:47 | toracat | Project | channel: elrepo/el8 => channel: kernel/el8 |
2024-05-15 14:47 | toracat | Category | --elrepo--OTHER-- => General |
2024-05-15 14:48 | toracat | Note Added: 0009744 | |
2024-05-15 23:45 | T.Kabu | Note Added: 0009745 | |
2024-05-15 23:56 | T.Kabu | Note Added: 0009746 | |
2024-05-16 03:57 | T.Kabu | Note Added: 0009748 | |
2024-05-16 12:16 | toracat | Note Added: 0009749 | |
2024-05-16 20:43 | T.Kabu | Note Added: 0009751 | |
2024-05-17 13:45 | toracat | Note Added: 0009752 | |
2024-05-19 22:46 | T.Kabu | Note Added: 0009775 | |
2024-08-07 13:16 | toracat | Status | assigned => resolved |
2024-08-07 13:16 | toracat | Resolution | open => no change required |
2024-08-07 13:16 | toracat | Category | General => --kernel--OTHER-- |
2024-08-07 13:16 | toracat | Note Added: 0010011 |