View Issue Details

IDProjectCategoryView StatusLast Update
0001450channel: kernel/el8--kernel--OTHER--public2024-08-07 13:16
ReporterT.Kabu Assigned Totoracat  
PrioritynormalSeverityminorReproducibilitysometimes
Status resolvedResolutionno change required 
Summary0001450: I'm having trouble booting a Linux VM on top of a Linux KVM.
DescriptionHello, Everyone.

And, Thank you for always providing us with the latest kernel.

We recently updated the kernel for HOST and VMs all the way to 6.8.x.
Then, when there are more than 5 VMs, the VMs started to deadlock during or after startup.
Up to four VMs will start and continue to run without any problems.

By the way, if you want to boot the fifth and later kernels with 6.7 or 6.6, there is no problem.

Has something changed in the kernel config between up to 6.7 and from 6.8?
Or is there a bug in kernel 6.8?

HOST 6.8.9-1.el8.elrepo.x86_64

VM001 6.8.9-1.el8.elrepo.x86_64
VM002 6.8.9-1.el9.elrepo.x86_64
VM003 6.8.9-1.el7.elrepo.x86_64
VM004 6.8.9-1.el7.elrepo.x86_64

VM005 6.8.9-1.el8.elrepo.x86_64 DeadLock!!
VM006 6.8.9-1.el8.elrepo.x86_64 DeadLock!!

Since we have no choice, we are currently...

VM005 6.7.9-1.el8.elrepo.x86_64
VM006 6.6.1-1.el8.elrepo.x86_64


Regards.

T.Kabu
Tagscentos, crash, kernel-ml

Activities

toracat

2024-05-13 11:49

administrator   ~0009727

You can see the config changes going from 6.7 to 6.8 in our git.

el7: https://github.com/elrepo/kernel/commit/15291c5242d5885fa788de42300e0acb2d571a65
el8: https://github.com/elrepo/kernel/commit/6e68d511d96f80086d64b57d7fcb5fd21785afc8
el9: https://github.com/elrepo/kernel/commit/ead38c65089d27e637f981c52d4e11f127849fad

T.Kabu

2024-05-13 21:47

reporter   ~0009735

Thanks for the info.

https://github.com/elrepo/kernel/commit/6e68d511d96f80086d64b57d7fcb5fd21785afc8

I will take a look at the information here and make a kernel with modified VM-related and KVM-related parameters.
Hopefully I can figure something out and report back.

T.Kabu

T.Kabu

2024-05-14 22:15

reporter   ~0009741

I built kernel version 6.8.9 with kernel version 6.7.9's config.
And, I install this kernel in a VM005 and boot it, it does not deadlock!! :-)
I am still trying to determine the cause, but clearly someone of the kernel options seems to be the cause of the deadlock.

T.Kabu

toracat

2024-05-15 14:48

administrator   ~0009744

Can you by any chance test the latest kernel 6.9.0? It is currently in the elrepo-testing repository.

T.Kabu

2024-05-15 23:45

reporter   ~0009745

Ok, I'll give it a try.

T.Kabu

2024-05-15 23:56

reporter   ~0009746

I said that rebuilding kernel version 6.8.9 with a 6.7.9 config would not deadlock, but that was a mistake.
It always deadlocks once every few times.
I found this out when I had to DESTROY and START dozens of times.
I still have not found the cause of this...

T.Kabu

T.Kabu

2024-05-16 03:57

reporter   ~0009748

Hmmm...

> error: Failed build dependencies:
> gcc-toolset-9 is needed by kernel-ml-6.9.0-1.el8.x86_64
> gcc-toolset-9-binutils is needed by kernel-ml-6.9.0-1.el8.x86_64
> gcc-toolset-9-runtime is needed by kernel-ml-6.9.0-1.el8.x86_64

dnf -y install gcc-toolset-9 gcc-toolset-9-binutils gcc-toolset-9-runtime

rpmbuild --rebuild kernel-ml-6.9.0-1.el8.elrepo.nosrc.rpm

> Warning: Kernel ABI header differences:
> diff -u tools/include/uapi/linux/vhost.h include/uapi/linux/vhost.h
> diff -u tools/include/linux/bits.h include/linux/bits.h
> diff -u tools/arch/x86/include/asm/cpufeatures.h arch/x86/include/asm/cpufeatures.h
> diff -u tools/arch/x86/include/asm/msr-index.h arch/x86/include/asm/msr-index.h
> Makefile.config:448: *** No gnu/libc-version.h found, please install glibc-dev[el]. Stop.
> make[1]: *** [Makefile.perf:264: sub-make] Error 2
> make: *** [Makefile:70: all] Error 2
> error: Bad exit status from /var/tmp/rpm-tmp.q6oH2D (%build)

toracat

2024-05-16 12:16

administrator   ~0009749

You can ignore the warning.

And do as the error message tells you to do. Try installing the glibc-devel package and see how it goes.

T.Kabu

2024-05-16 20:43

reporter   ~0009751

Thanks for your follow-up. But ...

# dnf -y install glibc-devel
Last metadata expiration check: 2:08:42 ago on Fri 17 May 2024 07:30:34 AM JST.
Package glibc-devel-2.28-251.el8.x86_64 is already installed.
Dependencies resolved.
Nothing to do.
Complete!

Is it Old!?

toracat

2024-05-17 13:45

administrator   ~0009752

You seem to be running CentOS Stream. I suggest you use mock to build the kernel.

T.Kabu

2024-05-19 22:46

reporter   ~0009775

I have been able to identify the cause of the problems I am having thanks to various advice and follow up from different people.

I am mainly running a number of VMs on a Dell R420 or R610 with CentOS Stream+KVM installed.

For some time now, I have been encountering problems with my VMs updating to the latest kernel (kernel-ml) and then deadlocking while booting.

However, I was not too concerned about it because after a few attempts of booting my VM, it would boot without any problems.

However, for the past month, my VM has been deadlocking once every two times with the latest kernel (6.8.x) and I was having trouble.

So I asked the question here, and after trying to diff and rebuild the kernel, I noticed an event.

I found the following message, every once in a while, while booting the VM.

> [ 7.962698] ------------[ cut here ]------------
> [ 7.962704] workqueue: WQ_MEM_RECLAIM ttm:ttm_bo_delayed_delete [ttm] is flushing !WQ_MEM_RECLAIM events:qxl_gc_work [qxl]
> [ 7.962822] WARNING: CPU: 0 PID: 471 at kernel/workqueue.c:2970 check_flush_dependency+0x10c/0x130
> [ 7.962850] Modules linked in: rfkill(E) ip6t_REJECT(E) nf_reject_ipv6(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nft_compat(E) nf_tables(E) libcrc32c(E) nfnetlink(E) kvm_intel(E) kvm(E) iTCO_wdt(E) irqbypass(E) intel_pmc_bxt(E) crct10dif_pclmul(E) crc32_pclmul(E) iTCO_vendor_support(E) polyval_generic(E) ghash_clmulni_intel(E) sha512_ssse3(E) joydev(E) i2c_i801(E) pcspkr(E) i2c_smbus(E) virtio_balloon(E) lpc_ich(E) ext4(E) mbcache(E) jbd2(E) qxl(E) drm_ttm_helper(E) ttm(E) drm_kms_helper(E) ahci(E) libahci(E) libata(E) drm(E) crc32c_intel(E) serio_raw(E) virtio_blk(E) virtio_net(E) net_failover(E) failover(E) virtio_console(E) fuse(E)
> [ 7.962898] CPU: 0 PID: 471 Comm: kworker/u8:2 Tainted: G E 6.8.9-1.el8.x86_64 #1
> [ 7.962901] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-4.module_el8+603+e0ca2c01 04/01/2014
> [ 7.962904] Workqueue: ttm ttm_bo_delayed_delete [ttm]
> [ 7.962917] RIP: 0010:check_flush_dependency+0x10c/0x130
> [ 7.962920] Code: ff ff 49 8b 55 18 48 8d 8b b0 00 00 00 49 89 e8 48 81 c6 b0 00 00 00 48 c7 c7 58 44 c8 90 c6 05 e9 2f c0 01 01 e8 64 ab fd ff <0f> 0b e9 0c ff ff ff 80 3d d7 2f c0 01 00 75 95 e9 46 ff ff ff 66
> [ 7.962922] RSP: 0000:ffffad9cc057bcf0 EFLAGS: 00010082
> [ 7.962924] RAX: 0000000000000000 RBX: ffff94708004ee00 RCX: 0000000000000027
> [ 7.962926] RDX: 0000000000000027 RSI: 0000000000000002 RDI: ffff9470fbc216c8
> [ 7.962927] RBP: ffffffffc04e45d0 R08: ffffffff91061ca0 R09: 0000000000000000
> [ 7.962928] R10: ffffffff91e5952f R11: 000000000000016c R12: ffff947082d76100
> [ 7.962929] R13: ffff947085d23b40 R14: ffff9470fbc33080 R15: ffff947085d23b01
> [ 7.962930] FS: 0000000000000000(0000) GS:ffff9470fbc00000(0000) knlGS:0000000000000000
> [ 7.962932] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 7.962933] CR2: 00007fc791f4bcb0 CR3: 0000000109fa4006 CR4: 0000000000020ef0
> [ 7.962937] Call Trace:
> [ 7.962949] <TASK>
> [ 7.962952] ? __warn+0x81/0x140
> [ 7.962957] ? check_flush_dependency+0x10c/0x130
> [ 7.962960] ? report_bug+0xf8/0x1e0
> [ 7.962964] ? _raw_spin_unlock_irqrestore+0xa/0x40
> [ 7.962967] ? handle_bug+0x44/0x70
> [ 7.962970] ? exc_invalid_op+0x13/0x60
> [ 7.962973] ? asm_exc_invalid_op+0x16/0x20
> [ 7.962977] ? __pfx_qxl_gc_work+0x10/0x10 [qxl]
> [ 7.962985] ? check_flush_dependency+0x10c/0x130
> [ 7.962987] __flush_work+0x9a/0x250
> [ 7.962990] ? _raw_spin_unlock+0xa/0x30
> [ 7.962992] ? __queue_work+0x100/0x3f0
> [ 7.962995] qxl_queue_garbage_collect+0x54/0x60 [qxl]
> [ 7.963009] qxl_fence_wait+0x9a/0x170 [qxl]
> [ 7.963020] dma_fence_wait_timeout+0x4a/0x110
> [ 7.963031] dma_resv_wait_timeout+0x67/0xd0
> [ 7.963037] ttm_bo_delayed_delete+0x2e/0x90 [ttm]
> [ 7.963056] process_scheduled_works+0x22e/0x360
> [ 7.963059] worker_thread+0x143/0x2a0
> [ 7.963061] ? __pfx_worker_thread+0x10/0x10
> [ 7.963064] kthread+0xe4/0x110
> [ 7.963067] ? __pfx_kthread+0x10/0x10
> [ 7.963068] ret_from_fork+0x30/0x40
> [ 7.963075] ? __pfx_kthread+0x10/0x10
> [ 7.963077] ret_from_fork_asm+0x1b/0x30
> [ 7.963083] </TASK>
> [ 7.963084] ---[ end trace 0000000000000000 ]---


> [ 6.077835] ------------[ cut here ]------------
> [ 6.078356] workqueue: WQ_MEM_RECLAIM ttm:ttm_bo_delayed_delete [ttm] is flushing !WQ_MEM_RECLAIM events:qxl_gc_work [qxl]
> [ 6.078411] WARNING: CPU: 0 PID: 468 at kernel/workqueue.c:2970 check_flush_dependency+0x10c/0x130
> [ 6.079386] Modules linked in: ipt_REJECT(E) ip6t_REJECT(E) nf_reject_ipv6(E) nf_reject_ipv4(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nft_compat(E) nf_tables(E) libcrc32c(E) nfnetlink(E) kvm_intel(E) kvm(E) irqbypass(E) iTCO_wdt(E) crct10dif_pclmul(E) crc32_pclmul(E) intel_pmc_bxt(E) iTCO_vendor_support(E) polyval_generic(E) ghash_clmulni_intel(E) sha512_ssse3(E) joydev(E) pcspkr(E) lpc_ich(E) i2c_i801(E) virtio_balloon(E) i2c_smbus(E) ext4(E) mbcache(E) jbd2(E) qxl(E) drm_ttm_helper(E) ttm(E) drm_kms_helper(E) ahci(E) libahci(E) drm(E) libata(E) crc32c_intel(E) virtio_net(E) serio_raw(E) net_failover(E) virtio_console(E) virtio_blk(E) failover(E) fuse(E)
> [ 6.082615] CPU: 0 PID: 468 Comm: kworker/u8:1 Tainted: G E 6.8.9-1.el8.x86_64 #1
> [ 6.083212] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-4.module_el8+603+e0ca2c01 04/01/2014
> [ 6.083778] Workqueue: ttm ttm_bo_delayed_delete [ttm]
> [ 6.084393] RIP: 0010:check_flush_dependency+0x10c/0x130
> [ 6.084398] Code: ff ff 49 8b 55 18 48 8d 8b b0 00 00 00 49 89 e8 48 81 c6 b0 00 00 00 48 c7 c7 58 44 68 a9 c6 05 e9 2f c0 01 01 e8 64 ab fd ff <0f> 0b e9 0c ff ff ff 80 3d d7 2f c0 01 00 75 95 e9 46 ff ff ff 66
> [ 6.084399] RSP: 0000:ffffc0c3806f7cf0 EFLAGS: 00010082
> [ 6.084401] RAX: 0000000000000000 RBX: ffff9ffdc004ee00 RCX: 0000000000000027
> [ 6.084403] RDX: 0000000000000027 RSI: 00000000ffff7fff RDI: ffff9ffe3bc216c8
> [ 6.084404] RBP: ffffffffc051d5d0 R08: 0000000000000000 R09: c0000000ffff7fff
> [ 6.084405] R10: 0000000000000001 R11: ffffc0c3806f7b08 R12: ffff9ffdc8ce9840
> [ 6.084406] R13: ffff9ffe27390d80 R14: ffff9ffe3bc33080 R15: ffff9ffe27390d01
> [ 6.084407] FS: 0000000000000000(0000) GS:ffff9ffe3bc00000(0000) knlGS:0000000000000000
> [ 6.084409] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 6.084410] CR2: 00005591a8aaafa0 CR3: 000000010472a005 CR4: 0000000000020ef0
> [ 6.084414] Call Trace:
> [ 6.084423] <TASK>
> [ 6.084426] ? __warn+0x81/0x140
> [ 6.084431] ? check_flush_dependency+0x10c/0x130
> [ 6.084433] ? report_bug+0xf8/0x1e0
> [ 6.084439] ? console_unlock+0x51/0xf0
> [ 6.095799] ? handle_bug+0x44/0x70
> [ 6.095804] ? exc_invalid_op+0x13/0x60
> [ 6.095807] ? asm_exc_invalid_op+0x16/0x20
> [ 6.095812] ? __pfx_qxl_gc_work+0x10/0x10 [qxl]
> [ 6.095819] ? check_flush_dependency+0x10c/0x130
> [ 6.095822] __flush_work+0x9a/0x250
> [ 6.095825] ? _raw_spin_unlock+0xa/0x30
> [ 6.095827] ? __queue_work+0x100/0x3f0
> [ 6.095829] qxl_queue_garbage_collect+0x54/0x60 [qxl]
> [ 6.095838] qxl_fence_wait+0x9a/0x170 [qxl]
> [ 6.095845] dma_fence_wait_timeout+0x4a/0x110
> [ 6.095851] dma_resv_wait_timeout+0x67/0xd0
> [ 6.095853] ttm_bo_delayed_delete+0x2e/0x90 [ttm]
> [ 6.095864] process_scheduled_works+0x22e/0x360
> [ 6.095866] worker_thread+0x143/0x2a0
> [ 6.095868] ? __pfx_worker_thread+0x10/0x10
> [ 6.095870] kthread+0xe4/0x110
> [ 6.095873] ? __pfx_kthread+0x10/0x10
> [ 6.095875] ret_from_fork+0x30/0x40
> [ 6.095878] ? __pfx_kthread+0x10/0x10
> [ 6.095879] ret_from_fork_asm+0x1b/0x30
> [ 6.095883] </TASK>
> [ 6.095884] ---[ end trace 0000000000000000 ]---

Searching for “DRM”, “TTM”, and “QXL” in this message, I found that there were others who were having trouble with the (nosy) DRM feature.

So I rebuilt it with the kernel option “CONFIG_DRM=n” and installed it in my VM.

To my surprise, it did not deadlock after dozens of starts.(But, my VM's VNC screen became narrower :-)

I have blacklisted the kernel module from loading because I don't need a wide screen with framebuffer in the future for my VM's use (WEB, DB and DNS).


Thanks everyone.

I hope DRM and its related sources will be better source code.

(BTW, VMs with “type=‘vga’” in KVM do not have the problem, but 'qxl' or 'cirrus' have a deadlock, sometime)


T.Kabu

P.S. I'll try MOCK later!!

toracat

2024-08-07 13:16

administrator   ~0010011

Closing as 'resolved/no change required'.

Issue History

Date Modified Username Field Change
2024-05-12 21:26 T.Kabu New Issue
2024-05-12 21:26 T.Kabu Status new => assigned
2024-05-12 21:26 T.Kabu Assigned To => toracat
2024-05-12 21:26 T.Kabu Tag Attached: centos
2024-05-12 21:26 T.Kabu Tag Attached: crash
2024-05-12 21:26 T.Kabu Tag Attached: kernel-ml
2024-05-13 11:49 toracat Note Added: 0009727
2024-05-13 21:47 T.Kabu Note Added: 0009735
2024-05-14 22:15 T.Kabu Note Added: 0009741
2024-05-15 14:47 toracat Project channel: elrepo/el8 => channel: kernel/el8
2024-05-15 14:47 toracat Category --elrepo--OTHER-- => General
2024-05-15 14:48 toracat Note Added: 0009744
2024-05-15 23:45 T.Kabu Note Added: 0009745
2024-05-15 23:56 T.Kabu Note Added: 0009746
2024-05-16 03:57 T.Kabu Note Added: 0009748
2024-05-16 12:16 toracat Note Added: 0009749
2024-05-16 20:43 T.Kabu Note Added: 0009751
2024-05-17 13:45 toracat Note Added: 0009752
2024-05-19 22:46 T.Kabu Note Added: 0009775
2024-08-07 13:16 toracat Status assigned => resolved
2024-08-07 13:16 toracat Resolution open => no change required
2024-08-07 13:16 toracat Category General => --kernel--OTHER--
2024-08-07 13:16 toracat Note Added: 0010011