View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0000885 | channel: kernel/el7 | kernel-lt | public | 2018-12-18 09:23 | 2019-07-28 08:07 |
Reporter | pjwelsh | Assigned To | burakkucat | ||
Priority | normal | Severity | major | Reproducibility | always |
Status | closed | Resolution | not fixable | ||
Summary | 0000885: kernel-lt-4.4.167-1.el7.elrepo.x86_64 and kernel-lt-4.4.168-1 fails to boot on Dell Poweredge R640 | ||||
Description | A Dell R640 with CentOS 7.6 installed fails to boot after installing and selecting the kernel-lt-4.4.167-1.el7.elrepo.x86_64 or kernel-lt-4.4.168-1. However, stock or kernel-ml-4.19.9-1.el7.elrepo.x86_64 or kernel-ml-4.19.10 will boot without issue on this hardware setup. Last lines: OK reached target basic system dracut-initqueue: Warning: dracut-initqueue timeout - starting timeout scripts ... | ||||
Additional Information | This system is currently being used to evaluate disk, SSD and NVME performance in select combinations. None of the testing disk combinations are included in fstab or needed for system operation. Not sure how to provide additional information; please be gentle with me ;) System: Host: tshh241.XX.com Kernel: 4.19.10-1.el7.elrepo.x86_64 x86_64 bits: 64 Console: N/A Distro: CentOS Linux release 7.6.1810 (Core) Machine: Type: Server System: Dell product: PowerEdge R640 v: N/A serial: <filter> Mobo: Dell model: 0XFK4K v: A05 serial: <filter> BIOS: Dell v: 1.6.12 date: 11/20/2018 CPU: Topology: 2x 12-Core model: Intel Xeon Gold 5118 bits: 64 type: MT MCP SMP L2 cache: 33.0 MiB Speed: 1000 MHz min/max: 1000/3200 MHz Core speeds (MHz): 1: 1001 2: 1001 3: 1001 4: 1001 5: 1000 6: 1000 7: 1000 8: 1000 9: 1092 10: 1000 11: 1001 12: 1001 13: 1000 14: 1000 15: 1001 16: 1001 17: 1001 18: 1000 19: 1001 20: 1000 21: 1001 22: 1000 23: 1000 24: 1001 25: 1000 26: 1000 27: 1000 28: 1000 29: 1001 30: 1001 31: 1000 32: 1001 33: 1000 34: 1001 35: 1001 36: 1119 37: 1001 38: 1001 39: 1000 40: 1001 41: 1001 42: 1001 43: 1001 44: 1001 45: 1001 46: 1001 47: 1001 48: 1001 Graphics: Device-1: Matrox Systems Integrated Matrox G200eW3 Graphics driver: mgag200 v: kernel Display: server: No display server data found. Headless machine? tty: N/A Message: Unable to show advanced data. Required tool glxinfo missing. Audio: Message: No Device data found. Network: Device-1: Intel I350 Gigabit Network driver: igb IF: em3 state: up speed: 1000 Mbps duplex: full mac: <filter> Device-2: Intel I350 Gigabit Network driver: igb IF: em4 state: down mac: <filter> Device-3: Intel Ethernet 10-Gigabit X540-AT2 driver: ixgbe IF: em1 state: down mac: <filter> Device-4: Intel Ethernet 10-Gigabit X540-AT2 driver: ixgbe IF: em2 state: down mac: <filter> Drives: Local Storage: total: 2.55 TiB used: 3.48 GiB (0.1%) ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 970 EVO 500GB size: 465.76 GiB ID-2: /dev/nvme1n1 vendor: Samsung model: SSD 970 EVO 500GB size: 465.76 GiB ID-3: /dev/nvme2n1 vendor: Samsung model: SSD 970 EVO 500GB size: 465.76 GiB ID-4: /dev/sda model: PERC H740P Mini size: 278.88 GiB ID-5: /dev/sdb model: PERC H740P Mini size: 1.82 TiB RAID: Hardware-1: LSI Logic / Symbios Logic MegaRAID Tri-Mode SAS3508 driver: megaraid_sas Device-1: md127 type: mdraid status: active raid: raid-0 report: N/A Components: online: nvme2n1p1~c1 nvme0n1p1~c0 Partition: ID-1: / size: 254.69 GiB used: 3.20 GiB (1.3%) fs: xfs dev: /dev/dm-0 ID-2: /boot size: 896.7 MiB used: 289.7 MiB (32.3%) fs: xfs dev: /dev/sda1 ID-3: swap-1 size: 9.77 GiB used: 0 KiB (0.0%) fs: swap dev: /dev/dm-1 Sensors: System Temperatures: ipmi cpu: N/A mobo: N/A Fan Speeds (RPM): ipmi cpu: 5400 fan-26: fan-27: 3960 fan-42: 5280 fan-43: 4080 fan-58: 5280 fan-59: 3960 fan-74: 5280 fan-75: 3960 fan-90: 5160 fan-91: 3840 fan-106: 5400 fan-107: 3960 fan-122: 5280 fan-123: 4200 fan-138: 5400 fan-139: 4200 System Temperatures: lm-sensors cpu: 26.0 C mobo: N/A Fan Speeds (RPM): lm-sensors N/A Info: Processes: 485 Uptime: 7m Memory: 15.29 GiB used: 860.9 MiB (5.5%) Init: systemd runlevel: 5 Client: Unknown Client: sshd inxi: 3.0.26 | ||||
Tags | No tags attached. | ||||
Attached Files | only_in_config-4.4.168.txt (2,735 bytes)
CONFIG_RCU_NOCB_CPU_ALL=y CONFIG_HAVE_DMA_ATTRS=y CONFIG_UNINLINE_SPIN_UNLOCK=y CONFIG_X86_FAST_FEATURE_TESTS=y CONFIG_HOTPLUG_PCI_SHPC=m CONFIG_NF_CT_PROTO_DCCP=m CONFIG_NF_CT_PROTO_SCTP=m CONFIG_NF_CT_PROTO_UDPLITE=m CONFIG_NF_NAT_PROTO_DCCP=m CONFIG_NF_NAT_PROTO_UDPLITE=m CONFIG_NF_NAT_PROTO_SCTP=m CONFIG_NF_TABLES_NETDEV=m CONFIG_BT_QCA=m CONFIG_BT_HCIUART_BCM=y CONFIG_BT_HCIUART_QCA=y CONFIG_BT_HCIBTUART=m CONFIG_HAVE_BPF_JIT=y CONFIG_REGMAP_MMIO=m CONFIG_ZRAM_LZ4_COMPRESS=y CONFIG_BLK_CPQ_CISS_DA=m CONFIG_CISS_SCSI_TAPE=y CONFIG_SENSORS_BH1780=m CONFIG_SCSI_EATA=m CONFIG_SCSI_EATA_TAGGED_QUEUE=y CONFIG_SCSI_EATA_LINKED_COMMANDS=y CONFIG_SCSI_EATA_MAX_TAGS=16 CONFIG_SCSI_FUTURE_DOMAIN=m CONFIG_DM_CACHE_MQ=m CONFIG_DM_CACHE_CLEANER=m CONFIG_BE2NET_VXLAN=y CONFIG_NET_VENDOR_EXAR=y CONFIG_FM10K_VXLAN=y CONFIG_MLX4_EN_VXLAN=y CONFIG_QLCNIC_VXLAN=y CONFIG_MDIO_OCTEON=m CONFIG_ATH_CARDS=m CONFIG_WL_TI=y CONFIG_WLCORE_SPI=m CONFIG_NVM_GENNVM=m CONFIG_NVM_RRPC=m CONFIG_TOUCHSCREEN_FT6236=m CONFIG_INPUT_MPU3050=m CONFIG_SERIAL_8250_FINTEK=m CONFIG_HW_RANDOM_TPM=m CONFIG_I2C_MUX_PINCTRL=m CONFIG_SPI_PXA2XX_DMA=y CONFIG_PINCTRL_AMD=y CONFIG_GPIO_104_IDIO_16=m CONFIG_GPIO_SX150X=y CONFIG_GPIO_INTEL_MID=y CONFIG_GPIO_MCP23S08=m CONFIG_BATTERY_BQ27XXX_I2C=y CONFIG_BATTERY_BQ27XXX_PLATFORM=y CONFIG_CHARGER_TPS65217=m CONFIG_SENSORS_GPIO_FAN=m CONFIG_SENSORS_HTU21=m CONFIG_BCM7038_WDT=m CONFIG_MFD_TPS65217=m CONFIG_MFD_TPS65218=m CONFIG_REGULATOR_MAX8973=m CONFIG_REGULATOR_TPS65217=m CONFIG_IR_HIX5HD2=m CONFIG_VIDEO_ZORAN=m CONFIG_VIDEO_ZORAN_DC30=m CONFIG_VIDEO_ZORAN_ZR36060=m CONFIG_VIDEO_ZORAN_BUZ=m CONFIG_VIDEO_ZORAN_DC10=m CONFIG_VIDEO_ZORAN_LML33=m CONFIG_VIDEO_ZORAN_LML33R10=m CONFIG_VIDEO_ZORAN_AVS6EYES=m CONFIG_RADIO_SI470X=y CONFIG_VIDEO_BT819=m CONFIG_VIDEO_BT856=m CONFIG_VIDEO_BT866=m CONFIG_VIDEO_KS0127=m CONFIG_VIDEO_SAA7110=m CONFIG_VIDEO_VPX3220=m CONFIG_VIDEO_SAA7185=m CONFIG_VIDEO_ADV7170=m CONFIG_VIDEO_ADV7175=m CONFIG_DRM_I2C_ADV7511=m CONFIG_DRM_TDFX=m CONFIG_DRM_R128=m CONFIG_DRM_I810=m CONFIG_DRM_MGA=m CONFIG_DRM_SIS=m CONFIG_DRM_VIA=m CONFIG_DRM_SAVAGE=m CONFIG_BACKLIGHT_TPS65217=m CONFIG_SND_SEQUENCER_OSS=y CONFIG_SND_RAWMIDI_SEQ=m CONFIG_USB_ISP1362_HCD=m CONFIG_USB_LED=m CONFIG_USB_LIBCOMPOSITE=m CONFIG_USB_F_MASS_STORAGE=m CONFIG_USB_MASS_STORAGE=m CONFIG_MMC_BLOCK_BOUNCE=y CONFIG_LEDS_LP8860=m CONFIG_EDAC_MM_EDAC=m CONFIG_RTC_DRV_ISL12057=m CONFIG_RTC_DRV_DS3234=m CONFIG_DW_DMAC_PCI=m CONFIG_R8723AU=m CONFIG_8723AU_AP_MODE=y CONFIG_8723AU_BT_COEXIST=y CONFIG_STAGING_RDMA=m CONFIG_AMD_IOMMU_STATS=y CONFIG_NVMEM=m CONFIG_CIFS_SMB2=y CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y CONFIG_BUILD_DOCSRC=y CONFIG_DEBUG_NX_TEST=m CONFIG_CRYPTO_ABLK_HELPER=m CONFIG_PKCS7_MESSAGE_PARSER=m | ||||
|
Not sure it is relevant, but the "dracut-initqueue timeout" caught my eye, given that this is likely an el 7.6 system. There was an EL 7.5 > 7.6 upgrade issue with device-manager\* and lvm2\* thought to be related to mdraid metadata 0.90 that hit me and, on a stock system, it would give lots of "dracut-initqueue timeout" errors and then fail with missing LV errors. I see md127 in your hardware output. Using kernel-4.14 (don't ask) the same system would boot, but would show really screwed up LVM stats. Basically, it was booting using mdraid members instead of the mdraids themselves as the PVs. Running pvscan gave strange results (missing UUID members, disk partition used as PV, etc.). The simple diagnostic is to run pvscan on the kernel-ml system and look for strange output. The fix would be to downgrade device-manager\* and lvm2\* to the el 7.5 versions and to exclude= the 7.6 updates in yum.conf (and there were many releases of them). May be completely unrelated to your issues. |
|
From your report, I understand it to be a new hardware configuration that has never been used with any version of kernel-lt earlier than the two you have mentioned. I do not suspect a "missing item" in the kernel-lt configuration and so suspect that if I were to offer you, say, kernel-lt-4.4.150-1.el7.elrepo (from the archive) the end result would still be a failure to boot. Please try the test documented in paragraph four of note 6065 . . . . for I currently have no other idea what could be causing the problem. |
|
Yes, this was updated from 7.5 to 7.6. However, the md127 did not have a filesystem or label on it during any of the boot attempts for any of the kernels. The md127 is the NVME testing I'm doing at this time. That grouping of /dev/nvme* and md* have been rebuilt/destroyed many times (wipefs, vgremove etc). Any idea why the kernel-ml is not identically affected? Currently running a long duration Phoronix test. It may be a while before I can test according to note paragraph four of note 6065. |
|
The system will eventually drop me to a dracut:/# prompt after telling me that my "slash" filesystem and swap are not found. Even more interesting is that: sh: pvscan: command not found when run from the dracut:/# prompt. I am remote to that system and will see about getting rdsosreport.txt mentioned. |
|
Uploaded an initial image of the virtual console with issue. "lvm_scan" when run from the dracut prompt produces no results. Also, when at the dracut prompt, there are no /dev/sd* device files. I'd expect to see at least the sdaX for the standard partition for /boot from the PERC H740P RAID card.?? |
|
When I ran into the 7.5 > 7.6 dm/lvm problem, I discovered on some systems that the dm and lvm dracut modules were not installed in the initramfs image. I think that would trigger your symptoms in posts 0006069 and 0006070. You can check the image currently in use by: # lsinitrd -k $(uname -r)|less (or ...-k <kver>... for other images) The installed dracut modules are in a list at the start of the output. You can force the inclusion during manual dracut runs (dracut --add "dm lvm"...) or, as I did, just modify dracut.conf to add the modules. # grep ^add_dracutmodules /etc/dracut.conf add_dracutmodules+="dm lvm" Hope that helps. |
|
FYI: blkid when run from the dracut prompt, not surprisingly, does not show any of the /dev/sdX. It did show the nvme and lvm partitions, however. How it can show the lvm part's but not any /dev/sdX when the VGs are built on the sdX devices? Here is module section from kernel-ml and stock (there are no differences int the modules group - only the "Early CPIO image"): # lsinitrd -k 4.4.168-1.el7.elrepo.x86_64 Image: /boot/initramfs-4.4.168-1.el7.elrepo.x86_64.img: 21M ======================================================================== Version: dracut-033-554.el7 Arguments: -f dracut modules: bash nss-softokn i18n network ifcfg drm plymouth dm kernel-modules lvm resume rootfs-block terminfo udev-rules biosdevname systemd usrmount base fs-lib microcode_ctl-fw_dir_override shutdown # lsinitrd -k 3.10.0-957.1.3.el7.x86_64 Image: /boot/initramfs-3.10.0-957.1.3.el7.x86_64.img: 21M ======================================================================== Early CPIO image ======================================================================== drwxr-xr-x 3 root root 0 Dec 4 07:52 . -rw-r--r-- 1 root root 2 Dec 4 07:52 early_cpio drwxr-xr-x 3 root root 0 Dec 4 07:52 kernel drwxr-xr-x 3 root root 0 Dec 4 07:52 kernel/x86 drwxr-xr-x 2 root root 0 Dec 4 07:52 kernel/x86/microcode -rw-r--r-- 1 root root 31744 Dec 4 07:52 kernel/x86/microcode/GenuineIntel.bin ======================================================================== Version: dracut-033-554.el7 Arguments: -f dracut modules: bash nss-softokn i18n network ifcfg drm plymouth dm kernel-modules lvm resume rootfs-block terminfo udev-rules biosdevname systemd usrmount base fs-lib microcode_ctl-fw_dir_override shutdown |
|
Yes, something strange is going on. If it is related to the dm/lvm2 updates, then you could revert to 7.5 dm/lvm2, remake initramfs images and retest. I am not clear on the relationship of the dracut dm/lvm2 modules to the installed dm/lvm2, so it might be a waste of time to downgrade. On the other hand, it is something to try. On a downgraded system: # rpm -q device-mapper device-mapper-event device-mapper-event-libs device-mapper-libs lvm2 lvm2-libs device-mapper-1.02.146-4.el7.x86_64 device-mapper-event-1.02.146-4.el7.x86_64 device-mapper-event-libs-1.02.146-4.el7.x86_64 device-mapper-libs-1.02.146-4.el7.x86_64 lvm2-2.02.177-4.el7.x86_64 lvm2-libs-2.02.177-4.el7.x86_64 and yum.conf includes: exclude=device-mapper-1.02.149-8.el7 device-mapper-event-1.02.149-8.el7 device-mapper-event-libs-1.02.149-8.el7 device-mapper-libs-1.02.149-8.el7 lvm2-2.02.180-8.el7 lvm2-libs-2.02.180-8.el7 device-mapper-1.02.149-10.el7_6 device-mapper-event-1.02.149-10.el7_6 device-mapper-event-libs-1.02.149-10.el7_6 device-mapper-libs-1.02.149-10.el7_6 lvm2-2.02.180-10.el7_6 lvm2-libs-2.02.180-10.el7_6 device-mapper-1.02.149-10.el7_6.1 device-mapper-event-1.02.149-10.el7_6.1 device-mapper-event-libs-1.02.149-10.el7_6.1 device-mapper-libs-1.02.149-10.el7_6.1 lvm2-2.02.180-10.el7_6.1 lvm2-libs-2.02.180-10.el7_6.1 device-mapper-1.02.149-10.el7_6.2 device-mapper-event-1.02.149-10.el7_6.2 device-mapper-event-libs-1.02.149-10.el7_6.2 device-mapper-libs-1.02.149-10.el7_6.2 lvm2-2.02.180-10.el7_6.2 lvm2-libs-2.02.180-10.el7_6.2 |
|
Bad news... the dm and lvm2 downgrade is not producing any different results. I removed all kernel-lt: yum remove kernel-lt\* I had to have the following in the /etc/yum.con: exclude=device-mapper-1.02.149-8.el7 device-mapper-event-1.02.149-8.el7 device-mapper-event-libs-1.02.149-8.el7 device-mapper-libs-1.02.149-8.el7 lvm2-2.02.180-8.el7 lvm2-libs-2.02.180-8.el7 device-mapper-1.02.149-10.el7_6 device-mapper-event-1.02.149-10.el7_6 device-mapper-event-libs-1.02.149-10.el7_6 device-mapper-libs-1.02.149-10.el7_6 lvm2-2.02.180-10.el7_6 lvm2-libs-2.02.180-10.el7_6 device-mapper-1.02.149-10.el7_6.1 device-mapper-event-1.02.149-10.el7_6.1 device-mapper-event-libs-1.02.149-10.el7_6.1 device-mapper-libs-1.02.149-10.el7_6.1 lvm2-2.02.180-10.el7_6.1 lvm2-libs-2.02.180-10.el7_6.1 device-mapper-1.02.149-10.el7_6.2 device-mapper-event-1.02.149-10.el7_6.2 device-mapper-event-libs-1.02.149-10.el7_6.2 device-mapper-libs-1.02.149-10.el7_6.2 lvm2-2.02.180-10.el7_6.2 lvm2-libs-2.02.180-10.el7_6.2 lvm2-python-libs-2.02.180-8.el7 lvm2-python-libs-2.02.180-10.el7_6.1 lvm2-python-libs-2.02.180-10.el7_6.2 device-mapper-event-1.02.149-10.el7_6.2 kpartx-0.4.9-123.el7 and then issue the downgrade command: yum downgrade kpartx\* device-manager\* lvm2\* device-mapper\* --enablerepo=C7.5.1804-updates,C7.5.1804-base For current list of: # rpm -qa device-manager\* lvm2\* device-mapper\* device-mapper-event-1.02.146-4.el7.x86_64 device-mapper-libs-1.02.146-4.el7.x86_64 lvm2-2.02.177-4.el7.x86_64 device-mapper-persistent-data-0.7.3-3.el7.x86_64 device-mapper-1.02.146-4.el7.x86_64 lvm2-python-libs-2.02.177-4.el7.x86_64 device-mapper-event-libs-1.02.146-4.el7.x86_64 device-mapper-multipath-libs-0.4.9-119.el7_5.1.x86_64 device-mapper-multipath-0.4.9-119.el7_5.1.x86_64 lvm2-libs-2.02.177-4.el7.x86_64 Installed the kernel-lt: yum -y install kernel-lt elrepo-kernel is enabled by default on this system. End result: kernel-lt-4.4.168-1.el7.elrepo.x86_64 is not working for some reason (and at least previous)... |
|
Downgrade produced no positive results for the kernel-lt. Any other ideas. Keeping in mind that neither the kernel-ml and stock C7.x kernels are afflicted. On a side note, the nvme md performance on the kernel-ml is much better than stock for my first round of testing. |
|
The attached file, only_in_config-4.4.168.txt, contains 114 lines from the configuration file of kernel-lt that are not present in either the distro or the kernel-ml configuration files. Perhaps its contents will provide inspiration? |
|
I know almost nothing about the internals of the kernel. Just scanning the listed differences has me wondering about the CONFIG_NVMEM=m as that is not seen as a module in either stock C7 or 4.19. It's seems more like something else is borked... why/how could I see the LVM's with the blkid command but not see the underlying /dev/sd* files? |
|
Sorry I could not be of more help. One thing to think about is that kernel-lt (kernel-4.4) is getting somewhat (cough) old (cough) in all the stacks. In theory, only security fixes are patched into the stable kernel.org kernels, unlike the backporting that goes on in the stock el7 kernel. I use kernel-lt on a 10-year-old HP laptop with an early AMD dual-core processor, because that laptop stopped booting the stock kernels around el 7.2. Thank you RHEL for disabling a critical module in addition to all the old NICs. :-) Also, I needed to build the Broadcom wl module for it and discovered that the Broadcom hybrid driver, which itself is getting old (10/01/2015), built and functioned as-is without modification for kernel-lt. In contrast, the wireless stack of the standard el7 kernels have been backported so many times that the Broadcom hybrid driver needs patches related to the wireless stacks of kernel-4.7, kernel-4.8, kernel-4.11 and kernel-4.12 before the wl driver will build. Wireless stack changes are also the reasons so many elrepo wireless drivers need a rebuild at some point releases. It is my assumption that similar things are happening in the other stacks, too. It could be that your hardware needs drivers not present in kernel-4.4. I see that you are using m.2 nvme ssd drives. Did those even exist in the time frame of kernel-4.4, which was initially released on 2016-01-10? Of course, there should be fallback to some level of PCIe, I would think. Yeah, like that really fixes your problem. :-) |
|
In the past, I've tried using the kernel-ml on some systems and found it, frankly, unstable. To a large degree involving the wifi and similar items you've mentioned. Hardware would just stop working for some time, I'd wait for another release or two and it would work again. Though not used for any OS file system, the nvme drives did show in the /dev/ on the 4.4 kernel. Thanks for taking the time to reflect on this. I at least have the option to continue to test in both the stock and -ml kernels to maximize IO for a couple of database projects. |
Date Modified | Username | Field | Change |
---|---|---|---|
2018-12-18 09:23 | pjwelsh | New Issue | |
2018-12-18 09:23 | pjwelsh | Status | new => assigned |
2018-12-18 09:23 | pjwelsh | Assigned To | => burakkucat |
2018-12-18 10:49 | stindall | Note Added: 0006065 | |
2018-12-18 11:13 | burakkucat | Note Added: 0006066 | |
2018-12-18 11:55 | pjwelsh | Note Added: 0006068 | |
2018-12-19 08:03 | pjwelsh | Note Added: 0006069 | |
2018-12-19 08:04 | pjwelsh | File Added: elrepo-bug-885-1.jpg | |
2018-12-19 08:15 | pjwelsh | Note Added: 0006070 | |
2018-12-19 09:53 | stindall | Note Added: 0006071 | |
2018-12-19 10:08 | pjwelsh | Note Added: 0006072 | |
2018-12-19 10:30 | stindall | Note Added: 0006073 | |
2018-12-19 19:02 | pjwelsh | Note Added: 0006074 | |
2018-12-20 13:27 | pjwelsh | Note Added: 0006075 | |
2018-12-20 14:57 | burakkucat | File Added: only_in_config-4.4.168.txt | |
2018-12-20 14:58 | burakkucat | Note Added: 0006076 | |
2018-12-20 19:39 | pjwelsh | Note Added: 0006077 | |
2018-12-20 21:20 | stindall | Note Added: 0006078 | |
2018-12-20 22:06 | pjwelsh | Note Added: 0006079 | |
2019-07-28 08:07 | burakkucat | Status | assigned => closed |
2019-07-28 08:07 | burakkucat | Resolution | open => not fixable |