Skip to content

Encrypted XFS volume fails to be mounted, pod stucks at ContainerCreating #923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bheisig opened this issue Apr 2, 2025 · 7 comments
Open
Labels
bug Something isn't working

Comments

@bheisig
Copy link

bheisig commented Apr 2, 2025

TL;DR

When moving a pod from one worker node to another the attached volume fails to be mounted. The volume is encrypted and formatted with XFS.

Expected behavior

When moving a pod to another node the attached volume should also be moved and mounted and attached properly.

Observed behavior

It worked fine a couple of months ago but since then this behavior occurs on 3 different k8s clusters.

Pod stucks at state ContainerCreating.

k8s event log
 Warning  FailedMount             33s (x14 over 12m)  kubelet                  MountVolume.SetUp failed for volume "pvc-f478066b-1e87-4572-9dd3-32da9985f75d" : rpc error: code = Internal desc = failed to publish volume: format of disk "/dev/mapper/scsi-0HC_Volume_100327067" failed: type:("xfs") target:("/var/lib/kubelet/pods/b8b76bc0-285c-4ee8-ae66-672f38b80ce9/volumes/kubernetes.io~csi/pvc-f478066b-1e87-4572-9dd3-32da9985f75d/mount") options:("defaults") errcode:(exit status 1) output:(Parameters parsed from config file /usr/share/xfsprogs/mkfs/lts_4.19.conf successfully
error reading existing superblock: I/O error
mkfs.xfs: pwrite failed: I/O error
libxfs_bwrite: write failed on (unknown) bno 0x27fef00/0x100, err=5
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
mkfs.xfs: pwrite failed: I/O error
libxfs_bwrite: write failed on (unknown) bno 0x0/0x100, err=5
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
mkfs.xfs: pwrite failed: I/O error
libxfs_bwrite: write failed on xfs_sb bno 0x0/0x1, err=5
mkfs.xfs: Releasing dirty buffer to free list!
mkfs.xfs: libxfs_device_zero write failed: I/O error
meta-data=/dev/mapper/scsi-0HC_Volume_100327067 isize=512    agcount=4, agsize=1310592 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0    bigtime=0 inobtcount=0 nrext64=0
         =                       exchange=0  
data     =                       bsize=4096   blocks=5242368, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
log      =internal log           bsize=4096   blocks=16384, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
)
dmesg on k8s worker node (source)
[  702.744852] sd 0:0:0:1: [sdb] Synchronizing SCSI cache
[  702.745000] sd 0:0:0:1: LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments.
[  702.745138] sd 0:0:0:1: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[  702.745142] sd 0:0:0:1: [sdb] Sense Key : Illegal Request [current] 
[  702.745144] sd 0:0:0:1: [sdb] Add. Sense: Logical unit not supported
[  703.218767] dmcrypt_write/2: attempt to access beyond end of device
               sdb: rw=169985, sector=10505187, nr_sectors = 7 limit=0
[  703.218791] XFS (dm-12): log I/O error -5
[  703.220142] XFS (dm-12): Filesystem has been shut down due to log error (0x2).
[  703.223894] XFS (dm-12): Please unmount the filesystem and rectify the problem(s).
[  703.232395] XFS (dm-12): Unmounting Filesystem 0a9c39c8-6311-4de9-aa29-c80d72678f8a
dmesg on k8s worker node (destination)
[ 8558.563139] sd 0:0:0:2: Power-on or device reset occurred
[ 8558.563283] sd 0:0:0:2: LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments.
[ 8558.563726] sd 0:0:0:2: [sdb] 41943040 512-byte logical blocks: (21.5 GB/20.0 GiB)
[ 8558.563781] sd 0:0:0:2: [sdb] Write Protect is off
[ 8558.563785] sd 0:0:0:2: [sdb] Mode Sense: 63 00 00 08
[ 8558.563867] sd 0:0:0:2: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 8558.566654] sd 0:0:0:2: [sdb] Attached SCSI disk
[ 8561.101394] scsi host7: iSCSI Initiator over TCP/IP
[ 8561.103179] scsi 7:0:0:0: RAID              IET      Controller       0001 PQ: 0 ANSI: 5
[ 8561.104758] scsi 7:0:0:0: Attached scsi generic sg3 type 12
[ 8561.105203] scsi 7:0:0:1: Direct-Access     IET      VIRTUAL-DISK     0001 PQ: 0 ANSI: 5
[ 8561.108397] sd 7:0:0:1: Attached scsi generic sg4 type 0
[ 8561.108491] sd 7:0:0:1: Power-on or device reset occurred
[ 8561.108795] sd 7:0:0:1: [sdc] 4194304 512-byte logical blocks: (2.15 GB/2.00 GiB)
[ 8561.108913] sd 7:0:0:1: [sdc] Write Protect is off
[ 8561.108916] sd 7:0:0:1: [sdc] Mode Sense: 69 00 10 08
[ 8561.109245] sd 7:0:0:1: [sdc] Write cache: disabled, read cache: enabled, supports DPO and FUA
[ 8561.126059] sd 7:0:0:1: [sdc] Attached SCSI disk
[ 8562.074292] blkid: attempt to access beyond end of device
               sdk: rw=524288, sector=41942912, nr_sectors = 8 limit=0
[ 8562.074305] blkid: attempt to access beyond end of device
               sdk: rw=0, sector=41942912, nr_sectors = 8 limit=0
[ 8562.074308] Buffer I/O error on dev dm-9, logical block 5242352, async page read
[ 8562.078707] mkfs.xfs: attempt to access beyond end of device
               sdk: rw=18435, sector=4096, nr_sectors = 2097152 limit=0
[ 8562.078721] mkfs.xfs: attempt to access beyond end of device
               sdk: rw=2051, sector=2101248, nr_sectors = 2097152 limit=0
[ 8562.078874] mkfs.xfs: attempt to access beyond end of device
               sdk: rw=2048, sector=4096, nr_sectors = 1 limit=0
[ 8562.079351] dmcrypt_write/2: attempt to access beyond end of device
               sdk: rw=34817, sector=41942784, nr_sectors = 256 limit=0
[ 8562.079703] dmcrypt_write/2: attempt to access beyond end of device
               sdk: rw=34817, sector=4096, nr_sectors = 256 limit=0
[ 8562.079876] dmcrypt_write/2: attempt to access beyond end of device
               sdk: rw=34817, sector=4096, nr_sectors = 1 limit=0
[ 8562.087313] dmcrypt_write/2: attempt to access beyond end of device
               sdk: rw=1, sector=20973608, nr_sectors = 2048 limit=0
[ 8562.087351] dmcrypt_write/2: attempt to access beyond end of device
               sdk: rw=1, sector=20975656, nr_sectors = 2048 limit=0
[ 8562.675559] Buffer I/O error on dev dm-9, logical block 5242352, async page read
[ 8563.786083] Buffer I/O error on dev dm-9, logical block 5242352, async page read
[ 8565.794174] eth0: renamed from tmpb18e5
[ 8565.907520] Buffer I/O error on dev dm-9, logical block 5242352, async page read
[ 8570.047102] bio_check_eod: 282 callbacks suppressed
[ 8570.047109] blkid: attempt to access beyond end of device
               sdk: rw=524288, sector=41942912, nr_sectors = 8 limit=0
[ 8570.047130] blkid: attempt to access beyond end of device
               sdk: rw=0, sector=41942912, nr_sectors = 8 limit=0
[ 8570.047137] Buffer I/O error on dev dm-9, logical block 5242352, async page read
[ 8570.055588] mkfs.xfs: attempt to access beyond end of device
               sdk: rw=18435, sector=4096, nr_sectors = 2097152 limit=0
[ 8570.055601] mkfs.xfs: attempt to access beyond end of device
               sdk: rw=2051, sector=2101248, nr_sectors = 2097152 limit=0
[ 8570.055729] mkfs.xfs: attempt to access beyond end of device
               sdk: rw=2048, sector=4096, nr_sectors = 1 limit=0
[ 8570.056255] dmcrypt_write/2: attempt to access beyond end of device
               sdk: rw=34817, sector=41942784, nr_sectors = 256 limit=0
[ 8570.056726] dmcrypt_write/2: attempt to access beyond end of device
               sdk: rw=34817, sector=4096, nr_sectors = 256 limit=0
[ 8570.057054] dmcrypt_write/2: attempt to access beyond end of device
               sdk: rw=34817, sector=4096, nr_sectors = 1 limit=0
[ 8570.060980] dmcrypt_write/2: attempt to access beyond end of device
               sdk: rw=1, sector=20975656, nr_sectors = 2048 limit=0
[ 8570.060997] dmcrypt_write/2: attempt to access beyond end of device
               sdk: rw=1, sector=20977704, nr_sectors = 2048 limit=0
[ 8578.217568] bio_check_eod: 63 callbacks suppressed
[ 8578.217573] blkid: attempt to access beyond end of device
               sdk: rw=524288, sector=41942912, nr_sectors = 8 limit=0
[ 8578.217584] blkid: attempt to access beyond end of device
               sdk: rw=0, sector=41942912, nr_sectors = 8 limit=0
[ 8578.217587] Buffer I/O error on dev dm-9, logical block 5242352, async page read
[ 8578.221335] mkfs.xfs: attempt to access beyond end of device
               sdk: rw=18435, sector=4096, nr_sectors = 2097152 limit=0
[ 8578.221341] mkfs.xfs: attempt to access beyond end of device
               sdk: rw=2051, sector=2101248, nr_sectors = 2097152 limit=0
[ 8578.221391] mkfs.xfs: attempt to access beyond end of device
               sdk: rw=2048, sector=4096, nr_sectors = 1 limit=0
[ 8578.221618] dmcrypt_write/2: attempt to access beyond end of device
               sdk: rw=34817, sector=41942784, nr_sectors = 256 limit=0
[ 8578.221837] dmcrypt_write/2: attempt to access beyond end of device
               sdk: rw=34817, sector=4096, nr_sectors = 256 limit=0
[ 8578.221958] dmcrypt_write/2: attempt to access beyond end of device
               sdk: rw=34817, sector=4096, nr_sectors = 1 limit=0
[ 8578.223096] dmcrypt_write/2: attempt to access beyond end of device
               sdk: rw=1, sector=20973608, nr_sectors = 2048 limit=0
[ 8578.223118] dmcrypt_write/2: attempt to access beyond end of device
               sdk: rw=1, sector=20975656, nr_sectors = 2048 limit=0
[ 8594.370868] bio_check_eod: 63 callbacks suppressed
[ 8594.370873] blkid: attempt to access beyond end of device
               sdk: rw=524288, sector=41942912, nr_sectors = 8 limit=0
[ 8594.370887] blkid: attempt to access beyond end of device
               sdk: rw=0, sector=41942912, nr_sectors = 8 limit=0
[ 8594.370891] Buffer I/O error on dev dm-9, logical block 5242352, async page read
journal on k8s worker node (destination)
Apr 02 11:45:59 k8s-worker-2d67845a20e2271b kernel: scsi 0:0:0:1: Direct-Access     HC       Volume           2.5+ PQ: 0 ANSI: 5
Apr 02 11:45:59 k8s-worker-2d67845a20e2271b kernel: sd 0:0:0:1: Attached scsi generic sg2 type 0
Apr 02 11:45:59 k8s-worker-2d67845a20e2271b kernel: sd 0:0:0:1: Power-on or device reset occurred
Apr 02 11:45:59 k8s-worker-2d67845a20e2271b kernel: sd 0:0:0:1: LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments.
Apr 02 11:45:59 k8s-worker-2d67845a20e2271b kernel: sd 0:0:0:1: [sdk] 41943040 512-byte logical blocks: (21.5 GB/20.0 GiB)
Apr 02 11:45:59 k8s-worker-2d67845a20e2271b kernel: sd 0:0:0:1: [sdk] Write Protect is off
Apr 02 11:45:59 k8s-worker-2d67845a20e2271b kernel: sd 0:0:0:1: [sdk] Mode Sense: 63 00 00 08
Apr 02 11:45:59 k8s-worker-2d67845a20e2271b kernel: sd 0:0:0:1: [sdk] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 02 11:45:59 k8s-worker-2d67845a20e2271b kernel: sd 0:0:0:1: [sdk] Attached SCSI disk
Apr 02 11:45:59 k8s-worker-2d67845a20e2271b (udev-worker)[213707]: sdk: Process '/usr/bin/unshare -m /usr/bin/snap auto-import --mount=/dev/sdk' failed with exit code 1.
Apr 02 11:46:00 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:00.438875727Z" level=info msg="TaskExit event in podsandbox handler container_id:\"ec43f24ae3d627e43567d72ba1a9ea8acf368f3b4ee697868ae914ea12fd2e3f\"  id:\"f0914f34a06f78b8c7ba83a2d77775b6c703396a858343a131932f31166b74a4\"  pid:213794  exited_at:{seconds:1743594360  nanos:438488333}"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.302967412Z" level=info msg="TaskExit event in podsandbox handler container_id:\"6363f2cb7e0c6ffa001f39952881cae5544f95f83d7c406dd3a3808dffeb822f\"  id:\"696bc5287d8978036999bbf4827e2841965bd796cf41c89f6dc1a38632d2c56c\"  pid:213848  exited_at:{seconds:1743594361  nanos:302623017}"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.305870435Z" level=info msg="TaskExit event in podsandbox handler container_id:\"6363f2cb7e0c6ffa001f39952881cae5544f95f83d7c406dd3a3808dffeb822f\"  id:\"4d7962cc892b52f42a2a506b598067c6f670aee3c8630558bb181ed395c9b7bf\"  pid:213850  exited_at:{seconds:1743594361  nanos:305650753}"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.601855000Z" level=info msg="TaskExit event in podsandbox handler container_id:\"a100e4f6459bb5e5e657f0aa8d2c069c9622bf6033318416061d795783c42e40\"  id:\"81c77cc709c1f8adb57f0c99aec31cf6f28f361fecb9a9ae8afb6330490bb88a\"  pid:213911  exited_at:{seconds:1743594361  nanos:601477162}"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.604140997Z" level=info msg="TaskExit event in podsandbox handler container_id:\"a100e4f6459bb5e5e657f0aa8d2c069c9622bf6033318416061d795783c42e40\"  id:\"636e8924f14b147d213e982ca70c8a551994de896202510587d639503cd7841b\"  pid:213913  exited_at:{seconds:1743594361  nanos:603797614}"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.637333740Z" level=warning msg="container event discarded" container=dc93210697a2d9a4bbcb68ca3ced36a0abeae5ed44970fa062029f48fe7d170a type=CONTAINER_DELETED_EVENT
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b kubelet[1480]: I0402 11:46:01.747587    1480 scope.go:117] "RemoveContainer" containerID="f8e7dfe737fe57e6904763db97ec09e01d71703bb1589a0d85beecdaea6f7f18"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.749742607Z" level=info msg="RemoveContainer for \"f8e7dfe737fe57e6904763db97ec09e01d71703bb1589a0d85beecdaea6f7f18\""
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.755661032Z" level=info msg="RemoveContainer for \"f8e7dfe737fe57e6904763db97ec09e01d71703bb1589a0d85beecdaea6f7f18\" returns successfully"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.757763345Z" level=info msg="StopPodSandbox for \"10819328a07f4ec6773ec091e7fa308318fd6e4122e4741d68f088f7c78ddc95\""
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.796210690Z" level=info msg="TearDown network for sandbox \"10819328a07f4ec6773ec091e7fa308318fd6e4122e4741d68f088f7c78ddc95\" successfully"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.796249022Z" level=info msg="StopPodSandbox for \"10819328a07f4ec6773ec091e7fa308318fd6e4122e4741d68f088f7c78ddc95\" returns successfully"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.796709053Z" level=info msg="RemovePodSandbox for \"10819328a07f4ec6773ec091e7fa308318fd6e4122e4741d68f088f7c78ddc95\""
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.796751633Z" level=info msg="Forcibly stopping sandbox \"10819328a07f4ec6773ec091e7fa308318fd6e4122e4741d68f088f7c78ddc95\""
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.830034965Z" level=info msg="TearDown network for sandbox \"10819328a07f4ec6773ec091e7fa308318fd6e4122e4741d68f088f7c78ddc95\" successfully"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.832517320Z" level=info msg="Ensure that sandbox 10819328a07f4ec6773ec091e7fa308318fd6e4122e4741d68f088f7c78ddc95 in task-service has been cleanup successfully"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.837023632Z" level=info msg="RemovePodSandbox \"10819328a07f4ec6773ec091e7fa308318fd6e4122e4741d68f088f7c78ddc95\" returns successfully"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.837501908Z" level=info msg="StopPodSandbox for \"4e1688ab2cfe9da243e4118c0e92adbddf99cb4d39d1a1d2fb50d534b316d8ca\""
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.878625230Z" level=info msg="TearDown network for sandbox \"4e1688ab2cfe9da243e4118c0e92adbddf99cb4d39d1a1d2fb50d534b316d8ca\" successfully"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.878668180Z" level=info msg="StopPodSandbox for \"4e1688ab2cfe9da243e4118c0e92adbddf99cb4d39d1a1d2fb50d534b316d8ca\" returns successfully"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.879228579Z" level=info msg="RemovePodSandbox for \"4e1688ab2cfe9da243e4118c0e92adbddf99cb4d39d1a1d2fb50d534b316d8ca\""
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.879302488Z" level=info msg="Forcibly stopping sandbox \"4e1688ab2cfe9da243e4118c0e92adbddf99cb4d39d1a1d2fb50d534b316d8ca\""
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.912867356Z" level=info msg="TearDown network for sandbox \"4e1688ab2cfe9da243e4118c0e92adbddf99cb4d39d1a1d2fb50d534b316d8ca\" successfully"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.916560006Z" level=info msg="Ensure that sandbox 4e1688ab2cfe9da243e4118c0e92adbddf99cb4d39d1a1d2fb50d534b316d8ca in task-service has been cleanup successfully"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.920025029Z" level=info msg="RemovePodSandbox \"4e1688ab2cfe9da243e4118c0e92adbddf99cb4d39d1a1d2fb50d534b316d8ca\" returns successfully"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.920597912Z" level=info msg="StopPodSandbox for \"2399fe1e38c7930ecd9d80d14d14d8fa73f7e562f75c89d8b52c8d4cf23ba237\""
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b kubelet[1480]: I0402 11:46:01.947678    1480 reconciler_common.go:245] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-f478066b-1e87-4572-9dd3-32da9985f75d\" (UniqueName: \"kubernetes.io/csi/csi.hetzner.cloud^100327067\") pod \"db-57fd8cc95d-ppdhr\" (UID: \"9c69a551-ba03-4bd6-ae7e-5cbe7adf5d7b\") " pod="test/db-57fd8cc95d-ppdhr"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b kubelet[1480]: I0402 11:46:01.964200    1480 operation_generator.go:1491] "Controller attach succeeded for volume \"pvc-f478066b-1e87-4572-9dd3-32da9985f75d\" (UniqueName: \"kubernetes.io/csi/csi.hetzner.cloud^100327067\") pod \"db-57fd8cc95d-ppdhr\" (UID: \"9c69a551-ba03-4bd6-ae7e-5cbe7adf5d7b\") device path: \"\"" pod="test/db-57fd8cc95d-ppdhr"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.967464021Z" level=info msg="TearDown network for sandbox \"2399fe1e38c7930ecd9d80d14d14d8fa73f7e562f75c89d8b52c8d4cf23ba237\" successfully"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.967509305Z" level=info msg="StopPodSandbox for \"2399fe1e38c7930ecd9d80d14d14d8fa73f7e562f75c89d8b52c8d4cf23ba237\" returns successfully"
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.968038616Z" level=info msg="RemovePodSandbox for \"2399fe1e38c7930ecd9d80d14d14d8fa73f7e562f75c89d8b52c8d4cf23ba237\""
Apr 02 11:46:01 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:01.968094540Z" level=info msg="Forcibly stopping sandbox \"2399fe1e38c7930ecd9d80d14d14d8fa73f7e562f75c89d8b52c8d4cf23ba237\""
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:02.007108085Z" level=info msg="TearDown network for sandbox \"2399fe1e38c7930ecd9d80d14d14d8fa73f7e562f75c89d8b52c8d4cf23ba237\" successfully"
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:02.011727128Z" level=info msg="Ensure that sandbox 2399fe1e38c7930ecd9d80d14d14d8fa73f7e562f75c89d8b52c8d4cf23ba237 in task-service has been cleanup successfully"
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:02.015867647Z" level=info msg="RemovePodSandbox \"2399fe1e38c7930ecd9d80d14d14d8fa73f7e562f75c89d8b52c8d4cf23ba237\" returns successfully"
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b containerd[1173]: time="2025-04-02T11:46:02.021943506Z" level=info msg="TaskExit event in podsandbox handler container_id:\"e9d4f67c03d04bd0e12dfdc922576f1e229600a5f20c67438bd8050915cf825e\"  id:\"6701d8e407af9e61fd40efba1a4120bbf02674b1e28bf1fd3c286d6a05201d67\"  pid:214037  exited_at:{seconds:1743594362  nanos:21457707}"
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: I0402 11:46:02.049158    1480 operation_generator.go:538] "MountVolume.WaitForAttach entering for volume \"pvc-f478066b-1e87-4572-9dd3-32da9985f75d\" (UniqueName: \"kubernetes.io/csi/csi.hetzner.cloud^100327067\") pod \"db-57fd8cc95d-ppdhr\" (UID: \"9c69a551-ba03-4bd6-ae7e-5cbe7adf5d7b\") DevicePath \"\"" pod="test/db-57fd8cc95d-ppdhr"
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: I0402 11:46:02.056010    1480 operation_generator.go:548] "MountVolume.WaitForAttach succeeded for volume \"pvc-f478066b-1e87-4572-9dd3-32da9985f75d\" (UniqueName: \"kubernetes.io/csi/csi.hetzner.cloud^100327067\") pod \"db-57fd8cc95d-ppdhr\" (UID: \"9c69a551-ba03-4bd6-ae7e-5cbe7adf5d7b\") DevicePath \"csi-4d085733a3b94c6963c384fc65fe08188edb788e9e3518f8cbd5e4a04516b313\"" pod="test/db-57fd8cc95d-ppdhr"
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: I0402 11:46:02.063375    1480 csi_attacher.go:380] kubernetes.io/csi: attacher.MountDevice STAGE_UNSTAGE_VOLUME capability not set. Skipping MountDevice...
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: I0402 11:46:02.063413    1480 operation_generator.go:580] "MountVolume.MountDevice succeeded for volume \"pvc-f478066b-1e87-4572-9dd3-32da9985f75d\" (UniqueName: \"kubernetes.io/csi/csi.hetzner.cloud^100327067\") pod \"db-57fd8cc95d-ppdhr\" (UID: \"9c69a551-ba03-4bd6-ae7e-5cbe7adf5d7b\") device mount path \"/var/lib/kubelet/plugins/kubernetes.io/csi/csi.hetzner.cloud/690cdbea35506e948320f285c6424c7b8d3de5876983b647558104acf864be34/globalmount\"" pod="test/db-57fd8cc95d-ppdhr"
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kernel: bio_check_eod: 63 callbacks suppressed
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kernel: blkid: attempt to access beyond end of device
                                                    sdb: rw=524288, sector=41942912, nr_sectors = 8 limit=0
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kernel: blkid: attempt to access beyond end of device
                                                    sdb: rw=0, sector=41942912, nr_sectors = 8 limit=0
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kernel: Buffer I/O error on dev dm-0, logical block 5242352, async page read
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kernel: mkfs.xfs: attempt to access beyond end of device
                                                    sdb: rw=18435, sector=4096, nr_sectors = 2097152 limit=0
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kernel: mkfs.xfs: attempt to access beyond end of device
                                                    sdb: rw=2051, sector=2101248, nr_sectors = 2097152 limit=0
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kernel: mkfs.xfs: attempt to access beyond end of device
                                                    sdb: rw=2048, sector=4096, nr_sectors = 1 limit=0
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kernel: dmcrypt_write/2: attempt to access beyond end of device
                                                    sdb: rw=34817, sector=41942784, nr_sectors = 256 limit=0
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kernel: dmcrypt_write/2: attempt to access beyond end of device
                                                    sdb: rw=34817, sector=4096, nr_sectors = 256 limit=0
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kernel: dmcrypt_write/2: attempt to access beyond end of device
                                                    sdb: rw=34817, sector=4096, nr_sectors = 1 limit=0
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kernel: dmcrypt_write/2: attempt to access beyond end of device
                                                    sdb: rw=1, sector=20973608, nr_sectors = 2048 limit=0
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kernel: dmcrypt_write/2: attempt to access beyond end of device
                                                    sdb: rw=1, sector=20975656, nr_sectors = 2048 limit=0
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: E0402 11:46:02.116819    1480 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/csi.hetzner.cloud^100327067 podName: nodeName:}" failed. No retries permitted until 2025-04-02 11:46:02.616793886 +0000 UTC m=+4926.399117456 (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume "pvc-f478066b-1e87-4572-9dd3-32da9985f75d" (UniqueName: "kubernetes.io/csi/csi.hetzner.cloud^100327067") pod "db-57fd8cc95d-ppdhr" (UID: "9c69a551-ba03-4bd6-ae7e-5cbe7adf5d7b") : rpc error: code = Internal desc = failed to publish volume: format of disk "/dev/mapper/scsi-0HC_Volume_100327067" failed: type:("xfs") target:("/var/lib/kubelet/pods/9c69a551-ba03-4bd6-ae7e-5cbe7adf5d7b/volumes/kubernetes.io~csi/pvc-f478066b-1e87-4572-9dd3-32da9985f75d/mount") options:("defaults") errcode:(exit status 1) output:(Parameters parsed from config file /usr/share/xfsprogs/mkfs/lts_4.19.conf successfully
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: error reading existing superblock: I/O error
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: mkfs.xfs: pwrite failed: I/O error
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: libxfs_bwrite: write failed on (unknown) bno 0x27fef00/0x100, err=5
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: mkfs.xfs: Releasing dirty buffer to free list!
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: found dirty buffer (bulk) on free list!
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: mkfs.xfs: pwrite failed: I/O error
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: libxfs_bwrite: write failed on (unknown) bno 0x0/0x100, err=5
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: mkfs.xfs: Releasing dirty buffer to free list!
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: found dirty buffer (bulk) on free list!
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: mkfs.xfs: pwrite failed: I/O error
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: libxfs_bwrite: write failed on xfs_sb bno 0x0/0x1, err=5
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: mkfs.xfs: Releasing dirty buffer to free list!
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: mkfs.xfs: libxfs_device_zero write failed: I/O error
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: meta-data=/dev/mapper/scsi-0HC_Volume_100327067 isize=512    agcount=4, agsize=1310592 blks
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]:          =                       sectsz=512   attr=2, projid32bit=1
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]:          =                       crc=1        finobt=1, sparse=1, rmapbt=0
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]:          =                       reflink=0    bigtime=0 inobtcount=0 nrext64=0
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]:          =                       exchange=0
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: data     =                       bsize=4096   blocks=5242368, imaxpct=25
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]:          =                       sunit=0      swidth=0 blks
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: log      =internal log           bsize=4096   blocks=16384, version=2
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]:          =                       sectsz=512   sunit=0 blks, lazy-count=1
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: realtime =none                   extsz=4096   blocks=0, rtextents=0
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: )
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: I0402 11:46:02.657138    1480 operation_generator.go:538] "MountVolume.WaitForAttach entering for volume \"pvc-f478066b-1e87-4572-9dd3-32da9985f75d\" (UniqueName: \"kubernetes.io/csi/csi.hetzner.cloud^100327067\") pod \"db-57fd8cc95d-ppdhr\" (UID: \"9c69a551-ba03-4bd6-ae7e-5cbe7adf5d7b\") DevicePath \"csi-4d085733a3b94c6963c384fc65fe08188edb788e9e3518f8cbd5e4a04516b313\"" pod="test/db-57fd8cc95d-ppdhr"
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: I0402 11:46:02.664760    1480 operation_generator.go:548] "MountVolume.WaitForAttach succeeded for volume \"pvc-f478066b-1e87-4572-9dd3-32da9985f75d\" (UniqueName: \"kubernetes.io/csi/csi.hetzner.cloud^100327067\") pod \"db-57fd8cc95d-ppdhr\" (UID: \"9c69a551-ba03-4bd6-ae7e-5cbe7adf5d7b\") DevicePath \"csi-4d085733a3b94c6963c384fc65fe08188edb788e9e3518f8cbd5e4a04516b313\"" pod="test/db-57fd8cc95d-ppdhr"
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kernel: Buffer I/O error on dev dm-0, logical block 5242352, async page read
Apr 02 11:46:02 k8s-worker-2d67845a20e2271b kubelet[1480]: E0402 11:46:02.717019    1480 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/csi.hetzner.cloud^100327067 podName: nodeName:}" failed. No retries permitted until 2025-04-02 11:46:03.716995522 +0000 UTC m=+4927.499319092 (durationBeforeRetry 1s). Error: MountVolume.SetUp failed for volume "pvc-f478066b-1e87-4572-9dd3-32da9985f75d" (UniqueName: "kubernetes.io/csi/csi.hetzner.cloud^100327067") pod "db-57fd8cc95d-ppdhr" (UID: "9c69a551-ba03-4bd6-ae7e-5cbe7adf5d7b") : rpc error: code = Internal desc = failed to publish volume: format of disk "/dev/mapper/scsi-0HC_Volume_100327067" failed: type:("xfs") target:("/var/lib/kubelet/pods/9c69a551-ba03-4bd6-ae7e-5cbe7adf5d7b/volumes/kubernetes.io~csi/pvc-f478066b-1e87-4572-9dd3-32da9985f75d/mount") options:("defaults") errcode:(exit status 1) output:(Parameters parsed from config file /usr/share/xfsprogs/mkfs/lts_4.19.conf successfully
hcloud csi node (destination node)
time=2025-04-02T11:15:53.995Z level=INFO source=/home/runner/work/csi-driver/csi-driver/internal/volumes/mount.go:125 msg="publishing volume" component=linux-mount-service target-path=/var/lib/kubelet/pods/ec66598d-e63d-4f2b-9631-ae29db466128/volumes/kubernetes.io~csi/pvc-f478066b-1e87-4572-9dd3-32da9985f75d/mount device-path=/dev/mapper/scsi-0HC_Volume_100327067 fs-type=xfs block-volume=false readonly=false mount-options="" encrypted=true
I0402 11:15:53.998597       1 mount_linux.go:618] Disk "/dev/mapper/scsi-0HC_Volume_100327067" appears to be unformatted, attempting to format as type: "xfs" with options: [-c options=/usr/share/xfsprogs/mkfs/lts_4.19.conf -f /dev/mapper/scsi-0HC_Volume_100327067]
E0402 11:15:54.014943       1 mount_linux.go:625] format of disk "/dev/mapper/scsi-0HC_Volume_100327067" failed: type:("xfs") target:("/var/lib/kubelet/pods/ec66598d-e63d-4f2b-9631-ae29db466128/volumes/kubernetes.io~csi/pvc-f478066b-1e87-4572-9dd3-32da9985f75d/mount") options:("defaults") errcode:(exit status 1) output:(Parameters parsed from config file /usr/share/xfsprogs/mkfs/lts_4.19.conf successfully
error reading existing superblock: I/O error
mkfs.xfs: pwrite failed: I/O error
libxfs_bwrite: write failed on (unknown) bno 0x27fef00/0x100, err=5
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
mkfs.xfs: pwrite failed: I/O error
libxfs_bwrite: write failed on (unknown) bno 0x0/0x100, err=5
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
mkfs.xfs: pwrite failed: I/O error
libxfs_bwrite: write failed on xfs_sb bno 0x0/0x1, err=5
mkfs.xfs: Releasing dirty buffer to free list!
mkfs.xfs: libxfs_device_zero write failed: I/O error
time=2025-04-02T11:15:54.014Z level=ERROR source=/home/runner/work/csi-driver/csi-driver/internal/app/app.go:276 msg="handler failed" err="rpc error: code = Internal desc = failed to publish volume: format of disk \"/dev/mapper/scsi-0HC_Volume_100327067\" failed: type:(\"xfs\") target:(\"/var/lib/kubelet/pods/ec66598d-e63d-4f2b-9631-ae29db466128/volumes/kubernetes.io~csi/pvc-f478066b-1e87-4572-9dd3-32da9985f75d/mount\") options:(\"defaults\") errcode:(exit status 1) output:(Parameters parsed from config file /usr/share/xfsprogs/mkfs/lts_4.19.conf successfully\nerror reading existing superblock: I/O error\nmkfs.xfs: pwrite failed: I/O error\nlibxfs_bwrite: write failed on (unknown) bno 0x27fef00/0x100, err=5\nmkfs.xfs: Releasing dirty buffer to free list!\nfound dirty buffer (bulk) on free list!\nmkfs.xfs: pwrite failed: I/O error\nlibxfs_bwrite: write failed on (unknown) bno 0x0/0x100, err=5\nmkfs.xfs: Releasing dirty buffer to free list!\nfound dirty buffer (bulk) on free list!\nmkfs.xfs: pwrite failed: I/O error\nlibxfs_bwrite: write failed on xfs_sb bno 0x0/0x1, err=5\nmkfs.xfs: Releasing dirty buffer to free list!\nmkfs.xfs: libxfs_device_zero write failed: I/O error\nmeta-data=/dev/mapper/scsi-0HC_Volume_100327067 isize=512    agcount=4, agsize=1310592 blks\n         =                       sectsz=512   attr=2, projid32bit=1\n         =                       crc=1        finobt=1, sparse=1, rmapbt=0\n         =                       reflink=0    bigtime=0 inobtcount=0 nrext64=0\n         =                       exchange=0  \ndata     =                       bsize=4096   blocks=5242368, imaxpct=25\n         =                       sunit=0      swidth=0 blks\nnaming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0\nlog      =internal log           bsize=4096   blocks=16384, version=2\n         =                       sectsz=512   sunit=0 blks, lazy-count=1\nrealtime =none                   extsz=4096   blocks=0, rtextents=0\n) "
meta-data=/dev/mapper/scsi-0HC_Volume_100327067 isize=512    agcount=4, agsize=1310592 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0    bigtime=0 inobtcount=0 nrext64=0
         =                       exchange=0  
data     =                       bsize=4096   blocks=5242368, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
log      =internal log           bsize=4096   blocks=16384, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
) 

Minimal working example

  • Vanilla k8s 1.31.7 (reproducible with 1.30.x, too)
  • At least 2 worker nodes
  • Worker nodes running Ubuntu 24.04.2 LTS (reproducible with Ubuntu 22.04 and Kernel 5.15.0-136-generic, too)
  • Linux Kernel 6.8.0-57-generic
  • containerd 2.0.4
  • hcloud-csi 2.13.0 Helm Chart
  • LUKS-encrypted volume formatted with XFS

For example, deploy the official MongoDB image v7 from Docker Hub:

k8s manifests
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: hcloud-encrypted-volume-xfs
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
provisioner: csi.hetzner.cloud
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
parameters:
  csi.storage.k8s.io/node-publish-secret-name: hcloud-csi-encryption
  csi.storage.k8s.io/node-publish-secret-namespace: kube-system
  fsType: xfs
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  namespace: test
  name: db
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: hcloud-encrypted-volume-xfs
  volumeMode: Filesystem
---
kind: Deployment
apiVersion: apps/v1
metadata:
  namespace: test
  name: db
  labels:
    app.kubernetes.io/name: db
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app.kubernetes.io/component: db
  template:
    metadata:
      labels:
        app.kubernetes.io/name: db
    spec:
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: db
      containers:
        - name: db
          image: mongo:7.0.18
          imagePullPolicy: Always
          args:
            - mongod
            - --auth
            - --wiredTigerCacheSizeGB
            - '0.25'
          volumeMounts:
            - name: data
              mountPath: /data/db
          env:
            - name: MONGO_INITDB_ROOT_USERNAME
              value: root
            - name: MONGO_INITDB_ROOT_PASSWORD
              value: secret
            - name: MONGO_INITDB_DATABASE
              value: test

Additional information

Diry workaround: rebooting the worker node solves this issue.

@bheisig bheisig added the bug Something isn't working label Apr 2, 2025
@lukasmetzner
Copy link
Contributor

Hey,
thank you for the detailed report of your issue. The people mainly responsible for this project are/were at KubeCon this week. I will investigate this issue next week.

Best Regards,
Lukas

@lukasmetzner
Copy link
Contributor

Hey,
I currently, cannot reproduce your issue. Do you have some more information about your environment, that you could share with me? Did you also try to reproduce the issue in a fresh testing cluster? How as the pod moved? Did you drain the node?

Could you provide some details about the xfs configuration of your volume? You can get this information by running the following command on the node where your volume is successfully mounted:

xfs_info <target-path> (target-path can be found in the logs of the DaemonSet)

Best Regards
Lukas

@bheisig
Copy link
Author

bheisig commented Apr 11, 2025

Hi.

This is directly from the affected worker node after a reboot (see my first post about the "dirty workaround"):

# xfs_info /var/lib/kubelet/pods/1898c74f-6cd1-4859-815a-e0d2fc3e5cc7/volumes/kubernetes.io~csi/pvc-9a591f1f-c60f-4a93-8fbe-91a8196f2e17/mount
meta-data=/dev/mapper/scsi-0HC_Volume_102191764 isize=512    agcount=4, agsize=655232 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0    bigtime=0 inobtcount=0 nrext64=0
data     =                       bsize=4096   blocks=2620928, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=16384, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

I don't know why but I can't run it on the DaemonSet:

# kubectl -n hcloud-csi exec -it hcloud-csi-node-rcrw9 --container=hcloud-csi-driver -- xfs_info /var/lib/kubelet/pods/1898c74f-6cd1-4859-815a-e0d2fc3e5cc7/volumes/kubernetes.io~csi/pvc-9a591f1f-c60f-4a93-8fbe-91a8196f2e17/mount
xfs_info: cannot open /var/lib/kubelet/pods/1898c74f-6cd1-4859-815a-e0d2fc3e5cc7/volumes/kubernetes.io~csi/pvc-9a591f1f-c60f-4a93-8fbe-91a8196f2e17/mount: Is a directory
command terminated with exit code 1

It's sufficient to cordon a worker node and delete a pod to reproduce the issue. There's no need to drain a node.

Sorry, I haven't found the time to reproduce this issue on a fresh cluster. Is there any other specific information I can provide?

@lukasmetzner
Copy link
Contributor

lukasmetzner commented Apr 14, 2025

Hey,

does this only occur with a specific volume? If not, does it also occur with newly created volumes? Does this issue only happen if the pod/volume are being scheduled on a specific node of yours, or does it happen regardless of the node?

Is your volume functioning normally after applying the workaround? Your dmesg (source) logs also don't appear to indicate a healthy state.

If this only affects a specific volume, we could try to investigate any file system damage. In order to do this, delete the pod and wait for the volume to be detached from any server. In the Cloud Console mount the volume to a server of your choice and choose manual as a mount option. Run fdisk -l on the node, and you should be able to see your volume. Now open the LUKS device via cryptsetup open /dev/xxx volume-debugging (note: you need to enter your passphrase here) and check the XFS file system for damage via xfs_repair -n /dev/mapper/volume-debugging (-n being a no modify mode, just checks the file system for damage).

@bheisig
Copy link
Author

bheisig commented Apr 28, 2025

Hey,

A bunch of questions… I try my best to answer them:

  • The problem occurs with a lot of volumes, not only one specific. They have in common that they are encrypted with LUKS and formatted with XFS.
  • The problem doesn't occur on newly created volumes. But it does occur when we try to reproduce it following the steps mentioned above.
  • It happens on all nodes (Ubuntu 22.04/24.04, k8s 1.30/1.31)
  • After the workaround the volume can be mounted successfully. But it fails again whenever I try to reproduce it.
dmesg from worker node during a failed volume state
[279286.471494] Memory cgroup out of memory: Killed process 1739884 (mongod) total-vm:1013180kB, anon-rss:519196kB, file-rss:85504kB, shmem-rss:0kB, UID:999 pgtables:1880kB oom_score_adj:992
[279286.476209] Tasks in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod18a7e0f3_f227_4c2f_92ca_79f9f6e47f7c.slice/cri-containerd-5415c4ab7f221d06753bd074fa4b97ab25765032f9357a48da6549f788b013df.scope are going to be killed due to memory.oom.group set
[279286.476219] Memory cgroup out of memory: Killed process 1739884 (mongod) total-vm:1013180kB, anon-rss:519196kB, file-rss:85504kB, shmem-rss:0kB, UID:999 pgtables:1880kB oom_score_adj:992
[281221.024566] XFS (dm-7): Unmounting Filesystem 6e95865a-4710-40a9-a3a6-803eb9e1fe25
[281226.264006] sd 0:0:0:5: [sdl] Synchronizing SCSI cache
[281226.264176] sd 0:0:0:5: LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments.
[281226.264428] sd 0:0:0:5: [sdl] Synchronize Cache(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[281226.264435] sd 0:0:0:5: [sdl] Sense Key : Illegal Request [current] 
[281226.264440] sd 0:0:0:5: [sdl] Add. Sense: Logical unit not supported
[281762.204260] scsi 0:0:0:1: Direct-Access     HC       Volume           2.5+ PQ: 0 ANSI: 5
[281762.206121] sd 0:0:0:1: Attached scsi generic sg14 type 0
[281762.206223] sd 0:0:0:1: Power-on or device reset occurred
[281762.206500] sd 0:0:0:1: LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments.
[281762.206859] sd 0:0:0:1: [sdh] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
[281762.206959] sd 0:0:0:1: [sdh] Write Protect is off
[281762.206968] sd 0:0:0:1: [sdh] Mode Sense: 63 00 00 08
[281762.207880] sd 0:0:0:1: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[281762.213200] sd 0:0:0:1: [sdh] Attached SCSI disk
[281765.485123] blkid: attempt to access beyond end of device
                sdl: rw=524288, sector=20971392, nr_sectors = 8 limit=0
[281765.485145] blkid: attempt to access beyond end of device
                sdl: rw=0, sector=20971392, nr_sectors = 8 limit=0
[281765.485152] Buffer I/O error on dev dm-7, logical block 2620912, async page read
[281765.502166] mkfs.xfs: attempt to access beyond end of device
                sdl: rw=18435, sector=4096, nr_sectors = 2097152 limit=0
[281765.502183] mkfs.xfs: attempt to access beyond end of device
                sdl: rw=2051, sector=2101248, nr_sectors = 2097152 limit=0
[281765.502316] mkfs.xfs: attempt to access beyond end of device
                sdl: rw=2048, sector=4096, nr_sectors = 1 limit=0
[281765.502837] dmcrypt_write/2: attempt to access beyond end of device
                sdl: rw=34817, sector=20971264, nr_sectors = 256 limit=0
[281765.503317] dmcrypt_write/2: attempt to access beyond end of device
                sdl: rw=34817, sector=4096, nr_sectors = 256 limit=0
[281765.503564] dmcrypt_write/2: attempt to access beyond end of device
                sdl: rw=34817, sector=4096, nr_sectors = 1 limit=0
[281765.507210] dmcrypt_write/2: attempt to access beyond end of device
                sdl: rw=1, sector=10487848, nr_sectors = 2048 limit=0
[281765.509805] dmcrypt_write/2: attempt to access beyond end of device
                sdl: rw=1, sector=10489896, nr_sectors = 2048 limit=0
[281766.193344] Buffer I/O error on dev dm-7, logical block 2620912, async page read
[281767.301946] Buffer I/O error on dev dm-7, logical block 2620912, async page read
[281769.439216] Buffer I/O error on dev dm-7, logical block 2620912, async page read
[281773.583443] bio_check_eod: 282 callbacks suppressed
[281773.583451] blkid: attempt to access beyond end of device
                sdl: rw=524288, sector=20971392, nr_sectors = 8 limit=0
[281773.583470] blkid: attempt to access beyond end of device
                sdl: rw=0, sector=20971392, nr_sectors = 8 limit=0
[281773.583475] Buffer I/O error on dev dm-7, logical block 2620912, async page read
[281773.593047] mkfs.xfs: attempt to access beyond end of device
                sdl: rw=18435, sector=4096, nr_sectors = 2097152 limit=0
[281773.593061] mkfs.xfs: attempt to access beyond end of device
                sdl: rw=2051, sector=2101248, nr_sectors = 2097152 limit=0
[281773.593187] mkfs.xfs: attempt to access beyond end of device
                sdl: rw=2048, sector=4096, nr_sectors = 1 limit=0
[281773.593657] dmcrypt_write/2: attempt to access beyond end of device
                sdl: rw=34817, sector=20971264, nr_sectors = 256 limit=0
[281773.594129] dmcrypt_write/2: attempt to access beyond end of device
                sdl: rw=34817, sector=4096, nr_sectors = 256 limit=0
[281773.594373] dmcrypt_write/2: attempt to access beyond end of device
                sdl: rw=34817, sector=4096, nr_sectors = 1 limit=0
[281773.598316] dmcrypt_write/2: attempt to access beyond end of device
                sdl: rw=1, sector=10502184, nr_sectors = 2048 limit=0
[281773.598345] dmcrypt_write/2: attempt to access beyond end of device
                sdl: rw=1, sector=10504232, nr_sectors = 2048 limit=0
[281781.692859] bio_check_eod: 63 callbacks suppressed
[281781.692866] blkid: attempt to access beyond end of device
                sdl: rw=524288, sector=20971392, nr_sectors = 8 limit=0
[281781.692884] blkid: attempt to access beyond end of device
                sdl: rw=0, sector=20971392, nr_sectors = 8 limit=0
[281781.692890] Buffer I/O error on dev dm-7, logical block 2620912, async page read
[281781.702395] mkfs.xfs: attempt to access beyond end of device
                sdl: rw=18435, sector=4096, nr_sectors = 2097152 limit=0
[281781.702410] mkfs.xfs: attempt to access beyond end of device
                sdl: rw=2051, sector=2101248, nr_sectors = 2097152 limit=0
[281781.702540] mkfs.xfs: attempt to access beyond end of device
                sdl: rw=2048, sector=4096, nr_sectors = 1 limit=0
[281781.702994] dmcrypt_write/2: attempt to access beyond end of device
                sdl: rw=34817, sector=20971264, nr_sectors = 256 limit=0
[281781.703525] dmcrypt_write/2: attempt to access beyond end of device
                sdl: rw=34817, sector=4096, nr_sectors = 256 limit=0
[281781.703840] dmcrypt_write/2: attempt to access beyond end of device
                sdl: rw=34817, sector=4096, nr_sectors = 1 limit=0
[281781.708961] dmcrypt_write/2: attempt to access beyond end of device
                sdl: rw=1, sector=10491944, nr_sectors = 2048 limit=0
[281781.708990] dmcrypt_write/2: attempt to access beyond end of device
                sdl: rw=1, sector=10496040, nr_sectors = 2048 limit=0
[281785.385529] eth0: renamed from tmp94819
[281795.388446] eth0: renamed from tmpae1df
[281797.820564] bio_check_eod: 63 callbacks suppressed
[281797.820571] blkid: attempt to access beyond end of device
                sdl: rw=524288, sector=20971392, nr_sectors = 8 limit=0
[281797.820590] blkid: attempt to access beyond end of device
                sdl: rw=0, sector=20971392, nr_sectors = 8 limit=0
[281797.820596] Buffer I/O error on dev dm-7, logical block 2620912, async page read
[281797.829756] mkfs.xfs: attempt to access beyond end of device
                sdl: rw=18435, sector=4096, nr_sectors = 2097152 limit=0
[281797.829801] mkfs.xfs: attempt to access beyond end of device
                sdl: rw=2051, sector=2101248, nr_sectors = 2097152 limit=0
[281797.829929] mkfs.xfs: attempt to access beyond end of device
                sdl: rw=2048, sector=4096, nr_sectors = 1 limit=0
[281797.830381] dmcrypt_write/2: attempt to access beyond end of device
                sdl: rw=34817, sector=20971264, nr_sectors = 256 limit=0
[281797.830856] dmcrypt_write/2: attempt to access beyond end of device
                sdl: rw=34817, sector=4096, nr_sectors = 256 limit=0
[281797.831089] dmcrypt_write/2: attempt to access beyond end of device
                sdl: rw=34817, sector=4096, nr_sectors = 1 limit=0
[281797.833340] dmcrypt_write/2: attempt to access beyond end of device
                sdl: rw=1, sector=10487848, nr_sectors = 2048 limit=0
[281797.834339] dmcrypt_write/2: attempt to access beyond end of device
                sdl: rw=1, sector=10493992, nr_sectors = 2048 limit=0

We mounted the volume on another non-k8s host and opened it with cryptsetup:

fdisk -l
Disk /dev/sdc: 10 GiB, 10737418240 bytes, 20971520 sectors
Disk model: Volume          
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
xfs_repair -n /dev/mapper/volume-debugging
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 2
        - agno = 0
        - agno = 1
        - agno = 3
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

@jooola
Copy link
Member

jooola commented Apr 29, 2025

Found a similar issue with longhorn: longhorn/longhorn#8604

@lukasmetzner
Copy link
Contributor

Hey,

We have tried to investigate this issue further and also started discussions with internal teams. So far we could still not reproduce the issue, which makes it quite tough to debug.

Could you maybe provide some details on the history of the affected volumes? Did you do any major upgrades (dist-upgrade) since the day you created them. Did you ever monitor some outages of the underlying hosts?

Could you possibly provide some logs from before and after the reboot on the node where the affected volume is currently scheduled?

Are their details of the CSI-Driver configuration (e.g. Helm values), which you could provide?

If the data in question is non-critical and its loss would not affect you or your organization, you could try using tools like xfs_repair or fsck to repair the filesystem.

Discussions with internal teams are also ongoing, I might come back with some more details, if we have any substantial finding.

I sincerely apologize for the inconvenience caused.

Best Regards,
Lukas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants