-
Notifications
You must be signed in to change notification settings - Fork 104
Encrypted XFS volume fails to be mounted, pod stucks at ContainerCreating
#923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey, Best Regards, |
Hey, Could you provide some details about the xfs configuration of your volume? You can get this information by running the following command on the node where your volume is successfully mounted:
Best Regards |
Hi. This is directly from the affected worker node after a reboot (see my first post about the "dirty workaround"):
I don't know why but I can't run it on the
It's sufficient to cordon a worker node and delete a pod to reproduce the issue. There's no need to drain a node. Sorry, I haven't found the time to reproduce this issue on a fresh cluster. Is there any other specific information I can provide? |
Hey, does this only occur with a specific volume? If not, does it also occur with newly created volumes? Does this issue only happen if the pod/volume are being scheduled on a specific node of yours, or does it happen regardless of the node? Is your volume functioning normally after applying the workaround? Your dmesg (source) logs also don't appear to indicate a healthy state. If this only affects a specific volume, we could try to investigate any file system damage. In order to do this, delete the pod and wait for the volume to be detached from any server. In the Cloud Console mount the volume to a server of your choice and choose manual as a mount option. Run |
Hey, A bunch of questions… I try my best to answer them:
dmesg from worker node during a failed volume state
We mounted the volume on another non-k8s host and opened it with cryptsetup: fdisk -l
xfs_repair -n /dev/mapper/volume-debugging
|
Found a similar issue with longhorn: longhorn/longhorn#8604 |
Hey, We have tried to investigate this issue further and also started discussions with internal teams. So far we could still not reproduce the issue, which makes it quite tough to debug. Could you maybe provide some details on the history of the affected volumes? Did you do any major upgrades (dist-upgrade) since the day you created them. Did you ever monitor some outages of the underlying hosts? Could you possibly provide some logs from before and after the reboot on the node where the affected volume is currently scheduled? Are their details of the CSI-Driver configuration (e.g. Helm values), which you could provide? If the data in question is non-critical and its loss would not affect you or your organization, you could try using tools like Discussions with internal teams are also ongoing, I might come back with some more details, if we have any substantial finding. I sincerely apologize for the inconvenience caused. Best Regards, |
TL;DR
When moving a pod from one worker node to another the attached volume fails to be mounted. The volume is encrypted and formatted with XFS.
Expected behavior
When moving a pod to another node the attached volume should also be moved and mounted and attached properly.
Observed behavior
It worked fine a couple of months ago but since then this behavior occurs on 3 different k8s clusters.
Pod stucks at state
ContainerCreating
.k8s event log
dmesg on k8s worker node (source)
dmesg on k8s worker node (destination)
journal on k8s worker node (destination)
hcloud csi node (destination node)
Minimal working example
For example, deploy the official MongoDB image v7 from Docker Hub:
k8s manifests
Additional information
Diry workaround: rebooting the worker node solves this issue.
The text was updated successfully, but these errors were encountered: