Skip to content

CSI driver doesn't process requests after 1.29 -> 1.30 -> 1.31 upgrade #330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
themightychris opened this issue Dec 2, 2024 · 7 comments

Comments

@themightychris
Copy link
Contributor

I've seen this exact behavior each time on 3 cluster now after rapidly upgrading from 1.29 to 1.30 to 1.31

Existing volumes don't get attached and new PVC's don't get fulfilled:

Waiting for a volume to be created either by the external provisioner 'linodebs.csi.linode.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.

csi-node-driver-registrar output looks normal:

I1202 02:26:18.510103       1 main.go:150] "Version" version="v1.12.0"
I1202 02:26:18.510609       1 main.go:151] "Running node-driver-registrar" mode=""
I1202 02:26:18.510617       1 main.go:172] "Attempting to open a gRPC connection" csiAddress="/csi/csi.sock"
I1202 02:26:28.511558       1 connection.go:253] "Still connecting" address="unix:///csi/csi.sock"
I1202 02:26:30.377351       1 main.go:180] "Calling CSI driver to discover driver name"
I1202 02:26:30.416617       1 main.go:189] "CSI driver name" csiDriverName="linodebs.csi.linode.com"
I1202 02:26:30.416821       1 node_register.go:56] "Starting Registration Server" socketPath="/registration/linodebs.csi.linode.com-reg.sock"
I1202 02:26:30.417065       1 node_register.go:66] "Registration Server started" socketPath="/registration/linodebs.csi.linode.com-reg.sock"
I1202 02:26:30.417367       1 node_register.go:96] "Skipping HTTP server"
I1202 02:26:30.757273       1 main.go:96] "Received GetInfo call" request="&InfoRequest{}"
I1202 02:26:33.917898       1 main.go:108] "Received NotifyRegistrationStatus call" status="&RegistrationStatus{PluginRegistered:true,Error:,}"

And so does csi-linode-plugin (other than that it's not doing anything):

I1202 02:26:29.216741       1 main.go:80] "maxprocs: Leaving GOMAXPROCS=2: CPU quota undefined" component="maxprocs" version="1.5.2"
I1202 02:26:29.216908       1 driver.go:65] "Creating LinodeDriver" method="GetLinodeDriver" traceID="21427321-1031-440a-a49d-df86f7127b00"
I1202 02:26:29.216945       1 driver.go:71] "LinodeDriver created successfully" method="GetLinodeDriver" traceID="21427321-1031-440a-a49d-df86f7127b00"
I1202 02:26:29.218367       1 metadata.go:32] "Processing request"
I1202 02:26:29.723572       1 metadata.go:61] "Successfully completed"
I1202 02:26:29.723608       1 driver.go:89] "Setting up LinodeDriver" method="SetupLinodeDriver" traceID="dc7b1bdd-b539-414f-b6aa-37abccc06e48"
I1202 02:26:29.723662       1 driver.go:111] "Setting up RPC Servers" method="SetupLinodeDriver" traceID="dc7b1bdd-b539-414f-b6aa-37abccc06e48"
I1202 02:26:29.723677       1 driver.go:128] "LinodeDriver setup completed successfully" method="SetupLinodeDriver" traceID="dc7b1bdd-b539-414f-b6aa-37abccc06e48"
I1202 02:26:29.723693       1 driver.go:157] "Starting LinodeDriver" method="Run" traceID="caf7e37a-b589-4591-b28c-ab2b0243e427" name="linodebs.csi.linode.com"
I1202 02:26:29.723701       1 driver.go:166] "Starting non-blocking GRPC server" method="Run" traceID="caf7e37a-b589-4591-b28c-ab2b0243e427"
I1202 02:26:29.723717       1 driver.go:169] "GRPC server started successfully" method="Run" traceID="caf7e37a-b589-4591-b28c-ab2b0243e427"
I1202 02:26:30.378766       1 identityserver.go:62] "Processing request" method="GetPluginInfo" traceID="fc8e6860-be40-4b1f-8671-354adfcb96f5"
I1202 02:26:30.762217       1 nodeserver.go:325] "Processing request" method="NodeGetInfo" traceID="ba55b961-3877-496e-8c1a-171c62d2bb9f" req=""
I1202 02:26:30.885347       1 nodeserver.go:342] "Successfully completed" method="NodeGetInfo" traceID="ba55b961-3877-496e-8c1a-171c62d2bb9f"
@komer3
Copy link
Contributor

komer3 commented Dec 3, 2024

Hey @themightychris! Thanks for bring this issue to attention.

Could you provide with some more information about what you are experiencing? You are not able to provision new volumes at all by creating new PVCs? Also are the existing volumes created by the CSI driver?

Could you also provide the event logs for when you try to use existing volumes or create new PVCs? This information would help in debugging this.

@themightychris
Copy link
Contributor Author

themightychris commented Dec 3, 2024

I found the issue, the upgrade process added a csi-linode-plugin container spec to the csi-linode-controller StatefulSet but failed to remove what I'm guessing is the deprecated linode-csi-plugin container spec, but left the old container spec with no image defined creating an invalid spec that blocked the whole StatefulSet from coming online.

I've seen this on three clusters now that I've rapidly upgraded from 1.29 to 1.30 to 1.31—all of the clusters I've upgrade like this, so it seems certainly to be a flaw in the upgrade process

Healthy 1.31 clusters don't have this linode-csi-plugin container included in this statefulset, I've manually deleted it and everything is coming back online now

@komer3
Copy link
Contributor

komer3 commented Dec 3, 2024

Btw are you using LKE cluster?

@themightychris
Copy link
Contributor Author

Yes

@komer3
Copy link
Contributor

komer3 commented Dec 4, 2024

We found what was causing the issue and have a fix for it. Thank you for bring this up. Going forward you shouldn't be seeing this once the fix is rolled out!

@komer3 komer3 closed this as completed Dec 4, 2024
@komer3 komer3 reopened this Dec 4, 2024
@themightychris
Copy link
Contributor Author

@komer3 awesome, thanks! When might the fix be rolled out? I have a bunch more clusters to upgrade down the same path so I can wait until it's rolled out and then verify for you

@komer3
Copy link
Contributor

komer3 commented Dec 5, 2024

Its scheduled for rollout for monday (9th dec). Please let us know if you still see the invalid container spec issue after rollout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants