Skip to content

Fix panic when OpenStack server is deleted by an external agent #2475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

mdbooth
Copy link
Contributor

@mdbooth mdbooth commented Mar 11, 2025

The panic happened because we did not correctly handle that GetInstanceStatus returns a nil server if the server does not exist, rather than a 404 error.

We had the same oversight in the OpenStackServer controller. It looks like this would have resulted in recreating the server.

Fixes: #2474

/hold

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Mar 11, 2025
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 11, 2025
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 11, 2025
Copy link

netlify bot commented Mar 11, 2025

Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name Link
🔨 Latest commit 4a2a8b7
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/67d2024290f06b0008748a2d
😎 Deploy Preview https://deploy-preview-2475--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@mdbooth
Copy link
Contributor Author

mdbooth commented Mar 11, 2025

@EmilienM Can you take a look at this? I'm slightly uncomfortable with the interactions between the machine and server controllers here. Specifically the machine controller is re-litigating the value of openStackServer.Status.InstanceState in reconcileMachineState. The problem with that is that in this specific case that value is meaningless1 because it refers to the last observed state before it was deleted. We should probably just copy the relevant conditions from server to machine, and let the server controller decide what the status means.

I was also wondering if there's an easy way to add an e2e test for this. What do you think?

Footnotes

  1. Incidentally, if we set the status in an SSA transaction InstanceState would have become unset, which would make more sense.

@mdbooth
Copy link
Contributor Author

mdbooth commented Mar 11, 2025

/test all

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 11, 2025
@EmilienM
Copy link
Contributor

/test all

@EmilienM EmilienM force-pushed the panic-on-server-deletion branch from 4e4ae5b to 22c9b36 Compare March 11, 2025 19:47
@EmilienM
Copy link
Contributor

/test all

@EmilienM
Copy link
Contributor

@EmilienM Can you take a look at this? I'm slightly uncomfortable with the interactions between the machine and server controllers here. Specifically the machine controller is re-litigating the value of openStackServer.Status.InstanceState in reconcileMachineState. The problem with that is that in this specific case that value is meaningless1 because it refers to the last observed state before it was deleted. We should probably just copy the relevant conditions from server to machine, and let the server controller decide what the status means.

I was also wondering if there's an easy way to add an e2e test for this. What do you think?

Footnotes

1. Incidentally, if we set the status in an SSA transaction InstanceState would have become unset, which would make more sense. [↩](#user-content-fnref-1-1bbddb0f33f84398ef2d178bc40de289)

as I said on Slack, this PR looks good to me, we probably want to sync the conditions at some point.

wantErr: false,
wantServer: &servers.Server{
ID: instanceUUID,
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The asymmetry of not setting InstanceReadyCondition here is weird, but it's weirdly asymmetric in the function under test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know! The condition is set later after getOrCreateServer is called 😢
I'll propose a move of that in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mhh actually we reconcile until the server has been created and then we check its state later in the switch:

https://github.com/shiftstack/cluster-api-provider-openstack/blob/22c9b36db47a84b4aa9a0a0c1d1d90f8f7418287/controllers/openstackserver_controller.go#L367-L392

@EmilienM EmilienM force-pushed the panic-on-server-deletion branch from 22c9b36 to 281f5b2 Compare March 12, 2025 18:45
@EmilienM EmilienM marked this pull request as ready for review March 12, 2025 18:45
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 12, 2025
@EmilienM
Copy link
Contributor

I had a look at copying Conditions from OpenStackServer and it's a can of 🪱 to be honest, I suggest we tackle this later.

@k8s-ci-robot
Copy link
Contributor

@EmilienM: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-openstack-build
/test pull-cluster-api-provider-openstack-e2e-test
/test pull-cluster-api-provider-openstack-test

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-openstack-conformance-test
/test pull-cluster-api-provider-openstack-e2e-full-test

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-openstack-build
pull-cluster-api-provider-openstack-e2e-test
pull-cluster-api-provider-openstack-test

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@EmilienM
Copy link
Contributor

/test pull-cluster-api-provider-openstack-e2e-full-test

@EmilienM EmilienM force-pushed the panic-on-server-deletion branch 3 times, most recently from 2049225 to 2bfe7a2 Compare March 12, 2025 19:33
The panic happened because we did not correctly handle that
GetInstanceStatus returns a nil server if the server does not exist,
rather than a 404 error.

We had the same oversight in the OpenStackServer controller. It looks
like this would have resulted in recreating the server.
@EmilienM EmilienM force-pushed the panic-on-server-deletion branch from 2bfe7a2 to 4a2a8b7 Compare March 12, 2025 21:53
@EmilienM
Copy link
Contributor

@mdbooth I think it worked: https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/2475/pull-cluster-api-provider-openstack-e2e-test/1899941891034583040/artifacts/clusters/bootstrap/resources/e2e-4g1zo5/OpenStackMachine/cluster-e2e-4g1zo5-control-plane-9g925.yaml

status:
  addresses:
  - address: 10.6.0.211
    type: InternalIP
  - address: cluster-e2e-4g1zo5-control-plane-9g925
    type: InternalDNS
  conditions:
  - lastTransitionTime: "2025-03-12T22:51:08Z"
    message: server has been unexpectedly deleted
    reason: InstanceDeleted
    severity: Error
    status: "False"
    type: Ready
  - lastTransitionTime: "2025-03-12T22:47:14Z"
    status: "True"
    type: APIServerIngressReadyCondition
  - lastTransitionTime: "2025-03-12T22:51:08Z"
    message: server has been unexpectedly deleted
    reason: InstanceDeleted
    severity: Error
    status: "False"
    type: InstanceReady
  failureMessage: server has been unexpectedly deleted
  failureReason: UpdateError
  instanceID: 25be8125-214b-473a-8f02-9b3ad44d424b
  ready: true

@EmilienM
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 12, 2025
@EmilienM
Copy link
Contributor

/test pull-cluster-api-provider-openstack-e2e-full-test
just in case

Copy link
Contributor

@lentzi90 lentzi90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 13, 2025
@EmilienM
Copy link
Contributor

/cherry-pick release-0.12

@k8s-infra-cherrypick-robot

@EmilienM: once the present PR merges, I will cherry-pick it on top of release-0.12 in a new PR and assign it to you.

In response to this:

/cherry-pick release-0.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@EmilienM
Copy link
Contributor

/lgtm

I'll let Matt approve it.

@mdbooth
Copy link
Contributor Author

mdbooth commented Mar 13, 2025

Thanks!

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mdbooth

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 13, 2025
@k8s-ci-robot k8s-ci-robot merged commit 170dc4c into kubernetes-sigs:main Mar 13, 2025
10 checks passed
@github-project-automation github-project-automation bot moved this from Inbox to Done in CAPO Roadmap Mar 13, 2025
@k8s-infra-cherrypick-robot

@EmilienM: new pull request created: #2477

In response to this:

/cherry-pick release-0.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mnaser
Copy link
Contributor

mnaser commented Apr 7, 2025

/cherry-pick release-0.11

@k8s-infra-cherrypick-robot

@mnaser: new pull request created: #2507

In response to this:

/cherry-pick release-0.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Panic after Deleting VM from Horizon
6 participants