MGMT-19878: Update control plane status for upgrade #90

CrystalChun · 2025-03-06T22:58:44Z

Modifies status.ready to false when upgrade starts and changes it back to true when upgrade completes. Modifies updated replicas to the number of control plane nodes that are ready after upgrade completes.
These statuses are checked by Sylva when running an upgrade.

Modifies status.ready to false when upgrade starts and changes it back to true when upgrade completes. Modifies updated replicas to the number of control plane nodes that are ready after upgrade completes. These statuses are checked by Sylva when running an upgrade.

openshift-ci-robot · 2025-03-06T22:58:47Z

@CrystalChun: This pull request references MGMT-19878 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.19.0" version, but no target version was set.

In response to this:

Modifies status.ready to false when upgrade starts and changes it back to true when upgrade completes. Modifies updated replicas to the number of control plane nodes that are ready after upgrade completes.
These statuses are checked by Sylva when running an upgrade.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-03-06T22:58:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CrystalChun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [CrystalChun]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rccrdpccl · 2025-03-07T15:54:51Z

Do we really want to set the ready status to false? the ready status should be there to inform about the controlplane readiness to receive requests, which should be unaffected during upgrades (and if it's affected, it shouldn't be solely because it's running an upgrade, but because some component is down).
https://cluster-api.sigs.k8s.io/developer/providers/contracts/control-plane#controlplane-initialization-completed

rccrdpccl · 2025-03-07T15:56:20Z

controlplane/internal/controller/openshiftassistedcontrolplane_controller.go

@@ -246,6 +247,7 @@ func (r *OpenshiftAssistedControlPlaneReconciler) upgradeWorkloadCluster(ctx con
 	oacp.Status.DistributionVersion, err = upgrader.GetCurrentVersion(ctx)
 	if err != nil {
 		log.V(logutil.WarningLevel).Info("failed to get OpenShift version from ClusterVersion", "error", err.Error())
+		return ctrl.Result{}, err


Why returning here? Even without current version we should be able to perform an upgrade

rccrdpccl · 2025-03-07T16:02:51Z

controlplane/internal/upgrade/upgrade.go

@@ -151,6 +155,34 @@ func (u *OpenshiftUpgrader) UpdateClusterVersionDesiredUpdate(ctx context.Contex
 	return nil
 }

+func (u *OpenshiftUpgrader) VerifyUpgradedNodes(ctx context.Context, oacp *controlplanev1alpha2.OpenshiftAssistedControlPlane) error {


Shouldnt this be handled by https://github.com/openshift-assisted/cluster-api-agent/blob/92cebf02e87613e0194b807776e3abde5ecca98e/controlplane/internal/controller/openshiftassistedcontrolplane_controller.go#L536C51-L536C70?

Ideally it would be, but we can't actually update the machines so this a work-around

I mean that even if we want to run the workaround, I think we should keep the logic updating those fields together (we should make sure it runs that part of reconciliation during upgrades, but still I don't think we should split the logic handling those fields)

rccrdpccl · 2025-03-07T16:05:56Z

controlplane/internal/upgrade/upgrade.go

+	for _, node := range nodes.Items {
+		// It's difficult to determine which OCP version is running on this Node, so we will
+		// just ensure that the Node is ready
+		for _, condition := range node.Status.Conditions {


we might want to add this logic for ready replicas, however this should not affect updated replicas
Updated replicas is a field that informs whether the specs are up to date or not.

Usually, this would inform about upgrades too, as one field would be Version. However we are using DistributionVersion which is not (yet) on the machine spec.

This is a field that sylva checks for and if it's required for them to detect that upgrade finishes, then we should have it unless Jianzhu mentions otherwise

I agree that we should find a way, however if the specs are the same this and we are declaring it not "Updated" could be a problem elsewhere.
See for example the machine deployment upgrade: we will not know by means of core CAPI controller if the upgrade is in progress or not, as CAPI controller will not change the updated machines value (as the specs do not technically change). Should we maintain the same logic as core CAPI components and not update the updated machines counter if specs don't change or should we try to reflect upgrades and change its value, even if it's not technically true that the machine is not updated?
Personally I believe that relying the check on a condition would have better semantics than hacking our way with updated replicas. Any thoughts on this @jianzzha?

Even though machine is immutable, their status can change, right? I assume for MD, if its owned machine status.conditions.Ready changes to false during the upgrade, the MD status will be updated to reflect that right? Same logic for controlplane machine?

Does openshift upgrade provide the granular to tell which node is being updated @CrystalChun

I guess what I'm trying to say is that the agent capi provider control is aware of the upgrade and should change the machine status, not the core CAPI provider.

the new MachinesUpToDate for MD and controlplane in v1beta2 might work? https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20240916-improve-status-in-CAPI-resources.md

For ControlPlane Machines we can set the status. However for Worker Machines we should not: it is owned by MachineSet. MachineSet reconciler will try to reconcile its status: https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/machinedeployment/machinedeployment_status.go#L69 so.
We can and should add UpToDate condition on the machines we control, however please keep in mind its meaning: UpToDate: True if the Machine spec matches the spec of the Machine's owner resource, e.g KubeadmControlPlane or MachineDeployment (https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20240916-improve-status-in-CAPI-resources.md#machine-newconditions)
This will not help in the case of upgrades

The UpToDate semantics seems to change with this inplace upgrade proposal: https://github.com/kubernetes-sigs/cluster-api/blob/ad3312b627195d43434946ae7557fd12985ed084/docs/proposals/20240807-in-place-updates.md#high-level-flow

It sets the machine UpToDate condition to false when the upgrade begins and set it to true when the upgrade is complete. I'm guessing this need to work together with pending-hooks: ExternalUpdate annotation to have the "upgrade" semantics.

I think we need to consider if we have a 'CAPI compliance' path. Right now we might be in a non-compliance position for upgrade: the MD owned machines are being upgraded but the MD and machine controllers have no visibility to what's going on therefore can not update status. When the distributionversion is accepted by upstream, I think the situation won't change because we still can not have the distributionversion under MD, otherwise changing distributionversion under MD would trigger MD to execute rolling upgrade, right?

For openshift inplace upgrade to be 'CAPI compliance', I guess we will need an "openshift friendly inplace upgrade scheme" accepted by upstream.

The UpToDate semantics seems to change with this inplace upgrade proposal: https://github.com/kubernetes-sigs/cluster-api/blob/ad3312b627195d43434946ae7557fd12985ed084/docs/proposals/20240807-in-place-updates.md#high-level-flow

Are you suggesting we should already anticipate that? I think we might get around the issue of the owner changing back the status by pausing the machines for the upgrade duration. I think we can explore this path

Right now we might be in a non-compliance position for upgrade
Yes, until both distributionVersion and in-place-updates are implemented, we are not 100% complying with the CAPI contract/way of doing things

I think the situation won't change because we still can not have the distributionversion under MD, otherwise changing distributionversion under MD would trigger MD to execute rolling upgrade, right?

We cannot have distributionVersion under MD because we do not control that API at the moment. And yes, even if we could, not supporting in-place upgrade changing such field will trigger a rolling upgrade.

rccrdpccl · 2025-03-07T16:08:34Z

as a generic remark, we should understand why this check is performed: is it making sure we are holding stuff during an upgrade? is it just for validation?
If we need to hold stuff, we need to make sure we hold it also while machine deployments machines are being updated, as the deployment would not have Version field, we are centralizing upgrade logic in the controlplane (even for worker nodes).

In any case I believe this should either be reflected in "UpdatedMachines" or some other conditions.

CrystalChun · 2025-03-07T16:38:45Z

Do we really want to set the ready status to false? the ready status should be there to inform about the controlplane readiness to receive requests, which should be unaffected during upgrades (and if it's affected, it shouldn't be solely because it's running an upgrade, but because some component is down). https://cluster-api.sigs.k8s.io/developer/providers/contracts/control-plane#controlplane-initialization-completed

This is a field that Sylva checks that upgrade has started and is finished (once set to true again). Should we have Sylva change their checks?

rccrdpccl · 2025-03-10T09:18:23Z

This is a field that Sylva checks that upgrade has started and is finished (once set to true again). Should we have Sylva change their checks?

On upgrades (especially rolling upgrades) it could happen that a node is restarting some components and the cluster controlplane would become unavailable. Checking ready makes sense, however we should report the real status: if the controlplane is ready, we should be reporting ready. IMO we should not fake the ready status to report upgrade in progress.

openshift-merge-robot · 2025-03-10T11:28:28Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

CrystalChun · 2025-03-10T15:09:15Z

Riccardo led a discussion offline with Jianzhu and I.

Jianzhu will ask Sylva if they're ok with creating a separate checking unit for openshift upgrades which will verify if the control plane's status upgrade completed condition is true to signal a completed upgrade.

We'll wait for that decision, and if they do not accept it, then we will need to pivot and either continue this hacking path or find a different solution that can satisfy both the CAPI standards and Sylva's upgrade expectations.

jianzzha · 2025-03-11T18:01:12Z

The Sylva discussion: https://gitlab.com/sylva-projects/sylva-core/-/issues/2191#note_2390882900

Please check and make sure what I said regarding the "status" in this thread accurately reflected what we discussed. I also have this concern that OpenShift behaves differently than other k8s distros used in Sylva. I understand the reasoning behind it but we do need to avoid that as much as possible.

openshift-ci · 2025-03-21T12:53:23Z

@CrystalChun: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/lint	`4694573`	link	true	`/test lint`
ci/prow/check-generated-files	`4694573`	link	true	`/test check-generated-files`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

rccrdpccl · 2025-04-01T07:54:15Z

@CrystalChun can this be closed?

CrystalChun requested a review from rccrdpccl March 6, 2025 22:58

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 6, 2025

openshift-ci bot requested review from eranco74 and gamli75 March 6, 2025 22:58

openshift-ci bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 6, 2025

rccrdpccl reviewed Mar 7, 2025

View reviewed changes

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MGMT-19878: Update control plane status for upgrade #90

MGMT-19878: Update control plane status for upgrade #90

CrystalChun commented Mar 6, 2025

openshift-ci-robot commented Mar 6, 2025 •

edited by openshift-ci bot

Loading

openshift-ci bot commented Mar 6, 2025

rccrdpccl commented Mar 7, 2025

rccrdpccl Mar 7, 2025

rccrdpccl Mar 7, 2025

CrystalChun Mar 7, 2025

rccrdpccl Mar 10, 2025

rccrdpccl Mar 7, 2025

CrystalChun Mar 7, 2025

rccrdpccl Mar 10, 2025

jianzzha Mar 12, 2025

jianzzha Mar 12, 2025

jianzzha Mar 14, 2025

rccrdpccl Mar 14, 2025

jianzzha Mar 16, 2025

rccrdpccl Mar 17, 2025

rccrdpccl commented Mar 7, 2025

CrystalChun commented Mar 7, 2025

rccrdpccl commented Mar 10, 2025

openshift-merge-robot commented Mar 10, 2025

CrystalChun commented Mar 10, 2025

jianzzha commented Mar 11, 2025

openshift-ci bot commented Mar 21, 2025

rccrdpccl commented Apr 1, 2025

MGMT-19878: Update control plane status for upgrade #90

Are you sure you want to change the base?

MGMT-19878: Update control plane status for upgrade #90

Conversation

CrystalChun commented Mar 6, 2025

openshift-ci-robot commented Mar 6, 2025 • edited by openshift-ci bot Loading

openshift-ci bot commented Mar 6, 2025

rccrdpccl commented Mar 7, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rccrdpccl commented Mar 7, 2025

CrystalChun commented Mar 7, 2025

rccrdpccl commented Mar 10, 2025

openshift-merge-robot commented Mar 10, 2025

CrystalChun commented Mar 10, 2025

jianzzha commented Mar 11, 2025

openshift-ci bot commented Mar 21, 2025

rccrdpccl commented Apr 1, 2025

openshift-ci-robot commented Mar 6, 2025 •

edited by openshift-ci bot

Loading