-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MGMT-19878: Update control plane status for upgrade #90
base: master
Are you sure you want to change the base?
Conversation
Modifies status.ready to false when upgrade starts and changes it back to true when upgrade completes. Modifies updated replicas to the number of control plane nodes that are ready after upgrade completes. These statuses are checked by Sylva when running an upgrade.
@CrystalChun: This pull request references MGMT-19878 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CrystalChun The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Do we really want to set the ready status to false? the ready status should be there to inform about the controlplane readiness to receive requests, which should be unaffected during upgrades (and if it's affected, it shouldn't be solely because it's running an upgrade, but because some component is down). |
@@ -246,6 +247,7 @@ func (r *OpenshiftAssistedControlPlaneReconciler) upgradeWorkloadCluster(ctx con | |||
oacp.Status.DistributionVersion, err = upgrader.GetCurrentVersion(ctx) | |||
if err != nil { | |||
log.V(logutil.WarningLevel).Info("failed to get OpenShift version from ClusterVersion", "error", err.Error()) | |||
return ctrl.Result{}, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why returning here? Even without current version we should be able to perform an upgrade
@@ -151,6 +155,34 @@ func (u *OpenshiftUpgrader) UpdateClusterVersionDesiredUpdate(ctx context.Contex | |||
return nil | |||
} | |||
|
|||
func (u *OpenshiftUpgrader) VerifyUpgradedNodes(ctx context.Context, oacp *controlplanev1alpha2.OpenshiftAssistedControlPlane) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally it would be, but we can't actually update the machines so this a work-around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that even if we want to run the workaround, I think we should keep the logic updating those fields together (we should make sure it runs that part of reconciliation during upgrades, but still I don't think we should split the logic handling those fields)
for _, node := range nodes.Items { | ||
// It's difficult to determine which OCP version is running on this Node, so we will | ||
// just ensure that the Node is ready | ||
for _, condition := range node.Status.Conditions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to add this logic for ready replicas, however this should not affect updated replicas
Updated replicas is a field that informs whether the specs are up to date or not.
Usually, this would inform about upgrades too, as one field would be Version. However we are using DistributionVersion which is not (yet) on the machine spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a field that sylva checks for and if it's required for them to detect that upgrade finishes, then we should have it unless Jianzhu mentions otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should find a way, however if the specs are the same this and we are declaring it not "Updated" could be a problem elsewhere.
See for example the machine deployment upgrade: we will not know by means of core CAPI controller if the upgrade is in progress or not, as CAPI controller will not change the updated machines value (as the specs do not technically change). Should we maintain the same logic as core CAPI components and not update the updated machines counter if specs don't change or should we try to reflect upgrades and change its value, even if it's not technically true that the machine is not updated?
Personally I believe that relying the check on a condition would have better semantics than hacking our way with updated replicas. Any thoughts on this @jianzzha?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though machine is immutable, their status can change, right? I assume for MD, if its owned machine status.conditions.Ready changes to false during the upgrade, the MD status will be updated to reflect that right? Same logic for controlplane machine?
Does openshift upgrade provide the granular to tell which node is being updated @CrystalChun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess what I'm trying to say is that the agent capi provider control is aware of the upgrade and should change the machine status, not the core CAPI provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new MachinesUpToDate
for MD and controlplane in v1beta2 might work? https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20240916-improve-status-in-CAPI-resources.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ControlPlane Machines we can set the status. However for Worker Machines we should not: it is owned by MachineSet. MachineSet reconciler will try to reconcile its status: https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/machinedeployment/machinedeployment_status.go#L69 so.
We can and should add UpToDate condition on the machines we control, however please keep in mind its meaning: UpToDate: True if the Machine spec matches the spec of the Machine's owner resource, e.g KubeadmControlPlane or MachineDeployment
(https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20240916-improve-status-in-CAPI-resources.md#machine-newconditions)
This will not help in the case of upgrades
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UpToDate
semantics seems to change with this inplace upgrade proposal: https://github.com/kubernetes-sigs/cluster-api/blob/ad3312b627195d43434946ae7557fd12985ed084/docs/proposals/20240807-in-place-updates.md#high-level-flow
It sets the machine UpToDate
condition to false when the upgrade begins and set it to true when the upgrade is complete. I'm guessing this need to work together with pending-hooks: ExternalUpdate
annotation to have the "upgrade" semantics.
I think we need to consider if we have a 'CAPI compliance' path. Right now we might be in a non-compliance position for upgrade: the MD owned machines are being upgraded but the MD and machine controllers have no visibility to what's going on therefore can not update status. When the distributionversion is accepted by upstream, I think the situation won't change because we still can not have the distributionversion under MD, otherwise changing distributionversion under MD would trigger MD to execute rolling upgrade, right?
For openshift inplace upgrade to be 'CAPI compliance', I guess we will need an "openshift friendly inplace upgrade scheme" accepted by upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UpToDate semantics seems to change with this inplace upgrade proposal: https://github.com/kubernetes-sigs/cluster-api/blob/ad3312b627195d43434946ae7557fd12985ed084/docs/proposals/20240807-in-place-updates.md#high-level-flow
Are you suggesting we should already anticipate that? I think we might get around the issue of the owner changing back the status by pausing the machines for the upgrade duration. I think we can explore this path
Right now we might be in a non-compliance position for upgrade
Yes, until both distributionVersion and in-place-updates are implemented, we are not 100% complying with the CAPI contract/way of doing things
I think the situation won't change because we still can not have the distributionversion under MD, otherwise changing distributionversion under MD would trigger MD to execute rolling upgrade, right?
We cannot have distributionVersion under MD because we do not control that API at the moment. And yes, even if we could, not supporting in-place upgrade changing such field will trigger a rolling upgrade.
as a generic remark, we should understand why this check is performed: is it making sure we are holding stuff during an upgrade? is it just for validation? In any case I believe this should either be reflected in "UpdatedMachines" or some other conditions. |
This is a field that Sylva checks that upgrade has started and is finished (once set to true again). Should we have Sylva change their checks? |
On upgrades (especially rolling upgrades) it could happen that a node is restarting some components and the cluster controlplane would become unavailable. Checking ready makes sense, however we should report the real status: if the controlplane is ready, we should be reporting ready. IMO we should not fake the ready status to report upgrade in progress. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Riccardo led a discussion offline with Jianzhu and I. Jianzhu will ask Sylva if they're ok with creating a separate checking unit for openshift upgrades which will verify if the control plane's status upgrade completed condition is true to signal a completed upgrade. We'll wait for that decision, and if they do not accept it, then we will need to pivot and either continue this hacking path or find a different solution that can satisfy both the CAPI standards and Sylva's upgrade expectations. |
The Sylva discussion: https://gitlab.com/sylva-projects/sylva-core/-/issues/2191#note_2390882900 Please check and make sure what I said regarding the "status" in this thread accurately reflected what we discussed. I also have this concern that OpenShift behaves differently than other k8s distros used in Sylva. I understand the reasoning behind it but we do need to avoid that as much as possible. |
@CrystalChun: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@CrystalChun can this be closed? |
Modifies status.ready to false when upgrade starts and changes it back to true when upgrade completes. Modifies updated replicas to the number of control plane nodes that are ready after upgrade completes.
These statuses are checked by Sylva when running an upgrade.