Skip to content

Adding progress updates for Cruise Control rebalances #11348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

kyguy
Copy link
Member

@kyguy kyguy commented Apr 11, 2025

Type of change

  • Enhancement / new feature

Description

This PR introduces a new feature to monitor the progression of an ongoing partition rebalance executed by a Strimzi-managed Cruise Control instance via a KafkaRebalance custom resource.

Implementation of this proposal: strimzi/proposals#140

Addresses the issue raised here: #10278

Checklist

Please go through this checklist and make sure all applicable tasks have been done

  • Write tests
  • Make sure all tests pass
  • Update documentation
  • Check RBAC rights for Kubernetes / OpenShift roles
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md
  • Supply screenshots for visual changes, such as Grafana dashboards

@kyguy kyguy requested review from ppatierno and tomncooper April 11, 2025 15:31
@kyguy kyguy force-pushed the kr-execution-status branch from 249050d to 50954e0 Compare April 11, 2025 20:15
@kyguy kyguy added this to the 0.46.0 milestone Apr 11, 2025
@kyguy kyguy force-pushed the kr-execution-status branch 5 times, most recently from 042a4b6 to 95325db Compare April 14, 2025 00:43
@kyguy kyguy changed the title WIP: Adding progress updates for Cruise Control rebalances Adding progress updates for Cruise Control rebalances Apr 14, 2025
Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a first pass leaving some comments after our offline call showing the code. I will come back to this by checking the tests as well. It's a good start I guess we are in the right direction.

@ppatierno ppatierno modified the milestones: 0.46.0, 0.47.0 Apr 15, 2025
Copy link
Contributor

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a first pass. My main comment is how you are handling time. You should really be using the tools java.time gives you and only dealing with longs/ints at the very end.

@kyguy kyguy force-pushed the kr-execution-status branch 3 times, most recently from c51fa82 to 6291bd3 Compare April 23, 2025 21:17
@kyguy
Copy link
Member Author

kyguy commented Apr 28, 2025

Update: Built and deployed on Minikube to test. There were a couple of new bugs that surfaced during the Rebalancing KafkaRebalance state. Currently investigating.

(1) The REST API call for fetching the executor state during the Rebalancing state can push Cruise Control over it's default max.active.user.tasks limit of 5. Not sure what other user tasks are in the queue when this call is made other than the /rebalance user task and the periodic health check user tasks which are removed from queue around ~5 seconds. Currently investigating to see what other tasks are in the queue when this happens to see if they are necessary or can be avoided. I would prefer to not increase the max.active.user.tasks limit.

(2) The first REST API call for fetching the executor state once a rebalance has just started hits Cruise Control before it has any of the movement information required for estimating the rate of data movement. Not sure if it would be better if we (a) placed some artificial delay on this first REST API call or (b) update the KafkaRebalance with a Warning condition that that some of the progress fields cannot be calculated yet. Then, users can wait for the next API call/reconciliation for this progress information.

@ppatierno
Copy link
Member

(2) The first REST API call for fetching the executor state once a rebalance has just started hits Cruise Control before it has any of the movement information required for estimating the rate of data movement.

I was wondering what CC is returning to you in this case and how is it possible? I mean, we asked CC to run a proposal it provided to us already, so it should know what are the movements to apply, or?

@tomncooper
Copy link
Contributor

(2) The first REST API call for fetching the executor state once a rebalance has just started hits Cruise Control before it has any of the movement information required for estimating the rate of data movement.

I was wondering what CC is returning to you in this case and how is it possible? I mean, we asked CC to run a proposal it provided to us already, so it should know what are the movements to apply, or?

Not necessarily. When you run a rebalance it could take the stored proposal (which could be the same as the proposal we presented to the user) or it might (if the cluster has changed significantly) calculate a brand new rebalance. So there could be significant work CC has to do once you trigger the rebalance.

Copy link
Contributor

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have had another pass. Nice to see you using the java.time classes to deal with time and durations. However, you are using ZonedDateTime objects and creating current time instances using the system's local time zone. This is a big no-no. All times should be in UTC (zero offset / Zulu / Miliary timezone). The reason is that the Timezone on the OS running Strimzi could be different to the one CC is using, so it is just safer to do everything in UTC.

If the example timestamp you use in your tests is representative, then the CC timestamps are all in UTC (you can tell because they end in a Z - Zulu). You should double check that on the CC side and if true switch to using UTC everywhere.

I also think you need to decide if a failure to calculate the progress update is worth failing the whole KafkaRebalance reconciliation. Personally I think not, you should just log warnings and try again next time in most cases.

Finally, on the ConditionType enum I think we either need to use it everywhere and integrate it into the other Condition logic (probably should be in another PR) or just use a variable within your new class.

@kyguy kyguy force-pushed the kr-execution-status branch 4 times, most recently from 9a0824f to 3b0a363 Compare May 5, 2025 20:10
@kyguy kyguy self-assigned this May 8, 2025
@kyguy
Copy link
Member Author

kyguy commented May 8, 2025

The first REST API call for fetching the executor state once a rebalance has just started hits Cruise Control before it has any of the movement information required for estimating the rate of data movement.

I was wondering what CC is returning to you in this case and how is it possible? I mean, we asked CC to run a proposal it provided to us already, so it should know what are the movements to apply, or?

Since the request to fetch the executor state is made immediately after triggering the partition rebalance, Kafka has likely not had enough time to start moving partitions. As a result, key fields like finishedDataMovement in the Cruise Control executor state response will most likely be "0":

{
  "abortingPartitions": 0,
  "averageConcurrentPartitionMovementsPerBroker": 5,
  "finishedDataMovement": 0,
  "maximumConcurrentPartitionMovementsPerBroker": 5,
  "minimumConcurrentPartitionMovementsPerBroker": 5,
  "numFinishedPartitionMovements": 0,
  "numInProgressPartitionMovements": 0,
  "numPendingPartitionMovements": 20,
  "numTotalPartitionMovements": 20,
  "state": "INTER_BROKER_REPLICA_MOVEMENT_TASK_IN_PROGRESS",
  "totalDataToMove": 0,
  "triggeredSelfHealingTaskId": "",
  "triggeredTaskReason": "No reason provided (Client: 172.17.0.1, Date: 2025-05-08T19:41:27Z)",
  "triggeredUserTaskId": "0230d401-6a36-430e-9858-fac8f2edde93"
}

This behavior is expected. To accurately estimate progress fields like estimatedTimeToCompletionInMinutes, which are dependent on metrics like average movement rate (which relies on fields like finishedDataMovement to have non-zero values), we need to allow some time after rebalance is triggered for partition movement to begin and for Cruise Control to accumulate meaningful movement data.

To make this more transparent, I have added some logic to create a warning condition in the KafkaRebalance resource and log: "Partition movement information unavailable; executor is in <_EXECUTOR_STATE_> state, progress estimation will be updated on next reconciliation." whenever we request progress fields and finishedDataMovement is zero or the executor state is in a non-rebalancing state.

@kyguy
Copy link
Member Author

kyguy commented May 8, 2025

Most (if not all) of the initial comments should be addressed, please have another pass when you get the chance @ppatierno @tomncooper. In the meantime, I am going to take this PR out of "Draft" mode and request reviews from other stakeholders and community members.

@kyguy kyguy marked this pull request as ready for review May 8, 2025 23:05
Copy link
Contributor

@mimaison mimaison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I can't really judge the overall logic but I made a pass from a pure Java point of view.

Copy link
Contributor

@tinaselenge tinaselenge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kyguy. Overall it looks to me, I spotted just a few nits that others have also commented.

@kyguy kyguy force-pushed the kr-execution-status branch from 6e710a5 to 7d1fea5 Compare May 14, 2025 16:46
@kyguy
Copy link
Member Author

kyguy commented May 19, 2025

PR has been updated to address the feedback of all reviewers, @mimaison, @tomncooper, @scholzj, @ppatierno, @see-quick, @tinaselenge could you have another pass when you get a chance?

Once it looks good to you I'll squash and rebase the commits on main.

Copy link
Contributor

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my previous comments. There are just a few nits for you to look at then I will be happy to approve.

Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had another pass but no more comments from me. Looking forward to address Tom's comments I agree with before approving ;-)

@kyguy kyguy force-pushed the kr-execution-status branch 2 times, most recently from e27587d to 8eb58cf Compare May 29, 2025 16:34
Signed-off-by: Kyle Liberti <kliberti@redhat.com>
@kyguy kyguy force-pushed the kr-execution-status branch from 8eb58cf to 2ec3dc2 Compare May 29, 2025 20:47
@ppatierno
Copy link
Member

/azp run regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the regression tests pass this LGTM!

Copy link
Contributor

@tinaselenge tinaselenge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kyguy . Looks good to me overall with one minor comment.

* @param status The {@link KafkaRebalanceStatus} object whose conditions will be updated.
* @param exception The {@link Throwable} containing the reason and message for the Warning condition.
*/
public static void addWarningCondition(KafkaRebalanceStatus status, Throwable exception) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you use the existing method from StatusUtils?

Also, do we really want to add the condition for the error? I know it is in the proposal. But it seems like something what can easily lead to tight loops. I think log warning would be safer and sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you use the existing method from StatusUtils?

This method has a little extra logic that is ensures that the first occurrence of a Warning is not overwritten by another Warning with the same reason and message. This way users can more easily trace when an ongoing issue first occurred instead of when it last occurred.

Also, do we really want to add the condition for the error? I know it is in the proposal. But it seems like something what can easily lead to tight loops. I think log warning would be safer and sufficient.

The inclusion of the condition for errors was asked for by one or some of the proposal reviewers for better UX. With the way it is written, it shouldn't be causing tight reconciliation loops. That being said, it wouldn't be hard to strip it out now or later if it is seen as unnecessarily risky.

Should I raise it with the reviewers on the community channel?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has a little extra logic that is ensures that the first occurrence of a Warning is not overwritten by another Warning with the same reason and message. This way users can more easily trace when an ongoing issue first occurred instead of when it last occurred.

Why do you think the methods in StatusUtils override it? Or do you expect multiple identical warnings there? Why do you want them multiple times?

If the CC does not work, failing seems like a better idea than trying to continue with some warning conditions and uncertainty about what is really going on. That is how the operator behaves in general.

The inclusion of the condition for errors was asked for by one or some of the proposal reviewers for better UX. With the way it is written, it shouldn't be causing tight reconciliation loops. That being said, it wouldn't be hard to strip it out now or later if it is seen as unnecessarily risky.

So, when exactly does it happen? And why do you expect errors affecting only the progress and not the whole rebalance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think the methods in StatusUtils override it?

What I mean to say is that we want to keep the timestamp of the Warning of the first occurrence of an error. If an error with the same message and reason happens on the next reconciliation, we want to make sure we keep the Warning that has the same message and reason from the last reconciliation.

Or do you expect multiple identical warnings there? Why do you want them multiple times?

No no, we just want one Warning here.

If the CC does not work, failing seems like a better idea than trying to continue with some warning conditions and uncertainty about what is really going on. That is how the operator behaves in general.

IIRC from the proposal discussions, the concern was that we didn't want a temporary issue of not being able to access the /kafkacruisecontrol/state?substates=executor endpoint for the progress information to interrupt/break an ongoing rebalance.

So, when exactly does it happen? And why do you expect errors affecting only the progress and not the whole rebalance?

If the Cruise Control REST API were to return an error or fail to respond to the operator when querying the /kafkacruisecontrol/state?substates=executor endpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think the methods in StatusUtils override it?

What I mean to say is that we want to keep the timestamp of the Warning of the first occurrence of an error. If an error with the same message and reason happens on the next reconciliation, we want to make sure we keep the Warning that has the same message and reason from the last reconciliation.

That should be done when updating the status during the status diffing. See the StatusDiff class. If the only difference is the transition timestamp, the resource is not updated. That is what should be used by the KafkaRebalanceAssemblyOperator and you should not need any special diffing. Assuming you want to keep the timestamp between reconciliations.

If you need to keep the timestamp first timestamp from warnings that happened within the same reconciliation, you are doing something wrong. You should either have two different warnings because they indicate something else or you should not set the first warning if you for example retry and set the warning only if it fails after all retries.

If the CC does not work, failing seems like a better idea than trying to continue with some warning conditions and uncertainty about what is really going on. That is how the operator behaves in general.

IIRC from the proposal discussions, the concern was that we didn't want a temporary issue of not being able to access the /kafkacruisecontrol/state?substates=executor endpoint for the progress information to interrupt/break an ongoing rebalance.

So, when exactly does it happen? And why do you expect errors affecting only the progress and not the whole rebalance?

If the Cruise Control REST API were to return an error or fail to respond to the operator when querying the /kafkacruisecontrol/state?substates=executor endpoint.

But isn't that an indication of some deeper issues? Or why do you expect that the whole CC will work fine and just this endpoint will be failing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the StatusDiff class. If the only difference is the transition timestamp, the resource is not updated.

Ah I see, I misunderstood the functionality. I assumed this code would update the resource with a new timestamp. If it doesn't then I can just reuse this code.

That is what should be used by the KafkaRebalanceAssemblyOperator and you should not need any special diffing. Assuming you want to keep the timestamp between reconciliations.

Yes, this is exactly what I need. Thank you for pointing this out.

But isn't that an indication of some deeper issues? Or why do you expect that the whole CC will work fine and just this endpoint will be failing?

Forgive me, I was mistaken with my reasoning, I think the original reasoning was not about the server failing to serve requests but about the request not reaching the endpoint due to a transient network issue.

Of course, if Cruise Control is failing to serve requests for this endpoint it will fail for the other endpoints and stop the KafkaRebalance anyway.

Copy link
Member Author

@kyguy kyguy Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what should be used by the KafkaRebalanceAssemblyOperator and you should not need any special diffing. Assuming you want to keep the timestamp between reconciliations.

Yes, this is exactly what I need. Thank you for pointing this out.

Just fixed this in the latest commit.

If the CC does not work, failing seems like a better idea than trying to continue with some warning conditions and uncertainty about what is really going on. That is how the operator behaves in general.

IIRC from the proposal discussions, the concern was that we didn't want a temporary issue of not being able to access the /kafkacruisecontrol/state?substates=executor endpoint for the progress information to interrupt/break an ongoing rebalance.

So, when exactly does it happen? And why do you expect errors affecting only the progress and not the whole rebalance?

If the Cruise Control REST API were to return an error or fail to respond to the operator when querying the /kafkacruisecontrol/state?substates=executor endpoint.

But isn't that an indication of some deeper issues? Or why do you expect that the whole CC will work fine and just this endpoint will be failing?

Forgive me, I was mistaken with my reasoning, I think the original reasoning was not about the server failing to serve requests but about the request not reaching the endpoint due to a transient network issue.

Of course, if Cruise Control is failing to serve requests for this endpoint it will fail for the other endpoints and stop the KafkaRebalance anyway.

Additionally, the Warning condition is used to indicate when some progress estimation fields cannot be calculated e.g, when a rebalance has just started and no data has been moved yet, making it impossible to calculate the data movement rate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, the Warning condition is used to indicate when some progress estimation fields cannot be calculated e.g, when a rebalance has just started and no data has been moved yet, making it impossible to calculate the data movement rate.

That does not seem like it deserves a warning because it is an expected situation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, I do not see any blockers in the PR. But it should be reviewed by Paolo and Kate.

kyguy added 2 commits May 30, 2025 18:01
Signed-off-by: Kyle Liberti <kliberti@redhat.com>
Signed-off-by: Kyle Liberti <kliberti@redhat.com>
@kyguy
Copy link
Member Author

kyguy commented Jun 2, 2025

@ppatierno @katheris Could you have another pass when you get a chance?

Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @kyguy!

Copy link
Contributor

@katheris katheris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments with nits around context in comments, indentation in the test classes and an info message that could be a warning. But happy to approve as otherwise the changes look good.

Signed-off-by: Kyle Liberti <kliberti@redhat.com>
@kyguy
Copy link
Member Author

kyguy commented Jun 5, 2025

Thanks for the final reviews @katheris @ppatierno! Could you run the regressions tests one last time?

@ppatierno
Copy link
Member

/azp run regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants