-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Adding progress updates for Cruise Control rebalances #11348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
249050d
to
50954e0
Compare
042a4b6
to
95325db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a first pass leaving some comments after our offline call showing the code. I will come back to this by checking the tests as well. It's a good start I guess we are in the right direction.
api/src/main/java/io/strimzi/api/kafka/model/rebalance/KafkaRebalanceStatus.java
Outdated
Show resolved
Hide resolved
...or/src/main/java/io/strimzi/operator/cluster/model/cruisecontrol/ExecutorStateProcessor.java
Outdated
Show resolved
Hide resolved
...or/src/main/java/io/strimzi/operator/cluster/model/cruisecontrol/ExecutorStateProcessor.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceProgressUtils.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceProgressUtils.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a first pass. My main comment is how you are handling time. You should really be using the tools java.time
gives you and only dealing with longs/ints at the very end.
...or/src/main/java/io/strimzi/operator/cluster/model/cruisecontrol/ExecutorStateProcessor.java
Outdated
Show resolved
Hide resolved
...or/src/main/java/io/strimzi/operator/cluster/model/cruisecontrol/ExecutorStateProcessor.java
Outdated
Show resolved
Hide resolved
...or/src/main/java/io/strimzi/operator/cluster/model/cruisecontrol/ExecutorStateProcessor.java
Outdated
Show resolved
Hide resolved
...or/src/main/java/io/strimzi/operator/cluster/model/cruisecontrol/ExecutorStateProcessor.java
Outdated
Show resolved
Hide resolved
...or/src/main/java/io/strimzi/operator/cluster/model/cruisecontrol/ExecutorStateProcessor.java
Outdated
Show resolved
Hide resolved
...perator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceUtils.java
Outdated
Show resolved
Hide resolved
...perator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceUtils.java
Outdated
Show resolved
Hide resolved
...perator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceUtils.java
Outdated
Show resolved
Hide resolved
...perator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceUtils.java
Outdated
Show resolved
Hide resolved
...perator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceUtils.java
Outdated
Show resolved
Hide resolved
c51fa82
to
6291bd3
Compare
Update: Built and deployed on Minikube to test. There were a couple of new bugs that surfaced during the (1) The REST API call for fetching the executor state during the (2) The first REST API call for fetching the executor state once a rebalance has just started hits Cruise Control before it has any of the movement information required for estimating the rate of data movement. Not sure if it would be better if we (a) placed some artificial delay on this first REST API call or (b) update the |
I was wondering what CC is returning to you in this case and how is it possible? I mean, we asked CC to run a proposal it provided to us already, so it should know what are the movements to apply, or? |
...or/src/main/java/io/strimzi/operator/cluster/model/cruisecontrol/ExecutorStateProcessor.java
Outdated
Show resolved
Hide resolved
...or/src/main/java/io/strimzi/operator/cluster/model/cruisecontrol/ExecutorStateProcessor.java
Outdated
Show resolved
Hide resolved
.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java
Show resolved
Hide resolved
Not necessarily. When you run a rebalance it could take the stored proposal (which could be the same as the proposal we presented to the user) or it might (if the cluster has changed significantly) calculate a brand new rebalance. So there could be significant work CC has to do once you trigger the rebalance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have had another pass. Nice to see you using the java.time
classes to deal with time and durations. However, you are using ZonedDateTime
objects and creating current time instances using the system's local time zone. This is a big no-no. All times should be in UTC (zero offset / Zulu / Miliary timezone). The reason is that the Timezone on the OS running Strimzi could be different to the one CC is using, so it is just safer to do everything in UTC.
If the example timestamp you use in your tests is representative, then the CC timestamps are all in UTC (you can tell because they end in a Z
- Zulu). You should double check that on the CC side and if true switch to using UTC everywhere.
I also think you need to decide if a failure to calculate the progress update is worth failing the whole KafkaRebalance
reconciliation. Personally I think not, you should just log warnings and try again next time in most cases.
Finally, on the ConditionType
enum I think we either need to use it everywhere and integrate it into the other Condition
logic (probably should be in another PR) or just use a variable within your new class.
...or/src/main/java/io/strimzi/operator/cluster/model/cruisecontrol/ExecutorStateProcessor.java
Outdated
Show resolved
Hide resolved
...or/src/main/java/io/strimzi/operator/cluster/model/cruisecontrol/ExecutorStateProcessor.java
Outdated
Show resolved
Hide resolved
.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java
Outdated
Show resolved
Hide resolved
.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java
Outdated
Show resolved
Hide resolved
...strimzi/operator/cluster/operator/resource/cruisecontrol/CruiseControlUserTasksResponse.java
Outdated
Show resolved
Hide resolved
...est/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceConfigMapUtilsTest.java
Outdated
Show resolved
Hide resolved
...tor/src/test/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceUtilsTest.java
Outdated
Show resolved
Hide resolved
...tor/src/test/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceUtilsTest.java
Outdated
Show resolved
Hide resolved
...tor/src/test/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceUtilsTest.java
Outdated
Show resolved
Hide resolved
...test/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/MockCruiseControl.java
Outdated
Show resolved
Hide resolved
9a0824f
to
3b0a363
Compare
Since the request to fetch the executor state is made immediately after triggering the partition rebalance, Kafka has likely not had enough time to start moving partitions. As a result, key fields like
This behavior is expected. To accurately estimate progress fields like To make this more transparent, I have added some logic to create a warning condition in the |
Most (if not all) of the initial comments should be addressed, please have another pass when you get the chance @ppatierno @tomncooper. In the meantime, I am going to take this PR out of "Draft" mode and request reviews from other stakeholders and community members. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. I can't really judge the overall logic but I made a pass from a pure Java point of view.
...rc/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceConfigMapUtils.java
Outdated
Show resolved
Hide resolved
...rc/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceConfigMapUtils.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceProgressUtils.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceProgressUtils.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceProgressUtils.java
Outdated
Show resolved
Hide resolved
...est/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceConfigMapUtilsTest.java
Outdated
Show resolved
Hide resolved
...test/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceProgressUtilsTest.java
Outdated
Show resolved
Hide resolved
...est/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/ExecutorStatusTest.java
Outdated
Show resolved
Hide resolved
...est/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/ExecutorStatusTest.java
Outdated
Show resolved
Hide resolved
...est/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/ExecutorStatusTest.java
Outdated
Show resolved
Hide resolved
api/src/main/java/io/strimzi/api/kafka/model/rebalance/KafkaRebalanceStatus.java
Outdated
Show resolved
Hide resolved
...test/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/MockCruiseControl.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kyguy. Overall it looks to me, I spotted just a few nits that others have also commented.
...o/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperatorProgressTest.java
Outdated
Show resolved
Hide resolved
...est/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceConfigMapUtilsTest.java
Outdated
Show resolved
Hide resolved
6e710a5
to
7d1fea5
Compare
PR has been updated to address the feedback of all reviewers, @mimaison, @tomncooper, @scholzj, @ppatierno, @see-quick, @tinaselenge could you have another pass when you get a chance? Once it looks good to you I'll squash and rebase the commits on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my previous comments. There are just a few nits for you to look at then I will be happy to approve.
.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java
Outdated
Show resolved
Hide resolved
.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java
Outdated
Show resolved
Hide resolved
.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java
Outdated
Show resolved
Hide resolved
...rc/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceConfigMapUtils.java
Show resolved
Hide resolved
...src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceProgressUtils.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceProgressUtils.java
Show resolved
Hide resolved
...src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceProgressUtils.java
Show resolved
Hide resolved
...rc/main/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/ExecutorStatus.java
Outdated
Show resolved
Hide resolved
...test/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceProgressUtilsTest.java
Outdated
Show resolved
Hide resolved
...test/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceProgressUtilsTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had another pass but no more comments from me. Looking forward to address Tom's comments I agree with before approving ;-)
.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java
Outdated
Show resolved
Hide resolved
e27587d
to
8eb58cf
Compare
Signed-off-by: Kyle Liberti <kliberti@redhat.com>
8eb58cf
to
2ec3dc2
Compare
/azp run regression |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the regression tests pass this LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kyguy . Looks good to me overall with one minor comment.
...rc/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceConfigMapUtils.java
Show resolved
Hide resolved
api/src/main/java/io/strimzi/api/kafka/model/rebalance/KafkaRebalanceStatus.java
Outdated
Show resolved
Hide resolved
.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java
Outdated
Show resolved
Hide resolved
* @param status The {@link KafkaRebalanceStatus} object whose conditions will be updated. | ||
* @param exception The {@link Throwable} containing the reason and message for the Warning condition. | ||
*/ | ||
public static void addWarningCondition(KafkaRebalanceStatus status, Throwable exception) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't you use the existing method from StatusUtils?
Also, do we really want to add the condition for the error? I know it is in the proposal. But it seems like something what can easily lead to tight loops. I think log warning would be safer and sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't you use the existing method from StatusUtils?
This method has a little extra logic that is ensures that the first occurrence of a Warning is not overwritten by another Warning with the same reason and message. This way users can more easily trace when an ongoing issue first occurred instead of when it last occurred.
Also, do we really want to add the condition for the error? I know it is in the proposal. But it seems like something what can easily lead to tight loops. I think log warning would be safer and sufficient.
The inclusion of the condition for errors was asked for by one or some of the proposal reviewers for better UX. With the way it is written, it shouldn't be causing tight reconciliation loops. That being said, it wouldn't be hard to strip it out now or later if it is seen as unnecessarily risky.
Should I raise it with the reviewers on the community channel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has a little extra logic that is ensures that the first occurrence of a Warning is not overwritten by another Warning with the same reason and message. This way users can more easily trace when an ongoing issue first occurred instead of when it last occurred.
Why do you think the methods in StatusUtils
override it? Or do you expect multiple identical warnings there? Why do you want them multiple times?
If the CC does not work, failing seems like a better idea than trying to continue with some warning conditions and uncertainty about what is really going on. That is how the operator behaves in general.
The inclusion of the condition for errors was asked for by one or some of the proposal reviewers for better UX. With the way it is written, it shouldn't be causing tight reconciliation loops. That being said, it wouldn't be hard to strip it out now or later if it is seen as unnecessarily risky.
So, when exactly does it happen? And why do you expect errors affecting only the progress and not the whole rebalance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you think the methods in StatusUtils override it?
What I mean to say is that we want to keep the timestamp of the Warning of the first occurrence of an error. If an error with the same message and reason happens on the next reconciliation, we want to make sure we keep the Warning that has the same message and reason from the last reconciliation.
Or do you expect multiple identical warnings there? Why do you want them multiple times?
No no, we just want one Warning here.
If the CC does not work, failing seems like a better idea than trying to continue with some warning conditions and uncertainty about what is really going on. That is how the operator behaves in general.
IIRC from the proposal discussions, the concern was that we didn't want a temporary issue of not being able to access the /kafkacruisecontrol/state?substates=executor endpoint for the progress information to interrupt/break an ongoing rebalance.
So, when exactly does it happen? And why do you expect errors affecting only the progress and not the whole rebalance?
If the Cruise Control REST API were to return an error or fail to respond to the operator when querying the /kafkacruisecontrol/state?substates=executor endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you think the methods in StatusUtils override it?
What I mean to say is that we want to keep the timestamp of the Warning of the first occurrence of an error. If an error with the same message and reason happens on the next reconciliation, we want to make sure we keep the Warning that has the same message and reason from the last reconciliation.
That should be done when updating the status during the status diffing. See the StatusDiff
class. If the only difference is the transition timestamp, the resource is not updated. That is what should be used by the KafkaRebalanceAssemblyOperator
and you should not need any special diffing. Assuming you want to keep the timestamp between reconciliations.
If you need to keep the timestamp first timestamp from warnings that happened within the same reconciliation, you are doing something wrong. You should either have two different warnings because they indicate something else or you should not set the first warning if you for example retry and set the warning only if it fails after all retries.
If the CC does not work, failing seems like a better idea than trying to continue with some warning conditions and uncertainty about what is really going on. That is how the operator behaves in general.
IIRC from the proposal discussions, the concern was that we didn't want a temporary issue of not being able to access the /kafkacruisecontrol/state?substates=executor endpoint for the progress information to interrupt/break an ongoing rebalance.
So, when exactly does it happen? And why do you expect errors affecting only the progress and not the whole rebalance?
If the Cruise Control REST API were to return an error or fail to respond to the operator when querying the /kafkacruisecontrol/state?substates=executor endpoint.
But isn't that an indication of some deeper issues? Or why do you expect that the whole CC will work fine and just this endpoint will be failing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the StatusDiff class. If the only difference is the transition timestamp, the resource is not updated.
Ah I see, I misunderstood the functionality. I assumed this code would update the resource with a new timestamp. If it doesn't then I can just reuse this code.
That is what should be used by the KafkaRebalanceAssemblyOperator and you should not need any special diffing. Assuming you want to keep the timestamp between reconciliations.
Yes, this is exactly what I need. Thank you for pointing this out.
But isn't that an indication of some deeper issues? Or why do you expect that the whole CC will work fine and just this endpoint will be failing?
Forgive me, I was mistaken with my reasoning, I think the original reasoning was not about the server failing to serve requests but about the request not reaching the endpoint due to a transient network issue.
Of course, if Cruise Control is failing to serve requests for this endpoint it will fail for the other endpoints and stop the KafkaRebalance
anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is what should be used by the KafkaRebalanceAssemblyOperator and you should not need any special diffing. Assuming you want to keep the timestamp between reconciliations.
Yes, this is exactly what I need. Thank you for pointing this out.
Just fixed this in the latest commit.
If the CC does not work, failing seems like a better idea than trying to continue with some warning conditions and uncertainty about what is really going on. That is how the operator behaves in general.
IIRC from the proposal discussions, the concern was that we didn't want a temporary issue of not being able to access the /kafkacruisecontrol/state?substates=executor endpoint for the progress information to interrupt/break an ongoing rebalance.
So, when exactly does it happen? And why do you expect errors affecting only the progress and not the whole rebalance?
If the Cruise Control REST API were to return an error or fail to respond to the operator when querying the /kafkacruisecontrol/state?substates=executor endpoint.
But isn't that an indication of some deeper issues? Or why do you expect that the whole CC will work fine and just this endpoint will be failing?
Forgive me, I was mistaken with my reasoning, I think the original reasoning was not about the server failing to serve requests but about the request not reaching the endpoint due to a transient network issue.
Of course, if Cruise Control is failing to serve requests for this endpoint it will fail for the other endpoints and stop the KafkaRebalance anyway.
Additionally, the Warning condition is used to indicate when some progress estimation fields cannot be calculated e.g, when a rebalance has just started and no data has been moved yet, making it impossible to calculate the data movement rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, the Warning condition is used to indicate when some progress estimation fields cannot be calculated e.g, when a rebalance has just started and no data has been moved yet, making it impossible to calculate the data movement rate.
That does not seem like it deserves a warning because it is an expected situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, I do not see any blockers in the PR. But it should be reviewed by Paolo and Kate.
...o/strimzi/operator/cluster/operator/assembly/AbstractKafkaRebalanceAssemblyOperatorTest.java
Show resolved
Hide resolved
...src/main/java/io/strimzi/operator/common/model/cruisecontrol/CruiseControlExecutorState.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Kyle Liberti <kliberti@redhat.com>
Signed-off-by: Kyle Liberti <kliberti@redhat.com>
@ppatierno @katheris Could you have another pass when you get a chance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @kyguy!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments with nits around context in comments, indentation in the test classes and an info message that could be a warning. But happy to approve as otherwise the changes look good.
.../main/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceAssemblyOperator.java
Outdated
Show resolved
Hide resolved
...rc/main/java/io/strimzi/operator/cluster/model/cruisecontrol/CruiseControlExecutorState.java
Show resolved
Hide resolved
...rc/main/java/io/strimzi/operator/cluster/operator/resource/cruisecontrol/ExecutorStatus.java
Outdated
Show resolved
Hide resolved
...o/strimzi/operator/cluster/operator/assembly/AbstractKafkaRebalanceAssemblyOperatorTest.java
Outdated
Show resolved
Hide resolved
...o/strimzi/operator/cluster/operator/assembly/AbstractKafkaRebalanceAssemblyOperatorTest.java
Outdated
Show resolved
Hide resolved
...o/strimzi/operator/cluster/operator/assembly/AbstractKafkaRebalanceAssemblyOperatorTest.java
Outdated
Show resolved
Hide resolved
...o/strimzi/operator/cluster/operator/assembly/AbstractKafkaRebalanceAssemblyOperatorTest.java
Outdated
Show resolved
Hide resolved
...o/strimzi/operator/cluster/operator/assembly/AbstractKafkaRebalanceAssemblyOperatorTest.java
Outdated
Show resolved
Hide resolved
...est/java/io/strimzi/operator/cluster/operator/assembly/KafkaRebalanceConfigMapUtilsTest.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Kyle Liberti <kliberti@redhat.com>
Thanks for the final reviews @katheris @ppatierno! Could you run the regressions tests one last time? |
/azp run regression |
Azure Pipelines successfully started running 1 pipeline(s). |
Type of change
Description
This PR introduces a new feature to monitor the progression of an ongoing partition rebalance executed by a Strimzi-managed Cruise Control instance via a
KafkaRebalance
custom resource.Implementation of this proposal: strimzi/proposals#140
Addresses the issue raised here: #10278
Checklist
Please go through this checklist and make sure all applicable tasks have been done