Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Operator managed quota. #199
✨ Operator managed quota. #199
Changes from 2 commits
b11da0a
f72cec1
611c186
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that I would be concerned about, but I think there options around this, is this scenario:
I think one thing to get around this, is have the hub/ui pods be of a higher priority class, and all the task pods be of another priority class, and use the resouce quota on just the task the pods using the resource quota scope of priority clas.
Just a thought, and may be a pre-mature optimization but just a concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The quota limits the creation of the Pod resource. So, if the hub crashes or is OOM killed, the quota would not impact restarting the container. I think the potentially concerning use case would be related to the operator needing to restart pods. Such as upgrade while tasks are in flight. Or, and edit of the Tackle CR causing the operator to update a deployment.
Some options (not in any order). Though, I think #4 is the simplest/cleanest.
Option 1
Perhaps this could be mitigated by having the operator reconcile in the following order:
This seems fragile and does not address all of the use cases.
Option 2
Priority classes (suggested by @shawn-hurley)
Unfortunately priority classes are cluster scoped and without a default priority class (which does not exist by default), pods without an assigned class default to Zero(0). IIUC, this means that tackle pods will run at higher priority than all other pods in the cluster (including controllers). Also, our operator would need perms to create the cluster-scoped priority classes which I highly doubt could be acceptable. All of this adds up to - we can't use/rely-on priority classes.
Option 3
It is also unfortunate that Quota's don't support labels selectors.
The only option I have considered to address this concern is for task pods (and quota) to be created in another namespace. Example: konveyor-task. This would have the benefits of:
But with the following drawbacks:
Option 4
Perhaps a better option would be to implement the quota in the hub. The hub would list the nodes and calculate the allocatable cpu and memory. Then, (optionally) create task pods within the requested quota also expressed as a % passed to the hub ENVAR. I had a PR for this at one time but discarded it in favor or ResourceQuota. The algorithm/calculation is straight forward. Given the limitations of ResourceQuota, this may be the better way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if the other suggestions don't work then as described it makes sense, sorry for the wild goose chase. Just thought there as maybe something that we could do.
With that said the usecase I am worried about is an edge case and baring good options (1-3 I agree are out) and I don't want to re-create a scheduler (option 4) then I think what we have is sufficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we already have a scheduler. Option 4 only proposed to make it a little smarter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some kind of scale test should be probably defined to prove the quota helps handling higher loads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we have any upper / lower bounds? 99%, 100% would probably give basically no protection, while 1%, 2% etc. might make it just about impossible to schedule anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps but hard to say. 2% of a huge cluster may be enough. On the flip side, 100% is essentially worthless but not broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be very complicated.