-
Notifications
You must be signed in to change notification settings - Fork 16
e2e test for pod completion and next pod start #458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks, this test case on KinD does what we ask in 1st sub-bullet point across two GPUs; one pod remains in scheduling gated: instaslice-operator/test/e2e/e2e_test.go Line 692 in ce1f522
|
AFAIK this test only addresses point 0, i.e., one pod remains gated, not point 1, i.e., the gated pod eventually runs. |
Hello @asm582 , I am newbie and currently testing InstaSlice for dynamic MIG allocation, I am facing the same issue where Pods stay in SchedulingGated state indefinitely, I am using an nvidia A30, first time it worked well when I deployed two pods with 1g.6gb demandes, but later I tried to deploy a third pod with 2g.12gb as limits it stayed in SchedulingGated even tho there is space to afford that allocation.. what I noticed is this error : I appreciate your help in this topic UPDATE : apparently InstaSlice is creating a GPU instance of profile 2g.12gb+me when I pod requests an 2g.12gb.. |
Hi @KM3dd, thanks for reporting this issue, InstaSlice is designed to work on A100s and H100 (instances that most cloud provider provides) It may not work with A30s, to enable it on A30 you may need to change https://github.com/openshift/instaslice-operator/blob/main/internal/controller/capacity.go , we are happy to take this contribution from you on a separate issue. |
Uh oh!
There was an error while loading. Please reload this page.
We test that we do not ungate more pods than we can fit on available gpus by launching 8 long-running 1g pods and checking exactly 7 are running (in a single gpu setup). We should extend such a test to:
The text was updated successfully, but these errors were encountered: