Skip to content

e2e test for pod completion and next pod start #458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tardieu opened this issue Feb 24, 2025 · 4 comments
Open

e2e test for pod completion and next pod start #458

tardieu opened this issue Feb 24, 2025 · 4 comments
Assignees

Comments

@tardieu
Copy link
Contributor

tardieu commented Feb 24, 2025

We test that we do not ungate more pods than we can fit on available gpus by launching 8 long-running 1g pods and checking exactly 7 are running (in a single gpu setup). We should extend such a test to:

  1. confirm that when one of the running pods completes, the pending pod starts running;
  2. verify the transition latency, i.e., that the pending pod starts running without delay.
@harche harche self-assigned this Feb 24, 2025
@asm582
Copy link
Contributor

asm582 commented Feb 24, 2025

Thanks, this test case on KinD does what we ask in 1st sub-bullet point across two GPUs; one pod remains in scheduling gated:

It("should verify all 1g profiles of GPUs are consumed", func() {

@tardieu
Copy link
Contributor Author

tardieu commented Feb 24, 2025

AFAIK this test only addresses point 0, i.e., one pod remains gated, not point 1, i.e., the gated pod eventually runs.

@KM3dd
Copy link

KM3dd commented Mar 13, 2025

Hello @asm582 , I am newbie and currently testing InstaSlice for dynamic MIG allocation, I am facing the same issue where Pods stay in SchedulingGated state indefinitely, I am using an nvidia A30,

first time it worked well when I deployed two pods with 1g.6gb demandes, but later I tried to deploy a third pod with 2g.12gb as limits it stayed in SchedulingGated even tho there is space to afford that allocation..

what I noticed is this error : error getting compute instance profile info, : Invalid Argument that persisted since the first time that error appeared and kept and preventing me to deploy any other pod with InstaSlice

I appreciate your help in this topic

UPDATE : apparently InstaSlice is creating a GPU instance of profile 2g.12gb+me when I pod requests an 2g.12gb..

Image

@asm582
Copy link
Contributor

asm582 commented Mar 13, 2025

Hi @KM3dd, thanks for reporting this issue, InstaSlice is designed to work on A100s and H100 (instances that most cloud provider provides) It may not work with A30s, to enable it on A30 you may need to change https://github.com/openshift/instaslice-operator/blob/main/internal/controller/capacity.go , we are happy to take this contribution from you on a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants