e2e test for pod completion and next pod start #458

tardieu · 2025-02-24T17:40:19Z

We test that we do not ungate more pods than we can fit on available gpus by launching 8 long-running 1g pods and checking exactly 7 are running (in a single gpu setup). We should extend such a test to:

confirm that when one of the running pods completes, the pending pod starts running;
verify the transition latency, i.e., that the pending pod starts running without delay.

asm582 · 2025-02-24T17:42:29Z

Thanks, this test case on KinD does what we ask in 1st sub-bullet point across two GPUs; one pod remains in scheduling gated:

instaslice-operator/test/e2e/e2e_test.go

Line 692 in ce1f522

It("should verify all 1g profiles of GPUs are consumed", func() {

tardieu · 2025-02-24T17:46:02Z

AFAIK this test only addresses point 0, i.e., one pod remains gated, not point 1, i.e., the gated pod eventually runs.

KM3dd · 2025-03-13T08:31:49Z

Hello @asm582 , I am newbie and currently testing InstaSlice for dynamic MIG allocation, I am facing the same issue where Pods stay in SchedulingGated state indefinitely, I am using an nvidia A30,

first time it worked well when I deployed two pods with 1g.6gb demandes, but later I tried to deploy a third pod with 2g.12gb as limits it stayed in SchedulingGated even tho there is space to afford that allocation..

what I noticed is this error : error getting compute instance profile info, : Invalid Argument that persisted since the first time that error appeared and kept and preventing me to deploy any other pod with InstaSlice

I appreciate your help in this topic

UPDATE : apparently InstaSlice is creating a GPU instance of profile 2g.12gb+me when I pod requests an 2g.12gb..

asm582 · 2025-03-13T13:26:55Z

Hi @KM3dd, thanks for reporting this issue, InstaSlice is designed to work on A100s and H100 (instances that most cloud provider provides) It may not work with A30s, to enable it on A30 you may need to change https://github.com/openshift/instaslice-operator/blob/main/internal/controller/capacity.go , we are happy to take this contribution from you on a separate issue.

harche self-assigned this Feb 24, 2025

harche added the tech-preview label Feb 24, 2025

KM3dd mentioned this issue Mar 14, 2025

Support for A30 : InstaSlice creates a 2g.12gb+me instead of 2g.12gb when running on an A30 #478

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

e2e test for pod completion and next pod start #458

e2e test for pod completion and next pod start #458

tardieu commented Feb 24, 2025 •

edited

Loading

asm582 commented Feb 24, 2025 •

edited

Loading

Uh oh!

tardieu commented Feb 24, 2025 •

edited

Loading

Uh oh!

KM3dd commented Mar 13, 2025 •

edited

Loading

Uh oh!

asm582 commented Mar 13, 2025

Uh oh!

e2e test for pod completion and next pod start #458

e2e test for pod completion and next pod start #458

Comments

tardieu commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

asm582 commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tardieu commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KM3dd commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asm582 commented Mar 13, 2025

Uh oh!

tardieu commented Feb 24, 2025 •

edited

Loading

asm582 commented Feb 24, 2025 •

edited

Loading

tardieu commented Feb 24, 2025 •

edited

Loading

KM3dd commented Mar 13, 2025 •

edited

Loading