Feature/vllm/input embedding completion api #17590

Nan2018 · 2025-05-02T14:49:16Z

adds support for passing prompt_embeds as b64 encoded bytes to the completions api.

Start the server with

VLLM_USE_V1=0 vllm serve HuggingFaceH4/zephyr-7b-beta

query example:

url = "http://localhost:8000/v1/completions"

prompt_embeds = []
for input_embeds in inputs_embeds: # inputs_embeds is list of 2d tensors of shape (seq_len, embed_dim)
    buff = io.BytesIO()
    torch.save(input_embeds.detach().cpu(), buff)
    prompt_embeds.append(b64encode(buff.getvalue()).decode("utf-8"))

resps = requests.post(
    url,
    json={
        "model": request_model,
        "prompt_embeds": prompt_embeds,
    },
)

# or with openai client

completions = client.completions.create(
    model=request_model,
    prompt="",
    extra_body={"prompt_embeds": prompt_embeds},
)

Note

this does not work with lora or prompt adapters

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

Co-authored-by: Nan2018 <nan@protopia.ai> Signed-off-by: Andrew Sansom <andrew@protopia.ai>

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

…mpty tensors instead of none Signed-off-by: Andrew Sansom <andrew@protopia.ai>

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

Signed-off-by: Andrew Sansom <qthequartermasterman@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

… handling shape mismatches in the engine and model runner Signed-off-by: Andrew Sansom <andrew@protopia.ai>

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

…nputs to have separate structs for handling input embeddings Signed-off-by: Andrew Sansom <andrew@protopia.ai>

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

Signed-off-by: Nan2018 <nan@protopia.ai>

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

… models Signed-off-by: Andrew Sansom <andrew@protopia.ai>

… a sample outputs Signed-off-by: Andrew Sansom <andrew@protopia.ai>

…' into feature/vllm/input-embedding-completion-api Signed-off-by: Nan2018 <nan@protopia.ai>

…ding-completion-api Signed-off-by: Nan2018 <nan@protopia.ai>

github-actions · 2025-05-02T14:49:25Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

CandiedCode · 2025-05-02T20:58:23Z

Now that #15428 has landed, this will be great to use that functionality 🙏🏽 . @DarkLight1337 I'm curious on what your thoughts are on the issues @Nan2018 is having.

临景 and others added 30 commits April 2, 2025 14:37

(vllm) add input embedding

cef6894

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

improve embedding input

c51d8fb

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

(vllm) fix import error

9564b40

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

(vllm) fix pre commit error

c60298a

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

apply ruff and isort fixes

0c24a82

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

apply ruff and isort fixes

403a165

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

styling

b1ac072

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

fix missing imports from rebase

0390c33

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

typing fixes

0ca4dae

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

type fix

35320fe

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

type fix

0a77630

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

remove unnecessary changes

11b6c02

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

remove unnecessary changes

cb92a3d

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

re-add deleted whitespace

375bd5b

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

Include unit tests from vllm-project#6869.

c9d8024

Co-authored-by: Nan2018 <nan@protopia.ai> Signed-off-by: Andrew Sansom <andrew@protopia.ai>

remove unrelated qwen2 changes

a64e627

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

guard clause around fully consumed prompt embeds to avoid returning e…

6ab349e

…mpty tensors instead of none Signed-off-by: Andrew Sansom <andrew@protopia.ai>

use v0 for prompt embeds model runner tests

26c8784

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

fix batching of input embeddings

b71a13c

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

style formatting

4aa9ade

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

remove incorrect overload

e2c4c26

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

remove incorrect overload

26d108a

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

Update representations

af20435

Signed-off-by: Andrew Sansom <qthequartermasterman@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

remove unrelated changes to docs

25aaf3f

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

remove unrelated typing change

bc05860

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

fix missing syntax

b55800d

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

do not schedule prompt embeds and non-prompt embeds in the same batch

be42a17

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

fix style linelength

c8fcfe4

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

Merge branch 'main' into feature/vllm/add-input-embedding

b21688f

propogate embeddings for sampled output tokens for decoding

1e359ae

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

qthequartermasterman and others added 27 commits April 18, 2025 16:30

preprocess tensors to handle batched/misshaped prompt embeds to avoid…

9a57aca

… handling shape mismatches in the engine and model runner Signed-off-by: Andrew Sansom <andrew@protopia.ai>

use seperate Embedsprompt class for preprocessing inputs embeddings

bbfb0f0

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

fix typing

933e567

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

fix type errors

4e0d12f

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

Merge branch 'vllm-project:main' into feature/vllm/add-input-embedding

164aeb5

fix mistaken type change

9e6909e

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

add missing type hint

90b950a

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

add spaces for style

01d83f4

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

seperate EmbedsInputs from TokenInputs and embeds_inputs from token_i…

6985452

…nputs to have separate structs for handling input embeddings Signed-off-by: Andrew Sansom <andrew@protopia.ai>

fix docstrings for EmbedsInputs

e916551

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

fix typing for token_type_ids

69f8725

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

fix typing for embeds_tokens in InputRegistry and InputsAdapter

9c2c89f

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

remove prompts and prompt_token_ids from EmbedsPrompts

499dc6a

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

Merge branch 'main' into feature/vllm/add-input-embedding

20668ca

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

fight mypy to get correct typing for not embeds prompts

6712ba6

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

remove incorrect call to embeds_inputs

740b290

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

wrestle with mypy and typeddict type narrowing

8f9bd51

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

wrestle with mypy and typeddict type narrowing

b8d36c6

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

support indexing graph runners that with inputs_embeds

b764c19

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

feat: completions using embeddings

0e75db4

Signed-off-by: Nan2018 <nan@protopia.ai>

Merge branch 'main' into feature/vllm/add-input-embedding

cb6ff22

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

support encoder decoder models with inputs_embeds

85642d0

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

simplify redundant ternary statement

b226fd6

Signed-off-by: Andrew Sansom <andrew@protopia.ai>

explicitly remove support for inputs embeds with speculative decoding…

b738d3f

… models Signed-off-by: Andrew Sansom <andrew@protopia.ai>

fix occasional device mismatch errors when appending output tokens to…

2340119

… a sample outputs Signed-off-by: Andrew Sansom <andrew@protopia.ai>

Merge remote-tracking branch 'andrew/feature/vllm/add-input-embedding…

6a3173a

…' into feature/vllm/input-embedding-completion-api Signed-off-by: Nan2018 <nan@protopia.ai>

Merge remote-tracking branch 'nan/main' into feature/vllm/input-embed…

06215c0

…ding-completion-api Signed-off-by: Nan2018 <nan@protopia.ai>

mergify bot added the frontend label May 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/vllm/input embedding completion api #17590

Feature/vllm/input embedding completion api #17590

Nan2018 commented May 2, 2025 •

edited by github-actions bot

Loading

github-actions bot commented May 2, 2025

CandiedCode commented May 2, 2025

Feature/vllm/input embedding completion api #17590

Are you sure you want to change the base?

Feature/vllm/input embedding completion api #17590

Conversation

Nan2018 commented May 2, 2025 • edited by github-actions bot Loading

github-actions bot commented May 2, 2025

CandiedCode commented May 2, 2025

Nan2018 commented May 2, 2025 •

edited by github-actions bot

Loading