Skip to content

Commit 30e1c6d

Browse files
DarkLight1337ywang96
authored andcommitted
[Model] PP support for embedding models and update docs (vllm-project#9090)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>
1 parent 70599d4 commit 30e1c6d

File tree

12 files changed

+610
-449
lines changed

12 files changed

+610
-449
lines changed

docs/source/models/supported_models.rst

Lines changed: 56 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,12 @@ vLLM supports a variety of generative Transformer models in `HuggingFace Transfo
77
The following is the list of model architectures that are currently supported by vLLM.
88
Alongside each architecture, we include some popular models that use it.
99

10-
----
10+
Text-only Language Models
11+
^^^^^^^^^^^^^^^^^^^^^^^^^
12+
13+
Text Generation
14+
---------------
1115

12-
Decoder-only Language Models
13-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1416
.. list-table::
1517
:widths: 25 25 50 5 5
1618
:header-rows: 1
@@ -40,6 +42,11 @@ Decoder-only Language Models
4042
- :code:`bigscience/bloom`, :code:`bigscience/bloomz`, etc.
4143
-
4244
- ✅︎
45+
* - :code:`BartForConditionalGeneration`
46+
- BART
47+
- :code:`facebook/bart-base`, :code:`facebook/bart-large-cnn`, etc.
48+
-
49+
-
4350
* - :code:`ChatGLMModel`
4451
- ChatGLM
4552
- :code:`THUDM/chatglm2-6b`, :code:`THUDM/chatglm3-6b`, etc.
@@ -259,11 +266,55 @@ Decoder-only Language Models
259266
.. note::
260267
Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.
261268

262-
.. _supported_vlms:
269+
Text Embedding
270+
--------------
271+
272+
.. list-table::
273+
:widths: 25 25 50 5 5
274+
:header-rows: 1
275+
276+
* - Architecture
277+
- Models
278+
- Example HuggingFace Models
279+
- :ref:`LoRA <lora>`
280+
- :ref:`PP <distributed_serving>`
281+
* - :code:`Gemma2Model`
282+
- Gemma2-based
283+
- :code:`BAAI/bge-multilingual-gemma2`, etc.
284+
-
285+
- ✅︎
286+
* - :code:`MistralModel`
287+
- Mistral-based
288+
- :code:`intfloat/e5-mistral-7b-instruct`, etc.
289+
-
290+
- ✅︎
291+
292+
Reward Modeling
293+
---------------
294+
295+
.. list-table::
296+
:widths: 25 25 50 5 5
297+
:header-rows: 1
298+
299+
* - Architecture
300+
- Models
301+
- Example HuggingFace Models
302+
- :ref:`LoRA <lora>`
303+
- :ref:`PP <distributed_serving>`
304+
* - :code:`Qwen2ForRewardModel`
305+
- Qwen2-based
306+
- :code:`Qwen/Qwen2.5-Math-RM-72B`, etc.
307+
-
308+
- ✅︎
309+
310+
.. note::
311+
As an interim measure, these models are supported via Embeddings API. See `this RFC <https://github.com/vllm-project/vllm/issues/8967>`_ for upcoming changes.
263312

264313
Multimodal Language Models
265314
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
266315

316+
.. _supported_vlms:
317+
267318
.. list-table::
268319
:widths: 25 25 25 25 5 5
269320
:header-rows: 1
@@ -378,6 +429,7 @@ Multimodal Language Models
378429
For :code:`openbmb/MiniCPM-V-2`, the official repo doesn't work yet, so we need to use a fork (:code:`HwwwH/MiniCPM-V-2`) for now.
379430
For more details, please see: https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630
380431

432+
----
381433

382434
If your model uses one of the above model architectures, you can seamlessly run your model with vLLM.
383435
Otherwise, please refer to :ref:`Adding a New Model <adding_a_new_model>` and :ref:`Enabling Multimodal Inputs <enabling_multimodal_inputs>`

docs/source/models/vlm.rst

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,9 @@ Using VLMs
66
vLLM provides experimental support for Vision Language Models (VLMs). See the :ref:`list of supported VLMs here <supported_vlms>`.
77
This document shows you how to run and serve these models using vLLM.
88

9-
.. important::
10-
We are actively iterating on VLM support. Expect breaking changes to VLM usage and development in upcoming releases without prior deprecation.
11-
12-
We are continuously improving user & developer experience for VLMs. Please `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
9+
.. note::
10+
We are actively iterating on VLM support. See `this RFC <https://github.com/vllm-project/vllm/issues/4194>`_ for upcoming changes,
11+
and `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
1312

1413
Offline Inference
1514
-----------------

tests/distributed/test_pipeline_parallel.py

Lines changed: 114 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
"""
88
import os
99
from dataclasses import dataclass
10-
from typing import List, NamedTuple, Optional
10+
from typing import List, Literal, NamedTuple, Optional
1111

1212
import pytest
1313

@@ -97,22 +97,23 @@ def iter_params(self, model_name: str):
9797
self.trust_remote_code, self.tokenizer_mode)
9898

9999

100+
# NOTE: You can adjust tp_base and/or pp_base locally to fit the model in GPU
101+
# The values displayed here are only a rough indicator of the size of the model
102+
100103
# yapf: disable
101104
GENERATION_MODEL_SETTINGS = {
102105
# [DETAILED TESTS]
103106
"meta-llama/Meta-Llama-3-8B": PPTestSettings.detailed(),
104107
# [FAST TESTS]
105108
# Uses Llama
106109
# "BAAI/AquilaChat-7B": PPTestSettings.fast(),
107-
# TODO: Test on larger GPU
108-
# "Snowflake/snowflake-arctic-instruct": PPTestSettings.fast(trust_remote_code=True), # noqa: E501
110+
"Snowflake/snowflake-arctic-instruct": PPTestSettings.fast(tp_base=8, trust_remote_code=True), # noqa: E501
109111
"baichuan-inc/Baichuan-7B": PPTestSettings.fast(trust_remote_code=True),
110112
"baichuan-inc/Baichuan2-13B-Chat": PPTestSettings.fast(trust_remote_code=True), # noqa: E501
111113
"bigscience/bloomz-1b1": PPTestSettings.fast(),
112114
"THUDM/chatglm3-6b": PPTestSettings.fast(trust_remote_code=True),
113115
"CohereForAI/c4ai-command-r-v01": PPTestSettings.fast(tp_base=2, trust_remote_code=True), # noqa: E501
114-
# TODO: Test on larger GPU
115-
# "databricks/dbrx-instruct": PPTestSettings.fast(),
116+
"databricks/dbrx-instruct": PPTestSettings.fast(tp_base=8),
116117
"Deci/DeciLM-7B-instruct": PPTestSettings.fast(trust_remote_code=True),
117118
"deepseek-ai/deepseek-llm-7b-chat": PPTestSettings.fast(),
118119
"deepseek-ai/DeepSeek-V2-Lite-Chat": PPTestSettings.fast(trust_remote_code=True), # noqa: E501
@@ -161,8 +162,9 @@ def iter_params(self, model_name: str):
161162

162163
EMBEDDING_MODEL_SETTINGS = { # type: ignore[var-annotated]
163164
# [FAST TESTS]
164-
# Uses Llama
165-
# "intfloat/e5-mistral-7b-instruct": PPTestSettings.fast(),
165+
"intfloat/e5-mistral-7b-instruct": PPTestSettings.fast(),
166+
"BAAI/bge-multilingual-gemma2": PPTestSettings.fast(),
167+
"Qwen/Qwen2.5-Math-RM-72B": PPTestSettings.fast(tp_base=4, trust_remote_code=True), # noqa: E501
166168
}
167169

168170
MULTIMODAL_MODEL_SETTINGS = {
@@ -192,40 +194,35 @@ def iter_params(self, model_name: str):
192194
}
193195
# yapf: enable
194196

195-
MODEL_SETTINGS = {
196-
**GENERATION_MODEL_SETTINGS,
197-
**EMBEDDING_MODEL_SETTINGS,
198-
**MULTIMODAL_MODEL_SETTINGS,
199-
}
200-
201-
# You can update this on your local machine to run specific tests
197+
# NOTE: You can update this on your local machine to run specific tests
202198
TEST_MODELS = [
199+
# [LANGUAGE GENERATION]
203200
"meta-llama/Meta-Llama-3-8B",
204-
"facebook/chameleon-7b",
201+
"ibm/PowerLM-3b",
202+
# [LANGUAGE EMBEDDING]
203+
"intfloat/e5-mistral-7b-instruct",
204+
"BAAI/bge-multilingual-gemma2",
205+
# [MULTIMODAL GENERATION]
205206
"OpenGVLab/InternVL2-1B",
206207
"microsoft/Phi-3-vision-128k-instruct",
207-
"mistralai/Pixtral-12B-2409",
208208
"fixie-ai/ultravox-v0_3",
209209
]
210210

211211

212-
@pytest.mark.parametrize(
213-
("model_name", "parallel_setup", "distributed_backend",
214-
"trust_remote_code", "tokenizer_mode"),
215-
[
216-
params for model_name, settings in MODEL_SETTINGS.items()
217-
for params in settings.iter_params(model_name)
218-
if model_name in TEST_MODELS
219-
],
220-
)
221-
@fork_new_process_for_each_test
222-
def test_compare_tp(model_name: str, parallel_setup: ParallelSetup,
223-
distributed_backend: str, trust_remote_code: bool,
224-
tokenizer_mode: Optional[str], num_gpus_available):
212+
def _compare_tp(
213+
model_name: str,
214+
parallel_setup: ParallelSetup,
215+
distributed_backend: str,
216+
trust_remote_code: bool,
217+
tokenizer_mode: Optional[str],
218+
num_gpus_available: int,
219+
*,
220+
method: Literal["generate", "encode"] = "encode",
221+
):
225222
tp_size, pp_size, eager_mode, chunked_prefill = parallel_setup
226223

227-
if num_gpus_available < tp_size:
228-
pytest.skip(f"Need at least {tp_size} GPUs to run the test")
224+
if num_gpus_available < tp_size * pp_size:
225+
pytest.skip(f"Need at least {tp_size} x {pp_size} GPUs")
229226
if VLLM_MULTI_NODE and distributed_backend == "mp":
230227
pytest.skip("Skipping multi-node pipeline parallel test for "
231228
"multiprocessing distributed backend")
@@ -286,10 +283,95 @@ def test_compare_tp(model_name: str, parallel_setup: ParallelSetup,
286283
]
287284

288285
try:
289-
compare_two_settings(model_name, pp_args, tp_args, pp_env)
286+
compare_two_settings(model_name,
287+
pp_args,
288+
tp_args,
289+
pp_env,
290+
method=method)
290291
except Exception:
291292
if pp_env is None:
292293
raise
293294
else:
294295
# Ray ADAG tests are flaky, so we don't want to fail the test
295296
logger.exception("Ray ADAG tests failed")
297+
298+
299+
@pytest.mark.parametrize(
300+
("model_name", "parallel_setup", "distributed_backend",
301+
"trust_remote_code", "tokenizer_mode"),
302+
[
303+
params for model_name, settings in GENERATION_MODEL_SETTINGS.items()
304+
for params in settings.iter_params(model_name)
305+
if model_name in TEST_MODELS
306+
],
307+
)
308+
@fork_new_process_for_each_test
309+
def test_tp_language_generation(
310+
model_name: str,
311+
parallel_setup: ParallelSetup,
312+
distributed_backend: str,
313+
trust_remote_code: bool,
314+
tokenizer_mode: Optional[str],
315+
num_gpus_available,
316+
):
317+
_compare_tp(model_name,
318+
parallel_setup,
319+
distributed_backend,
320+
trust_remote_code,
321+
tokenizer_mode,
322+
num_gpus_available,
323+
method="generate")
324+
325+
326+
@pytest.mark.parametrize(
327+
("model_name", "parallel_setup", "distributed_backend",
328+
"trust_remote_code", "tokenizer_mode"),
329+
[
330+
params for model_name, settings in EMBEDDING_MODEL_SETTINGS.items()
331+
for params in settings.iter_params(model_name)
332+
if model_name in TEST_MODELS
333+
],
334+
)
335+
@fork_new_process_for_each_test
336+
def test_tp_language_embedding(
337+
model_name: str,
338+
parallel_setup: ParallelSetup,
339+
distributed_backend: str,
340+
trust_remote_code: bool,
341+
tokenizer_mode: Optional[str],
342+
num_gpus_available,
343+
):
344+
_compare_tp(model_name,
345+
parallel_setup,
346+
distributed_backend,
347+
trust_remote_code,
348+
tokenizer_mode,
349+
num_gpus_available,
350+
method="encode")
351+
352+
353+
@pytest.mark.parametrize(
354+
("model_name", "parallel_setup", "distributed_backend",
355+
"trust_remote_code", "tokenizer_mode"),
356+
[
357+
params for model_name, settings in MULTIMODAL_MODEL_SETTINGS.items()
358+
for params in settings.iter_params(model_name)
359+
if model_name in TEST_MODELS
360+
],
361+
)
362+
@fork_new_process_for_each_test
363+
def test_tp_multimodal_generation(
364+
model_name: str,
365+
parallel_setup: ParallelSetup,
366+
distributed_backend: str,
367+
trust_remote_code: bool,
368+
tokenizer_mode: Optional[str],
369+
num_gpus_available,
370+
):
371+
_compare_tp(model_name,
372+
parallel_setup,
373+
distributed_backend,
374+
trust_remote_code,
375+
tokenizer_mode,
376+
num_gpus_available,
377+
method="generate")

0 commit comments

Comments
 (0)