Skip to content

Commit 1c6c748

Browse files
committed
Merge remote-tracking branch 'upstream/main' into kuntai-fix-a100-perf
2 parents 2ed6ffc + 22481fb commit 1c6c748

File tree

30 files changed

+466
-322
lines changed

30 files changed

+466
-322
lines changed
143 KB
Loading
Loading
51.8 KB
Loading
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
(deployment-dify)=
2+
3+
# Dify
4+
5+
[Dify](https://github.com/langgenius/dify) is an open-source LLM app development platform. Its intuitive interface combines agentic AI workflow, RAG pipeline, agent capabilities, model management, observability features, and more, allowing you to quickly move from prototype to production.
6+
7+
It supports vLLM as a model provider to efficiently serve large language models.
8+
9+
This guide walks you through deploying Dify using a vLLM backend.
10+
11+
## Prerequisites
12+
13+
- Setup vLLM environment
14+
- Install [Docker](https://docs.docker.com/engine/install/) and [Docker Compose](https://docs.docker.com/compose/install/)
15+
16+
## Deploy
17+
18+
- Start the vLLM server with the supported chat completion model, e.g.
19+
20+
```console
21+
vllm serve Qwen/Qwen1.5-7B-Chat
22+
```
23+
24+
- Start the Dify server with docker compose ([details](https://github.com/langgenius/dify?tab=readme-ov-file#quick-start)):
25+
26+
```console
27+
git clone https://github.com/langgenius/dify.git
28+
cd dify
29+
cd docker
30+
cp .env.example .env
31+
docker compose up -d
32+
```
33+
34+
- Open the browser to access `http://localhost/install`, config the basic login information and login.
35+
36+
- In the top-right user menu (under the profile icon), go to Settings, then click `Model Provider`, and locate the `vLLM` provider to install it.
37+
38+
- Fill in the model provider details as follows:
39+
- **Model Type**: `LLM`
40+
- **Model Name**: `Qwen/Qwen1.5-7B-Chat`
41+
- **API Endpoint URL**: `http://{vllm_server_host}:{vllm_server_port}/v1`
42+
- **Model Name for API Endpoint**: `Qwen/Qwen1.5-7B-Chat`
43+
- **Completion Mode**: `Completion`
44+
45+
:::{image} /assets/deployment/dify-settings.png
46+
:::
47+
48+
- To create a test chatbot, go to `Studio → Chatbot → Create from Blank`, then select Chatbot as the type:
49+
50+
:::{image} /assets/deployment/dify-create-chatbot.png
51+
:::
52+
53+
- Click the chatbot you just created to open the chat interface and start interacting with the model:
54+
55+
:::{image} /assets/deployment/dify-chat.png
56+
:::

docs/source/deployment/frameworks/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ anything-llm
77
bentoml
88
cerebrium
99
chatbox
10+
dify
1011
dstack
1112
helm
1213
lws

docs/source/design/v1/prefix_caching.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ To improve privacy in shared environments, vLLM supports isolating prefix cache
8686
{"role": "user", "content": "Here is a document with details about the world series: ..."},
8787
{"role": "user", "content": "Who won the world series in 2020?"}
8888
],
89-
"cache_salt": "Z3V2bmV3aGxza3ZubGFoZ3Zud3V3ZWZ2bmd0b3V2bnZmc2xpZ3RoZ2x2aQ=="
89+
"cache_salt": "your-cache-salt"
9090
}
9191
```
9292

docs/source/features/reasoning_outputs.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,9 @@ vLLM currently supports the following reasoning models:
1717
| [IBM Granite 3.2 language models](https://huggingface.co/collections/ibm-granite/granite-32-language-models-67b3bc8c13508f6d064cff9a) | `granite` |||
1818
| [Qwen3 series](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) | `qwen3` | `guided_json`, `guided_regex` ||
1919

20-
- IBM Granite 3.2 reasoning is disabled by default; to enable it, you must also pass `thinking=True` in your `chat_template_kwargs`.
20+
:::{note}
21+
IBM Granite 3.2 reasoning is disabled by default; to enable it, you must also pass `thinking=True` in your `chat_template_kwargs`.
22+
:::
2123

2224
## Quickstart
2325

@@ -83,7 +85,7 @@ Streaming chat completions are also supported for reasoning models. The `reasoni
8385
}
8486
```
8587

86-
OpenAI Python client library does not officially support `reasoning_content` attribute for streaming output. But the client support extra attributes in the response. You can use `hasattr` to check if the `reasoning_content` attribute is present in the response. For example:
88+
OpenAI Python client library does not officially support `reasoning_content` attribute for streaming output. But the client supports extra attributes in the response. You can use `hasattr` to check if the `reasoning_content` attribute is present in the response. For example:
8789

8890
```python
8991
from openai import OpenAI
@@ -221,15 +223,15 @@ print(f"Function called: {tool_call.name}")
221223
print(f"Arguments: {tool_call.arguments}")
222224
```
223225

224-
For more examples, please refer to <gh-file:examples/online_serving/openai_chat_completion_tool_calls_with_reasoning.py> .
226+
For more examples, please refer to <gh-file:examples/online_serving/openai_chat_completion_tool_calls_with_reasoning.py>.
225227

226228
## Limitations
227229

228230
- The reasoning content is only available for online serving's chat completion endpoint (`/v1/chat/completions`).
229231

230232
## How to support a new reasoning model
231233

232-
You can add a new `ReasoningParser` similar to `vllm/entrypoints/openai/reasoning_parsers/deepseek_r1_reasoning_parser.py`.
234+
You can add a new `ReasoningParser` similar to <gh-file:vllm/entrypoints/openai/reasoning_parsers/deepseek_r1_reasoning_parser.py>.
233235

234236
```python
235237
# import the required packages
@@ -286,7 +288,7 @@ class ExampleParser(ReasoningParser):
286288
"""
287289
```
288290

289-
Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in `vllm/model_executor/guided_decoding/reasoner/deepseek_reasoner.py`.
291+
Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in <gh-file:vllm/model_executor/guided_decoding/reasoner/deepseek_reasoner.py>.
290292

291293
```python
292294
@dataclass
@@ -312,7 +314,7 @@ class DeepSeekReasoner(Reasoner):
312314
...
313315
```
314316

315-
The structured output engine like `xgrammar` will use `end_token_id` to check if the reasoning content is present in the model output and skip the structured output if it is the case.
317+
The structured output engine like [xgrammar](https://github.com/mlc-ai/xgrammar) will use `end_token_id` to check if the reasoning content is present in the model output and skip the structured output if it is the case.
316318

317319
Finally, you can enable reasoning for the model by using the `--reasoning-parser` flags.
318320

examples/online_serving/ray_serve_deepseek.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
# SPDX-License-Identifier: Apache-2.0
22
"""
33
Example to deploy DeepSeek R1 or V3 with Ray Serve LLM.
4-
See Ray Serve LLM documentation at:
4+
See more details at:
5+
https://docs.ray.io/en/latest/serve/tutorials/serve-deepseek.html
6+
And see Ray Serve LLM documentation at:
57
https://docs.ray.io/en/latest/serve/llm/serving-llms.html
68
79
Run `python3 ray_serve_deepseek.py` to deploy the model.

tests/kernels/moe/test_moe.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -286,6 +286,7 @@ def test_mixtral_moe(dtype: torch.dtype, padding: bool, use_rocm_aiter: bool,
286286
atol=mixtral_moe_tol[dtype])
287287

288288

289+
@pytest.mark.flaky(reruns=2)
289290
@pytest.mark.parametrize("m", [1, 123, 666])
290291
@pytest.mark.parametrize("n", [128, 1024])
291292
@pytest.mark.parametrize("k", [256, 2048])

tests/samplers/test_sampler.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -478,7 +478,7 @@ def test_sampler_mixed(seed: int, device: str):
478478
sampling_params = SamplingParams(
479479
temperature=random.random() + 0.1,
480480
top_p=min(random.random() + 0.1, 1),
481-
top_k=random.randint(0, 10) or -1,
481+
top_k=random.randint(0, 10),
482482
n=n,
483483
presence_penalty=random.randint(0, 1),
484484
)

tests/tensorizer_loader/conftest.py

Lines changed: 0 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,5 @@
11
# SPDX-License-Identifier: Apache-2.0
2-
3-
import functools
4-
import gc
5-
from typing import Callable, TypeVar
6-
72
import pytest
8-
import torch
9-
from typing_extensions import ParamSpec
103

114
from vllm.distributed import cleanup_dist_env_and_memory
125
from vllm.model_executor.model_loader.tensorizer import TensorizerConfig
@@ -25,32 +18,6 @@ def cleanup():
2518
cleanup_dist_env_and_memory(shutdown_ray=True)
2619

2720

28-
_P = ParamSpec("_P")
29-
_R = TypeVar("_R")
30-
31-
32-
def retry_until_skip(n: int):
33-
34-
def decorator_retry(func: Callable[_P, _R]) -> Callable[_P, _R]:
35-
36-
@functools.wraps(func)
37-
def wrapper_retry(*args: _P.args, **kwargs: _P.kwargs) -> _R:
38-
for i in range(n):
39-
try:
40-
return func(*args, **kwargs)
41-
except AssertionError:
42-
gc.collect()
43-
torch.cuda.empty_cache()
44-
if i == n - 1:
45-
pytest.skip(f"Skipping test after {n} attempts.")
46-
47-
raise AssertionError("Code should not be reached")
48-
49-
return wrapper_retry
50-
51-
return decorator_retry
52-
53-
5421
@pytest.fixture(autouse=True)
5522
def tensorizer_config():
5623
config = TensorizerConfig(tensorizer_uri="vllm")

tests/tensorizer_loader/test_tensorizer.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,6 @@
2828
from vllm.utils import PlaceholderModule, import_from_path
2929

3030
from ..utils import VLLM_PATH, RemoteOpenAIServer
31-
from .conftest import retry_until_skip
3231

3332
try:
3433
from tensorizer import EncryptionParams
@@ -325,7 +324,7 @@ def test_deserialized_encrypted_vllm_model_with_tp_has_same_outputs(
325324
assert outputs == deserialized_outputs
326325

327326

328-
@retry_until_skip(3)
327+
@pytest.mark.flaky(reruns=3)
329328
def test_vllm_tensorized_model_has_same_outputs(vllm_runner, tmp_path):
330329
gc.collect()
331330
torch.cuda.empty_cache()

tests/v1/core/test_kv_cache_utils.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -539,7 +539,7 @@ def test_allocate_with_lookahead():
539539
max_model_len=100)
540540
blocks = kv_cache_manager.allocate_slots(
541541
request,
542-
num_tokens=3,
542+
num_new_tokens=3,
543543
num_lookahead_tokens=2, # Total required: 3+2=5 tokens
544544
)
545545
assert len(blocks.blocks) == 2 # ceil(5/4)=2 blocks
@@ -550,7 +550,7 @@ def test_allocate_with_lookahead():
550550
# required_blocks = ceil((3 + 2) /4) = 2
551551
blocks = kv_cache_manager.allocate_slots(
552552
request,
553-
num_tokens=3,
553+
num_new_tokens=3,
554554
num_lookahead_tokens=2,
555555
)
556556
assert len(blocks.blocks) == 2
@@ -561,7 +561,7 @@ def test_allocate_with_lookahead():
561561
max_model_len=100)
562562
blocks = kv_cache_manager.allocate_slots(
563563
request,
564-
num_tokens=3,
564+
num_new_tokens=3,
565565
num_lookahead_tokens=4,
566566
)
567567
assert len(blocks.blocks) == 2

tests/v1/core/test_prefix_caching.py

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -299,7 +299,8 @@ def test_decode():
299299
req0.append_output_token_ids(8)
300300
new_blocks = manager.allocate_slots(req0, 4)
301301
assert new_blocks is not None and len(new_blocks.blocks) == 0
302-
assert manager.req_to_blocks[req0.request_id][-1].block_hash is None
302+
assert manager.single_type_manager.req_to_blocks[
303+
req0.request_id][-1].block_hash is None
303304

304305
# Append slots with allocating a new block.
305306
req0.num_computed_tokens = 59
@@ -309,8 +310,10 @@ def test_decode():
309310
req0.append_output_token_ids(7)
310311
new_blocks = manager.allocate_slots(req0, 19)
311312
assert new_blocks is not None and len(new_blocks.blocks) == 1
312-
assert manager.req_to_blocks[req0.request_id][-2].block_hash is not None
313-
assert manager.req_to_blocks[req0.request_id][-1].block_hash is None
313+
assert manager.single_type_manager.req_to_blocks[
314+
req0.request_id][-2].block_hash is not None
315+
assert manager.single_type_manager.req_to_blocks[
316+
req0.request_id][-1].block_hash is None
314317

315318

316319
def test_evict():
@@ -689,15 +692,15 @@ def test_prefill_not_enough_free_blocks_with_computed_blocks():
689692
assert not computed_blocks.blocks
690693
assert num_computed_tokens == 0
691694
manager.allocate_slots(req0, 48, computed_blocks)
692-
block_part0 = manager.req_to_blocks[req0.request_id]
695+
block_part0 = manager.single_type_manager.req_to_blocks[req0.request_id]
693696

694697
# | Common-0 | Common-1 | Common-2 | Req1-3 | Req1-4 | Req1-5 | ... |
695698
req1 = make_request("1", common_token_ids * 2)
696699
computed_blocks, num_computed_tokens = manager.get_computed_blocks(req1)
697700
assert computed_blocks.blocks == block_part0
698701
assert num_computed_tokens == 3 * 16
699702
manager.allocate_slots(req1, 48, computed_blocks)
700-
block_part1 = manager.req_to_blocks[req1.request_id]
703+
block_part1 = manager.single_type_manager.req_to_blocks[req1.request_id]
701704
# | Common-0 | Common-1 | Common-2 | Req1-3 (F) | Req1-4 (F) |
702705
# | Req1-5(F)| ... |
703706
manager.free(req1)

tests/v1/core/test_scheduler.py

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -812,10 +812,11 @@ def _assert_right_kv_cache_manager(
812812
# Make sure the request stats are right.
813813
EXPECTED_TOTAL_BLOCKS = num_tokens // block_size
814814
for req_id in req_ids:
815-
blocks = scheduler.kv_cache_manager.req_to_blocks[req_id]
815+
blocks = (scheduler.kv_cache_manager.single_type_manager.
816+
req_to_blocks[req_id])
816817
hashes = scheduler.kv_cache_manager.req_to_block_hashes[req_id]
817-
assert (scheduler.kv_cache_manager.num_cached_block[req_id] ==
818-
EXPECTED_TOTAL_BLOCKS)
818+
assert (scheduler.kv_cache_manager.single_type_manager.
819+
num_cached_block[req_id] == EXPECTED_TOTAL_BLOCKS)
819820
assert len(blocks) == EXPECTED_TOTAL_BLOCKS
820821
assert len(hashes) == EXPECTED_TOTAL_BLOCKS
821822

@@ -1195,9 +1196,11 @@ def assert_scheduler_empty(scheduler: Scheduler):
11951196
assert len(scheduler.encoder_cache_manager.cached) == 0
11961197

11971198
# KVCache Manager.
1198-
assert len(scheduler.kv_cache_manager.req_to_blocks) == 0
1199+
assert len(
1200+
scheduler.kv_cache_manager.single_type_manager.req_to_blocks) == 0
11991201
assert len(scheduler.kv_cache_manager.req_to_block_hashes) == 0
1200-
assert len(scheduler.kv_cache_manager.num_cached_block) == 0
1202+
assert len(
1203+
scheduler.kv_cache_manager.single_type_manager.num_cached_block) == 0
12011204
num_free_blocks = (
12021205
scheduler.kv_cache_manager.block_pool.free_block_queue.num_free_blocks)
12031206
assert num_free_blocks == (

0 commit comments

Comments
 (0)