Skip to content

[V1] Structured Outputs + Thinking compatibility #16577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
7f9b174
chore: migrate tokenizer init to manager only
aarnphm Apr 14, 2025
023807d
chore: init reasoning_parser on manager
aarnphm Apr 14, 2025
92527f6
feat: support parsing thinking tokens
aarnphm Apr 14, 2025
e50ea40
chore: add a check to make sure that the reasoning token is not being…
aarnphm Apr 14, 2025
7542582
chore: update docs
aarnphm Apr 14, 2025
fa6da3f
chore: move reasoning_ended to so_request
aarnphm Apr 17, 2025
061ee09
chore: reduce diff
aarnphm Apr 17, 2025
5eecdbb
chore: move up checker logics
aarnphm Apr 17, 2025
873b08b
chore: update correct function imports
aarnphm Apr 26, 2025
218ad9c
chore: remove incorrect function
aarnphm Apr 26, 2025
1ec8928
fix: make sure to reset the bitmask before update
aarnphm Apr 26, 2025
9b6f4e8
chore: make sure non reasoning case works
aarnphm Apr 26, 2025
63eecbf
fix: remove unused check
aarnphm Apr 26, 2025
ea1487f
merge: branch 'main' of github.com:vllm-project/vllm into feat/suppor…
aarnphm Apr 29, 2025
910ee0c
chore: fix pre-comimt
aarnphm Apr 29, 2025
e220eac
merge: branch 'main' of github.com:vllm-project/vllm into feat/suppor…
aarnphm Apr 29, 2025
327a0d0
revert: bad merge and remove inlines
aarnphm Apr 29, 2025
6d26942
fix: make sure to initialize DecodingConfig by default, and fix types
aarnphm Apr 29, 2025
7635f17
merge: with upstream and add compatibility with thinking cases
aarnphm Apr 30, 2025
5a708aa
merge: branch 'main' of github.com:vllm-project/vllm into feat/suppor…
aarnphm Apr 30, 2025
fcfef12
--wip--
aarnphm Apr 30, 2025
1e828bd
chore: move logic to manager
aarnphm Apr 30, 2025
ce1fddc
chore: update notes
aarnphm Apr 30, 2025
c211110
fix: make sure works with both thinking, spec and struct matrixes
aarnphm May 1, 2025
97d1d4e
merge: branch 'main' of github.com:vllm-project/vllm into feat/suppor…
aarnphm May 1, 2025
b89662a
chore: cleanup logics
aarnphm May 1, 2025
8627691
merge: branch 'main' of github.com:vllm-project/vllm into feat/suppor…
aarnphm May 1, 2025
27817a0
merge: branch 'main' of github.com:vllm-project/vllm into feat/suppor…
aarnphm May 5, 2025
591da8e
fix: update to newer logics
aarnphm May 5, 2025
c41e80c
merge: branch 'main' of github.com:vllm-project/vllm into feat/suppor…
aarnphm May 7, 2025
5cf804d
merge: branch 'main' of github.com:vllm-project/vllm into feat/suppor…
aarnphm May 8, 2025
a807bee
chore: revert whitespace changes
aarnphm May 9, 2025
6c2b9df
fix(tests): ignore runaway properties
aarnphm May 9, 2025
fb92d8a
fix: broken tests
aarnphm May 9, 2025
174e7e8
Update tests/v1/entrypoints/llm/test_struct_output_generate.py
aarnphm May 9, 2025
42671cf
revert: update noqa changes
aarnphm May 9, 2025
9c364d0
chore: add a notes about bitmask reset
aarnphm May 9, 2025
ffd3fa1
fix: initialize default decoding_config
aarnphm May 13, 2025
b64f5f5
merge: branch 'main' of github.com:vllm-project/vllm into feat/suppor…
aarnphm May 13, 2025
ddc9c47
merge: branch 'main' of github.com:vllm-project/vllm into feat/suppor…
aarnphm May 13, 2025
edd235b
chore(test): use deepseek_r1 parser for qwen3
aarnphm May 13, 2025
3cbbd8c
chore: separate out reasoning tests
aarnphm May 13, 2025
a559b72
fix: reasoning tests to parse it
aarnphm May 13, 2025
1f3c369
chore: replicate duplicate thinking budget
aarnphm May 13, 2025
d5574be
revert: remove duplications
aarnphm May 13, 2025
59f2aa7
chore: reorder test logs
aarnphm May 13, 2025
ded3890
chore: keep main change to reduce diff
aarnphm May 13, 2025
0fb92a5
fix: use deepseek_r1 parser for tests
aarnphm May 13, 2025
7ace2cb
chore: use a slightly larger models for smarter cot
aarnphm May 13, 2025
1816b3b
fix: support for qwen3 prompts
aarnphm May 13, 2025
91058ba
merge: branch 'main' of github.com:vllm-project/vllm into feat/suppor…
aarnphm May 14, 2025
d96fa45
chore: make it more clear
aarnphm May 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/source/features/reasoning_outputs.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,10 +141,10 @@ Remember to check whether the `reasoning_content` exists in the response before
The reasoning content is also available in the structured output. The structured output engine like `xgrammar` will use the reasoning content to generate structured output. It is only supported in v0 engine now.

```bash
VLLM_USE_V1=0 vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --reasoning-parser deepseek_r1
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --reasoning-parser deepseek_r1
```

Please note that the `VLLM_USE_V1` environment variable must be set to `0` to use the v0 engine.
The following is an example client:

```python
from openai import OpenAI
Expand Down
96 changes: 92 additions & 4 deletions tests/v1/entrypoints/llm/test_struct_output_generate.py
Original file line number Diff line number Diff line change
@@ -1,21 +1,27 @@
# ruff: noqa: E501
# SPDX-License-Identifier: Apache-2.0

from __future__ import annotations

import json
import re
from enum import Enum
from typing import Any
from typing import TYPE_CHECKING, Any

import jsonschema
import pytest
from pydantic import BaseModel

from tests.reasoning.utils import run_reasoning_extraction
from vllm.entrypoints.llm import LLM
from vllm.outputs import RequestOutput
from vllm.platforms import current_platform
from vllm.reasoning.abs_reasoning_parsers import ReasoningParserManager
from vllm.sampling_params import GuidedDecodingParams, SamplingParams

if TYPE_CHECKING:
from vllm.config import TokenizerMode

NGRAM_SPEC_CONFIG = {
"model": "[ngram]",
"num_speculative_tokens": 5,
Expand Down Expand Up @@ -444,7 +450,7 @@ def test_structured_output(

prompt = """
You have access to the following function to retrieve the weather in a city:

{
"name": "get_weather",
"parameters": {
Expand All @@ -455,7 +461,7 @@ def test_structured_output(
}
}
}

If a you choose to call a function ONLY reply in the following format:
<{start_tag}={function_name}>{parameters}{end_tag}
where
Expand All @@ -476,7 +482,7 @@ def test_structured_output(
- Always add your sources when using search results to answer the user query

You are a helpful assistant.

Given the previous instructions, what is the weather in New York City? \
Make the response as short as possible.
"""
Expand Down Expand Up @@ -514,6 +520,88 @@ def test_structured_output(
f"{generated_text!r}\nError: {str(e)}")


@pytest.mark.skip_global_cleanup
@pytest.mark.parametrize(
"model_name, guided_decoding_backend, tokenizer_mode, reasoning_parser, speculative_config", # noqa: E501
[
("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", "xgrammar", "auto",
"deepseek_r1", NGRAM_SPEC_CONFIG),
("Qwen/Qwen3-1.7B", "xgrammar", "auto", "deepseek_r1", None),
],
)
def test_structured_output_with_reasoning_matrices(
monkeypatch: pytest.MonkeyPatch,
guided_decoding_backend: str,
tokenizer_mode: TokenizerMode,
reasoning_parser: str,
model_name: str,
speculative_config: dict[str, Any] | None,
):
monkeypatch.setenv("VLLM_USE_V1", "1")

if current_platform.is_tpu() and speculative_config:
pytest.skip("TPU does not support speculative decoding")

# Use a single LLM instance for several scenarios to
# speed up the test suite.
llm = LLM(
model=model_name,
# Don't use eager execution on TPUs because we want to test for no
# recompilation at runtime
enforce_eager=bool(not current_platform.is_tpu()),
max_model_len=1024,
max_num_seqs=16,
guided_decoding_backend=guided_decoding_backend,
guided_decoding_disable_any_whitespace=True,
tokenizer_mode=tokenizer_mode,
reasoning_parser=reasoning_parser,
speculative_config=speculative_config,
)
tokenizer = llm.get_tokenizer(None)
reasoner = ReasoningParserManager.get_reasoning_parser(reasoning_parser)(
tokenizer=tokenizer)

reasoning_prompt = "Solve the following math problem step-by-step, then provide the final answer as JSON object with a single key 'result'. Make sure to correct your reasoning if there are any issue should it arise.\nProblem: What is 5 * 8 + 2?" # noqa: E501
reasoning_schema = {
"type": "object",
"properties": {
"result": {
"type": "integer"
}
},
"required": ["result"],
"additionalProperties": False
}
if "Qwen3" in model_name:
reasoning_prompt += "<think>\n"

sampling_params = SamplingParams(
temperature=0.1,
max_tokens=8192,
guided_decoding=GuidedDecodingParams(json=reasoning_schema),
)
outputs = llm.generate(
[reasoning_prompt],
sampling_params=sampling_params,
use_tqdm=True,
)

assert outputs is not None
output = outputs[0]
assert output is not None and isinstance(output, RequestOutput)
prompt = output.prompt
generated_text = output.outputs[0].text
reasoning_content, content = run_reasoning_extraction(
reasoner, [generated_text])
print(
f"Prompt: {prompt!r}\nReasoning: {reasoning_content!r}\nContent: {content!r}"
)

assert content is not None and reasoning_content is not None
output_json = json.loads(content)
jsonschema.validate(instance=output_json, schema=reasoning_schema)


@pytest.mark.skip_global_cleanup
@pytest.mark.parametrize("model_name, tokenizer_mode",
PARAMS_MODELS_TOKENIZER_MODE)
Expand Down
4 changes: 2 additions & 2 deletions vllm/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -2325,7 +2325,7 @@ class SpeculativeConfig:
`TypicalAcceptanceSampler`."""

speculative_token_tree: Optional[str] = None
"""Specifies the tree structure for speculative token generation.
"""Specifies the tree structure for speculative token generation.
"""
# required configuration params passed from engine
target_model_config: ModelConfig = field(default=None,
Expand Down Expand Up @@ -4017,7 +4017,7 @@ class VllmConfig:
"""LoRA configuration."""
speculative_config: Optional[SpeculativeConfig] = None
"""Speculative decoding configuration."""
decoding_config: Optional[DecodingConfig] = None
decoding_config: DecodingConfig = field(default_factory=DecodingConfig)
"""Decoding configuration."""
observability_config: Optional[ObservabilityConfig] = None
"""Observability configuration."""
Expand Down
6 changes: 4 additions & 2 deletions vllm/reasoning/abs_reasoning_parsers.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# SPDX-License-Identifier: Apache-2.0

from __future__ import annotations

import os
from abc import abstractmethod
from collections.abc import Sequence
Expand Down Expand Up @@ -33,7 +35,7 @@ def vocab(self) -> dict[str, int]:
return self.model_tokenizer.get_vocab()

@abstractmethod
def is_reasoning_end(self, input_ids: list[int]) -> bool:
def is_reasoning_end(self, input_ids: Sequence[int]) -> bool:
"""
Check if the reasoning content ends in the input_ids.

Expand Down Expand Up @@ -106,7 +108,7 @@ class ReasoningParserManager:
reasoning_parsers: dict[str, type] = {}

@classmethod
def get_reasoning_parser(cls, name) -> type:
def get_reasoning_parser(cls, name: str | None) -> type[ReasoningParser]:
"""
Get reasoning parser by name which is registered by `register_module`.

Expand Down
8 changes: 4 additions & 4 deletions vllm/v1/core/sched/scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -749,7 +749,8 @@ def update_from_output(
# the outer lists can be of length > 1.
new_logprobs = logprobs.slice(req_index, req_index + 1)

if new_token_ids and request.use_structured_output:
if new_token_ids and self.structured_output_manager.should_advance(
request):
# NOTE: structured_output_request
# should not be None if use_structured_output, we have
# check above, so safe to ignore type warning
Expand All @@ -758,11 +759,10 @@ def update_from_output(

# Add newly generated spec token ids to the request.
if spec_token_ids is not None:
if request.use_structured_output:
if self.structured_output_manager.should_advance(request):
metadata = request.structured_output_request
assert metadata is not None and metadata.grammar is not None
# Needs to happen after new_token_ids are accepted.
request.spec_token_ids = metadata.grammar.validate_tokens(
request.spec_token_ids = metadata.grammar.validate_tokens( # type: ignore[union-attr]
spec_token_ids[req_index])
else:
request.spec_token_ids = spec_token_ids[req_index]
Expand Down
Loading