Skip to content

Commit 5bf35a9

Browse files
[Doc][CI/Build] Update docs and tests to use vllm serve (#6431)
1 parent a19e8d3 commit 5bf35a9

23 files changed

+155
-175
lines changed

docs/source/getting_started/quickstart.rst

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -73,16 +73,13 @@ Start the server:
7373

7474
.. code-block:: console
7575
76-
$ python -m vllm.entrypoints.openai.api_server \
77-
$ --model facebook/opt-125m
76+
$ vllm serve facebook/opt-125m
7877
7978
By default, the server uses a predefined chat template stored in the tokenizer. You can override this template by using the ``--chat-template`` argument:
8079

8180
.. code-block:: console
8281
83-
$ python -m vllm.entrypoints.openai.api_server \
84-
$ --model facebook/opt-125m \
85-
$ --chat-template ./examples/template_chatml.jinja
82+
$ vllm serve facebook/opt-125m --chat-template ./examples/template_chatml.jinja
8683
8784
This server can be queried in the same format as OpenAI API. For example, list the models:
8885

docs/source/models/adding_model.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ Just add the following lines in your code:
114114
from your_code import YourModelForCausalLM
115115
ModelRegistry.register_model("YourModelForCausalLM", YourModelForCausalLM)
116116
117-
If you are running api server with `python -m vllm.entrypoints.openai.api_server args`, you can wrap the entrypoint with the following code:
117+
If you are running api server with :code:`vllm serve <args>`, you can wrap the entrypoint with the following code:
118118

119119
.. code-block:: python
120120
@@ -124,4 +124,4 @@ If you are running api server with `python -m vllm.entrypoints.openai.api_server
124124
import runpy
125125
runpy.run_module('vllm.entrypoints.openai.api_server', run_name='__main__')
126126
127-
Save the above code in a file and run it with `python your_file.py args`.
127+
Save the above code in a file and run it with :code:`python your_file.py <args>`.

docs/source/models/engine_args.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Below, you can find an explanation of every engine argument for vLLM:
88
.. argparse::
99
:module: vllm.engine.arg_utils
1010
:func: _engine_args_parser
11-
:prog: -m vllm.entrypoints.openai.api_server
11+
:prog: vllm serve
1212
:nodefaultconst:
1313

1414
Async Engine Arguments
@@ -19,5 +19,5 @@ Below are the additional arguments related to the asynchronous engine:
1919
.. argparse::
2020
:module: vllm.engine.arg_utils
2121
:func: _async_engine_args_parser
22-
:prog: -m vllm.entrypoints.openai.api_server
22+
:prog: vllm serve
2323
:nodefaultconst:

docs/source/models/lora.rst

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,8 +61,7 @@ LoRA adapted models can also be served with the Open-AI compatible vLLM server.
6161

6262
.. code-block:: bash
6363
64-
python -m vllm.entrypoints.openai.api_server \
65-
--model meta-llama/Llama-2-7b-hf \
64+
vllm serve meta-llama/Llama-2-7b-hf \
6665
--enable-lora \
6766
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
6867

docs/source/models/vlm.rst

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -94,9 +94,7 @@ Below is an example on how to launch the same ``llava-hf/llava-1.5-7b-hf`` with
9494

9595
.. code-block:: bash
9696
97-
python -m vllm.entrypoints.openai.api_server \
98-
--model llava-hf/llava-1.5-7b-hf \
99-
--chat-template template_llava.jinja
97+
vllm serve llava-hf/llava-1.5-7b-hf --chat-template template_llava.jinja
10098
10199
.. important::
102100
We have removed all vision language related CLI args in the ``0.5.1`` release. **This is a breaking change**, so please update your code to follow

docs/source/serving/deploying_with_dstack.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7
4040
gpu: 24GB
4141
commands:
4242
- pip install vllm
43-
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
43+
- vllm serve $MODEL --port 8000
4444
model:
4545
format: openai
4646
type: chat

docs/source/serving/distributed_serving.rst

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -35,16 +35,14 @@ To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument wh
3535

3636
.. code-block:: console
3737
38-
$ python -m vllm.entrypoints.openai.api_server \
39-
$ --model facebook/opt-13b \
38+
$ vllm serve facebook/opt-13b \
4039
$ --tensor-parallel-size 4
4140
4241
You can also additionally specify :code:`--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:
4342

4443
.. code-block:: console
4544
46-
$ python -m vllm.entrypoints.openai.api_server \
47-
$ --model gpt2 \
45+
$ vllm serve gpt2 \
4846
$ --tensor-parallel-size 4 \
4947
$ --pipeline-parallel-size 2 \
5048
$ --distributed-executor-backend ray

docs/source/serving/openai_compatible_server.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ vLLM provides an HTTP server that implements OpenAI's [Completions](https://plat
44

55
You can start the server using Python, or using [Docker](deploying_with_docker.rst):
66
```bash
7-
python -m vllm.entrypoints.openai.api_server --model NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
7+
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
88
```
99

1010
To call the server, you can use the official OpenAI Python client library, or any other HTTP client.
@@ -97,9 +97,7 @@ template, or the template in string form. Without a chat template, the server wi
9797
and all chat requests will error.
9898

9999
```bash
100-
python -m vllm.entrypoints.openai.api_server \
101-
--model ... \
102-
--chat-template ./path-to-chat-template.jinja
100+
vllm serve <model> --chat-template ./path-to-chat-template.jinja
103101
```
104102

105103
vLLM community provides a set of chat templates for popular models. You can find them in the examples
@@ -110,7 +108,7 @@ directory [here](https://github.com/vllm-project/vllm/tree/main/examples/)
110108
```{argparse}
111109
:module: vllm.entrypoints.openai.cli_args
112110
:func: create_parser_for_docs
113-
:prog: -m vllm.entrypoints.openai.api_server
111+
:prog: vllm serve
114112
```
115113

116114
## Tool calling in the chat completion API

examples/api_client.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
1-
"""Example Python client for vllm.entrypoints.api_server
1+
"""Example Python client for `vllm.entrypoints.api_server`
22
NOTE: The API server is used only for demonstration and simple performance
33
benchmarks. It is not intended for production use.
4-
For production use, we recommend vllm.entrypoints.openai.api_server
5-
and the OpenAI client API
4+
For production use, we recommend `vllm serve` and the OpenAI client API.
65
"""
76

87
import argparse

examples/logging_configuration.md

Lines changed: 3 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -95,9 +95,7 @@ to the path of the custom logging configuration JSON file:
9595

9696
```bash
9797
VLLM_LOGGING_CONFIG_PATH=/path/to/logging_config.json \
98-
python3 -m vllm.entrypoints.openai.api_server \
99-
--max-model-len 2048 \
100-
--model mistralai/Mistral-7B-v0.1
98+
vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
10199
```
102100

103101

@@ -152,9 +150,7 @@ to the path of the custom logging configuration JSON file:
152150

153151
```bash
154152
VLLM_LOGGING_CONFIG_PATH=/path/to/logging_config.json \
155-
python3 -m vllm.entrypoints.openai.api_server \
156-
--max-model-len 2048 \
157-
--model mistralai/Mistral-7B-v0.1
153+
vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
158154
```
159155

160156

@@ -167,9 +163,7 @@ loggers.
167163

168164
```bash
169165
VLLM_CONFIGURE_LOGGING=0 \
170-
python3 -m vllm.entrypoints.openai.api_server \
171-
--max-model-len 2048 \
172-
--model mistralai/Mistral-7B-v0.1
166+
vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
173167
```
174168

175169

examples/openai_vision_api_client.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,7 @@
11
"""An example showing how to use vLLM to serve VLMs.
22
33
Launch the vLLM server with the following command:
4-
python -m vllm.entrypoints.openai.api_server \
5-
--model llava-hf/llava-1.5-7b-hf \
6-
--chat-template template_llava.jinja
4+
vllm serve llava-hf/llava-1.5-7b-hf --chat-template template_llava.jinja
75
"""
86
import base64
97

examples/production_monitoring/Otel.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@
3636
```
3737
export OTEL_SERVICE_NAME="vllm-server"
3838
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
39-
python -m vllm.entrypoints.openai.api_server --model="facebook/opt-125m" --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
39+
vllm serve facebook/opt-125m --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
4040
```
4141
4242
1. In a new shell, send requests with trace context from a dummy client
@@ -62,7 +62,7 @@ By default, `grpc` is used. To set `http/protobuf` as the protocol, configure th
6262
```
6363
export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=http/protobuf
6464
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://$JAEGER_IP:4318/v1/traces
65-
python -m vllm.entrypoints.openai.api_server --model="facebook/opt-125m" --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
65+
vllm serve facebook/opt-125m --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
6666
```
6767
6868
## Instrumentation of FastAPI
@@ -74,7 +74,7 @@ OpenTelemetry allows automatic instrumentation of FastAPI.
7474
7575
1. Run vLLM with `opentelemetry-instrument`
7676
```
77-
opentelemetry-instrument python -m vllm.entrypoints.openai.api_server --model="facebook/opt-125m"
77+
opentelemetry-instrument vllm serve facebook/opt-125m
7878
```
7979
8080
1. Send a request to vLLM and find its trace in Jaeger. It should contain spans from FastAPI.

examples/production_monitoring/README.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,7 @@ Install:
1010

1111
Prometheus metric logging is enabled by default in the OpenAI-compatible server. Launch via the entrypoint:
1212
```bash
13-
python3 -m vllm.entrypoints.openai.api_server \
14-
--model mistralai/Mistral-7B-v0.1 \
13+
vllm serve mistralai/Mistral-7B-v0.1 \
1514
--max-model-len 2048 \
1615
--disable-log-requests
1716
```

tests/async_engine/test_openapi_server_ray.py

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -9,17 +9,17 @@
99

1010
@pytest.fixture(scope="module")
1111
def server():
12-
with RemoteOpenAIServer([
13-
"--model",
14-
MODEL_NAME,
15-
# use half precision for speed and memory savings in CI environment
16-
"--dtype",
17-
"float16",
18-
"--max-model-len",
19-
"2048",
20-
"--enforce-eager",
21-
"--engine-use-ray"
22-
]) as remote_server:
12+
args = [
13+
# use half precision for speed and memory savings in CI environment
14+
"--dtype",
15+
"float16",
16+
"--max-model-len",
17+
"2048",
18+
"--enforce-eager",
19+
"--engine-use-ray"
20+
]
21+
22+
with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
2323
yield remote_server
2424

2525

tests/distributed/test_pipeline_parallel.py

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,6 @@
1515
])
1616
def test_compare_tp(TP_SIZE, PP_SIZE, EAGER_MODE, CHUNKED_PREFILL, MODEL_NAME):
1717
pp_args = [
18-
"--model",
19-
MODEL_NAME,
2018
# use half precision for speed and memory savings in CI environment
2119
"--dtype",
2220
"bfloat16",
@@ -34,8 +32,6 @@ def test_compare_tp(TP_SIZE, PP_SIZE, EAGER_MODE, CHUNKED_PREFILL, MODEL_NAME):
3432
# schedule all workers in a node other than the head node,
3533
# which can cause the test to fail.
3634
tp_args = [
37-
"--model",
38-
MODEL_NAME,
3935
# use half precision for speed and memory savings in CI environment
4036
"--dtype",
4137
"bfloat16",
@@ -53,7 +49,7 @@ def test_compare_tp(TP_SIZE, PP_SIZE, EAGER_MODE, CHUNKED_PREFILL, MODEL_NAME):
5349

5450
results = []
5551
for args in [pp_args, tp_args]:
56-
with RemoteOpenAIServer(args) as server:
52+
with RemoteOpenAIServer(MODEL_NAME, args) as server:
5753
client = server.get_client()
5854

5955
# test models list

tests/entrypoints/openai/test_chat.py

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -27,27 +27,27 @@ def zephyr_lora_files():
2727

2828
@pytest.fixture(scope="module")
2929
def server(zephyr_lora_files):
30-
with RemoteOpenAIServer([
31-
"--model",
32-
MODEL_NAME,
33-
# use half precision for speed and memory savings in CI environment
34-
"--dtype",
35-
"bfloat16",
36-
"--max-model-len",
37-
"8192",
38-
"--enforce-eager",
39-
# lora config below
40-
"--enable-lora",
41-
"--lora-modules",
42-
f"zephyr-lora={zephyr_lora_files}",
43-
f"zephyr-lora2={zephyr_lora_files}",
44-
"--max-lora-rank",
45-
"64",
46-
"--max-cpu-loras",
47-
"2",
48-
"--max-num-seqs",
49-
"128",
50-
]) as remote_server:
30+
args = [
31+
# use half precision for speed and memory savings in CI environment
32+
"--dtype",
33+
"bfloat16",
34+
"--max-model-len",
35+
"8192",
36+
"--enforce-eager",
37+
# lora config below
38+
"--enable-lora",
39+
"--lora-modules",
40+
f"zephyr-lora={zephyr_lora_files}",
41+
f"zephyr-lora2={zephyr_lora_files}",
42+
"--max-lora-rank",
43+
"64",
44+
"--max-cpu-loras",
45+
"2",
46+
"--max-num-seqs",
47+
"128",
48+
]
49+
50+
with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
5151
yield remote_server
5252

5353

tests/entrypoints/openai/test_completion.py

Lines changed: 30 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -37,36 +37,36 @@ def zephyr_pa_files():
3737

3838
@pytest.fixture(scope="module")
3939
def server(zephyr_lora_files, zephyr_pa_files):
40-
with RemoteOpenAIServer([
41-
"--model",
42-
MODEL_NAME,
43-
# use half precision for speed and memory savings in CI environment
44-
"--dtype",
45-
"bfloat16",
46-
"--max-model-len",
47-
"8192",
48-
"--max-num-seqs",
49-
"128",
50-
"--enforce-eager",
51-
# lora config
52-
"--enable-lora",
53-
"--lora-modules",
54-
f"zephyr-lora={zephyr_lora_files}",
55-
f"zephyr-lora2={zephyr_lora_files}",
56-
"--max-lora-rank",
57-
"64",
58-
"--max-cpu-loras",
59-
"2",
60-
# pa config
61-
"--enable-prompt-adapter",
62-
"--prompt-adapters",
63-
f"zephyr-pa={zephyr_pa_files}",
64-
f"zephyr-pa2={zephyr_pa_files}",
65-
"--max-prompt-adapters",
66-
"2",
67-
"--max-prompt-adapter-token",
68-
"128",
69-
]) as remote_server:
40+
args = [
41+
# use half precision for speed and memory savings in CI environment
42+
"--dtype",
43+
"bfloat16",
44+
"--max-model-len",
45+
"8192",
46+
"--max-num-seqs",
47+
"128",
48+
"--enforce-eager",
49+
# lora config
50+
"--enable-lora",
51+
"--lora-modules",
52+
f"zephyr-lora={zephyr_lora_files}",
53+
f"zephyr-lora2={zephyr_lora_files}",
54+
"--max-lora-rank",
55+
"64",
56+
"--max-cpu-loras",
57+
"2",
58+
# pa config
59+
"--enable-prompt-adapter",
60+
"--prompt-adapters",
61+
f"zephyr-pa={zephyr_pa_files}",
62+
f"zephyr-pa2={zephyr_pa_files}",
63+
"--max-prompt-adapters",
64+
"2",
65+
"--max-prompt-adapter-token",
66+
"128",
67+
]
68+
69+
with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
7070
yield remote_server
7171

7272

0 commit comments

Comments
 (0)