Skip to content

Commit ceaf4ed

Browse files
authored
[Doc] Update the SkyPilot doc with serving and Llama-3 (#4276)
1 parent ad8d696 commit ceaf4ed

File tree

1 file changed

+264
-23
lines changed

1 file changed

+264
-23
lines changed

docs/source/serving/run_on_sky.rst

Lines changed: 264 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,69 +1,310 @@
11
.. _on_cloud:
22

3-
Running on clouds with SkyPilot
4-
===============================
3+
Deploying and scaling up with SkyPilot
4+
================================================
55

66
.. raw:: html
77

88
<p align="center">
99
<img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
1010
</p>
1111

12-
vLLM can be run on the cloud to scale to multiple GPUs with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud.
12+
vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in `SkyPilot AI gallery <https://skypilot.readthedocs.io/en/latest/gallery/index.html>`__.
1313

14-
To install SkyPilot and setup your cloud credentials, run:
14+
15+
Prerequisites
16+
-------------
17+
18+
- Go to the `HuggingFace model page <https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct>`__ and request access to the model :code:`meta-llama/Meta-Llama-3-8B-Instruct`.
19+
- Check that you have installed SkyPilot (`docs <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`__).
20+
- Check that :code:`sky check` shows clouds or Kubernetes are enabled.
1521

1622
.. code-block:: console
1723
18-
$ pip install skypilot
19-
$ sky check
24+
pip install skypilot-nightly
25+
sky check
26+
27+
28+
Run on a single instance
29+
------------------------
2030

2131
See the vLLM SkyPilot YAML for serving, `serving.yaml <https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml>`__.
2232

2333
.. code-block:: yaml
2434
2535
resources:
26-
accelerators: A100
36+
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
37+
use_spot: True
38+
disk_size: 512 # Ensure model checkpoints can fit.
39+
disk_tier: best
40+
ports: 8081 # Expose to internet traffic.
2741
2842
envs:
29-
MODEL_NAME: decapoda-research/llama-13b-hf
30-
TOKENIZER: hf-internal-testing/llama-tokenizer
43+
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
44+
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
3145
3246
setup: |
33-
conda create -n vllm python=3.9 -y
47+
conda create -n vllm python=3.10 -y
3448
conda activate vllm
35-
git clone https://github.com/vllm-project/vllm.git
36-
cd vllm
37-
pip install .
38-
pip install gradio
49+
50+
pip install vllm==0.4.0.post1
51+
# Install Gradio for web UI.
52+
pip install gradio openai
53+
pip install flash-attn==2.5.7
3954
4055
run: |
4156
conda activate vllm
4257
echo 'Starting vllm api server...'
43-
python -u -m vllm.entrypoints.api_server \
44-
--model $MODEL_NAME \
45-
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
46-
--tokenizer $TOKENIZER 2>&1 | tee api_server.log &
58+
python -u -m vllm.entrypoints.openai.api_server \
59+
--port 8081 \
60+
--model $MODEL_NAME \
61+
--trust-remote-code \
62+
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
63+
2>&1 | tee api_server.log &
64+
4765
echo 'Waiting for vllm api server to start...'
4866
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
67+
4968
echo 'Starting gradio server...'
50-
python vllm/examples/gradio_webserver.py
69+
git clone https://github.com/vllm-project/vllm.git || true
70+
python vllm/examples/gradio_openai_chatbot_webserver.py \
71+
-m $MODEL_NAME \
72+
--port 8811 \
73+
--model-url http://localhost:8081/v1 \
74+
--stop-token-ids 128009,128001
5175
52-
Start the serving the LLaMA-13B model on an A100 GPU:
76+
Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...):
5377

5478
.. code-block:: console
5579
56-
$ sky launch serving.yaml
80+
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
5781
5882
Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.
5983

6084
.. code-block:: console
6185
6286
(task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live
6387
64-
**Optional**: Serve the 65B model instead of the default 13B and use more GPU:
88+
**Optional**: Serve the 70B model instead of the default 8B and use more GPU:
89+
90+
.. code-block:: console
91+
92+
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
93+
94+
95+
Scale up to multiple replicas
96+
-----------------------------
97+
98+
SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file.
99+
100+
.. code-block:: yaml
101+
102+
service:
103+
replicas: 2
104+
# An actual request for readiness probe.
105+
readiness_probe:
106+
path: /v1/chat/completions
107+
post_data:
108+
model: $MODEL_NAME
109+
messages:
110+
- role: user
111+
content: Hello! What is your name?
112+
max_tokens: 1
113+
114+
.. raw:: html
115+
116+
<details>
117+
<summary>Click to see the full recipe YAML</summary>
118+
119+
120+
.. code-block:: yaml
121+
122+
service:
123+
replicas: 2
124+
# An actual request for readiness probe.
125+
readiness_probe:
126+
path: /v1/chat/completions
127+
post_data:
128+
model: $MODEL_NAME
129+
messages:
130+
- role: user
131+
content: Hello! What is your name?
132+
max_tokens: 1
133+
134+
resources:
135+
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
136+
use_spot: True
137+
disk_size: 512 # Ensure model checkpoints can fit.
138+
disk_tier: best
139+
ports: 8081 # Expose to internet traffic.
140+
141+
envs:
142+
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
143+
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
144+
145+
setup: |
146+
conda create -n vllm python=3.10 -y
147+
conda activate vllm
148+
149+
pip install vllm==0.4.0.post1
150+
# Install Gradio for web UI.
151+
pip install gradio openai
152+
pip install flash-attn==2.5.7
153+
154+
run: |
155+
conda activate vllm
156+
echo 'Starting vllm api server...'
157+
python -u -m vllm.entrypoints.openai.api_server \
158+
--port 8081 \
159+
--model $MODEL_NAME \
160+
--trust-remote-code \
161+
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
162+
2>&1 | tee api_server.log &
163+
164+
echo 'Waiting for vllm api server to start...'
165+
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
166+
167+
echo 'Starting gradio server...'
168+
git clone https://github.com/vllm-project/vllm.git || true
169+
python vllm/examples/gradio_openai_chatbot_webserver.py \
170+
-m $MODEL_NAME \
171+
--port 8811 \
172+
--model-url http://localhost:8081/v1 \
173+
--stop-token-ids 128009,128001
174+
175+
.. raw:: html
176+
177+
</details>
178+
179+
Start the serving the Llama-3 8B model on multiple replicas:
180+
181+
.. code-block:: console
182+
183+
HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN
184+
185+
186+
Wait until the service is ready:
65187

66188
.. code-block:: console
67189
68-
sky launch -c vllm-serve-new -s serve.yaml --gpus A100:8 --env MODEL_NAME=decapoda-research/llama-65b-hf
190+
watch -n10 sky serve status vllm
191+
192+
193+
.. raw:: html
194+
195+
<details>
196+
<summary>Example outputs:</summary>
197+
198+
.. code-block:: console
199+
200+
Services
201+
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
202+
vllm 1 35s READY 2/2 xx.yy.zz.100:30001
203+
204+
Service Replicas
205+
SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION
206+
vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP({'L4': 1}) READY us-east4
207+
vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP({'L4': 1}) READY us-east4
208+
209+
.. raw:: html
210+
211+
</details>
212+
213+
After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
214+
215+
.. code-block:: console
216+
217+
ENDPOINT=$(sky serve status --endpoint 8081 vllm)
218+
curl -L http://$ENDPOINT/v1/chat/completions \
219+
-H "Content-Type: application/json" \
220+
-d '{
221+
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
222+
"messages": [
223+
{
224+
"role": "system",
225+
"content": "You are a helpful assistant."
226+
},
227+
{
228+
"role": "user",
229+
"content": "Who are you?"
230+
}
231+
],
232+
"stop_token_ids": [128009, 128001]
233+
}'
234+
235+
To enable autoscaling, you could specify additional configs in `services`:
236+
237+
.. code-block:: yaml
238+
239+
services:
240+
replica_policy:
241+
min_replicas: 0
242+
max_replicas: 3
243+
target_qps_per_replica: 2
244+
245+
This will scale the service up to when the QPS exceeds 2 for each replica.
246+
247+
248+
**Optional**: Connect a GUI to the endpoint
249+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
250+
251+
252+
It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.
253+
254+
.. raw:: html
255+
256+
<details>
257+
<summary>Click to see the full GUI YAML</summary>
258+
259+
.. code-block:: yaml
260+
261+
envs:
262+
MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct
263+
ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm.
264+
265+
resources:
266+
cpus: 2
267+
268+
setup: |
269+
conda activate vllm
270+
if [ $? -ne 0 ]; then
271+
conda create -n vllm python=3.10 -y
272+
conda activate vllm
273+
fi
274+
275+
# Install Gradio for web UI.
276+
pip install gradio openai
277+
278+
run: |
279+
conda activate vllm
280+
export PATH=$PATH:/sbin
281+
WORKER_IP=$(hostname -I | cut -d' ' -f1)
282+
CONTROLLER_PORT=21001
283+
WORKER_PORT=21002
284+
285+
echo 'Starting gradio server...'
286+
git clone https://github.com/vllm-project/vllm.git || true
287+
python vllm/examples/gradio_openai_chatbot_webserver.py \
288+
-m $MODEL_NAME \
289+
--port 8811 \
290+
--model-url http://$ENDPOINT/v1 \
291+
--stop-token-ids 128009,128001 | tee ~/gradio.log
292+
293+
.. raw:: html
294+
295+
</details>
296+
297+
1. Start the chat web UI:
298+
299+
.. code-block:: console
300+
301+
sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
302+
303+
304+
2. Then, we can access the GUI at the returned gradio link:
305+
306+
.. code-block:: console
307+
308+
| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
309+
69310

0 commit comments

Comments
 (0)