|
1 | 1 | .. _on_cloud:
|
2 | 2 |
|
3 |
| -Running on clouds with SkyPilot |
4 |
| -=============================== |
| 3 | +Deploying and scaling up with SkyPilot |
| 4 | +================================================ |
5 | 5 |
|
6 | 6 | .. raw:: html
|
7 | 7 |
|
8 | 8 | <p align="center">
|
9 | 9 | <img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
|
10 | 10 | </p>
|
11 | 11 |
|
12 |
| -vLLM can be run on the cloud to scale to multiple GPUs with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud. |
| 12 | +vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in `SkyPilot AI gallery <https://skypilot.readthedocs.io/en/latest/gallery/index.html>`__. |
13 | 13 |
|
14 |
| -To install SkyPilot and setup your cloud credentials, run: |
| 14 | + |
| 15 | +Prerequisites |
| 16 | +------------- |
| 17 | + |
| 18 | +- Go to the `HuggingFace model page <https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct>`__ and request access to the model :code:`meta-llama/Meta-Llama-3-8B-Instruct`. |
| 19 | +- Check that you have installed SkyPilot (`docs <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`__). |
| 20 | +- Check that :code:`sky check` shows clouds or Kubernetes are enabled. |
15 | 21 |
|
16 | 22 | .. code-block:: console
|
17 | 23 |
|
18 |
| - $ pip install skypilot |
19 |
| - $ sky check |
| 24 | + pip install skypilot-nightly |
| 25 | + sky check |
| 26 | +
|
| 27 | +
|
| 28 | +Run on a single instance |
| 29 | +------------------------ |
20 | 30 |
|
21 | 31 | See the vLLM SkyPilot YAML for serving, `serving.yaml <https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml>`__.
|
22 | 32 |
|
23 | 33 | .. code-block:: yaml
|
24 | 34 |
|
25 | 35 | resources:
|
26 |
| - accelerators: A100 |
| 36 | + accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. |
| 37 | + use_spot: True |
| 38 | + disk_size: 512 # Ensure model checkpoints can fit. |
| 39 | + disk_tier: best |
| 40 | + ports: 8081 # Expose to internet traffic. |
27 | 41 |
|
28 | 42 | envs:
|
29 |
| - MODEL_NAME: decapoda-research/llama-13b-hf |
30 |
| - TOKENIZER: hf-internal-testing/llama-tokenizer |
| 43 | + MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct |
| 44 | + HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass. |
31 | 45 |
|
32 | 46 | setup: |
|
33 |
| - conda create -n vllm python=3.9 -y |
| 47 | + conda create -n vllm python=3.10 -y |
34 | 48 | conda activate vllm
|
35 |
| - git clone https://github.com/vllm-project/vllm.git |
36 |
| - cd vllm |
37 |
| - pip install . |
38 |
| - pip install gradio |
| 49 | +
|
| 50 | + pip install vllm==0.4.0.post1 |
| 51 | + # Install Gradio for web UI. |
| 52 | + pip install gradio openai |
| 53 | + pip install flash-attn==2.5.7 |
39 | 54 |
|
40 | 55 | run: |
|
41 | 56 | conda activate vllm
|
42 | 57 | echo 'Starting vllm api server...'
|
43 |
| - python -u -m vllm.entrypoints.api_server \ |
44 |
| - --model $MODEL_NAME \ |
45 |
| - --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ |
46 |
| - --tokenizer $TOKENIZER 2>&1 | tee api_server.log & |
| 58 | + python -u -m vllm.entrypoints.openai.api_server \ |
| 59 | + --port 8081 \ |
| 60 | + --model $MODEL_NAME \ |
| 61 | + --trust-remote-code \ |
| 62 | + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ |
| 63 | + 2>&1 | tee api_server.log & |
| 64 | + |
47 | 65 | echo 'Waiting for vllm api server to start...'
|
48 | 66 | while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
|
| 67 | +
|
49 | 68 | echo 'Starting gradio server...'
|
50 |
| - python vllm/examples/gradio_webserver.py |
| 69 | + git clone https://github.com/vllm-project/vllm.git || true |
| 70 | + python vllm/examples/gradio_openai_chatbot_webserver.py \ |
| 71 | + -m $MODEL_NAME \ |
| 72 | + --port 8811 \ |
| 73 | + --model-url http://localhost:8081/v1 \ |
| 74 | + --stop-token-ids 128009,128001 |
51 | 75 |
|
52 |
| -Start the serving the LLaMA-13B model on an A100 GPU: |
| 76 | +Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...): |
53 | 77 |
|
54 | 78 | .. code-block:: console
|
55 | 79 |
|
56 |
| - $ sky launch serving.yaml |
| 80 | + HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN |
57 | 81 |
|
58 | 82 | Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.
|
59 | 83 |
|
60 | 84 | .. code-block:: console
|
61 | 85 |
|
62 | 86 | (task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live
|
63 | 87 |
|
64 |
| -**Optional**: Serve the 65B model instead of the default 13B and use more GPU: |
| 88 | +**Optional**: Serve the 70B model instead of the default 8B and use more GPU: |
| 89 | + |
| 90 | +.. code-block:: console |
| 91 | +
|
| 92 | + HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct |
| 93 | +
|
| 94 | +
|
| 95 | +Scale up to multiple replicas |
| 96 | +----------------------------- |
| 97 | + |
| 98 | +SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file. |
| 99 | + |
| 100 | +.. code-block:: yaml |
| 101 | +
|
| 102 | + service: |
| 103 | + replicas: 2 |
| 104 | + # An actual request for readiness probe. |
| 105 | + readiness_probe: |
| 106 | + path: /v1/chat/completions |
| 107 | + post_data: |
| 108 | + model: $MODEL_NAME |
| 109 | + messages: |
| 110 | + - role: user |
| 111 | + content: Hello! What is your name? |
| 112 | + max_tokens: 1 |
| 113 | + |
| 114 | +.. raw:: html |
| 115 | + |
| 116 | + <details> |
| 117 | + <summary>Click to see the full recipe YAML</summary> |
| 118 | + |
| 119 | + |
| 120 | +.. code-block:: yaml |
| 121 | +
|
| 122 | + service: |
| 123 | + replicas: 2 |
| 124 | + # An actual request for readiness probe. |
| 125 | + readiness_probe: |
| 126 | + path: /v1/chat/completions |
| 127 | + post_data: |
| 128 | + model: $MODEL_NAME |
| 129 | + messages: |
| 130 | + - role: user |
| 131 | + content: Hello! What is your name? |
| 132 | + max_tokens: 1 |
| 133 | +
|
| 134 | + resources: |
| 135 | + accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. |
| 136 | + use_spot: True |
| 137 | + disk_size: 512 # Ensure model checkpoints can fit. |
| 138 | + disk_tier: best |
| 139 | + ports: 8081 # Expose to internet traffic. |
| 140 | +
|
| 141 | + envs: |
| 142 | + MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct |
| 143 | + HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass. |
| 144 | +
|
| 145 | + setup: | |
| 146 | + conda create -n vllm python=3.10 -y |
| 147 | + conda activate vllm |
| 148 | +
|
| 149 | + pip install vllm==0.4.0.post1 |
| 150 | + # Install Gradio for web UI. |
| 151 | + pip install gradio openai |
| 152 | + pip install flash-attn==2.5.7 |
| 153 | +
|
| 154 | + run: | |
| 155 | + conda activate vllm |
| 156 | + echo 'Starting vllm api server...' |
| 157 | + python -u -m vllm.entrypoints.openai.api_server \ |
| 158 | + --port 8081 \ |
| 159 | + --model $MODEL_NAME \ |
| 160 | + --trust-remote-code \ |
| 161 | + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ |
| 162 | + 2>&1 | tee api_server.log & |
| 163 | + |
| 164 | + echo 'Waiting for vllm api server to start...' |
| 165 | + while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done |
| 166 | +
|
| 167 | + echo 'Starting gradio server...' |
| 168 | + git clone https://github.com/vllm-project/vllm.git || true |
| 169 | + python vllm/examples/gradio_openai_chatbot_webserver.py \ |
| 170 | + -m $MODEL_NAME \ |
| 171 | + --port 8811 \ |
| 172 | + --model-url http://localhost:8081/v1 \ |
| 173 | + --stop-token-ids 128009,128001 |
| 174 | +
|
| 175 | +.. raw:: html |
| 176 | + |
| 177 | + </details> |
| 178 | + |
| 179 | +Start the serving the Llama-3 8B model on multiple replicas: |
| 180 | + |
| 181 | +.. code-block:: console |
| 182 | +
|
| 183 | + HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN |
| 184 | +
|
| 185 | +
|
| 186 | +Wait until the service is ready: |
65 | 187 |
|
66 | 188 | .. code-block:: console
|
67 | 189 |
|
68 |
| - sky launch -c vllm-serve-new -s serve.yaml --gpus A100:8 --env MODEL_NAME=decapoda-research/llama-65b-hf |
| 190 | + watch -n10 sky serve status vllm |
| 191 | +
|
| 192 | +
|
| 193 | +.. raw:: html |
| 194 | + |
| 195 | + <details> |
| 196 | + <summary>Example outputs:</summary> |
| 197 | + |
| 198 | +.. code-block:: console |
| 199 | +
|
| 200 | + Services |
| 201 | + NAME VERSION UPTIME STATUS REPLICAS ENDPOINT |
| 202 | + vllm 1 35s READY 2/2 xx.yy.zz.100:30001 |
| 203 | +
|
| 204 | + Service Replicas |
| 205 | + SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION |
| 206 | + vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP({'L4': 1}) READY us-east4 |
| 207 | + vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP({'L4': 1}) READY us-east4 |
| 208 | +
|
| 209 | +.. raw:: html |
| 210 | + |
| 211 | + </details> |
| 212 | + |
| 213 | +After the service is READY, you can find a single endpoint for the service and access the service with the endpoint: |
| 214 | + |
| 215 | +.. code-block:: console |
| 216 | +
|
| 217 | + ENDPOINT=$(sky serve status --endpoint 8081 vllm) |
| 218 | + curl -L http://$ENDPOINT/v1/chat/completions \ |
| 219 | + -H "Content-Type: application/json" \ |
| 220 | + -d '{ |
| 221 | + "model": "meta-llama/Meta-Llama-3-8B-Instruct", |
| 222 | + "messages": [ |
| 223 | + { |
| 224 | + "role": "system", |
| 225 | + "content": "You are a helpful assistant." |
| 226 | + }, |
| 227 | + { |
| 228 | + "role": "user", |
| 229 | + "content": "Who are you?" |
| 230 | + } |
| 231 | + ], |
| 232 | + "stop_token_ids": [128009, 128001] |
| 233 | + }' |
| 234 | +
|
| 235 | +To enable autoscaling, you could specify additional configs in `services`: |
| 236 | + |
| 237 | +.. code-block:: yaml |
| 238 | +
|
| 239 | + services: |
| 240 | + replica_policy: |
| 241 | + min_replicas: 0 |
| 242 | + max_replicas: 3 |
| 243 | + target_qps_per_replica: 2 |
| 244 | +
|
| 245 | +This will scale the service up to when the QPS exceeds 2 for each replica. |
| 246 | + |
| 247 | + |
| 248 | +**Optional**: Connect a GUI to the endpoint |
| 249 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 250 | + |
| 251 | + |
| 252 | +It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas. |
| 253 | + |
| 254 | +.. raw:: html |
| 255 | + |
| 256 | + <details> |
| 257 | + <summary>Click to see the full GUI YAML</summary> |
| 258 | + |
| 259 | +.. code-block:: yaml |
| 260 | +
|
| 261 | + envs: |
| 262 | + MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct |
| 263 | + ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm. |
| 264 | +
|
| 265 | + resources: |
| 266 | + cpus: 2 |
| 267 | +
|
| 268 | + setup: | |
| 269 | + conda activate vllm |
| 270 | + if [ $? -ne 0 ]; then |
| 271 | + conda create -n vllm python=3.10 -y |
| 272 | + conda activate vllm |
| 273 | + fi |
| 274 | +
|
| 275 | + # Install Gradio for web UI. |
| 276 | + pip install gradio openai |
| 277 | +
|
| 278 | + run: | |
| 279 | + conda activate vllm |
| 280 | + export PATH=$PATH:/sbin |
| 281 | + WORKER_IP=$(hostname -I | cut -d' ' -f1) |
| 282 | + CONTROLLER_PORT=21001 |
| 283 | + WORKER_PORT=21002 |
| 284 | +
|
| 285 | + echo 'Starting gradio server...' |
| 286 | + git clone https://github.com/vllm-project/vllm.git || true |
| 287 | + python vllm/examples/gradio_openai_chatbot_webserver.py \ |
| 288 | + -m $MODEL_NAME \ |
| 289 | + --port 8811 \ |
| 290 | + --model-url http://$ENDPOINT/v1 \ |
| 291 | + --stop-token-ids 128009,128001 | tee ~/gradio.log |
| 292 | +
|
| 293 | +.. raw:: html |
| 294 | + |
| 295 | + </details> |
| 296 | + |
| 297 | +1. Start the chat web UI: |
| 298 | + |
| 299 | +.. code-block:: console |
| 300 | +
|
| 301 | + sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm) |
| 302 | +
|
| 303 | +
|
| 304 | +2. Then, we can access the GUI at the returned gradio link: |
| 305 | + |
| 306 | +.. code-block:: console |
| 307 | +
|
| 308 | + | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live |
| 309 | +
|
69 | 310 |
|
0 commit comments