[Doc] Update the SkyPilot doc with serving and Llama-3 (#4276)

Michaelvll · web-flow · commit ceaf4ed0030c · 2024-04-22T15:34:31.000-07:00
diff --git a/docs/source/serving/run_on_sky.rst b/docs/source/serving/run_on_sky.rst
@@ -1,69 +1,310 @@
 .. _on_cloud:
 
-Running on clouds with SkyPilot
-===============================
+Deploying and scaling up with SkyPilot
+================================================
 
 .. raw:: html
 
     <p align="center">
         <img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
     </p>
 
-vLLM can be run on the cloud to scale to multiple GPUs with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud.
+vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in `SkyPilot AI gallery <https://skypilot.readthedocs.io/en/latest/gallery/index.html>`__.
 
-To install SkyPilot and setup your cloud credentials, run:
+
+Prerequisites
+-------------
+
+- Go to the `HuggingFace model page <https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct>`__ and request access to the model :code:`meta-llama/Meta-Llama-3-8B-Instruct`.
+- Check that you have installed SkyPilot (`docs <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`__).
+- Check that :code:`sky check` shows clouds or Kubernetes are enabled.
 
 .. code-block:: console
 
-    $ pip install skypilot
-    $ sky check
+    pip install skypilot-nightly
+    sky check
+
+
+Run on a single instance
+------------------------
 
 See the vLLM SkyPilot YAML for serving, `serving.yaml <https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml>`__.
 
 .. code-block:: yaml
 
     resources:
-        accelerators: A100
+        accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
+        use_spot: True
+        disk_size: 512  # Ensure model checkpoints can fit.
+        disk_tier: best
+        ports: 8081  # Expose to internet traffic.
 
     envs:
-        MODEL_NAME: decapoda-research/llama-13b-hf
-        TOKENIZER: hf-internal-testing/llama-tokenizer
+        MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+        HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
 
     setup: |
-        conda create -n vllm python=3.9 -y
+        conda create -n vllm python=3.10 -y
         conda activate vllm
-        git clone https://github.com/vllm-project/vllm.git
-        cd vllm
-        pip install .
-        pip install gradio
+
+        pip install vllm==0.4.0.post1
+        # Install Gradio for web UI.
+        pip install gradio openai
+        pip install flash-attn==2.5.7
 
     run: |
         conda activate vllm
         echo 'Starting vllm api server...'
-        python -u -m vllm.entrypoints.api_server \
-                        --model $MODEL_NAME \
-                        --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
-                        --tokenizer $TOKENIZER 2>&1 | tee api_server.log &
+        python -u -m vllm.entrypoints.openai.api_server \
+            --port 8081 \
+            --model $MODEL_NAME \
+            --trust-remote-code \
+            --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+            2>&1 | tee api_server.log &
+        
         echo 'Waiting for vllm api server to start...'
         while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
+
         echo 'Starting gradio server...'
-        python vllm/examples/gradio_webserver.py
+        git clone https://github.com/vllm-project/vllm.git || true
+        python vllm/examples/gradio_openai_chatbot_webserver.py \
+            -m $MODEL_NAME \
+            --port 8811 \
+            --model-url http://localhost:8081/v1 \
+            --stop-token-ids 128009,128001
 
-Start the serving the LLaMA-13B model on an A100 GPU:
+Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...): 
 
 .. code-block:: console
 
-    $ sky launch serving.yaml
+    HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
 
 Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.
 
 .. code-block:: console
 
     (task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live
 
-**Optional**: Serve the 65B model instead of the default 13B and use more GPU:
+**Optional**: Serve the 70B model instead of the default 8B and use more GPU:
+
+.. code-block:: console
+
+    HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
+
+
+Scale up to multiple replicas
+-----------------------------
+
+SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file.
+
+.. code-block:: yaml
+
+    service:
+        replicas: 2
+        # An actual request for readiness probe.
+        readiness_probe:
+            path: /v1/chat/completions
+            post_data:
+            model: $MODEL_NAME
+            messages:
+                - role: user
+                content: Hello! What is your name?
+        max_tokens: 1
+        
+.. raw:: html
+
+    <details>
+    <summary>Click to see the full recipe YAML</summary>
+
+
+.. code-block:: yaml
+
+    service:
+        replicas: 2
+        # An actual request for readiness probe.
+        readiness_probe:
+            path: /v1/chat/completions
+            post_data:
+            model: $MODEL_NAME
+            messages:
+                - role: user
+                content: Hello! What is your name?
+        max_tokens: 1
+
+    resources:
+        accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
+        use_spot: True
+        disk_size: 512  # Ensure model checkpoints can fit.
+        disk_tier: best
+        ports: 8081  # Expose to internet traffic.
+
+    envs:
+        MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+        HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
+
+    setup: |
+        conda create -n vllm python=3.10 -y
+        conda activate vllm
+
+        pip install vllm==0.4.0.post1
+        # Install Gradio for web UI.
+        pip install gradio openai
+        pip install flash-attn==2.5.7
+
+    run: |
+        conda activate vllm
+        echo 'Starting vllm api server...'
+        python -u -m vllm.entrypoints.openai.api_server \
+            --port 8081 \
+            --model $MODEL_NAME \
+            --trust-remote-code \
+            --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+            2>&1 | tee api_server.log &
+        
+        echo 'Waiting for vllm api server to start...'
+        while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
+
+        echo 'Starting gradio server...'
+        git clone https://github.com/vllm-project/vllm.git || true
+        python vllm/examples/gradio_openai_chatbot_webserver.py \
+            -m $MODEL_NAME \
+            --port 8811 \
+            --model-url http://localhost:8081/v1 \
+            --stop-token-ids 128009,128001
+
+.. raw:: html
+
+    </details>
+
+Start the serving the Llama-3 8B model on multiple replicas:
+
+.. code-block:: console
+
+    HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN
+
+
+Wait until the service is ready:
 
 .. code-block:: console
 
-    sky launch -c vllm-serve-new -s serve.yaml --gpus A100:8 --env MODEL_NAME=decapoda-research/llama-65b-hf
+    watch -n10 sky serve status vllm
+
+
+.. raw:: html
+
+    <details>
+    <summary>Example outputs:</summary>
+
+.. code-block:: console
+
+    Services
+    NAME  VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT
+    vllm  1        35s     READY   2/2       xx.yy.zz.100:30001
+
+    Service Replicas
+    SERVICE_NAME  ID  VERSION  IP            LAUNCHED     RESOURCES          STATUS  REGION
+    vllm          1   1        xx.yy.zz.121  18 mins ago  1x GCP({'L4': 1})  READY   us-east4
+    vllm          2   1        xx.yy.zz.245  18 mins ago  1x GCP({'L4': 1})  READY   us-east4
+
+.. raw:: html
+    
+    </details>
+
+After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
+
+.. code-block:: console
+
+    ENDPOINT=$(sky serve status --endpoint 8081 vllm)
+    curl -L http://$ENDPOINT/v1/chat/completions \
+        -H "Content-Type: application/json" \
+        -d '{
+            "model": "meta-llama/Meta-Llama-3-8B-Instruct",
+            "messages": [
+            {
+                "role": "system",
+                "content": "You are a helpful assistant."
+            },
+            {
+                "role": "user",
+                "content": "Who are you?"
+            }
+            ],
+            "stop_token_ids": [128009,  128001]
+        }'
+
+To enable autoscaling, you could specify additional configs in `services`:
+
+.. code-block:: yaml
+
+    services:
+        replica_policy:
+            min_replicas: 0
+            max_replicas: 3
+        target_qps_per_replica: 2
+
+This will scale the service up to when the QPS exceeds 2 for each replica.
+
+
+**Optional**: Connect a GUI to the endpoint
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.
+
+.. raw:: html
+
+    <details>
+    <summary>Click to see the full GUI YAML</summary>
+
+.. code-block:: yaml
+
+    envs:
+        MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct
+        ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm. 
+
+    resources:
+        cpus: 2
+
+    setup: |
+        conda activate vllm
+        if [ $? -ne 0 ]; then
+            conda create -n vllm python=3.10 -y
+            conda activate vllm
+        fi
+
+        # Install Gradio for web UI.
+        pip install gradio openai
+
+    run: |
+        conda activate vllm
+        export PATH=$PATH:/sbin
+        WORKER_IP=$(hostname -I | cut -d' ' -f1)
+        CONTROLLER_PORT=21001
+        WORKER_PORT=21002
+
+        echo 'Starting gradio server...'
+        git clone https://github.com/vllm-project/vllm.git || true
+        python vllm/examples/gradio_openai_chatbot_webserver.py \
+            -m $MODEL_NAME \
+            --port 8811 \
+            --model-url http://$ENDPOINT/v1 \
+            --stop-token-ids 128009,128001 | tee ~/gradio.log
+
+.. raw:: html
+    
+    </details>
+
+1. Start the chat web UI:
+
+.. code-block:: console
+
+    sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
+
+
+2. Then, we can access the GUI at the returned gradio link:
+
+.. code-block:: console
+
+    | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
+