Skip to content

Latest commit

 

History

History
77 lines (61 loc) · 3.12 KB

README.md

File metadata and controls

77 lines (61 loc) · 3.12 KB

Welcome to vLLM Windows Home!

This repository contains a Docker Compose setup for running vLLM on Windows. With this setup, you can easily run and experiment with vLLM on Windows Home.

Enjoy the state-of-the art LLM serving throughput on your Windows Home PC with efficient paged attention, continuous batching and fast inferencing, Plus Qualitzation.

vllm-windows-home

Once the following is setup, you can start vLLM on the click of a button on Docker Desktop or configure to start at windows startup.

Getting Started

Prerequisites

Docker Desktop: Install Docker Desktop from https://www.docker.com/products/docker-desktop

Note: Windows Home uses WSL as backend engine.

Steps

  1. Clone the Repository
git clone https://github.com/aneeshjoy/vllm-windows.git
cd vllm-windows
  1. Update Hugging Face Token Open docker-compose.yml and replace <hugging_face_token> with your own Hugging Face token. The format should be like this:
  environment:
    - HUGGING_FACE_HUB_TOKEN=<hugging_face_token>
  1. Copy Model Weights

Download or copy the desired LLM model weights into the models directory within the cloned repository and update the model name.

  command: --model /models/mistral-7b
  1. Simply execute the following command at the root level of the project:
docker-compose up
  1. Test by accessing the /models endpoints

http://127.0.0.1:8000/v1/models

  1. Check throughput ( I am running on a RTX 3090 )

http://127.0.0.1:8000/metrics

# HELP exceptions_total_counter Total number of requested which generated an exception
# TYPE exceptions_total_counter counter
# HELP requests_total_counter Total number of requests received
# TYPE requests_total_counter counter
requests_total_counter{method="POST",path="/v1/completions"} 24
# HELP responses_total_counter Total number of responses sent
# TYPE responses_total_counter counter
responses_total_counter{method="POST",path="/v1/completions"} 24
# HELP status_codes_counter Total number of response status codes
# TYPE status_codes_counter counter
status_codes_counter{method="POST",path="/v1/completions",status_code="200"} 24
# HELP vllm:avg_generation_throughput_toks_per_s Average generation throughput in tokens/s.
# TYPE vllm:avg_generation_throughput_toks_per_s gauge
vllm:avg_generation_throughput_toks_per_s{model_name="/models/mistral-7b"} 842.7750196184555
# HELP vllm:avg_prompt_throughput_toks_per_s Average prefill throughput in tokens/s.
# TYPE vllm:avg_prompt_throughput_toks_per_s gauge
vllm:avg_prompt_throughput_toks_per_s{model_name="/models/mistral-7b"} 1211.5997677115236
# HELP vllm:cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:cpu_cache_usage_perc gauge
vllm:cpu_cache_usage_perc{model_name="/models/mistral-7b"} 0.0
# HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="/models/mistral-7b"} 0.38849487785658
# HELP vllm:num_requests_running Number of requests that is currently running for inference.