Welcome to vLLM Windows Home!

This repository contains a Docker Compose setup for running vLLM on Windows. With this setup, you can easily run and experiment with vLLM on Windows Home.

Enjoy the state-of-the art LLM serving throughput on your Windows Home PC with efficient paged attention, continuous batching and fast inferencing, Plus Qualitzation.

Once the following is setup, you can start vLLM on the click of a button on Docker Desktop or configure to start at windows startup.

Getting Started

Prerequisites

Docker Desktop: Install Docker Desktop from https://www.docker.com/products/docker-desktop

Note: Windows Home uses WSL as backend engine.

Steps

Clone the Repository

git clone https://github.com/aneeshjoy/vllm-windows.git
cd vllm-windows

Update Hugging Face Token Open docker-compose.yml and replace <hugging_face_token> with your own Hugging Face token. The format should be like this:

  environment:
    - HUGGING_FACE_HUB_TOKEN=<hugging_face_token>

Copy Model Weights

Download or copy the desired LLM model weights into the models directory within the cloned repository and update the model name.

  command: --model /models/mistral-7b

Simply execute the following command at the root level of the project:

docker-compose up

Test by accessing the /models endpoints

http://127.0.0.1:8000/v1/models

Check throughput ( I am running on a RTX 3090 )

http://127.0.0.1:8000/metrics

# HELP exceptions_total_counter Total number of requested which generated an exception
# TYPE exceptions_total_counter counter
# HELP requests_total_counter Total number of requests received
# TYPE requests_total_counter counter
requests_total_counter{method="POST",path="/v1/completions"} 24
# HELP responses_total_counter Total number of responses sent
# TYPE responses_total_counter counter
responses_total_counter{method="POST",path="/v1/completions"} 24
# HELP status_codes_counter Total number of response status codes
# TYPE status_codes_counter counter
status_codes_counter{method="POST",path="/v1/completions",status_code="200"} 24
# HELP vllm:avg_generation_throughput_toks_per_s Average generation throughput in tokens/s.
# TYPE vllm:avg_generation_throughput_toks_per_s gauge
vllm:avg_generation_throughput_toks_per_s{model_name="/models/mistral-7b"} 842.7750196184555
# HELP vllm:avg_prompt_throughput_toks_per_s Average prefill throughput in tokens/s.
# TYPE vllm:avg_prompt_throughput_toks_per_s gauge
vllm:avg_prompt_throughput_toks_per_s{model_name="/models/mistral-7b"} 1211.5997677115236
# HELP vllm:cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:cpu_cache_usage_perc gauge
vllm:cpu_cache_usage_perc{model_name="/models/mistral-7b"} 0.0
# HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="/models/mistral-7b"} 0.38849487785658
# HELP vllm:num_requests_running Number of requests that is currently running for inference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Welcome to vLLM Windows Home!

Getting Started

Prerequisites

Steps

Files

README.md

Latest commit

History

README.md

File metadata and controls

Welcome to vLLM Windows Home!

Getting Started

Prerequisites

Steps