vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests This README provides instructions on how to install and use vLLM version 0.6.6 on Polaris.

Request a Compute node

qsub -I -A <project> -q debug -l select=1 -l walltime=01:00:00 -l filesystems=home:eagle

Installation

To install vLLM on Polaris on a compute node, run the following

module use /soft/modulefiles/
module load conda
conda create -n vllm_v071_env python==3.11.9 -y
conda activate vllm_v071_env
module use /soft/spack/base/0.8.1/install/modulefiles/Core
module load gcc
pip install vllm

Usage

To use vLLM, you can do one of the following:

Run on compute node and query the model

Run the following commands to run vllm on a compute node

module use /soft/modulefiles
module load conda
conda activate vllm_v071_env #change path
module use /soft/spack/base/0.8.1/install/modulefiles/Core
module load gcc
export HF_DATASETS_CACHE="/eagle/argonne_tpc/model_weights/"
export HF_HOME="/eagle/argonne_tpc/model_weights/"
export RAY_TMPDIR="/tmp"
export RAYON_NUM_THREADS=4
export RUST_BACKTRACE=1
export PROMETHEUS_MULTIPROC_DIR="/tmp"
export VLLM_RPC_BASE_PATH="/tmp"
export HF_TOKEN="" #Add your token
export no_proxy="127.0.0.1,localhost"
vllm serve meta-llama/Meta-Llama-3-8B-Instruct --host 127.0.0.1 --tensor-parallel-size 4 --gpu-memory-utilization 0.98 --enforce-eager #For online serving

An alternative is to run vLLM in the background on the compute node. Check nohup.out for logs and ensure model is up and running

nohup vllm serve meta-llama/Meta-Llama-3-8B-Instruct --host 127.0.0.1 --tensor-parallel-size 4 --gpu-memory-utilization 0.98 --enforce-eager &

To now interact with the model run openai_client.py or ssh tunnel from a login node as follows

bash tunnel.sh
python3 vllm_client.py # or use curl see `curl.sh`

💡 Note: You can run python3 vllm_client.py -h to view all available options :bulb: Note: Ensure you chmod +x all the bash scripts.

Run multi node inference on models like Llama3.1-405B using vllm & ray.

See multi_node_inference_job_submission.sh for running Llama3.1-405B on 8 Polaris nodes. Reduce context size etc to make it fit in memory with smaller number of nodes as needed. Modify the setup_ray_and_vllm.sh file with appropriate values if needed. For e.g. HF_TOKEN with your hugging face token

Use Globus Compute to run vLLM remotely

Instructions in vLLM_Inference.ipynb notebook will guide you in triggering vllm inference runs remotely from your local machine using globus compute

Use job submission scripts to run vLLM on Polaris

Use the job_submission.sh file to submit a batch job to run vllm on a compute node

qsub job_submission.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

vLLM

Request a Compute node

Installation

Usage

Run on compute node and query the model

Run multi node inference on models like Llama3.1-405B using vllm & ray.

Use Globus Compute to run vLLM remotely

Use job submission scripts to run vLLM on Polaris

Files

README.md

Latest commit

History

README.md

File metadata and controls

vLLM

Request a Compute node

Installation

Usage

Run on compute node and query the model

Run multi node inference on models like Llama3.1-405B using vllm & ray.

Use Globus Compute to run vLLM remotely

Use job submission scripts to run vLLM on Polaris