Skip to content

Commit d7fefdf

Browse files
tjtanaamaleksan85Aleksandr MalyshevhmellorIsotr0py
authored
[MFM-2025-02-21] Merge main to llama fp8, DeepSeekV3 and PTPC-FP8 (#445)
* [ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling (vllm-project#12713) Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> * Refactor `Linear` handling in `TransformersModel` (vllm-project#12727) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [VLM] Add MLA with pure RoPE support for deepseek-vl2 models (vllm-project#12729) * [Misc] Bump the compressed-tensors version (vllm-project#12736) * [Model][Quant] Fix GLM, Fix fused module mappings for quantization (vllm-project#12634) Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: mgoin <michael@neuralmagic.com> * [Doc] Update PR Reminder with link to Developer Slack (vllm-project#12748) * [Bugfix] Fix OpenVINO model runner (vllm-project#12750) * [V1][Misc] Shorten `FinishReason` enum and use constant strings (vllm-project#12760) * [Doc] Remove performance warning for auto_awq.md (vllm-project#12743) * [Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_for_pytorch'' for --tensor-parallel-size more than 1 (vllm-project#12546) * [core][distributed] exact ray placement control (vllm-project#12732) Signed-off-by: youkaichao <youkaichao@gmail.com> * The code assumes WARP_SIZE to be equal to 32, which is not the case on ROCm (#406) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * Merging PR vllm-project#12536 Merged via CLI script * [Hardware][Intel-Gaudi] Enable FusedSDPA support for Intel Gaudi (HPU) * Add: Support for Sparse24Bitmask Compressed Models * [VLM] Use shared field to pass token ids to model * [Docs] Drop duplicate [source] links * [VLM] Qwen2.5-VL * [VLM] Update compatibility with transformers 4.49 * [ROCm][Kernel] Using the correct warp_size value * [Bugfix] Better FP8 supported defaults * [Misc][Easy] Remove the space from the file name * [Model] LoRA Support for Ultravox model (vllm-project#11253) * [Bugfix] Fix the test_ultravox.py's license (vllm-project#12806) Signed-off-by: Lu Fang <lufang@fb.com> * Improve `TransformersModel` UX (vllm-project#12785) * [Misc] Remove duplicated DeepSeek V2/V3 model definition (vllm-project#12793) * [Misc] Improve error message for incorrect pynvml (vllm-project#12809) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Misc] Update w2 scale loading for GPTQMarlinMoE (vllm-project#12757) * [Docs] Add Google Cloud Slides (vllm-project#12814) * [Attention] Use FA3 for MLA on Hopper (vllm-project#12807) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> * [misc] Reduce number of config file requests to HuggingFace (vllm-project#12797) Signed-off-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal> Signed-off-by: <> Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal> * Update README.md 20250205_aiter (#407) * Update README.md 20250205_aiter * whitespace * adding VLLM_USE_AITER=0 advice * [Misc] Remove unnecessary decode call (vllm-project#12833) * [Kernel] Make rotary_embedding ops more flexible with input shape (vllm-project#12777) * [torch.compile] PyTorch 2.6 and nightly compatibility (vllm-project#12393) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Doc] double quote cmake package in build.inc.md (vllm-project#12840) * [Bugfix] Fix unsupported FA version check for Turing GPU (vllm-project#12828) * [V1] LoRA Support (vllm-project#10957) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> * Add Bamba Model (vllm-project#10909) Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> * [MISC] Check space in the file names in the pre commit checks (vllm-project#12804) Signed-off-by: Lu Fang <lufang@fb.com> * [misc] Revert # 12833 (vllm-project#12857) Signed-off-by: <> Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal> * [Bugfix] FA2 illegal memory access (vllm-project#12848) * Make vllm compatible with verl (vllm-project#12824) Co-authored-by: zhangshulai <zhangshulai@bytedance.com> * [Bugfix] Missing quant_config in deepseek embedding layer (vllm-project#12836) * Prevent unecessary requests to huggingface hub (vllm-project#12837) * [MISC][EASY] Break check file names into entry and args in the pre-commit hooks (vllm-project#12880) Signed-off-by: Lu Fang <lufang@fb.com> * [Misc] Remove unnecessary detokenization in multimodal processing (vllm-project#12868) * PR vllm-project#12718 (vllm-project#12718) * [V1] Logprobs and prompt logprobs support (vllm-project#9880) This PR is adding support for sample logprobs & prompt logprobs to vLLM v1. New behavior: - During model execution, model runner computes sample logprobs (if user-provided logprobs setting is not None) and prompt logprobs (if user-provided prompt_logprobs setting is not None). For both sample and prompt logprobs, the engine core returns 3 vectors: token ids, token logprob values, token ranks. Ranks reflect tokens' 1-indexed positions in the vocabulary vector after sorting the vocabulary by log probability in descending order. - In scheduler.update_from_output(), sample and prompt logprobs are incorporated into the EngineCoreOutput data structure which is transferred to the engine client. If multiprocessing is enabled, then sample and prompt logprobs will be (de)serialized when the EngineCoreOutput data structure is (de)serialized. - During output processing, the LogprobsProcessor transforms the triplet of token ids, token logprobs values, and token ranks into the OpenAI-compatible List[Dict[token id,Logprob]] format (for sample and prompt logprobs respectively.) - Each Logprob instance (whether sample- or prompt-) consists of a token's log-probability, rank, and detokenized string representation. Note that logprob detokenization is handled by the LogprobsProcessor not the detokenizer. Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: Nick Hill <nhill@redhat.com> * [ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing (vllm-project#12501) * fix rocm get_device name for moe configs (#359) * fix rocm get_device name use 'market_name' hard-code names for mi308 & mi300 * use gfx and num_CU for device name * using market_name * rename MI325_OAM to MI325X * rm (duplicate) MI300X_OAM * rename mi308 * [V1] LM Eval With Streaming Integration Tests (vllm-project#11590) * [Bugfix] Fix disagg hang caused by the prefill and decode communication issues (vllm-project#12723) Signed-off-by: Lu Fang <lufang@fb.com> * [V1][Minor] Remove outdated comment (vllm-project#12928) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [V1] Move KV block hashes from Request to KVCacheManager (vllm-project#12922) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Bugfix] Fix Qwen2_5_VLForConditionalGeneration packed_modules_mapping (vllm-project#12905) * [Misc] Fix typo in the example file (vllm-project#12896) Signed-off-by: Zhao Ke <yingxiongraomingzk@gmail.com> * [Bugfix] Fix multi-round chat error when mistral tokenizer is used (vllm-project#12859) Signed-off-by: Zifei Tong <zifeitong@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [bugfix] respect distributed_executor_backend in world_size=1 (vllm-project#12934) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Misc] Add offline test for disaggregated prefill (vllm-project#12418) * [V1][Minor] Move cascade attn logic outside _prepare_inputs (vllm-project#12943) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Build] Make pypi install work on CPU platform (vllm-project#12874) * [Hardware][Intel-Gaudi] Enable long-contexts + LoRA support for Intel Gaudi (vllm-project#12812) Signed-off-by: Sanju C Sudhakaran <scsudhakaran@habana.ai> * [misc] Add LoRA to benchmark_serving (vllm-project#12898) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> * [Misc] Log time consumption on weight downloading (vllm-project#12926) * [CI] Resolve transformers-neuronx version conflict (vllm-project#12925) * [Doc] Correct HF repository for TeleChat2 models (vllm-project#12949) * [Misc] Add qwen2.5-vl BNB support (vllm-project#12944) * [CI/Build] Auto-fix Markdown files (vllm-project#12941) * [Bugfix] Remove unused seq_group_metadata_list from ModelInputForGPU (vllm-project#12935) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [bugfix] fix early import of flash attention (vllm-project#12959) Signed-off-by: youkaichao <youkaichao@gmail.com> * [VLM] Merged multi-modal processor for GLM4V (vllm-project#12449) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [V1][Minor] Remove outdated comment (vllm-project#12968) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [RFC] [Mistral] FP8 format (vllm-project#10130) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> * [V1] Cache `uses_mrope` in GPUModelRunner (vllm-project#12969) * [core] port pynvml into vllm codebase (vllm-project#12963) Signed-off-by: youkaichao <youkaichao@gmail.com> * [MISC] Always import version library first in the vllm package (vllm-project#12979) Signed-off-by: Lu Fang <lufang@fb.com> * [core] improve error handling when wake up from sleep mode (vllm-project#12981) Signed-off-by: youkaichao <youkaichao@gmail.com> * [core][rlhf] add colocate example for RLHF (vllm-project#12984) Signed-off-by: youkaichao <youkaichao@gmail.com> * [V1] Use msgpack for core request serialization (vllm-project#12918) Signed-off-by: Nick Hill <nhill@redhat.com> * Check if selected backend is None in get_attn_backend_cls() (vllm-project#12975) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> * [core] fix sleep mode and pytorch checkpoint compatibility (vllm-project#13001) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Doc] Add link to tool_choice tracking issue in tool_calling.md (vllm-project#13003) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> * [misc] Add retries with exponential backoff for HF file existence check (vllm-project#13008) * [Bugfix] Clean up and fix multi-modal processors (vllm-project#13012) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * Fix seed parameter behavior in vLLM (vllm-project#13007) Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> * Fixing the output formatting (#414) * [Model] Ultravox Model: Support v0.5 Release (vllm-project#12912) Signed-off-by: Farzad Abdolhosseini <farzad@fixie.ai> * [misc] Fix setup.py condition to avoid AMD from being mistaken with CPU (vllm-project#13022) Signed-off-by: kevin <kevin@anyscale.com> * [V1][Minor] Move scheduler outputs to a separate file (vllm-project#13062) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Docs] Annouce Meta Meetup (vllm-project#13065) Signed-off-by: simon-mo <simon.mo@hey.com> * [Bugfix] Support missing tool parameters in mistral tokenizer (vllm-project#12884) Signed-off-by: Florian Greinacher <florian.greinacher@siemens.com> * [Benchmark] Add BurstGPT to benchmark_serving (vllm-project#13063) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> * [Core] Don't do platform detection at import time (vllm-project#12933) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [Misc] LoRA - Refactor Punica ops tests (vllm-project#12970) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> * [Bugfix]: Reasoning output bug according to the chat template change (vllm-project#13025) Signed-off-by: Ce Gao <cegao@tensorchord.ai> * [V1][Metrics] Add GPU prefix cache hit rate % gauge (vllm-project#12592) * [executor] init `local_rank` as device index (vllm-project#13027) Signed-off-by: Mengqing Cao <cmq0113@163.com> * [ROCm] Using a more precise memory profiling (vllm-project#12624) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * [Build] Fix cuda link target of cumem_allocator in CPU env (vllm-project#12863) Signed-off-by: YuhongGuo <yuhong.gyh@antgroup.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> * [Platform] add pre_register_and_update function (vllm-project#12432) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * [Bugfix] fix flaky test (vllm-project#13089) Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> * [V1][Metrics] Add several request timing histograms (vllm-project#12644) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * Set `torch_dtype` in `TransformersModel` (vllm-project#13088) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Misc] Fix typo at comments at metrics.py (vllm-project#13024) * [Bugfix] Do not use resource module on Windows (vllm-project#12858) (vllm-project#13029) * [BugFix] Pop instead of del CUDA_VISIBLE_DEVICES (vllm-project#12962) Signed-off-by: Hollow Man <hollowman@opensuse.org> * Fix initializing GGUF weights for ColumnParallelLinear when using tensor parallel > 1 (vllm-project#13023) * Add tuned moe config for qwen1.5_moe_A2.7B (#398) * Add tuned moe config for qwen1.5_moe_A2.7B * Add more sweep parameters on qwen2_moe * Add tp = 1,2,4,8 after applying PR12838 * Rename config name by deleting "_OAM" --------- Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> * [CI/Build][Bugfix] Fix CPU backend default threads num (vllm-project#13077) * Removing non-existent parameter * [Doc] Improve OpenVINO installation doc (vllm-project#13102) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Bugfix] Guided decoding falls back to outlines when fails to import xgrammar (vllm-project#12976) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> * [Misc] Move pre-commit suggestion back to the end (vllm-project#13114) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [RFC][vllm-API] Support tokenizer registry for customized tokenizer in vLLM (vllm-project#12518) Signed-off-by: Keyun Tong <tongkeyun@gmail.com> * [Model] IBM/NASA Prithvi Geospatial model (vllm-project#12830) * [ci] Add more source file dependencies for some tests (vllm-project#13123) Signed-off-by: <> Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal> * [Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency (vllm-project#12921) Signed-off-by: Lingfan Yu <lingfany@amazon.com> * Bump helm/kind-action from 1.10.0 to 1.12.0 (vllm-project#11612) * Bump actions/stale from 9.0.0 to 9.1.0 (vllm-project#12462) * Bump helm/chart-testing-action from 2.6.1 to 2.7.0 (vllm-project#12463) * Bump actions/setup-python from 5.3.0 to 5.4.0 (vllm-project#12672) * Further reduce the HTTP calls to huggingface.co (vllm-project#13107) * [Misc] AMD Build Improvements (vllm-project#12923) * [Bug] [V1] Try fetching stop_reason from EngineOutput before checking the request (vllm-project#13108) * [Bugfix] Fix num video tokens calculation for Qwen2-VL (vllm-project#13148) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Frontend] Generate valid tool call IDs when using `tokenizer-mode=mistral` (vllm-project#12332) * [Misc] Delete unused LoRA modules (vllm-project#13151) * Introduce VLLM_CUDART_SO_PATH to allow users specify the .so path (vllm-project#12998) Signed-off-by: Lu Fang <lufang@fb.com> * [CI/Build] Use mypy matcher for pre-commit CI job (vllm-project#13162) Signed-off-by: Russell Bryant <rbryant@redhat.com> * Update Benchmark Profiling Scripts (#417) * Update profiling benchmarks * Fix linter errors --------- Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> * [CORE] [QUANT] Support for GPTQModel's `dynamic` quantization per module override/control (vllm-project#7086) * [Bugfix] Allow fallback to AWQ from AWQMarlin at per-layer granularity (vllm-project#13119) * DS V2V3 fix for same file * Lint * updating manfiest (#416) * [CI] Fix failing FP8 cpu offload test (vllm-project#13170) Signed-off-by: mgoin <mgoin64@gmail.com> * Aiter base (#419) * Using upstream FA repo. Building aiter in the base docker image * Renaming the file to match upstream naming * [V1][Bugfix] Copy encoder input ids to fix set iteration issue during VLM abort (vllm-project#13173) Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com> * [CI/Build] Ignore ruff warning up007 (vllm-project#13182) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [perf-benchmark] cleanup unused Docker images and volumes in H100 benchmark instance (vllm-project#12706) * [NVIDIA] Support nvfp4 quantization (vllm-project#12784) * [Bugfix][Example] Fix GCed profiling server for TPU (vllm-project#12792) Signed-off-by: mgoin <michael@neuralmagic.com> * [VLM] Implement merged multimodal processor for Mllama (vllm-project#11427) * Simplify logic of locating CUDART so file path (vllm-project#13203) Signed-off-by: Lu Fang <lufang@fb.com> * [Build] Automatically use the wheel of the base commit with Python-only build (vllm-project#13178) * [Bugfix] deepseek_r1_reasoning_parser put reason content in wrong field in certain edge case (vllm-project#13097) * [Frontend] Move CLI code into vllm.cmd package (vllm-project#12971) * Allow Unsloth Dynamic 4bit BnB quants to work (vllm-project#12974) * [CI/Build] Allow ruff to auto-fix some issues (vllm-project#13180) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [V1][core] Implement pipeline parallel on Ray (vllm-project#12996) * [VLM] Remove input processor from clip and siglip (vllm-project#13165) * [Frontend] Pass pre-created socket to uvicorn (vllm-project#13113) * [V1] Clarify input processing and multimodal feature caching logic (vllm-project#13211) * [VLM] Merged multi-modal processor for Molmo (vllm-project#12966) * [V1][Core] Add worker_base for v1 worker (vllm-project#12816) Signed-off-by: Aoyu <aoyuzhan@amazon.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Aoyu <aoyuzhan@amazon.com> Co-authored-by: youkaichao <youkaichao@gmail.com> * [Misc] Qwen2.5-VL Optimization (vllm-project#13155) * [VLM] Separate text-only and vision variants of the same model architecture (vllm-project#13157) * [Bugfix] Missing Content Type returns 500 Internal Server Error (vllm-project#13193) * [Frontend] Add `/v1/audio/transcriptions` OpenAI API endpoint (vllm-project#12909) * Initial attempt to adjust codeowners to the ROCm fork (#420) * Applying weight padding to deepseek (#421) * Add label if pre-commit passes (vllm-project#12527) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Model] DeepSeek Tunings (#423) * fused_moe config for DSv3 on MI300X updated * Add tuning script and post processing script Signed-off-by: Randall Smith <Randall.Smith@amd.com> * Add modification to fp8_utils for tuning Signed-off-by: Randall Smith <Randall.Smith@amd.com> * update tuning script and add the configs Signed-off-by: Randall Smith <Randall.Smith@amd.com> * slightly better tunings Signed-off-by: Randall Smith <Randall.Smith@amd.com> * benchmark_moe.py is updated to generate more accurate MoE configs and a specific MoE config for DSv3 is added * Bug in sgl_moe_align_block_size() is fixed by Greg * Generate fp8_w8a8 config for MI300XHF * tunings that don't give garbage output Signed-off-by: Randall Smith <Randall.Smith@amd.com> * More accurate tunings Signed-off-by: Randall Smith <Randall.Smith@amd.com> * More accurate tunings and reject inaccurate configs Signed-off-by: Randall Smith <Randall.Smith@amd.com> * add new tunings Signed-off-by: Randall Smith <Randall.Smith@amd.com> * rename tuning script and add benchmark script to use for optimizing blockwise quant Signed-off-by: Randall Smith <Randall.Smith@amd.com> * remove white space from file names Signed-off-by: Randall Smith <Randall.Smith@amd.com> * remove white space from file names Signed-off-by: Randall Smith <Randall.Smith@amd.com> * Remove some unnecessary changes Signed-off-by: Randall Smith <Randall.Smith@amd.com> * don't use space in file names Signed-off-by: Randall Smith <Randall.Smith@amd.com> * remove XHF tunings Signed-off-by: Randall Smith <Randall.Smith@amd.com> * remove OAM from file name Signed-off-by: Randall Smith <Randall.Smith@amd.com> * rmeove OAM from file names Signed-off-by: Randall Smith <Randall.Smith@amd.com> * yapf Signed-off-by: Randall Smith <Randall.Smith@amd.com> * update config name Signed-off-by: Randall Smith <Randall.Smith@amd.com> * remove benchmark_moe.py changes Signed-off-by: Randall Smith <Randall.Smith@amd.com> * remove is_contiguous Signed-off-by: Randall Smith <Randall.Smith@amd.com> * use more recent fp8_utils.py Signed-off-by: Randall Smith <Randall.Smith@amd.com> * remove is_contiguous Signed-off-by: Randall Smith <Randall.Smith@amd.com> --------- Signed-off-by: Randall Smith <Randall.Smith@amd.com> Co-authored-by: qli88 <qiang.li2@amd.com> * Optimize moe_align_block_size for deepseek_v3 (vllm-project#12850) Signed-off-by: mgoin <mgoin64@gmail.com> * [Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels (vllm-project#13198) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Revert "Add label if pre-commit passes" (vllm-project#13242) * [ROCm] Avoid using the default stream on ROCm (vllm-project#13238) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * [Kernel] Fix awq error when n is not divisable by 128 (vllm-project#13227) * [V1] Consolidate MM cache size to vllm.envs (vllm-project#13239) * [Bugfix/CI] Turn test_compressed_tensors_2of4_sparse back on (vllm-project#13250) * [Bugfix][CI] Inherit codespell settings from pyproject.toml in the pre-commit-config (vllm-project#13237) * [Bugfix] Offline example of disaggregated prefill (vllm-project#13214) * [Misc] Remove redundant statements in scheduler.py (vllm-project#13229) * Consolidate Llama model usage in tests (vllm-project#13094) * Expand MLA to support most types of quantization (vllm-project#13181) * [V1] LoRA - Enable Serving Usecase (vllm-project#12883) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> * [ROCm][V1] Add intial ROCm support to V1 (vllm-project#12790) * [Bugfix][V1] GPUModelRunner._update_states should return True when there is a finished request in batch (vllm-project#13126) * [WIP] TPU V1 Support Refactored (vllm-project#13049) * [Frontend] Optionally remove memory buffer used for uploading to URLs in run_batch (vllm-project#12927) Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io> * [Bugfix] Fix missing parentheses (vllm-project#13263) * [Misc] Log time consumption of sleep and wake-up (vllm-project#13115) Signed-off-by: Jun Duan <jun.duan.phd@outlook.com> * [VLM] Keep track of whether prompt replacements have been applied (vllm-project#13215) * [V1] Simplify GPUModelRunner._update_states check (vllm-project#13265) * Support logit_bias in v1 Sampler (vllm-project#13079) * [Core] choice-based structured output with xgrammar (vllm-project#12632) * [Hardware][Gaudi][Bugfix] Fix error for guided decoding (vllm-project#12317) * Removing bad config (#425) * The order in the file is important. One needs to be explicitly be added to each following path for their ownership to apply (#427) * [Quant][Perf] Use moe_wna16 kernel by default for MoEs with many experts (vllm-project#13236) Signed-off-by: mgoin <mgoin64@gmail.com> * [Core] Reduce TTFT with concurrent partial prefills (vllm-project#10235) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> * [V1][Core] min_p sampling support (vllm-project#13191) Signed-off-by: Aoyu <aoyuzhan@amazon.com> Co-authored-by: Aoyu <aoyuzhan@amazon.com> * [V1][CI] Fix failed v1-test because of min_p (vllm-project#13316) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [V1][Sampler] Don't apply temp for greedy-only (vllm-project#13311) Signed-off-by: Nick Hill <nhill@redhat.com> * [V1][PP] Fix memory profiling in PP (vllm-project#13315) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Bugfix][AMD] Update torch_bindings so that scaled_fp4_quant isn't build on ROCm (vllm-project#13235) * [Bugfix][Docs] Fix offline Whisper (vllm-project#13274) * [Bugfix] Massage MLA's usage of flash attn for RoCM (vllm-project#13310) * [BugFix] Don't scan entire cache dir when loading model (vllm-project#13302) * [Bugfix]Fix search start_index of stop_checker (vllm-project#13280) * [Bugfix] Fix qwen2.5-vl image processor (vllm-project#13286) * [V1][Metrics] Add iteration_tokens_total histogram from V0 (vllm-project#13288) * [AMD] [Model] DeepSeek tunings (vllm-project#13199) * [V1][PP] Run engine busy loop with batch queue (vllm-project#13064) * [ci/build] update flashinfer (vllm-project#13323) * [Doc] [2/N] Add Fuyu E2E example for multimodal processor (vllm-project#13331) * [V1][Spec Decode] Ngram Spec Decode (vllm-project#12193) Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> * [Quant] Add `SupportsQuant` to phi3 and clip (vllm-project#13104) * [Bugfix] Pin xgrammar to 0.1.11 (vllm-project#13338) * avoid calling hf_list_repo_files for local model Signed-off-by: isotr0py <2037008807@qq.com> * annotation Signed-off-by: isotr0py <2037008807@qq.com> * [BugFix] Enhance test_pos_encoding to support execution on multi-devices (vllm-project#13187) Signed-off-by: wchen61 <wchen61@foxmail.com> * [V1] Update doc and examples for H2O-VL (vllm-project#13349) Signed-off-by: Roger Wang <ywang@roblox.com> * [ci] skip failed tests for flashinfer (vllm-project#13352) Signed-off-by: youkaichao <youkaichao@gmail.com> * [platform] add base class for communicators (vllm-project#13208) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Bugfix] Fix 2 Node and Spec Decode tests (vllm-project#13341) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Docs] Change myenv to vllm. Update python_env_setup.inc.md (vllm-project#13325) * [V1][BugFix] Add __init__.py to v1/spec_decode/ (vllm-project#13359) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [V1][PP] Cache Intermediate Tensors (vllm-project#13353) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Bugfix][Platform][CPU] Fix cuda platform detection on CPU backend edge case (vllm-project#13358) Signed-off-by: Isotr0py <2037008807@qq.com> * [V1][BugFix] Clean up rejection sampler & Fix warning msg (vllm-project#13362) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [V1][Misc] Avoid unnecessary log output (vllm-project#13289) * [Feature][Spec Decode] Simplify the use of Eagle Spec Decode (vllm-project#12304) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Fix spelling error in index.md (vllm-project#13369) * Run v1 benchmark and integrate with PyTorch OSS benchmark database (vllm-project#13068) Signed-off-by: Huy Do <huydhn@gmail.com> * [MISC] tiny fixes (vllm-project#13378) * [VLM] Check required fields before initializing field config in `DictEmbeddingItems` (vllm-project#13380) * [Model] Support Mamba2 (Codestral Mamba) (vllm-project#9292) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * [Bugfix] fix xpu communicator (vllm-project#13368) Signed-off-by: yan ma <yan.ma@intel.com> * [Bugfix] Fix VLLM_USE_MODELSCOPE issue (vllm-project#13384) * Updating PR template to point people to the upstream repo. Updating codeowners (#431) * Enabling the ROCm-vLLM CI on MI250 machines (#432) * Enabling ROCm CI on MI250 machines: - correct build target - correct queue Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> --------- Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> * Optimization for quantized gemm skinny sizes (#411) * Optimization for quantized gemm skinny sizes * lint fix * Add support for bf16/fp16 * code cleanup * code cleanup * lint fix2 * cleanup * Moved the logic into tuned gemm to preserve API compatibility --------- Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * Restricting FP8 wvSplitk to MI300x (#439) * Remove mi300a (#440) * Removing gfx940 and gfx941 targets. These have been deprecated in favor of gfx942 for MI300X Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * Remove from custom kernels as well --------- Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * resolve diff for mixtral8x7B configs (#437) Signed-off-by: Divakar Verma <divakar.verma@amd.com> * Torch version bump to fix tunable ops (#442) * Advance torch commit to be past pytorch/pytorch#144942 to fix tunable ops * Make sure to use the submodule commit compatible with the main aiter commit * bugfix: remove unused argument passed to the forward pass of ReplicatedLinear layer Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> --------- Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: Lu Fang <lufang@fb.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal> Signed-off-by: <> Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Zhao Ke <yingxiongraomingzk@gmail.com> Signed-off-by: Zifei Tong <zifeitong@gmail.com> Signed-off-by: Sanju C Sudhakaran <scsudhakaran@habana.ai> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> Signed-off-by: Farzad Abdolhosseini <farzad@fixie.ai> Signed-off-by: kevin <kevin@anyscale.com> Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: Florian Greinacher <florian.greinacher@siemens.com> Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Ce Gao <cegao@tensorchord.ai> Signed-off-by: Mengqing Cao <cmq0113@163.com> Signed-off-by: YuhongGuo <yuhong.gyh@antgroup.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Hollow Man <hollowman@opensuse.org> Signed-off-by: Keyun Tong <tongkeyun@gmail.com> Signed-off-by: Lingfan Yu <lingfany@amazon.com> Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com> Signed-off-by: Aoyu <aoyuzhan@amazon.com> Signed-off-by: Randall Smith <Randall.Smith@amd.com> Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io> Signed-off-by: Jun Duan <jun.duan.phd@outlook.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com> Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> Signed-off-by: isotr0py <2037008807@qq.com> Signed-off-by: wchen61 <wchen61@foxmail.com> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: Huy Do <huydhn@gmail.com> Signed-off-by: yan ma <yan.ma@intel.com> Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> Signed-off-by: Divakar Verma <divakar.verma@amd.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Akash kaothalkar <61960177+Akashcodes732@users.noreply.github.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Sanju C Sudhakaran <scsudhakaran@habana.ai> Co-authored-by: Rahul Tuli <rahul@neuralmagic.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com> Co-authored-by: Sumit Vij <sumitvij11+github@gmail.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Kevin H. Luu <kevin@anyscale.com> Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal> Co-authored-by: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com> Co-authored-by: Jitse Klomp <jitse@jitseklomp.nl> Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Yu Chin Fabian Lim <fabianlim@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: ZSL98 <36250440+ZSL98@users.noreply.github.com> Co-authored-by: zhangshulai <zhangshulai@bytedance.com> Co-authored-by: Szymon Ożóg <58388001+SzymonOzog@users.noreply.github.com> Co-authored-by: Maximilien de Bayser <mbayser@br.ibm.com> Co-authored-by: Amit Garg <mitgarg17495@gmail.com> Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com> Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Ke Zhao <yingxiongraomingzk@gmail.com> Co-authored-by: zifeitong <zifeitong@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Shaoting <shaotingf@uchicago.edu> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Jun Duan <jun.duan.phd@outlook.com> Co-authored-by: Liangfu Chen <liangfc@amazon.com> Co-authored-by: shangmingc <caishangming@linux.alibaba.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> Co-authored-by: Yuan Tang <terrytangyuan@gmail.com> Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> Co-authored-by: Farzad Abdolhosseini <farzad.abdolhosseini@gmail.com> Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Florian Greinacher <florian.greinacher@siemens.com> Co-authored-by: Ce Gao <cegao@tensorchord.ai> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Jewon Lee <105219284+je1lee@users.noreply.github.com> Co-authored-by: MoonRide303 <130458190+MoonRide303@users.noreply.github.com> Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <hollowman@opensuse.org> Co-authored-by: sky0530 <weiching0530@gmail.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> Co-authored-by: Keyun Tong <tongkeyun@gmail.com> Co-authored-by: Christian Pinto <chrpinto@gmail.com> Co-authored-by: Lingfan Yu <lingfany@amazon.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Shiyan Deng <842974287@qq.com> Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com> Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com> Co-authored-by: Adrian Abeyta <adabeyta@amd.com> Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai> Co-authored-by: Yida Wu <yida.wu@amd.com> Co-authored-by: Murali Andoorveedu <37849411+andoorve@users.noreply.github.com> Co-authored-by: Kaixi Hou <kaixih@nvidia.com> Co-authored-by: LikeSundayLikeRain <monsoon1013@gmail.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com> Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Co-authored-by: Aoyu <aoyuzhang1989@gmail.com> Co-authored-by: Aoyu <aoyuzhan@amazon.com> Co-authored-by: 燃 <wulipc@163.com> Co-authored-by: Vaibhav Jain <vajain@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: rasmith <Randall.Smith@amd.com> Co-authored-by: qli88 <qiang.li2@amd.com> Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com> Co-authored-by: Wang Ran (汪然) <wrran@outlook.com> Co-authored-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: Kero Liang <kerorek@outlook.com> Co-authored-by: Alexander Matveev <59768536+alexm-redhat@users.noreply.github.com> Co-authored-by: Pooya Davoodi <pooya.davoodi@parasail.io> Co-authored-by: Xu Song <xusong.vip@gmail.com> Co-authored-by: Yu-Zhou <yu.zhou@intel.com> Co-authored-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Lily Liu <lilyliupku@gmail.com> Co-authored-by: isotr0py <2037008807@qq.com> Co-authored-by: wchen61 <wchen61@foxmail.com> Co-authored-by: 凌 <i@ioioi.cn> Co-authored-by: yankooo <948162199@qq.com> Co-authored-by: Huy Do <huydhn@gmail.com> Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Co-authored-by: Yan Ma <yan.ma@intel.com> Co-authored-by: r.4ntix <antix.blue@gmail.com> Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com> Co-authored-by: Hashem Hashemi <159079214+amd-hhashemi@users.noreply.github.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
1 parent 4c8c86d commit d7fefdf

File tree

1,420 files changed

+67095
-16753
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,420 files changed

+67095
-16753
lines changed

.buildkite/check-wheel-size.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,14 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
13
import os
24
import sys
35
import zipfile
46

5-
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 300 MiB
7+
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 400 MiB
68
# Note that we have 400 MiB quota, please use it wisely.
79
# See https://github.com/pypi/support/issues/3792 .
810
# Please also sync the value with the one in Dockerfile.
9-
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 300))
11+
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 400))
1012

1113

1214
def print_top_10_largest_files(zip_file):

.buildkite/generate_index.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
13
import argparse
24
import os
35

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM -b "auto" -t 2
2+
model_name: "nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM"
3+
tasks:
4+
- name: "gsm8k"
5+
metrics:
6+
- name: "exact_match,strict-match"
7+
value: 0.6353
8+
- name: "exact_match,flexible-extract"
9+
value: 0.637
10+
limit: null
11+
num_fewshot: null

.buildkite/lm-eval-harness/test_lm_eval_correctness.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# SPDX-License-Identifier: Apache-2.0
12
"""
23
LM eval harness on model to compare vs HF baseline computed offline.
34
Configs are found in configs/$MODEL.yaml

.buildkite/nightly-benchmarks/README.md

Lines changed: 18 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,13 @@
11
# vLLM benchmark suite
22

3-
43
## Introduction
54

65
This directory contains two sets of benchmark for vllm.
6+
77
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
88
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.
99

10-
11-
See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
12-
10+
See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
1311

1412
## Performance benchmark quick overview
1513

@@ -19,17 +17,14 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performan
1917

2018
**For benchmarking developers**: please try your best to constraint the duration of benchmarking to about 1 hr so that it won't take forever to run.
2119

22-
2320
## Nightly benchmark quick overview
2421

25-
**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B.
22+
**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B.
2623

2724
**Benchmarking engines**: vllm, TGI, trt-llm and lmdeploy.
2825

2926
**Benchmarking Duration**: about 3.5hrs.
3027

31-
32-
3328
## Trigger the benchmark
3429

3530
Performance benchmark will be triggered when:
@@ -39,16 +34,11 @@ Performance benchmark will be triggered when:
3934
Nightly benchmark will be triggered when:
4035
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
4136

42-
43-
44-
4537
## Performance benchmark details
4638

47-
4839
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
4940

50-
51-
#### Latency test
41+
### Latency test
5242

5343
Here is an example of one test inside `latency-tests.json`:
5444

@@ -68,23 +58,25 @@ Here is an example of one test inside `latency-tests.json`:
6858
```
6959

7060
In this example:
71-
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
72-
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
61+
62+
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
63+
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
7364

7465
Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.
7566

7667
WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.
7768

69+
### Throughput test
7870

79-
#### Throughput test
8071
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
8172

8273
The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.
8374

84-
#### Serving test
75+
### Serving test
76+
8577
We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
8678

87-
```
79+
```json
8880
[
8981
{
9082
"test_name": "serving_llama8B_tp1_sharegpt",
@@ -109,6 +101,7 @@ We test the throughput by using `benchmark_serving.py` with request rate = inf t
109101
```
110102

111103
Inside this example:
104+
112105
- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
113106
- The `server-parameters` includes the command line arguments for vLLM server.
114107
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
@@ -118,36 +111,33 @@ The number of this test is less stable compared to the delay and latency benchma
118111

119112
WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.
120113

121-
#### Visualizing the results
114+
### Visualizing the results
115+
122116
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
123117
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
124118
If you do not see the table, please wait till the benchmark finish running.
125119
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
126120
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
127121

128-
129-
130122
## Nightly test details
131123

132124
See [nightly-descriptions.md](nightly-descriptions.md) for the detailed description on test workload, models and docker containers of benchmarking other llm engines.
133125

126+
### Workflow
134127

135-
#### Workflow
136-
137-
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
128+
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
138129
- Inside each container, we run [run-nightly-suite.sh](run-nightly-suite.sh), which will probe the serving engine of the current container.
139130
- The `run-nightly-suite.sh` will redirect the request to `tests/run-[llm serving engine name]-nightly.sh`, which parses the workload described in [nightly-tests.json](tests/nightly-tests.json) and performs the benchmark.
140131
- At last, we run [scripts/plot-nightly-results.py](scripts/plot-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.
141132

142-
#### Nightly tests
133+
### Nightly tests
143134

144135
In [nightly-tests.json](tests/nightly-tests.json), we include the command line arguments for benchmarking commands, together with the benchmarking test cases. The format is highly similar to performance benchmark.
145136

146-
#### Docker containers
137+
### Docker containers
147138

148139
The docker containers for benchmarking are specified in `nightly-pipeline.yaml`.
149140

150141
WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `tests/run-[llm serving engine name]-nightly.sh`.
151142

152143
WARNING: populating `trt-llm` to latest version is not easy, as it requires updating several protobuf files in [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo.git).
153-

.buildkite/nightly-benchmarks/benchmark-pipeline.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,12 @@ steps:
7070
#key: block-h100
7171
#depends_on: ~
7272

73+
- label: "Cleanup H100"
74+
agents:
75+
queue: H100
76+
depends_on: ~
77+
command: docker system prune -a --volumes --force
78+
7379
- label: "H100"
7480
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
7581
agents:

.buildkite/nightly-benchmarks/nightly-annotation.md

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -9,20 +9,19 @@ This file contains the downloading link for benchmarking results.
99

1010
Please download the visualization scripts in the post
1111

12-
1312
## Results reproduction
1413

1514
- Find the docker we use in `benchmarking pipeline`
1615
- Deploy the docker, and inside the docker:
17-
- Download `nightly-benchmarks.zip`.
18-
- In the same folder, run the following code
19-
```
20-
export HF_TOKEN=<your HF token>
21-
apt update
22-
apt install -y git
23-
unzip nightly-benchmarks.zip
24-
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
25-
```
16+
- Download `nightly-benchmarks.zip`.
17+
- In the same folder, run the following code:
2618

27-
And the results will be inside `./benchmarks/results`.
19+
```console
20+
export HF_TOKEN=<your HF token>
21+
apt update
22+
apt install -y git
23+
unzip nightly-benchmarks.zip
24+
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
25+
```
2826

27+
And the results will be inside `./benchmarks/results`.

.buildkite/nightly-benchmarks/nightly-descriptions.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,14 @@
22
# Nightly benchmark
33

44
This benchmark aims to:
5+
56
- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload.
67
- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions.
78

89
Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html), scroll to the end.
910

1011
Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176)
1112

12-
1313
## Setup
1414

1515
- Docker images:
@@ -33,7 +33,7 @@ Latest reproduction guilde: [github issue link](https://github.com/vllm-project/
3333
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
3434
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
3535

36-
# Known issues
36+
## Known issues
3737

3838
- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105).
39-
- TGI does not support `ignore-eos` flag.
39+
- TGI does not support `ignore-eos` flag.

.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,8 @@
77
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
88
- Evaluation metrics: end-to-end latency (mean, median, p99).
99

10-
1110
{latency_tests_markdown_table}
1211

13-
1412
## Throughput tests
1513

1614
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
@@ -19,10 +17,8 @@
1917
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
2018
- Evaluation metrics: throughput.
2119

22-
2320
{throughput_tests_markdown_table}
2421

25-
2622
## Serving tests
2723

2824
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
@@ -33,13 +29,11 @@
3329
- We also added a speculative decoding test for llama-3 70B, under QPS 2
3430
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
3531

36-
3732
{serving_tests_markdown_table}
3833

39-
4034
## json version of the benchmarking tables
4135

42-
This section contains the data of the markdown tables above in JSON format.
36+
This section contains the data of the markdown tables above in JSON format.
4337
You can load the benchmarking tables into pandas dataframes as follows:
4438

4539
```python
@@ -54,9 +48,9 @@ serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
5448
```
5549

5650
The json string for all benchmarking tables:
51+
5752
```json
5853
{benchmarking_results_in_json_string}
5954
```
6055

6156
You can also check the raw experiment data in the Artifact tab of the Buildkite page.
62-

.buildkite/nightly-benchmarks/scripts/convert-results-json-to-markdown.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
13
import json
24
import os
35
from pathlib import Path

.buildkite/nightly-benchmarks/scripts/download-tokenizer.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
13
import argparse
24

35
from transformers import AutoTokenizer

.buildkite/nightly-benchmarks/scripts/generate-nightly-markdown.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
13
import argparse
24
import json
35
from pathlib import Path

.buildkite/nightly-benchmarks/scripts/get-lmdeploy-modelname.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
13
from lmdeploy.serve.openai.api_client import APIClient
24

35
api_client = APIClient("http://localhost:8000")

.buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -345,6 +345,11 @@ main() {
345345
check_gpus
346346
check_hf_token
347347

348+
# Set to v1 to run v1 benchmark
349+
if [[ "${ENGINE_VERSION:-v0}" == "v1" ]]; then
350+
export VLLM_USE_V1=1
351+
fi
352+
348353
# dependencies
349354
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
350355
(which jq) || (apt-get update && apt-get -y install jq)

.buildkite/nightly-benchmarks/scripts/summary-nightly-results.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
13
import datetime
24
import json
35
import os

.buildkite/nightly-benchmarks/tests/latency-tests.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,4 +29,4 @@
2929
"num-iters": 15
3030
}
3131
}
32-
]
32+
]

.buildkite/release-pipeline.yaml

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,11 @@ steps:
5656
env:
5757
DOCKER_BUILDKIT: "1"
5858

59+
- input: "Provide Release version here"
60+
fields:
61+
- text: "What is the release version?"
62+
key: "release-version"
63+
5964
- block: "Build CPU release image"
6065
key: block-cpu-release-image-build
6166
depends_on: ~
@@ -66,7 +71,7 @@ steps:
6671
queue: cpu_queue_postmerge
6772
commands:
6873
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
69-
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$RELEASE_VERSION --progress plain -f Dockerfile.cpu ."
70-
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$RELEASE_VERSION"
74+
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version) --progress plain -f Dockerfile.cpu ."
75+
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
7176
env:
7277
DOCKER_BUILDKIT: "1"

.buildkite/run-gh200-test.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,6 @@ trap remove_docker_container EXIT
2323
remove_docker_container
2424

2525
# Run the image and test offline inference
26-
docker run --name gh200-test --gpus=all --entrypoint="" gh200-test bash -c '
27-
python3 examples/offline_inference/basic.py
26+
docker run -e HF_TOKEN -v /root/.cache/huggingface:/root/.cache/huggingface --name gh200-test --gpus=all --entrypoint="" gh200-test bash -c '
27+
python3 examples/offline_inference/cli.py --model meta-llama/Llama-3.2-1B
2828
'

.buildkite/run-neuron-test.sh

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,6 @@ if [ -f /tmp/neuron-docker-build-timestamp ]; then
2929
docker image prune -f
3030
# Remove unused volumes / force the system prune for old images as well.
3131
docker volume prune -f && docker system prune -f
32-
# Remove huggingface model artifacts and compiler cache
33-
rm -rf "${HF_MOUNT:?}/*"
34-
rm -rf "${NEURON_COMPILE_CACHE_MOUNT:?}/*"
3532
echo "$current_time" > /tmp/neuron-docker-build-timestamp
3633
fi
3734
else
@@ -54,4 +51,4 @@ docker run --rm -it --device=/dev/neuron0 --device=/dev/neuron1 --network host \
5451
-e "NEURON_COMPILE_CACHE_URL=${NEURON_COMPILE_CACHE_MOUNT}" \
5552
--name "${container_name}" \
5653
${image_name} \
57-
/bin/bash -c "python3 /workspace/vllm/examples/offline_inference/neuron.py"
54+
/bin/bash -c "python3 /workspace/vllm/examples/offline_inference/neuron.py && python3 -m pytest /workspace/vllm/tests/neuron/ -v --capture=tee-sys"

.buildkite/run-tpu-test.sh

100644100755
File mode changed.

0 commit comments

Comments
 (0)