[WIP][Metrics] Re-work approach to LoRA metrics #13303

markmc · 2025-02-14T18:31:52Z

Part of #10582 and discussed in #12745

The current vllm:lora_requests_info Gauge is somewhat similar to an Info metric (like cache_config_info) except the value is the current wall-clock time, and is updated every iteration.

The label names used are:

running_lora_adapters: a list of adapters with running requests, formatted as a comma-separated string.
waiting_lora_adapters: similar, except listing adapters with requests waiting to be scheduled.
max_lora - the static "max number of LoRAs in a single batch." configuration.

It looks like this:

vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09
vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09
vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09

I can't really make much sense of this. Encoding a running/waiting status for multiple adapters in a comma-separated string seems quite misguided - we should use labels to distinguish between per-adapter counts instead:

vllm:num_lora_requests_running{lora_name="test-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 8.0
vllm:num_lora_requests_waiting{lora_name="test-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.0

This was added in #9477 and there is at least one known user. If we revisit this design and deprecate the old metric, we should reduce the need for a significant deprecation period by making the change in v0 also and asking this project to move to the new metric.

TODO:

Add a lora config info gauge - max_loras, max_lora_rank
Add more unit test coverage of the new metrics
Add the new metrics to V1
Add the old metric to V0 (to ease the transition to V1)
Deprecate the old metrics

github-actions · 2025-02-14T18:32:03Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vllm/engine/metrics.py

The current `vllm:lora_requests_info` Gauge is somewhat similar to an Info metric (like cache_config_info) except the value is the current wall-clock time, and is updated every iteration. The label names used are: - running_lora_adapters: a list of adapters with running requests, formatted as a comma-separated string. - waiting_lora_adapters: similar, except listing adapters with requests waiting to be scheduled. - max_lora - the static "max number of LoRAs in a single batch." configuration. It looks like this: ``` vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09 ``` I can't really make much sense of this. Encoding a running/waiting status for multiple adapters in a comma-separated string seems quite misguided - we should use labels to distinguish between per-adapter counts instead: ``` vllm:num_lora_requests_running{lora_name="test-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 8.0 vllm:num_lora_requests_waiting{lora_name="test-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.0 ``` This was added in vllm-project#9477 and there is at least one known user. If we revisit this design and deprecate the old metric, we should reduce the need for a significant deprecation period by making the change in v0 also and asking this project to move to the new metric. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc · 2025-02-14T20:29:22Z

You could argue either:

We don't need per-adapter counts at all, just an info metric (like cache_config_info) that lists the configured adapters, or
Most of our metrics should be per-adapter ... just break them down per-adapter and label with lora_name=

markmc · 2025-02-15T17:15:16Z

See also #6275

markmc · 2025-02-18T12:05:02Z

ok, I took a closer look at what the Gateway API Inference Extension is doing with this metric. I've filed kubernetes-sigs/gateway-api-inference-extension#354 to invite feedback from that project.

The premise is this:

route to a model server that has the adapter already loaded, so long as there is batch capacity

and the way the metric is described:

Metric name implemented in vLLM: vllm:lora_requests_info
Metric type: Gauge
Metric value: The last updated timestamp - so the Endpoint Picker (EPP) can find the latest
Metric labels:
max_lora: The maximum number of adapters that can be loaded to GPU memory to serve a batch. Requests will be queued if the model server has reached MaxActiveAdapter and canno load the requested adapter. Example: "max_lora": "8".
running_lora_adapters: A comma separated list of adapters that are currently loaded in GPU memory and ready to serve requests. Example: "running_lora_adapters": "adapter1, adapter2"

References:

markmc · 2025-02-18T12:12:00Z

Given the way LRUCacheWorkerLoRAManager works, the current V0 metric implementation and what's proposed here in V1 both miss an important point - even if there was no requests for a given LoRA included in the most recent batch, that LoRA's weights could still be loaded on GPUs and it would still be efficient to route requests for it to this vLLM instance.

varun-sundar-rabindranath · 2025-02-18T13:38:10Z

Hi @markmc ! Thanks for doing this!
I looked through #6275. On top of what you propose (adapters + counts) the metric proposed there look very informative,

from the RFC:

 - Loading and unloading times for LoRA adapters.
 - Memory and compute resource usage by LoRA adapters.
 - Performance impact on base models when using LoRA adapters.

About,

Loading and unloading times for LoRA adapters.

I think this is good information. The max loras value combined with information about the number of running/waiting loras , will inform the users about the dynamics of LoRA loads. I believe the load times could serve as a good secondary information.
side-note: In V1, the LoRA adapter Loads are triggered here

vllm/vllm/v1/worker/gpu_model_runner.py

Line 581 in 3809458

if self.lora_config:

. In this vein, i.e. informing the users about the input preparation time, I think we should consider exposing the run-time of _prepare_inputs function in general.

About,

Memory and compute resource usage by LoRA adapters.

For memory usage, we already profile the memory usage for a forward pass in determining the available memory for KV cache. However, it is not granular enough to inform about the memory usage by particular LoRA adapters. Also, since the max_loras engine argument effectively limits the number of LoRA adapters used, I believe the memory usage by LoRA adapters would be constant and wouldn't change during runtime.
For compute usage by LoRA adapters, I am not sure how we would do this at his granularity. Perhaps, we can add a "gpu-utilitzation" metric to Stats for the users to infer.
I believe this set of metrics is less important and we can tackle the memory and compute usage metrics after the first cut maybe.

About,

even if there was no requests for a given LoRA included in the most recent batch, that LoRA's weights could still be loaded on GPUs and it would still be efficient to route requests for it to this vLLM instance.
and,
Performance impact on base models when using LoRA adapters.

I believe the existing Iteration level metrics (

vllm/vllm/engine/llm_engine.py

Line 1600 in 3809458

# Iteration stats

) and what you proposed (adapters + counts) combined should inform the user of this.

markmc · 2025-02-18T22:55:38Z

Implemented the V0 metric in V1 in #13504

``` vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09 ``` As discussed in vllm-project#13303, this metric perhaps isn't the most ideal solution for the use case but, given there is an existing user, we should retain compatibility in V1 and deprecate it if we replace it with a different metric. See also kubernetes-sigs/gateway-api-inference-extension#354 Signed-off-by: Mark McLoughlin <markmc@redhat.com>

mergify · 2025-02-25T08:21:17Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @markmc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

markmc · 2025-04-04T14:15:05Z

I've filed kubernetes-sigs/gateway-api-inference-extension#354 to invite feedback from that project.

Deferring for now, based on the feedback from above

varun-sundar-rabindranath reviewed Feb 14, 2025

View reviewed changes

vllm/engine/metrics.py Outdated Show resolved Hide resolved

markmc force-pushed the metrics-v1-lora-metrics branch from 3382c66 to afd51dc Compare February 14, 2025 20:05

markmc mentioned this pull request Feb 14, 2025

[Feature][v1]: Add metrics support #10582

Open

1 task

markmc mentioned this pull request Feb 18, 2025

Consider re-working the vLLM Gauge exposing the currently active LoRAs kubernetes-sigs/gateway-api-inference-extension#354

Closed

markmc mentioned this pull request Feb 18, 2025

[V1][Metrics] Implement vllm:lora_requests_info metric #13504

Merged

mergify bot added the needs-rebase label Feb 25, 2025

markmc closed this Apr 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP][Metrics] Re-work approach to LoRA metrics #13303

[WIP][Metrics] Re-work approach to LoRA metrics #13303

Uh oh!

markmc commented Feb 14, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Feb 14, 2025

Uh oh!

Uh oh!

markmc commented Feb 14, 2025

Uh oh!

markmc commented Feb 15, 2025

Uh oh!

markmc commented Feb 18, 2025

Uh oh!

markmc commented Feb 18, 2025

Uh oh!

varun-sundar-rabindranath commented Feb 18, 2025 •

edited

Loading

Uh oh!

markmc commented Feb 18, 2025

Uh oh!

mergify bot commented Feb 25, 2025

Uh oh!

markmc commented Apr 4, 2025

Uh oh!

Uh oh!

Uh oh!

[WIP][Metrics] Re-work approach to LoRA metrics #13303

[WIP][Metrics] Re-work approach to LoRA metrics #13303

Uh oh!

Conversation

markmc commented Feb 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 14, 2025

Uh oh!

Uh oh!

markmc commented Feb 14, 2025

Uh oh!

markmc commented Feb 15, 2025

Uh oh!

markmc commented Feb 18, 2025

Uh oh!

markmc commented Feb 18, 2025

Uh oh!

varun-sundar-rabindranath commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markmc commented Feb 18, 2025

Uh oh!

mergify bot commented Feb 25, 2025

Uh oh!

markmc commented Apr 4, 2025

Uh oh!

Uh oh!

markmc commented Feb 14, 2025 •

edited by github-actions bot

Loading

varun-sundar-rabindranath commented Feb 18, 2025 •

edited

Loading