Skip to content

[WIP][Metrics] Re-work approach to LoRA metrics #13303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

markmc
Copy link
Member

@markmc markmc commented Feb 14, 2025

Part of #10582 and discussed in #12745

The current vllm:lora_requests_info Gauge is somewhat similar to an Info metric (like cache_config_info) except the value is the current wall-clock time, and is updated every iteration.

The label names used are:

  • running_lora_adapters: a list of adapters with running requests, formatted as a comma-separated string.
  • waiting_lora_adapters: similar, except listing adapters with requests waiting to be scheduled.
  • max_lora - the static "max number of LoRAs in a single batch." configuration.

It looks like this:

vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09
vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09
vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09

I can't really make much sense of this. Encoding a running/waiting status for multiple adapters in a comma-separated string seems quite misguided - we should use labels to distinguish between per-adapter counts instead:

vllm:num_lora_requests_running{lora_name="test-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 8.0
vllm:num_lora_requests_waiting{lora_name="test-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.0

This was added in #9477 and there is at least one known user. If we revisit this design and deprecate the old metric, we should reduce the need for a significant deprecation period by making the change in v0 also and asking this project to move to the new metric.

TODO:

  • Add a lora config info gauge - max_loras, max_lora_rank
  • Add more unit test coverage of the new metrics
  • Add the new metrics to V1
  • Add the old metric to V0 (to ease the transition to V1)
  • Deprecate the old metrics

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

The current `vllm:lora_requests_info` Gauge is somewhat similar to an
Info metric (like cache_config_info) except the value is the current
wall-clock time, and is updated every iteration.

The label names used are:

- running_lora_adapters: a list of adapters with running requests,
  formatted as a comma-separated string.
- waiting_lora_adapters: similar, except listing adapters with
  requests waiting to be scheduled.
- max_lora - the static "max number of LoRAs in a single batch."
  configuration.

It looks like this:

```
vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09
vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09
vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09
```

I can't really make much sense of this. Encoding a running/waiting
status for multiple adapters in a comma-separated string seems quite
misguided - we should use labels to distinguish between per-adapter
counts instead:

```
vllm:num_lora_requests_running{lora_name="test-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 8.0
vllm:num_lora_requests_waiting{lora_name="test-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.0
```

This was added in vllm-project#9477 and there is at least one known user. If we
revisit this design and deprecate the old metric, we should reduce the
need for a significant deprecation period by making the change in v0
also and asking this project to move to the new metric.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc markmc force-pushed the metrics-v1-lora-metrics branch from 3382c66 to afd51dc Compare February 14, 2025 20:05
@markmc
Copy link
Member Author

markmc commented Feb 14, 2025

You could argue either:

  1. We don't need per-adapter counts at all, just an info metric (like cache_config_info) that lists the configured adapters, or

  2. Most of our metrics should be per-adapter ... just break them down per-adapter and label with lora_name=

@markmc
Copy link
Member Author

markmc commented Feb 15, 2025

See also #6275

@markmc
Copy link
Member Author

markmc commented Feb 18, 2025

ok, I took a closer look at what the Gateway API Inference Extension is doing with this metric. I've filed kubernetes-sigs/gateway-api-inference-extension#354 to invite feedback from that project.

The premise is this:

route to a model server that has the adapter already loaded, so long as there is batch capacity

and the way the metric is described:

Metric name implemented in vLLM: vllm:lora_requests_info
Metric type: Gauge
Metric value: The last updated timestamp - so the Endpoint Picker (EPP) can find the latest
Metric labels:
max_lora: The maximum number of adapters that can be loaded to GPU memory to serve a batch. Requests will be queued if the model server has reached MaxActiveAdapter and canno load the requested adapter. Example: "max_lora": "8".
running_lora_adapters: A comma separated list of adapters that are currently loaded in GPU memory and ready to serve requests. Example: "running_lora_adapters": "adapter1, adapter2"

References:

@markmc
Copy link
Member Author

markmc commented Feb 18, 2025

Given the way LRUCacheWorkerLoRAManager works, the current V0 metric implementation and what's proposed here in V1 both miss an important point - even if there was no requests for a given LoRA included in the most recent batch, that LoRA's weights could still be loaded on GPUs and it would still be efficient to route requests for it to this vLLM instance.

@varun-sundar-rabindranath
Copy link
Contributor

varun-sundar-rabindranath commented Feb 18, 2025

Hi @markmc ! Thanks for doing this!
I looked through #6275. On top of what you propose (adapters + counts) the metric proposed there look very informative,

from the RFC:

 - Loading and unloading times for LoRA adapters.
 - Memory and compute resource usage by LoRA adapters.
 - Performance impact on base models when using LoRA adapters.

About,

Loading and unloading times for LoRA adapters.

I think this is good information. The max loras value combined with information about the number of running/waiting loras , will inform the users about the dynamics of LoRA loads. I believe the load times could serve as a good secondary information.
side-note: In V1, the LoRA adapter Loads are triggered here

if self.lora_config:
. In this vein, i.e. informing the users about the input preparation time, I think we should consider exposing the run-time of _prepare_inputs function in general.

About,

Memory and compute resource usage by LoRA adapters.

For memory usage, we already profile the memory usage for a forward pass in determining the available memory for KV cache. However, it is not granular enough to inform about the memory usage by particular LoRA adapters. Also, since the max_loras engine argument effectively limits the number of LoRA adapters used, I believe the memory usage by LoRA adapters would be constant and wouldn't change during runtime.
For compute usage by LoRA adapters, I am not sure how we would do this at his granularity. Perhaps, we can add a "gpu-utilitzation" metric to Stats for the users to infer.
I believe this set of metrics is less important and we can tackle the memory and compute usage metrics after the first cut maybe.

About,

even if there was no requests for a given LoRA included in the most recent batch, that LoRA's weights could still be loaded on GPUs and it would still be efficient to route requests for it to this vLLM instance.
and,
Performance impact on base models when using LoRA adapters.

I believe the existing Iteration level metrics (

# Iteration stats
) and what you proposed (adapters + counts) combined should inform the user of this.

@markmc
Copy link
Member Author

markmc commented Feb 18, 2025

Implemented the V0 metric in V1 in #13504

markmc added a commit to markmc/vllm that referenced this pull request Feb 24, 2025
```
vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09
vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09
vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09
```

As discussed in vllm-project#13303, this metric perhaps isn't the most ideal
solution for the use case but, given there is an existing user,
we should retain compatibility in V1 and deprecate it if we
replace it with a different metric.

See also kubernetes-sigs/gateway-api-inference-extension#354

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Copy link

mergify bot commented Feb 25, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @markmc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 25, 2025
@markmc
Copy link
Member Author

markmc commented Apr 4, 2025

I've filed kubernetes-sigs/gateway-api-inference-extension#354 to invite feedback from that project.

Deferring for now, based on the feedback from above

@markmc markmc closed this Apr 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants