-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
[WIP][Metrics] Re-work approach to LoRA metrics #13303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
The current `vllm:lora_requests_info` Gauge is somewhat similar to an Info metric (like cache_config_info) except the value is the current wall-clock time, and is updated every iteration. The label names used are: - running_lora_adapters: a list of adapters with running requests, formatted as a comma-separated string. - waiting_lora_adapters: similar, except listing adapters with requests waiting to be scheduled. - max_lora - the static "max number of LoRAs in a single batch." configuration. It looks like this: ``` vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09 ``` I can't really make much sense of this. Encoding a running/waiting status for multiple adapters in a comma-separated string seems quite misguided - we should use labels to distinguish between per-adapter counts instead: ``` vllm:num_lora_requests_running{lora_name="test-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 8.0 vllm:num_lora_requests_waiting{lora_name="test-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.0 ``` This was added in vllm-project#9477 and there is at least one known user. If we revisit this design and deprecate the old metric, we should reduce the need for a significant deprecation period by making the change in v0 also and asking this project to move to the new metric. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
3382c66
to
afd51dc
Compare
You could argue either:
|
See also #6275 |
ok, I took a closer look at what the Gateway API Inference Extension is doing with this metric. I've filed kubernetes-sigs/gateway-api-inference-extension#354 to invite feedback from that project. The premise is this:
and the way the metric is described:
References: |
Given the way |
Hi @markmc ! Thanks for doing this! from the RFC:
About,
I think this is good information. The max loras value combined with information about the number of running/waiting loras , will inform the users about the dynamics of LoRA loads. I believe the load times could serve as a good secondary information. vllm/vllm/v1/worker/gpu_model_runner.py Line 581 in 3809458
_prepare_inputs function in general.
About,
For memory usage, we already profile the memory usage for a forward pass in determining the available memory for KV cache. However, it is not granular enough to inform about the memory usage by particular LoRA adapters. Also, since the About,
I believe the existing Iteration level metrics ( vllm/vllm/engine/llm_engine.py Line 1600 in 3809458
|
Implemented the V0 metric in V1 in #13504 |
``` vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09 ``` As discussed in vllm-project#13303, this metric perhaps isn't the most ideal solution for the use case but, given there is an existing user, we should retain compatibility in V1 and deprecate it if we replace it with a different metric. See also kubernetes-sigs/gateway-api-inference-extension#354 Signed-off-by: Mark McLoughlin <markmc@redhat.com>
This pull request has merge conflicts that must be resolved before it can be |
Deferring for now, based on the feedback from above |
Part of #10582 and discussed in #12745
The current
vllm:lora_requests_info
Gauge is somewhat similar to an Info metric (like cache_config_info) except the value is the current wall-clock time, and is updated every iteration.The label names used are:
It looks like this:
I can't really make much sense of this. Encoding a running/waiting status for multiple adapters in a comma-separated string seems quite misguided - we should use labels to distinguish between per-adapter counts instead:
This was added in #9477 and there is at least one known user. If we revisit this design and deprecate the old metric, we should reduce the need for a significant deprecation period by making the change in v0 also and asking this project to move to the new metric.
TODO: