[V1][Metrics] Implement vllm:lora_requests_info metric #13504

markmc · 2025-02-18T22:55:06Z

vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09
vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09
vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09

As discussed in #13303, this metric perhaps isn't the most ideal solution for the use case but, given there is an existing user, we should retain compatibility in V1 and deprecate it if we replace it with a different metric.

See also kubernetes-sigs/gateway-api-inference-extension#354

github-actions · 2025-02-18T22:55:19Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vllm/v1/engine/output_processor.py

vllm/v1/metrics/stats.py

robertgshaw2-redhat · 2025-02-21T03:24:24Z

vllm/v1/engine/output_processor.py

@@ -241,11 +255,22 @@ def _update_stats_from_finished(self, req_state: RequestState,
        if iteration_stats is None:
            return

+        lora_stats: Optional[LoRAStats] = None
+        if req_state.lora_name:


this can just be lora_stats = self.lora_stats.get(req_state.lora_name, None)

vllm/v1/engine/output_processor.py

robertgshaw2-redhat · 2025-02-21T03:31:11Z

vllm/v1/metrics/stats.py

@@ -151,3 +168,11 @@ def update_from_finished_request(self, finish_reason: "FinishReason",
                                 inference_time=inference_time,
                                 decode_time=decode_time)
        self.finished_requests.append(finished_req)
+
+        if lora_stats is not None:


One state transition which may not be accounted for here is if a request is aborted

Aborts occur when the client disconnects from the server. The AsyncLLM sends an abort message to the OutputProcessor and to the EngineCore. This may cause the request to be finished, but I am not 100% sure that it does. This could cause the request to never be removed from running_requests. We should just double check this.

Additionally, if a client aborts when the request is in lora_stats.waiting_requests it will not ever be removed from the waiting set since the request will never be scheduled.

Great point, take a look now - basically I've tried to couple this closely with RequestState lifecycle

robertgshaw2-redhat

Generally looks good. Left a small nit that will help clean up the API a bit EngineCoreRequest already contains lora_request so we do not need to pass it around as a separate arg

The one functionality issue I see is that we do not handle request aborting properly. Can connect offline about this but I left a detailed comment below.

``` vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09 ``` As discussed in vllm-project#13303, this metric perhaps isn't the most ideal solution for the use case but, given there is an existing user, we should retain compatibility in V1 and deprecate it if we replace it with a different metric. See also kubernetes-sigs/gateway-api-inference-extension#354 Signed-off-by: Mark McLoughlin <markmc@redhat.com>

The LoRA name is already available via the EngineCoreRequest arg. PR feedback from Rob. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Both are global state. PR feedback from Rob. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Tie LoRAStats lifecycle more closely to RequestStats. 1. Add request to waiting_requests when we create a RequestStats in add_request() 2. Remove request from LoRAStats whenever we remove a RequestStats (both finish and abort) Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Encapsulate all of this into a LoRARequestStates class in the v1.metrics.stats module, since its sole purpose is to gather stats. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc · 2025-02-24T15:07:00Z

Generally looks good. Left a small nit that will help clean up the API a bit EngineCoreRequest already contains lora_request so we do not need to pass it around as a separate arg

The one functionality issue I see is that we do not handle request aborting properly. Can connect offline about this but I left a detailed comment below.

Thanks @robertgshaw2-redhat. Great feedback. This definitely warranted a bunch of cleanup. See the 5 fixup commits I pushed 👍

vllm/v1/metrics/stats.py

robertgshaw2-redhat · 2025-02-24T15:39:22Z

LGTM, I like the abstraction and lifecycle. Just one nit on the typing. Ping when ready for automerge.

Do you know if we have any test coverage for this metric?

PR feedback from Rob. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc · 2025-02-24T16:22:19Z

LGTM, I like the abstraction and lifecycle. Just one nit on the typing. Ping when ready for automerge.

Thanks! Fixed

Do you know if we have any test coverage for this metric?

No we don't, and I did consider addressing that (instead of manually) testing, but I was loathe to invest more time in it because I'm only adding it for compatibility reasons ... I'm pretty dubious that it's the right long-term solution (so I'd rather e.g. spend that time capturing all of the input on the topic in the design doc)

…13504)

hidva · 2025-03-22T05:39:10Z

vllm/v1/engine/output_processor.py

@@ -246,6 +260,7 @@ def _update_stats_from_finished(self, req_state: RequestState,
        iteration_stats.update_from_finished_request(finish_reason,
                                                     request_output,
                                                     req_state.stats)
+        self.lora_states.finish_request(req_state)


If iteration_stats is None, the execution of self.lora_states.finish_request(req_state) will be skipped. Could this lead to a lora_state leak?

You are correct. This would cause a leak in the case that:

stat logging is disabled

lora is enabled

Thanks for your detailed review!

…13504) Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

…13504)

markmc requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners February 18, 2025 22:55

markmc mentioned this pull request Feb 18, 2025

[WIP][Metrics] Re-work approach to LoRA metrics #13303

Closed

5 tasks

mergify bot added the v1 label Feb 18, 2025

markmc mentioned this pull request Feb 18, 2025

[Feature][v1]: Add metrics support #10582

Open

1 task