-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
[V1][Metrics] Implement vllm:lora_requests_info metric #13504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
vllm/v1/engine/output_processor.py
Outdated
@@ -241,11 +255,22 @@ def _update_stats_from_finished(self, req_state: RequestState, | |||
if iteration_stats is None: | |||
return | |||
|
|||
lora_stats: Optional[LoRAStats] = None | |||
if req_state.lora_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can just be lora_stats = self.lora_stats.get(req_state.lora_name, None)
vllm/v1/metrics/stats.py
Outdated
@@ -151,3 +168,11 @@ def update_from_finished_request(self, finish_reason: "FinishReason", | |||
inference_time=inference_time, | |||
decode_time=decode_time) | |||
self.finished_requests.append(finished_req) | |||
|
|||
if lora_stats is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One state transition which may not be accounted for here is if a request is aborted
Aborts occur when the client disconnects from the server. The AsyncLLM
sends an abort message to the OutputProcessor
and to the EngineCore
. This may cause the request to be finished, but I am not 100% sure that it does. This could cause the request to never be removed from running_requests
. We should just double check this.
Additionally, if a client aborts when the request is in lora_stats.waiting_requests
it will not ever be removed from the waiting set since the request will never be scheduled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, take a look now - basically I've tried to couple this closely with RequestState
lifecycle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. Left a small nit that will help clean up the API a bit EngineCoreRequest
already contains lora_request
so we do not need to pass it around as a separate arg
The one functionality issue I see is that we do not handle request aborting properly. Can connect offline about this but I left a detailed comment below.
``` vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09 ``` As discussed in vllm-project#13303, this metric perhaps isn't the most ideal solution for the use case but, given there is an existing user, we should retain compatibility in V1 and deprecate it if we replace it with a different metric. See also kubernetes-sigs/gateway-api-inference-extension#354 Signed-off-by: Mark McLoughlin <markmc@redhat.com>
The LoRA name is already available via the EngineCoreRequest arg. PR feedback from Rob. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Both are global state. PR feedback from Rob. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Tie LoRAStats lifecycle more closely to RequestStats. 1. Add request to waiting_requests when we create a RequestStats in add_request() 2. Remove request from LoRAStats whenever we remove a RequestStats (both finish and abort) Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Encapsulate all of this into a LoRARequestStates class in the v1.metrics.stats module, since its sole purpose is to gather stats. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
15b9326
to
b78a6f1
Compare
Thanks @robertgshaw2-redhat. Great feedback. This definitely warranted a bunch of cleanup. See the 5 fixup commits I pushed 👍 |
LGTM, I like the abstraction and lifecycle. Just one nit on the typing. Ping when ready for automerge. Do you know if we have any test coverage for this metric? |
PR feedback from Rob. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Thanks! Fixed
No we don't, and I did consider addressing that (instead of manually) testing, but I was loathe to invest more time in it because I'm only adding it for compatibility reasons ... I'm pretty dubious that it's the right long-term solution (so I'd rather e.g. spend that time capturing all of the input on the topic in the design doc) |
@@ -246,6 +260,7 @@ def _update_stats_from_finished(self, req_state: RequestState, | |||
iteration_stats.update_from_finished_request(finish_reason, | |||
request_output, | |||
req_state.stats) | |||
self.lora_states.finish_request(req_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If iteration_stats
is None, the execution of self.lora_states.finish_request(req_state)
will be skipped. Could this lead to a lora_state leak?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct. This would cause a leak in the case that:
- stat logging is disabled
- lora is enabled
Thanks for your detailed review!
…13504) Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
As discussed in #13303, this metric perhaps isn't the most ideal solution for the use case but, given there is an existing user, we should retain compatibility in V1 and deprecate it if we replace it with a different metric.
See also kubernetes-sigs/gateway-api-inference-extension#354