[Feature][v1]: Add metrics support #10582

rickyyx · 2024-11-22T19:48:58Z

🚀 The feature, motivation and pitch

We should also be feature parity on metrics with most of available stats if possible. On a high level:

[P0] Support system and requests stats logging
[P0] Support metric export to prometheus.
[P1] Support or deprecate all metrics from V0
[P1] Allow users to define their own prometheus client and other arbitrary loggers.
[P2] Make it work with tracing too (there's some request level stats that tracing needs, like queue time, ttft). These request level metric should be possible to be surfaced in v1 too.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

rickyyx · 2024-11-22T19:50:22Z

Opening the issue to track and collab - in case someone else is already looking into this.

rickyyx · 2024-11-26T02:34:51Z

Prototype in #10651

Part of vllm-project#10582 Implement the vllm:num_requests_running and vllm:num_requests_waiting gauges from V0. This is a simple starting point from which to iterate towards parity with V0. There's no need to use prometheus_client's "multi-processing mode" (at least at this stage) because these metrics all exist within the API server process. Note this restores the following metrics - these were lost when we started using multi-processing mode: - python_gc_objects_collected_total - python_gc_objects_uncollectable_total - python_gc_collections_total - python_info - process_virtual_memory_bytes - process_resident_memory_bytes - process_start_time_seconds - process_cpu_seconds_total - process_open_fds - process_max_fds Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc · 2025-02-04T17:42:13Z

Follow on from vllm-project#12579, part of vllm-project#10582. Add the following: - vllm:e2e_request_latency_seconds - vllm:request_queue_time_seconds - vllm:request_inference_time_seconds - vllm:request_prefill_time_seconds - vllm:request_decode_time_seconds e2e_request_latency is calculated relative to the arrival_time timestamp recorded by the frontend. For the rest ... we want to capture (in histograms) precise pre-request timing intervals between certain events in the engine core: ``` << queued timestamp >> [ queue interval ] << scheduled timestamp >> [ prefill interval ] << new token timestamp (FIRST) >> [ inter-token interval ] << new token timestamp >> [ decode interval (relative to first token time) [ inference interval (relative to scheduled time) << new token timestamp (FINISHED) >> ``` We want to collect these metrics in the frontend process, to keep the engine core freed up as much as possible. We need to calculate these intervals based on timestamps recorded by the engine core. Engine core will include these timestamps in EngineCoreOutput (per request) as a sequence of timestamped events, and the frontend will calculate intervals and log them. Where we record these timestamped events: - QUEUED: scheduler add_request() - SCHEDULED: scheduler schedule() There is an implicit NEW_TOKENS timestamp based on an initialization timestamp recorded on EngineCoreOutputs. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc · 2025-02-27T13:47:07Z

As a bit of a status update, here's how the example Grafana dashboard currently looks with a serving benchmark run like this:

$ python3 ./benchmarks/benchmark_serving.py --model meta-llama/Llama-3.1-8B-Instruct --tokenizer meta-llama/Llama-3.1-8B-Instruct --endpoint /v1/completions --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 3.0 --num-prompts 200
...
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  71.53     
Total input tokens:                      42659     
Total generated tokens:                  43516     
Request throughput (req/s):              2.80      
Output token throughput (tok/s):         608.37    
Total Token throughput (tok/s):          1204.76   
---------------Time to First Token----------------
Mean TTFT (ms):                          24.47     
Median TTFT (ms):                        24.67     
P99 TTFT (ms):                           31.27     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.16     
Median TPOT (ms):                        13.20     
P99 TPOT (ms):                           13.64     
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.14     
Median ITL (ms):                         13.16     
P99 ITL (ms):                            14.78     
==================================================

markmc · 2025-02-27T14:03:40Z

What's nice about the above is that even though V1 does not have vllm:num_requests_swapped and vllm:cpu_cache_usage_perc (because V1 doesn't have swap-to-CPU preemption mode), it doesn't impact the user experience of the dashboard - i.e. they just don't show up in the Scheduler State and Cache Utilization panels 👍

markmc · 2025-03-03T13:13:49Z

Here's the latest on what's in V0 versus V1:

In Both	In V0 Only	In V1 Only
vllm:cache_config_info	vllm:cpu_cache_usage_perc #14136	vllm:gpu_prefix_cache_hits #12592
vllm:e2e_request_latency_seconds	vllm:cpu_prefix_cache_hit_rate #14136	vllm:gpu_prefix_cache_queries #12592
vllm:generation_tokens_total	vllm:gpu_prefix_cache_hit_rate #14136
vllm:gpu_cache_usage_perc	vllm:model_execute_time_milliseconds #14135
vllm:iteration_tokens_total	vllm:model_forward_time_milliseconds #14135
vllm:lora_requests_info	vllm:num_requests_swapped #14136
vllm:num_preemptions_total	vllm:request_max_num_generation_tokens #14055
vllm:num_requests_running	vllm:request_params_max_tokens #14055
vllm:num_requests_waiting	vllm:request_params_n #14055
vllm:prompt_tokens_total	vllm:spec_decode_draft_acceptance_rate
vllm:request_decode_time_seconds	vllm:spec_decode_efficiency
vllm:request_generation_tokens	vllm:spec_decode_num_accepted_tokens_total
vllm:request_inference_time_seconds	vllm:spec_decode_num_draft_tokens_total
vllm:request_prefill_time_seconds	vllm:spec_decode_num_emitted_tokens_total
vllm:request_prompt_tokens	vllm:time_in_queue_requests #14135
vllm:request_queue_time_seconds	~~vllm:tokens_total #14134~~
vllm:request_success_total
vllm:time_per_output_token_seconds
vllm:time_to_first_token_seconds

Next Steps

liuzijing2014 · 2025-03-06T19:27:34Z

Hi, just wanna check in to see if we have plan to support per request level stats logging? For example: {request_1: {ttit: 10}, {e2e_latency: 200}}.

Fixes vllm-project#13990, part of vllm-project#10582 Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Fixes vllm-project#13990, part of vllm-project#10582

Fixes vllm-project#13990, part of vllm-project#10582 Omitting system efficiency for now. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

rickyyx added the feature request New feature or request label Nov 22, 2024

rickyyx mentioned this issue Dec 5, 2024

[v1][stats][1/n] Add RequestStatsUpdate and RequestStats types #10907

Merged

markmc mentioned this issue Jan 24, 2025

[V1][Metrics] Add initial Prometheus logger #12416

Merged

comaniac mentioned this issue Jan 30, 2025

[V1][Metrics] Add GPU prefix cache hit rate % gauge #12592

Merged

markmc mentioned this issue Feb 1, 2025

[V1][Metrics] Add several request timing histograms #12644

Merged

markmc mentioned this issue Feb 28, 2025

[WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics #14055

Merged

markmc added a commit to markmc/vllm that referenced this issue Mar 19, 2025

[WIP][V1][Metrics] Speculative decoding metrics

cd3ecad

Fixes vllm-project#13990, part of vllm-project#10582 Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc mentioned this issue Mar 19, 2025

[V1][Metrics] Initial speculative decoding metrics #15151

Merged

markmc added a commit to markmc/vllm that referenced this issue Mar 24, 2025

[WIP][V1][Metrics] Speculative decoding metrics

209d131

Fixes vllm-project#13990, part of vllm-project#10582 Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc added a commit to markmc/vllm that referenced this issue Mar 28, 2025

[WIP][V1][Metrics] Structured decoding metrics

a1b6b14

Fixes vllm-project#13990, part of vllm-project#10582

markmc added a commit to markmc/vllm that referenced this issue Mar 28, 2025

[WIP][V1][Metrics] Speculative decoding metrics

478238d

Fixes vllm-project#13990, part of vllm-project#10582 Omitting system efficiency for now. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc added a commit to markmc/vllm that referenced this issue Mar 31, 2025

[WIP][V1][Metrics] Speculative decoding metrics

1647956

Fixes vllm-project#13990, part of vllm-project#10582 Omitting system efficiency for now. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

dtransposed mentioned this issue Apr 9, 2025

[Bug]: Missing metrics in V1 #16348

Open

1 task

This was referenced May 9, 2025

feat: add vllm v1 engine tracing #17874

Closed

feat: add vllm v1 engine tracing #17992

Closed

feat:add engine v1 tracing #17993

Closed

This was referenced May 12, 2025

[Misc] Add Ray Prometheus logger to V1 #17925

Merged

Expose vLLM Metrics to serve.llm API ray-project/ray#52719

Merged

RichardoMrMu mentioned this issue May 13, 2025

[V1] feat:add engine v1 tracing #18069

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature][v1]: Add metrics support #10582

[Feature][v1]: Add metrics support #10582

rickyyx commented Nov 22, 2024 •

edited

Loading

rickyyx commented Nov 22, 2024

Uh oh!

rickyyx commented Nov 26, 2024

Uh oh!

markmc commented Feb 4, 2025 •

edited

Loading

Uh oh!

markmc commented Feb 27, 2025

Uh oh!

markmc commented Feb 27, 2025

Uh oh!

markmc commented Mar 3, 2025 •

edited by hmellor

Loading

Uh oh!

liuzijing2014 commented Mar 6, 2025

Uh oh!

Uh oh!

[Feature][v1]: Add metrics support #10582

[Feature][v1]: Add metrics support #10582

Comments

rickyyx commented Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

rickyyx commented Nov 22, 2024

Uh oh!

rickyyx commented Nov 26, 2024

Uh oh!

markmc commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status

Next Steps

Uh oh!

markmc commented Feb 27, 2025

Uh oh!

markmc commented Feb 27, 2025

Uh oh!

markmc commented Mar 3, 2025 • edited by hmellor Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Next Steps

Uh oh!

liuzijing2014 commented Mar 6, 2025

Uh oh!

rickyyx commented Nov 22, 2024 •

edited

Loading

markmc commented Feb 4, 2025 •

edited

Loading

markmc commented Mar 3, 2025 •

edited by hmellor

Loading