Skip to content

[Feature][v1]: Add metrics support #10582

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
rickyyx opened this issue Nov 22, 2024 · 7 comments
Open
1 task done

[Feature][v1]: Add metrics support #10582

rickyyx opened this issue Nov 22, 2024 · 7 comments
Labels
feature request New feature or request

Comments

@rickyyx
Copy link
Contributor

rickyyx commented Nov 22, 2024

🚀 The feature, motivation and pitch

We should also be feature parity on metrics with most of available stats if possible. On a high level:

  1. [P0] Support system and requests stats logging
  2. [P0] Support metric export to prometheus.
  3. [P1] Support or deprecate all metrics from V0
  4. [P1] Allow users to define their own prometheus client and other arbitrary loggers.
  5. [P2] Make it work with tracing too (there's some request level stats that tracing needs, like queue time, ttft). These request level metric should be possible to be surfaced in v1 too.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@rickyyx rickyyx added the feature request New feature or request label Nov 22, 2024
@rickyyx
Copy link
Contributor Author

rickyyx commented Nov 22, 2024

Opening the issue to track and collab - in case someone else is already looking into this.

@rickyyx
Copy link
Contributor Author

rickyyx commented Nov 26, 2024

Prototype in #10651

markmc added a commit to markmc/vllm that referenced this issue Jan 26, 2025
Part of vllm-project#10582

Implement the vllm:num_requests_running and vllm:num_requests_waiting
gauges from V0. This is a simple starting point from which to iterate
towards parity with V0.

There's no need to use prometheus_client's "multi-processing mode"
(at least at this stage) because these metrics all exist within the
API server process.

Note this restores the following metrics - these were lost when we
started using multi-processing mode:

- python_gc_objects_collected_total
- python_gc_objects_uncollectable_total
- python_gc_collections_total
- python_info
- process_virtual_memory_bytes
- process_resident_memory_bytes
- process_start_time_seconds
- process_cpu_seconds_total
- process_open_fds
- process_max_fds

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc
Copy link
Member

markmc commented Feb 4, 2025

I thought it was about time to update on the latest status of this and note some TODOs.

Status

The v1 engine frontend API server now has a Prometheus-compatible `/metrics' endpoint.

The following PRs should merge soon:

Which will mean we support the following metrics:

  • vllm:num_requests_running (Gauge)
  • vllm:num_requests_waiting (Gauge)
  • vllm:gpu_cache_usage_perc (Gauge)
  • vllm:prompt_tokens_total (Counter)
  • vllm:generation_tokens_total (Counter)
  • vllm:request_success_total (Counter)
  • vllm:request_prompt_tokens (Histogram)
  • vllm:request_generation_tokens (Histogram)
  • vllm:time_to_first_token_seconds (Histogram)
  • vllm:time_per_output_token_seconds (Histogram)
  • vllm:e2e_request_latency_seconds (Histogram)
  • vllm:request_queue_time_seconds (Histogram)
  • vllm:request_inference_time_seconds (Histogram)
  • vllm:request_prefill_time_seconds (Histogram)
  • vllm:request_decode_time_seconds (Histogram)

Also, note that - vllm:gpu_prefix_cache_queries and vllm:gpu_prefix_cache_hits (Counters) replaces vllm:gpu_prefix_cache_hit_rate (Gauge).

These are most of the metrics used by the example Grafana dashboard, with the exception of:

  • vllm:num_requests_swapped (Gauge)
  • vllm:cpu_cache_usage_perc (Gauge)
  • vllm:request_max_num_generation_tokens (Histogram)

Additionally, these are other metrics supported by v0, but not yet by v1:

  • vllm:num_preemptions_total (Counter)
  • vllm:cache_config_info (Gauge)
  • vllm:lora_requests_info (Gauge)
  • vllm:cpu_prefix_cache_hit_rate (Gauge)
  • vllm:tokens_total (Counter)
  • vllm:iteration_tokens_total (Histogram)
  • vllm:time_in_queue_requests (Historgram)
  • vllm:model_forward_time_milliseconds (Histogram
  • vllm:model_execute_time_milliseconds (Histogram)
  • vllm:request_params_n (Histogram)
  • vllm:request_params_max_tokens (Histogram)
  • vllm:spec_decode_draft_acceptance_rate (Gauge)
  • vllm:spec_decode_efficiency (Gauge)
  • vllm:spec_decode_num_accepted_tokens_total (Counter)
  • vllm:spec_decode_num_draft_tokens_total (Counter)
  • vllm:spec_decode_num_emitted_tokens_total (Counter)

Next Steps

markmc added a commit to markmc/vllm that referenced this issue Feb 11, 2025
Follow on from vllm-project#12579, part of vllm-project#10582.

Add the following:

- vllm:e2e_request_latency_seconds
- vllm:request_queue_time_seconds
- vllm:request_inference_time_seconds
- vllm:request_prefill_time_seconds
- vllm:request_decode_time_seconds

e2e_request_latency is calculated relative to the arrival_time
timestamp recorded by the frontend.

For the rest ... we want to capture (in histograms) precise
pre-request timing intervals between certain events in the engine
core:

```
  << queued timestamp >>
    [ queue interval ]
  << scheduled timestamp >>
    [ prefill interval ]
  << new token timestamp (FIRST) >>
    [ inter-token interval ]
  << new token timestamp >>
    [ decode interval (relative to first token time)
    [ inference interval (relative to scheduled time)
  << new token timestamp (FINISHED) >>
```

We want to collect these metrics in the frontend process, to keep the
engine core freed up as much as possible. We need to calculate these
intervals based on timestamps recorded by the engine core.

Engine core will include these timestamps in EngineCoreOutput (per
request) as a sequence of timestamped events, and the frontend will
calculate intervals and log them. Where we record these timestamped
events:

- QUEUED: scheduler add_request()
- SCHEDULED: scheduler schedule()

There is an implicit NEW_TOKENS timestamp based on an initialization
timestamp recorded on EngineCoreOutputs.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc
Copy link
Member

markmc commented Feb 27, 2025

As a bit of a status update, here's how the example Grafana dashboard currently looks with a serving benchmark run like this:

$ python3 ./benchmarks/benchmark_serving.py --model meta-llama/Llama-3.1-8B-Instruct --tokenizer meta-llama/Llama-3.1-8B-Instruct --endpoint /v1/completions --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 3.0 --num-prompts 200
...
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  71.53     
Total input tokens:                      42659     
Total generated tokens:                  43516     
Request throughput (req/s):              2.80      
Output token throughput (tok/s):         608.37    
Total Token throughput (tok/s):          1204.76   
---------------Time to First Token----------------
Mean TTFT (ms):                          24.47     
Median TTFT (ms):                        24.67     
P99 TTFT (ms):                           31.27     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.16     
Median TPOT (ms):                        13.20     
P99 TPOT (ms):                           13.64     
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.14     
Median ITL (ms):                         13.16     
P99 ITL (ms):                            14.78     
==================================================

Image

Image

@markmc
Copy link
Member

markmc commented Feb 27, 2025

What's nice about the above is that even though V1 does not have vllm:num_requests_swapped and vllm:cpu_cache_usage_perc (because V1 doesn't have swap-to-CPU preemption mode), it doesn't impact the user experience of the dashboard - i.e. they just don't show up in the Scheduler State and Cache Utilization panels 👍

@markmc
Copy link
Member

markmc commented Mar 3, 2025

Here's the latest on what's in V0 versus V1:

In Both In V0 Only In V1 Only
vllm:cache_config_info vllm:cpu_cache_usage_perc #14136 vllm:gpu_prefix_cache_hits #12592
vllm:e2e_request_latency_seconds vllm:cpu_prefix_cache_hit_rate #14136 vllm:gpu_prefix_cache_queries #12592
vllm:generation_tokens_total vllm:gpu_prefix_cache_hit_rate #14136
vllm:gpu_cache_usage_perc vllm:model_execute_time_milliseconds #14135
vllm:iteration_tokens_total vllm:model_forward_time_milliseconds #14135
vllm:lora_requests_info vllm:num_requests_swapped #14136
vllm:num_preemptions_total vllm:request_max_num_generation_tokens #14055
vllm:num_requests_running vllm:request_params_max_tokens #14055
vllm:num_requests_waiting vllm:request_params_n #14055
vllm:prompt_tokens_total vllm:spec_decode_draft_acceptance_rate
vllm:request_decode_time_seconds vllm:spec_decode_efficiency
vllm:request_generation_tokens vllm:spec_decode_num_accepted_tokens_total
vllm:request_inference_time_seconds vllm:spec_decode_num_draft_tokens_total
vllm:request_prefill_time_seconds vllm:spec_decode_num_emitted_tokens_total
vllm:request_prompt_tokens vllm:time_in_queue_requests #14135
vllm:request_queue_time_seconds vllm:tokens_total #14134
vllm:request_success_total
vllm:time_per_output_token_seconds
vllm:time_to_first_token_seconds

Next Steps

@liuzijing2014
Copy link
Collaborator

Hi, just wanna check in to see if we have plan to support per request level stats logging? For example: {request_1: {ttit: 10}, {e2e_latency: 200}}.

markmc added a commit to markmc/vllm that referenced this issue Mar 19, 2025
Fixes vllm-project#13990, part of vllm-project#10582

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
markmc added a commit to markmc/vllm that referenced this issue Mar 24, 2025
Fixes vllm-project#13990, part of vllm-project#10582

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
markmc added a commit to markmc/vllm that referenced this issue Mar 28, 2025
markmc added a commit to markmc/vllm that referenced this issue Mar 28, 2025
Fixes vllm-project#13990, part of vllm-project#10582

Omitting system efficiency for now.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
markmc added a commit to markmc/vllm that referenced this issue Mar 31, 2025
Fixes vllm-project#13990, part of vllm-project#10582

Omitting system efficiency for now.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants
@markmc @liuzijing2014 @rickyyx and others