Skip to content

[ModelRunner]Add profile execute duration observation #1013

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

depeng1994
Copy link

What this PR does / why we need it?

We need to observe the time consumed in each stage of inference (including pre-processing, model forward, etc.), without any performance loss.
Therefore, we use the event timestamp mechanism of the NPU to mark any stage during the execution of the NPU device (this marking operation is executed asynchronously, with no performance loss).
Additionally, we provide a blocking synchronization API pop_captured_sync to be called at an appropriate time, to print the time consumed in all observed stages.

model_runner_v1.py file only changed 5 lines, all of which were ProfileExecuteDuration() calls, and nothing else was changed, while more changes were showed due to the alignment issue.

Does this PR introduce any user-facing change?

Use env VLLM_MODEL_EXECUTE_TIME_OBSERVE to enable this feature

How was this patch tested?

Tested in deepseek model,Print like this:

5691:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.17ms [prepare input and forward]:9.57ms [forward]:4.14ms
5695:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.29ms [prepare input and forward]:10.19ms [forward]:4.14ms
5697:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.81ms [prepare input and forward]:10.29ms [forward]:3.99ms
5701:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.10ms [prepare input and forward]:10.62ms [forward]:4.33ms
5705:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.65ms [prepare input and forward]:9.58ms [forward]:4.20ms
5709:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.43ms [prepare input and forward]:9.88ms [forward]:4.20ms
5711:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.89ms [prepare input and forward]:10.49ms [forward]:4.19ms
5715:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.14ms [prepare input and forward]:11.21ms [forward]:4.18ms
5719:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.71ms [prepare input and forward]:10.15ms [forward]:4.42ms
5723:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.31ms [forward]:4.25ms
5725:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.12ms [prepare input and forward]:10.33ms [forward]:4.24ms
5729:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.58ms [prepare input and forward]:10.85ms [forward]:4.32ms
5733:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.32ms [prepare input and forward]:9.79ms [forward]:4.28ms
5737:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:15.06ms [prepare input and forward]:9.89ms [forward]:4.32ms
5739:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.48ms [forward]:4.27ms
5743:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.60ms [prepare input and forward]:10.71ms [forward]:4.61ms
5747:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.21ms [prepare input and forward]:10.10ms [forward]:4.52ms
5751:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:15.03ms [prepare input and forward]:10.00ms [forward]:4.42ms

@MengqingCao
Copy link
Collaborator

Overall lgtm, just some suggestions:

  • Thanks for the detail pr description, could you please convert it to a tutorial for this feature? We can put it in docs/source/developer_guide/profile_execute_duration_observation.md
  • Could you add ut for this feature?
  • Plz run bash format.sh locally to fix lint failures, and sign off your commit by commit -sm "your commit message" to fix DCO

@depeng1994
Copy link
Author

@wangxiyuan @Yikun @ganyi1996ppo please take a look, tks

Signed-off-by: depeng1994 <depengzhang@foxmail.com>
Signed-off-by: depeng1994 <depengzhang@foxmail.com>
@github-actions github-actions bot added documentation Improvements or additions to documentation merge-conflicts labels May 30, 2025
Copy link

github-actions bot commented Jun 3, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: depeng1994 <depengzhang@foxmail.com>
Signed-off-by: depeng1994 <depengzhang@foxmail.com>
@depeng1994 depeng1994 force-pushed the main branch 3 times, most recently from 25ee163 to 9967032 Compare June 3, 2025 12:49
Signed-off-by: depeng1994 <depengzhang@foxmail.com>
Copy link

github-actions bot commented Jun 4, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Copy link

github-actions bot commented Jun 4, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@@ -36,6 +36,8 @@
lambda: bool(int(os.getenv("COMPILE_CUSTOM_KERNELS", "1"))),
"VLLM_ENABLE_MC2":
lambda: bool(int(os.getenv("VLLM_ENABLE_MC2", '0'))),
"VLLM_MODEL_EXECUTE_TIME_OBSERVE":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"VLLM_MODEL_EXECUTE_TIME_OBSERVE":
"VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE":

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Signed-off-by: depeng1994 <depengzhang@foxmail.com>
Copy link

github-actions bot commented Jun 5, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Copy link
Collaborator

@Yikun Yikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message should also update.

The PR is good enough, just some nits see comments inline.

You can choose to address them in a separate PR.

* Use the non-blocking API `ProfileExecuteDuration().capture_async` to set observation points asynchronously when you need to observe the execution duration.
* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.

## Example Output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc is good but we could provide a e2e guid to help devs understand. Such as:


We already add key stage of inference (including pre-processing, model forward, etc.), you can execute inference script:

VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py

for tag, duration in durations.items()
]
captured_name = "Decode" if self.attn_state == AscendAttentionState.DecodeOnly else "Prefill"
print(f"Profile execute duration [{captured_name}]:",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print or log?

@Yikun
Copy link
Collaborator

Yikun commented Jun 5, 2025

@ganyi1996ppo @wangxiyuan pls

@Yikun Yikun added the ready read for review label Jun 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation module:core module:tests ready read for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants