-
Notifications
You must be signed in to change notification settings - Fork 180
[ModelRunner]Add profile execute duration observation #1013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Overall lgtm, just some suggestions:
|
@wangxiyuan @Yikun @ganyi1996ppo please take a look, tks |
Signed-off-by: depeng1994 <depengzhang@foxmail.com>
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: depeng1994 <depengzhang@foxmail.com>
25ee163
to
9967032
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
vllm_ascend/envs.py
Outdated
@@ -36,6 +36,8 @@ | |||
lambda: bool(int(os.getenv("COMPILE_CUSTOM_KERNELS", "1"))), | |||
"VLLM_ENABLE_MC2": | |||
lambda: bool(int(os.getenv("VLLM_ENABLE_MC2", '0'))), | |||
"VLLM_MODEL_EXECUTE_TIME_OBSERVE": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"VLLM_MODEL_EXECUTE_TIME_OBSERVE": | |
"VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Signed-off-by: depeng1994 <depengzhang@foxmail.com>
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit message should also update.
The PR is good enough, just some nits see comments inline.
You can choose to address them in a separate PR.
* Use the non-blocking API `ProfileExecuteDuration().capture_async` to set observation points asynchronously when you need to observe the execution duration. | ||
* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages. | ||
|
||
## Example Output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc is good but we could provide a e2e guid to help devs understand. Such as:
We already add key stage of inference (including pre-processing, model forward, etc.), you can execute inference script:
VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py
for tag, duration in durations.items() | ||
] | ||
captured_name = "Decode" if self.attn_state == AscendAttentionState.DecodeOnly else "Prefill" | ||
print(f"Profile execute duration [{captured_name}]:", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print or log?
What this PR does / why we need it?
We need to observe the time consumed in each stage of inference (including pre-processing, model forward, etc.), without any performance loss.
Therefore, we use the event timestamp mechanism of the NPU to mark any stage during the execution of the NPU device (this marking operation is executed asynchronously, with no performance loss).
Additionally, we provide a blocking synchronization API
pop_captured_sync
to be called at an appropriate time, to print the time consumed in all observed stages.model_runner_v1.py file only changed 5 lines, all of which were
ProfileExecuteDuration()
calls, and nothing else was changed, while more changes were showed due to the alignment issue.Does this PR introduce any user-facing change?
Use env
VLLM_MODEL_EXECUTE_TIME_OBSERVE
to enable this featureHow was this patch tested?
Tested in deepseek model,Print like this: