[Bugfix][Frontend] Fixed issue where requests with duplicate request IDs might be sent to EngineCore simultaneously #15326

hidva · 2025-03-22T09:08:18Z

Currently, vllm allows users to send duplicate request IDs. At the same time, numerous modules in EngineCore use request IDs as dictionary keys, such as KVCacheManager.req_to_blocks. This is based on the assumption that EngineCore always expects the Frontend to first abort a request before adding a new one with the same request ID:

# req1, req2 have the same request_id.
(EngineCoreRequestType.ADD, req1(request_id=RequestId))
(EngineCoreRequestType.ABORT, req1)
(EngineCoreRequestType.ADD, req2(request_id=RequestId))

Currently, AsyncLLM ensures that duplicate request IDs must first be aborted before they can be added through the sequence AsyncLLM._add_request -> OutputProcessor.add_request:

# OutputProcessor.add_request
request_id = request.request_id
if request_id in self.request_states:
    raise ValueError(f"Request id {request_id} already running.")

# AsyncLLM.abort
async def abort(self, request_id: str) -> None:
    """Abort RequestId in OutputProcessor and EngineCore."""

    request_ids = self.output_processor.abort_requests((request_id,))
    # BUG!
    # This operation is not atomic, and there might be a time window during which
    # the request has already been removed from OutputProcessor.request_states,
    # but the corresponding ABORT has not yet been issued to EngineCore.
    await self.engine_core.abort_requests_async(request_ids)

    if self.log_requests:
        logger.info("Aborted request %s.", request_id)

We can easily simulate the potential bug by enlarging the possible time window with an await asyncio.sleep(13) inserted at the BUG point:

To fix this issue, we categorized completed requests into two types:

abort req, handle_abort_reqs
finished req, _handle_finished_reqs

And ensured that the scope of request visibility in the Frontend always includes the scope of request visibility in EngineCore.

github-actions · 2025-03-22T09:08:29Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vllm/v1/engine/output_processor.py

robertgshaw2-redhat · 2025-03-22T12:25:34Z

Thanks for your contribution! I agree that this is a race condition. Appreciate you digging in

robertgshaw2-redhat · 2025-03-22T12:54:09Z

vllm/v1/engine/output_processor.py

+        self.handle_abort_reqs(request_ids_to_abort)
+        return request_ids_to_abort
+
+    def flatten_req_to_abort(self, req_ids: Iterable[str]) -> list[str]:


Can we call this something more descriptive? get_parent_and_children_reqs?

It should probably also reflect the fact that the parent request is being removed.

the fact that the parent request is being removed

Yes, Do you have any good suggestions? How about try_pop_parent?

vllm/v1/engine/output_processor.py

vllm/v1/engine/async_llm.py

robertgshaw2-redhat · 2025-03-22T13:05:34Z

Thanks a ton! I reviewed the implementation in detail and you have fixed the problem! Just left some minor comments about naming the functions and comments. Ping me on slack when this is ready!

njhill · 2025-03-26T22:07:01Z

Thanks for this @hidva, I agree with @robertgshaw2-redhat's comments.

However, I was already thinking it might be more robust to have the engine return finished notifications for all requests, including those whose abort is initiated from the front-end process. Currently it just stops sending any outputs for these but we could change it so that there will be a terminating RequestOutput with "aborted" finish_reason in these cases.

Then we can clean up the output processor request states based on these responses rather than the current logic that's a bit disjoint.

Another reason to do this is that in addition to the leak that you pointed out, there may still be a bug where such aborted requests aren't captured properly in the metrics, because _update_stats_from_finished never gets called for them.

hidva · 2025-03-27T06:32:37Z

Apologies for the delay; I was on vacation until now. I will continue to follow up on this PR.

hidva · 2025-03-27T07:25:45Z

the engine return finished notifications for all requests,

However, there are indeed some scenarios where only the frontend can notify the engine to stop outputting, such as the presence of a stop string or when the client disconnects. If we let the engine return finished notifications for all requests, how should the engine be aware of such external conditions like client disconnection?

_update_stats_from_finished never gets called for them.

Yes, we should add a call to _update_stats_from_finished within handle_abort_reqs, and at the same time, ensure that _update_stats_from_finished is idempotent. This way, requests that are aborted due to client disconnection can also be captured properly in the metrics.

In other words, after we introduced the concepts of aborted requests and finished requests, we also introduced two interfaces: finish_request()(renamed to free_finished_reqs) and handle_abort_reqs()(renamed to free_aborted_reqs). All finished requests must ultimately call free_finished_reqs() to complete resource cleanup, and similarly, all aborted requests must call free_aborted_reqs() to complete resource cleanup. And all resource cleanup should be idempotent. See Commit: Unified the resource cleanup for aborted and finished requests

vllm/v1/engine/output_processor.py

njhill · 2025-03-27T17:12:55Z

Thanks @hidva just to be clear, I think this PR would be good to merge in its current form but that we should consider a follow-on to address the other things I mentioned.

the engine return finished notifications for all requests,

However, there are indeed some scenarios where only the frontend can notify the engine to stop outputting, such as the presence of a stop string or when the client disconnects. If we let the engine return finished notifications for all requests, how should the engine be aware of such external conditions like client disconnection?

The front-end would still initiate the aborts in the same way, i.e. for client disconnection and stop strings. It's just that the engine would now be guaranteed to subsequently return a final RequestOutput for these with aborted finish reason (this will require a change in the engine of course).

_update_stats_from_finished never gets called for them.

Yes, we should add a call to _update_stats_from_finished within handle_abort_reqs, and at the same time, ensure that _update_stats_from_finished is idempotent. This way, requests that are aborted due to client disconnection can also be captured properly in the metrics.

In other words, after we introduced the concepts of aborted requests and finished requests, we also introduced two interfaces: finish_request()(renamed to free_finished_reqs) and handle_abort_reqs()(renamed to free_aborted_reqs). All finished requests must ultimately call free_finished_reqs() to complete resource cleanup, and similarly, all aborted requests must call free_aborted_reqs() to complete resource cleanup. And all resource cleanup should be idempotent. See Commit: Unified the resource cleanup for aborted and finished requests

Regardless of the idempotence I think that it would be nice if we always do the cleanup when receiving the final response for a given request, irrespective of how it was terminated.

hidva · 2025-04-01T01:50:00Z

@njhill Is there anything else that needs to be done for this PR? Also, I'm not sure why the two tests are failing.

njhill · 2025-04-02T03:50:45Z

@hidva it seems that the test is hanging. Could you try merging in the latest main again? It's possible that it's a side-effect of the changes.

…IDs might be sent to EngineCore simultaneously Signed-off-by: 盏一 <zhanyi.ww@alibaba-inc.com>

Signed-off-by: 盏一 <zhanyi.ww@alibaba-inc.com>

hidva · 2025-04-15T14:06:23Z

@njhill Could you help me rerun the Entrypoints test? It seems like a fluke, and I don't have the necessary permissions.

hidva requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners March 22, 2025 09:08

mergify bot added the v1 label Mar 22, 2025

hidva commented Mar 22, 2025

View reviewed changes

vllm/v1/engine/output_processor.py Show resolved Hide resolved

robertgshaw2-redhat reviewed Mar 22, 2025

View reviewed changes

vllm/v1/engine/output_processor.py Outdated Show resolved Hide resolved

robertgshaw2-redhat reviewed Mar 22, 2025

View reviewed changes

vllm/v1/engine/output_processor.py Outdated Show resolved Hide resolved

robertgshaw2-redhat reviewed Mar 22, 2025

View reviewed changes

vllm/v1/engine/output_processor.py Show resolved Hide resolved

robertgshaw2-redhat reviewed Mar 22, 2025

View reviewed changes

vllm/v1/engine/async_llm.py Show resolved Hide resolved

mergify bot added the tpu Related to Google TPUs label Mar 27, 2025

hidva force-pushed the main branch from 03867b2 to 3e849bf Compare March 27, 2025 07:09

njhill reviewed Mar 27, 2025

View reviewed changes

vllm/v1/engine/output_processor.py Outdated Show resolved Hide resolved

mergify bot removed the tpu Related to Google TPUs label Mar 28, 2025

hidva force-pushed the main branch from 909955b to 817c4d4 Compare March 28, 2025 01:58

hidva force-pushed the main branch from 817c4d4 to 2b3befc Compare April 2, 2025 05:39

mergify bot added tpu Related to Google TPUs and removed tpu Related to Google TPUs labels Apr 9, 2025

hidva added 4 commits April 15, 2025 15:54

[Bugfix][Frontend] Fixed issue where requests with duplicate request …

93634c3

…IDs might be sent to EngineCore simultaneously Signed-off-by: 盏一 <zhanyi.ww@alibaba-inc.com>

Addressed feedback from pull request

cbf6c03

Signed-off-by: 盏一 <zhanyi.ww@alibaba-inc.com>

Unified the resource cleanup for aborted and finished requests

e22447a

Signed-off-by: 盏一 <zhanyi.ww@alibaba-inc.com>

fix typo

cf1cb47

Signed-off-by: 盏一 <zhanyi.ww@alibaba-inc.com>

hidva force-pushed the main branch from 2b3befc to cf1cb47 Compare April 15, 2025 07:54

Uh oh!

[Bugfix][Frontend] Fixed issue where requests with duplicate request IDs might be sent to EngineCore simultaneously #15326

Are you sure you want to change the base?

[Bugfix][Frontend] Fixed issue where requests with duplicate request IDs might be sent to EngineCore simultaneously #15326

Uh oh!

Conversation

hidva commented Mar 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 22, 2025

Uh oh!

Uh oh!

robertgshaw2-redhat commented Mar 22, 2025

Uh oh!

robertgshaw2-redhat Mar 22, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

hidva Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robertgshaw2-redhat commented Mar 22, 2025

Uh oh!

njhill commented Mar 26, 2025

Uh oh!

hidva commented Mar 27, 2025

Uh oh!

hidva commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

njhill commented Mar 27, 2025

Uh oh!

hidva commented Apr 1, 2025

Uh oh!

njhill commented Apr 2, 2025

Uh oh!

hidva commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

hidva commented Mar 22, 2025 •

edited by github-actions bot

Loading

hidva commented Mar 27, 2025 •

edited

Loading

hidva commented Apr 15, 2025 •

edited

Loading