Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Python binding inference performance improvement #426

Merged
merged 27 commits into from
Feb 8, 2025

Conversation

kthui
Copy link
Contributor

@kthui kthui commented Jan 11, 2025

What does the PR do?

Refactor infer() and async_infer() APIs to handle memory allocation and callbacks internally in C++, and only expose the basic interface for the Python iterator to fetch responses.

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Related PRs:

triton-inference-server/server#7949

Where should the reviewer start?

Start with the tritonserver_pybind.cc for the interface change, then move on to _model.py on how the Python iterator interacts with the interface. Finally, move on to the _request.py and _response.py on how they support the Python iterator.

For testing, start with test_binding.py and test_api.py, and then _tensor.py on the DLPack limitation regarding bytes.

Test plan:

Existing L0_python_api is sufficient for catching any regression from this performance improvement. It is modified to test from the new interface.

  • CI Pipeline ID: 23514700

Caveats:

User are no longer able to specify custom:

  • request release callback
  • response allocator
  • response callback

Currently, only CPU memory output is supported at the binding level, so GPU memory output will involve an extra D2H copy at the backend and a H2D copy at the frontend. This will be resolved as a follow-up.

The test_stop failure will have to be triaged and fixed as a follow-up.

Background

N/A

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

* fix request tensor lifecycle

* remove prints for benchmarking

* remove one h2h response copy

* add allocator delete

* schedule future.set_result to be called in event loop
@kthui kthui added the PR: perf A code change that improves performance label Jan 11, 2025
@kthui kthui self-assigned this Jan 11, 2025
* Use tensor object in output

* Enable infer() to use the improved binding

* Pass the C pointer directly to Python

* Move output device if differ from request setting

* Copyright and pre-commit
@kthui kthui force-pushed the jacky-py-res-callback branch from 300b0b6 to 131078a Compare January 11, 2025 02:52
@kthui kthui marked this pull request as ready for review January 15, 2025 19:43
@@ -137,6 +137,7 @@ def test_memory_fallback_to_cpu(self, server_options):

tritonserver.default_memory_allocators[tritonserver.MemoryType.GPU] = allocator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be removed - or changed in some way to indicate the allocator is internal ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, updated the test to indicate the allocator is internal, and it will always use CPU memory regardless of the backend memory preference.

@@ -164,6 +165,7 @@ def test_memory_allocator_exception(self, server_options):
):
pass

@pytest.mark.skip(reason="Skipping test, infer no longer use allocator")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should keep this but refactor - if user requests output memory type gpu and that is not supported by the internal allocator - we would still want to raise an exception during inference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we should raise an exception if the output memory type specified on the request is not supported, but currently the bindings does not accept a requested output memory type, so I think we can skip this test for now and add proper testing after adding support for allocating GPU memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, but if we fail because say cupy is not available? Can we still make sure the right error gets propagated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the "test_unsupported_memory_type" is repurposed for testing moving outputs to unsupported memory type

@@ -418,6 +420,9 @@ def test_ready(self, server_options):
server = tritonserver.Server(server_options).start()
assert server.ready()

@pytest.mark.skip(
reason="Skipping test, some request/response object may not be released which may cause server stop to fail"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use xfail instead of skip?

@rmccorm4 - what do you think - as we had spent time on fixing this - how much an issue is this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, switched to xfail.

* Indicate allocator is internal on test_memory_fallback_to_cpu()

* Remove test_memory_allocator_exception() as infer no longer use custom allocators

* Update reason for skipping test_unsupported_memory_type()

* Mark test_stop() with xfail instead of skip
TRITONSERVER_MemoryType* actual_memory_type,
int64_t* actual_memory_type_id)
{
*buffer = malloc(byte_size * sizeof(uint8_t));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there tritonserver apis we should be using for memory allocation instead of directly calling malloc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GuanLuo to double check - should we be using the backend_memory_manager ? Reason I ask is that these / and the server already do support the different types of memory gpu / cpu / cpu pinned - and we would want that kind of support here if we can get it - and we should reuse it ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is in server certainly won't work, it is tightly related to the server frontend for optimization which is why allocator is up for the user to define.

You probably can reuse the one for backend but that is kind of strange. Plus the one in backend may not be as well crafted as you expected.. Please evaluate whether it may sense to proceed in this way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make that a todo as we bring in support for more memory types - we should look to see if we can reuse a common memory allocator in the core - of the ones backend_memeory_manager seems the closest - but agree is odd to reuse here directly.

*actual_memory_type = TRITONSERVER_MEMORY_CPU;
*actual_memory_type_id = 0;
// The response allocator needs to be kept alive until the allocated memory
// is released via the release function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this true? - question: why not an allocator singleton for cpu - and reuse it for all requests?

Copy link
Contributor Author

@kthui kthui Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the core will need the allocator opaque object for finding the set release function. The allocator opaque object is basically this class that stores the release function pointer.

yes, updated to use singleton allocator object for all instances of the request wrapper.

int64_t memory_type_id)
{
free(buffer);
// Release ownership of the response allocator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking a singleton could be simpler and avoid alloc / dealloc of the allocator object itself - if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, updated to use a singleton response allocator.

// Release ownership of the response allocator.
std::unique_ptr<std::shared_ptr<struct TRITONSERVER_ResponseAllocator>>
allocator_ptr(reinterpret_cast<
std::shared_ptr<struct TRITONSERVER_ResponseAllocator>*>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually - what do we use the buffer_userp for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was used to store the shared pointer to the response allocator object and passed to the release function, so the reference count can be increased/decreased as malloc/free that enables the allocator to be destructed only after the last allocated response memory is freed.

but it is no longer necessary with the singleton allocator.

if (response_future_.get() != nullptr) {
throw AlreadyExistsError("cannot call GetNextResponse concurrently");
}
response_future_.reset(new py::object(py_future));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I follow the logic here - why do we create a new object based on the passed in object? why not a reference on the original object?

second: how does the swap below work?

Copy link
Contributor Author

@kthui kthui Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried holding a raw pointer to the original object, but the pointer can no longer be dereferenced (segmentation fault) after the GetNextResponse() returned, so the only safe way of holding on to the future object after GetNextResponse() returns is to increment the reference count on the future object, which is achieved by "copying" it to a new object.

the reason for using a unique pointer is to have a mechanism for checking if there is a future object pending to be set with the next response without accessing any Python variable, to avoid holding the GIL when doing the simple check. it can be achieved with a simple bool flag, but I think it is safer to have one variable than two to avoid setting one variable but forgetting about the other one.

Copy link
Contributor Author

@kthui kthui Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Experiment: Increment the reference count on the py_future object, and see if the object can be remained valid after the function return.

TODO: The new will always allocate on heap, can we keep the variable on the rts, to avoid new/delete for every response?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have some kind of small memory leak/growth pytest test that checks memory before, does some inferences, and checks memory after?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is updated to remove the pointer wrapper around response_future_, and more comments clarifying the flow among GetNextResponse() and AddNextResponse(): Avoid extra malloc when tracking response_future_
cc @nnshah1 @rmccorm4

Copy link
Contributor Author

@kthui kthui Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have some kind of small memory leak/growth pytest test that checks memory before, does some inferences, and checks memory after?

Do you think something like this makes sense? I think there will always be some small growth, but we can set the acceptable growth on the test case and have the test fail if it grows beyond what we set.

More discussion here: #426 (comment)

Copy link
Contributor

@rmccorm4 rmccorm4 Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! Looks worth a try to me - I see it's used by quite a few big projects - https://github.com/search?q=pytest-memray&type=code

  • pydantic
  • datadog
  • huggingface
  • ...

I'm curious how much time overhead it adds to run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we can set the acceptable growth on the test case and have the test fail if it grows beyond what we set.

Let's make sure whatever threshold we set isn't tailored to the specific test case or number of iterations, and doesn't get hit if we increased the iterations on the test case or something

{
py::gil_scoped_acquire gil;
std::unique_ptr<py::object> response_future_local(nullptr);
response_future_.swap(response_future_local);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I follow this - we create a new py object from null, swap it with the stored one (releasing it?) - then set result on the nullptr based object?...

Or ... I guess swap works the opposite way :)

we swap the null ptr for the one stored in the future ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the guarantee here that the GetNextResponse() will not be called and receive a new future object (vs this future object) is this future object is not yet set with a result, so as soon as this future object is set with a result, we need to be ready for the next future object from GetNextResponse() that may take the place of this future object. thus, the solution here is to move the current future object from the class to local variable, making the class variable ready for the next future object before the current future object is set with a result.

inference_request.response_queue,
raise_on_error,
)
self._server.infer_async(request)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we pass in the requested memory type to the C++ side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add a new binding function to the request wrapper that allows setting a requested output memory type to the request wrapper object, then the response allocation function can access the requested memory type from the request wrapper object and allocate the correct memory type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make sure the stories to follow up on customizing allocation are in place and scheduled. Priority would be to specify the memory type (device id, etc.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, DLIS-7824.

// The request wrapper needs to be kept alive until the release callback is
// invoked, so the Triton request can be deleted after the core/backend is
// done with it.
std::unique_ptr<std::shared_ptr<PyInferenceRequest>> request_ptr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the unique pointer is used here to make sure the resource is being cleaned up if exception is thrown below.. But if that is the intention, I don't think the current code does the job as the ownership is released before calling into TRITONSERVER_InferenceRequestSetReleaseCallback which is not safe guarding the exception.

struct TRITONSERVER_InferenceRequest* request, const uint32_t flags,
void* userp)
{
std::unique_ptr<std::shared_ptr<PyInferenceRequest>> request_ptr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, at this point just reinterpret cast and delete should be fine.

ThrowIfError(TRITONSERVER_ResponseAllocatorSetBufferAttributesFunction(
allocator_.get(), ResponseAllocatorBufferAttributesFn));
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I think you can move the allocator initialization and the related callback functions out to a separate class and make it a singleton, which then this SetResponseAllocator function simply grabs the Triton allocator pointer from the singleton

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a good idea. I think we can proceed with the refactoring when we add GPU memory support into the allocators, which the class abstraction will be necessary to keep the function block short. DLIS-7824

auto cr = reinterpret_cast<CallbackResource*>(userp);
cr->release_fn(cr->request, flags, cr->user_object);
delete cr;
ThrowIfError(TRITONSERVER_BufferAttributesSetMemoryType(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should return error as that is what the caller expects

TRITONSERVER_MemoryType* actual_memory_type,
int64_t* actual_memory_type_id)
{
*buffer = malloc(byte_size * sizeof(uint8_t));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is in server certainly won't work, it is tightly related to the server frontend for optimization which is why allocator is up for the user to define.

You probably can reuse the one for backend but that is kind of strange. Plus the one in backend may not be as well crafted as you expected.. Please evaluate whether it may sense to proceed in this way.

@kthui
Copy link
Contributor Author

kthui commented Feb 4, 2025

What is in server certainly won't work, it is tightly related to the server frontend for optimization which is why allocator is up for the user to define.

You probably can reuse the one for backend but that is kind of strange. Plus the one in backend may not be as well crafted as you expected.. Please evaluate whether it may sense to proceed in this way.

I think we will proceed with a static class in the binding that is singleton and exclusive for the binding, which as support for both CPU and GPU allocation and is optimized for the binding use-case. DLIS-7824

@kthui kthui requested review from GuanLuo, rmccorm4 and nnshah1 February 6, 2025 18:00
# increase is not proportional to the increase in inference requests.
@pytest.mark.parametrize("infer_repeat", [200, 500, 1000, 8000])
def test_inference_leak(self, server_options, infer_repeat):
min_mem_inc, max_mem_inc = 20000, 120000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these values in bytes? 20KB and 120KB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, those are in bytes, so 20KB and 120KB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add _bytes suffix to the var names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added the _bytes suffix. Add _bytes suffix to mem size vars

Comment on lines 664 to 665
# max_infer_repeat / min_infer_repeat >> max_mem_inc / min_mem_inc, so the memory
# increase is not proportional to the increase in inference requests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no per-request/response memory leaks, shouldn't sending ~200 requests and ~8000 requests with the same concurrency/batching see roughly the same amount of memory usage before/after?

Do you have more detailed results/findings from local experiments that led you to these numbers and methodology?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the memory usage increment will start higher and converge to around 40KB locally on my machine as the test progresses, regardless of the requests count, i.e.

  1. requests = 200; memory increment = 71193
  2. requests = 500; memory increment = 60405
  3. requests = 1000; memory increment = 40133
  4. requests = 8000; memory increment = 38519

and in the opposite order:

  1. requests = 8000; memory increment = 83592
  2. requests = 1000; memory increment = 47596
  3. requests = 500; memory increment = 41406
  4. requests = 200; memory increment = 40853

requests = 1000 repeat 6x:

  1. memory increment = 84436
  2. memory increment = 41967
  3. memory increment = 40166
  4. memory increment = 45353
  5. memory increment = 41281
  6. memory increment = 34740

Comment on lines +295 to +298
if self._iterator_queue is None:
response = self._get_next_response()
else:
response = self._iterator_queue.get()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is iterator_queue only used when a user_queue is provided? Would it make sense to always use iterator_queue for no if/else here?

One major difference currentlly seems to be only getting next response on demand from iteration calls to __next__ vs. pre-fetching the responses to iterator queue in a background thread, even if the response iterator isn't iterated on.

Can you elaborate on your thoughts/decision here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is iterator_queue only used when a user_queue is provided? Would it make sense to always use iterator_queue for no if/else here?

The code would be a bit simpler if the iterator_queue is always used, but it is not needed if the user did not provide an additional response queue to be feed, so the iterator can pull directly from the binding without the intermediate iterator queue, which in theory will allow for a higher maximum output throughput / lower latency compared to w/ the intermediate iterator queue (and its thread/coroutine).

In other words, we can bypass the intermediate queue when there is only a single consumer for the producer at the binding. If there is more than one consumer for the producer, an additional queue for each consumer is needed to allow those consumer to consume at a different rate without blocking each other. A key part of the test is to make sure the user provided queue is feed with responses without having to iterate over the iterator.

Copy link
Contributor

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Majority of changes LGTM - but left a few questions to wrap up tomorrow

@kthui kthui requested a review from rmccorm4 February 7, 2025 21:57
@rmccorm4
Copy link
Contributor

rmccorm4 commented Feb 8, 2025

Let's prioritize the follow-ups for consistent tests (grpc fork errors), gpu memory support, etc

@rmccorm4
Copy link
Contributor

rmccorm4 commented Feb 8, 2025

Nice work Jacky! Thanks for pushing this through!

@kthui kthui merged commit d53c1f7 into main Feb 8, 2025
2 checks passed
@kthui kthui deleted the jacky-py-res-callback branch February 8, 2025 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: perf A code change that improves performance
Development

Successfully merging this pull request may close these issues.

4 participants