Concurent requests thows memory exception problem #1033
Replies: 2 comments 5 replies
-
Creating a new context doesn't load the entire model again:
Re-using the same context for multiple conversations is slightly better (there is some overhead per-context). The problem with your implementation is you're never clearing the context, so everything that has previously been said is still in the KV cache. Adding in |
Beta Was this translation helpful? Give feedback.
-
Thanks for the clarification, so I have to options:
For the second case. I have done the necessary changes and I keep as a static the weights and I create new context each time, but I observed the following: Server starts with 0.0GB of GPU and 17.0GB or RAM Of course if a request is finish the reserved capacity of GPU and RAM for this request is released when using the context_llama.NativeHandle.KvCacheClear() in the end as you said. According to my results the overhead is to big which means that we have very limited parallel requests according to GPU an RAM, unless I am doing something wrong. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I have a Blazor Hosted WebAssembly app in .NET8 and testing LLamaSharp with Phi3.5mini. First results seems fine, a question gets it's answer. But then some problems occurs in more complex schenarios:
Attempted to read or write protected memory. This is often an indication that other memory is corrupt
I use the same static context because seems each new context loads again the model and multiplies the consuption of RAM and GPU Memory. Below is a sample code:
Beta Was this translation helpful? Give feedback.
All reactions